Compare commits
13 Commits
Author | SHA1 | Date |
---|---|---|
Kasianov Nikolai Alekseevich | 722f3fb536 | 2 years ago |
Kasianov Nikolai Alekseevich | c91986d42d | 2 years ago |
Kasianov Nikolai Alekseevich | c2ec2073dc | 2 years ago |
Kasianov Nikolai Alekseevich | 812fd2adf7 | 2 years ago |
Kasianov Nikolai Alekseevich | b256d8a83e | 2 years ago |
Kasianov Nikolai Alekseevich | e5af2939cc | 2 years ago |
Kasianov Nikolai Alekseevich | 6fab9031b1 | 2 years ago |
Kasianov Nikolai Alekseevich | fd484c665e | 2 years ago |
Kasianov Nikolai Alekseevich | 00bc33d5de | 2 years ago |
Kasianov Nikolai Alekseevich | f96bad448a | 2 years ago |
Kasianov Nikolai Alekseevich | d877a483a2 | 2 years ago |
Kasianov Nikolai Alekseevich | 023c2e5a19 | 2 years ago |
Kasianov Nikolai Alekseevich | 1771d19b82 | 2 years ago |
19 changed files with 12927 additions and 472 deletions
@ -1,38 +1,91 @@ |
|||||||
# Wecr - simple web crawler |
# Wecr - versatile WEb CRawler |
||||||
|
|
||||||
## Overview |
## Overview |
||||||
|
|
||||||
Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way. |
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. |
||||||
|
|
||||||
## Configuration |
## Configuration Overview |
||||||
|
|
||||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wDir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`. |
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`. |
||||||
|
|
||||||
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
||||||
|
|
||||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
||||||
|
|
||||||
|
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). |
||||||
|
|
||||||
|
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true` |
||||||
|
|
||||||
### Search query |
### Search query |
||||||
|
|
||||||
There are some special `query` values: |
There are some special `query` values to control the flow of work: |
||||||
|
|
||||||
- `email` - tells wecr to scrape email addresses and output to `output_file` |
- `email` - tells wecr to scrape email addresses and output to `output_file` |
||||||
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
||||||
- `videos` - find and fetch files that look like videos |
- `videos` - find and fetch files that look like videos |
||||||
- `audio` - find and fetch files that look like audio |
- `audio` - find and fetch files that look like audio |
||||||
- `everything` - find and fetch images, audio and video |
- `documents` - find and fetch files that look like a document |
||||||
|
- `everything` - find and fetch images, audio, video, documents and email addresses |
||||||
|
- `archive` - no text to be searched, save every visited page |
||||||
|
|
||||||
|
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it. |
||||||
|
|
||||||
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it. |
### Data Output |
||||||
|
|
||||||
### Output |
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory. |
||||||
|
|
||||||
By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side. |
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. |
||||||
|
|
||||||
## Build |
## Build |
||||||
|
|
||||||
If you're on *nix - it's as easy as `make`. |
If you're on *nix - it's as easy as `make`. |
||||||
|
|
||||||
Otherwise - `go build` in the `src` directory to build `wecr`. |
Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies. |
||||||
|
|
||||||
|
## Examples |
||||||
|
|
||||||
|
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples. |
||||||
|
|
||||||
|
Dump of a basic configuration: |
||||||
|
|
||||||
|
```json |
||||||
|
{ |
||||||
|
"search": { |
||||||
|
"is_regexp": true, |
||||||
|
"query": "(sequence to search)|(other sequence)" |
||||||
|
}, |
||||||
|
"requests": { |
||||||
|
"request_wait_timeout_ms": 2500, |
||||||
|
"request_pause_ms": 100, |
||||||
|
"content_fetch_timeout_ms": 0, |
||||||
|
"user_agent": "" |
||||||
|
}, |
||||||
|
"depth": 90, |
||||||
|
"workers": 30, |
||||||
|
"initial_pages": [ |
||||||
|
"https://en.wikipedia.org/wiki/Main_Page" |
||||||
|
], |
||||||
|
"allowed_domains": [ |
||||||
|
"https://en.wikipedia.org/" |
||||||
|
], |
||||||
|
"blacklisted_domains": [ |
||||||
|
"" |
||||||
|
], |
||||||
|
"in_memory_visit_queue": false, |
||||||
|
"web_dashboard": { |
||||||
|
"launch_dashboard": true, |
||||||
|
"port": 13370 |
||||||
|
}, |
||||||
|
"save": { |
||||||
|
"output_dir": "scraped", |
||||||
|
"save_pages": false |
||||||
|
}, |
||||||
|
"logging": { |
||||||
|
"output_logs": true, |
||||||
|
"logs_file": "logs.log" |
||||||
|
} |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
## License |
## License |
||||||
AGPLv3 |
wecr is distributed under AGPLv3 license |
@ -0,0 +1,161 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package dashboard |
||||||
|
|
||||||
|
import ( |
||||||
|
"embed" |
||||||
|
"encoding/json" |
||||||
|
"fmt" |
||||||
|
"html/template" |
||||||
|
"io" |
||||||
|
"io/fs" |
||||||
|
"net/http" |
||||||
|
"unbewohnte/wecr/config" |
||||||
|
"unbewohnte/wecr/logger" |
||||||
|
"unbewohnte/wecr/worker" |
||||||
|
) |
||||||
|
|
||||||
|
type Dashboard struct { |
||||||
|
Server *http.Server |
||||||
|
} |
||||||
|
|
||||||
|
//go:embed res
|
||||||
|
var resFS embed.FS |
||||||
|
|
||||||
|
type PageData struct { |
||||||
|
Conf config.Conf |
||||||
|
Stats worker.Statistics |
||||||
|
} |
||||||
|
|
||||||
|
type PoolStop struct { |
||||||
|
Stop bool `json:"stop"` |
||||||
|
} |
||||||
|
|
||||||
|
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard { |
||||||
|
mux := http.NewServeMux() |
||||||
|
res, err := fs.Sub(resFS, "res") |
||||||
|
if err != nil { |
||||||
|
logger.Error("Failed to Sub embedded dashboard FS: %s", err) |
||||||
|
return nil |
||||||
|
} |
||||||
|
|
||||||
|
mux.Handle("/static/", http.FileServer(http.FS(res))) |
||||||
|
|
||||||
|
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
template, err := template.ParseFS(res, "*.html") |
||||||
|
if err != nil { |
||||||
|
logger.Error("Failed to parse embedded dashboard FS: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
template.ExecuteTemplate(w, "index.html", nil) |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
var stop PoolStop |
||||||
|
|
||||||
|
requestBody, err := io.ReadAll(req.Body) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
defer req.Body.Close() |
||||||
|
|
||||||
|
err = json.Unmarshal(requestBody, &stop) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
if stop.Stop { |
||||||
|
// stop worker pool
|
||||||
|
pool.Stop() |
||||||
|
logger.Info("Stopped worker pool via request from dashboard") |
||||||
|
} else { |
||||||
|
// resume work
|
||||||
|
pool.Work() |
||||||
|
logger.Info("Resumed work via request from dashboard") |
||||||
|
} |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ") |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to marshal stats to send to the dashboard: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
w.Header().Add("Content-type", "application/json") |
||||||
|
w.Write(jsonStats) |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/conf", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
switch req.Method { |
||||||
|
case http.MethodPost: |
||||||
|
var newConfig config.Conf |
||||||
|
|
||||||
|
defer req.Body.Close() |
||||||
|
newConfigData, err := io.ReadAll(req.Body) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to read new configuration from dashboard request: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
err = json.Unmarshal(newConfigData, &newConfig) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to unmarshal new configuration", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to unmarshal new configuration from dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
// DO NOT blindly replace global configuration. Manually check and replace values
|
||||||
|
webConf.Search.IsRegexp = newConfig.Search.IsRegexp |
||||||
|
if len(newConfig.Search.Query) != 0 { |
||||||
|
webConf.Search.Query = newConfig.Search.Query |
||||||
|
} |
||||||
|
|
||||||
|
webConf.Logging.OutputLogs = newConfig.Logging.OutputLogs |
||||||
|
|
||||||
|
default: |
||||||
|
jsonConf, err := json.MarshalIndent(webConf, "", " ") |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to marshal configuration", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to marshal current configuration to send to the dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
w.Header().Add("Content-type", "application/json") |
||||||
|
w.Write(jsonConf) |
||||||
|
} |
||||||
|
}) |
||||||
|
|
||||||
|
server := &http.Server{ |
||||||
|
Addr: fmt.Sprintf(":%d", port), |
||||||
|
Handler: mux, |
||||||
|
} |
||||||
|
|
||||||
|
return &Dashboard{ |
||||||
|
Server: server, |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
func (board *Dashboard) Launch() error { |
||||||
|
return board.Server.ListenAndServe() |
||||||
|
} |
@ -0,0 +1,223 @@ |
|||||||
|
<!DOCTYPE html> |
||||||
|
<html lang="en"> |
||||||
|
|
||||||
|
<head> |
||||||
|
<meta charset="utf-8"> |
||||||
|
<title>Wecr dashboard</title> |
||||||
|
<!-- <link rel="icon" href="/static/icon.png"> --> |
||||||
|
<link rel="stylesheet" href="/static/bootstrap.css"> |
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
||||||
|
</head> |
||||||
|
|
||||||
|
<body class="d-flex flex-column h-100"> |
||||||
|
<div class="container"> |
||||||
|
<header class="d-flex flex-wrap justify-content-center py-3 mb-4 border-bottom"> |
||||||
|
<a href="/" class="d-flex align-items-center mb-3 mb-md-0 me-md-auto text-dark text-decoration-none"> |
||||||
|
<svg class="bi me-2" width="40" height="32"> |
||||||
|
<use xlink:href="#bootstrap"></use> |
||||||
|
</svg> |
||||||
|
<strong class="fs-4">Wecr</strong> |
||||||
|
</a> |
||||||
|
|
||||||
|
<ul class="nav nav-pills"> |
||||||
|
<li class="nav-item"><a href="/stats" class="nav-link">Stats</a></li> |
||||||
|
<li class="nav-item"><a href="/conf" class="nav-link">Config</a></li> |
||||||
|
</ul> |
||||||
|
</header> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h1>Dashboard</h1> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h2>Statistics</h2> |
||||||
|
<div id="statistics"> |
||||||
|
<ol class="list-group list-group-numbered"> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Pages visited</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="pages_visited">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Matches found</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="matches_found">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Pages saved</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="pages_saved">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Start time</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="start_time_unix">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Stopped</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="stopped">false</span> |
||||||
|
</li> |
||||||
|
</ol> |
||||||
|
</div> |
||||||
|
|
||||||
|
<button class="btn btn-primary" id="btn_stop">Stop</button> |
||||||
|
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h2>Configuration</h2> |
||||||
|
<div> |
||||||
|
<b>Make runtime changes to configuration</b> |
||||||
|
<table class="table table-borderless"> |
||||||
|
<tr> |
||||||
|
<th>Key</th> |
||||||
|
<th>Value</th> |
||||||
|
</tr> |
||||||
|
<tr> |
||||||
|
<th>Query</th> |
||||||
|
<th> |
||||||
|
<input type="text" id="conf_query"> |
||||||
|
</th> |
||||||
|
</tr> |
||||||
|
<tr> |
||||||
|
<th>Is regexp</th> |
||||||
|
<th> |
||||||
|
<input type="text" id="conf_is_regexp"> |
||||||
|
</th> |
||||||
|
</tr> |
||||||
|
</table> |
||||||
|
<button class="btn btn-primary" id="config_apply_button"> |
||||||
|
Apply |
||||||
|
</button> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
<pre id="conf_output"></pre> |
||||||
|
</div> |
||||||
|
</div> |
||||||
|
</body> |
||||||
|
|
||||||
|
<script> |
||||||
|
window.onload = function () { |
||||||
|
let confOutput = document.getElementById("conf_output"); |
||||||
|
let pagesVisitedOut = document.getElementById("pages_visited"); |
||||||
|
let matchesFoundOut = document.getElementById("matches_found"); |
||||||
|
let pagesSavedOut = document.getElementById("pages_saved"); |
||||||
|
let startTimeOut = document.getElementById("start_time_unix"); |
||||||
|
let stoppedOut = document.getElementById("stopped"); |
||||||
|
let applyConfButton = document.getElementById("config_apply_button"); |
||||||
|
let confQuery = document.getElementById("conf_query"); |
||||||
|
let confIsRegexp = document.getElementById("conf_is_regexp"); |
||||||
|
let buttonStop = document.getElementById("btn_stop"); |
||||||
|
let buttonResume = document.getElementById("btn_resume"); |
||||||
|
|
||||||
|
buttonStop.addEventListener("click", (event) => { |
||||||
|
buttonStop.disabled = true; |
||||||
|
buttonResume.disabled = false; |
||||||
|
|
||||||
|
// stop worker pool |
||||||
|
let signal = { |
||||||
|
"stop": true, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/stop", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(signal), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
buttonResume.addEventListener("click", (event) => { |
||||||
|
buttonResume.disabled = true; |
||||||
|
buttonStop.disabled = false; |
||||||
|
|
||||||
|
// resume worker pool's work |
||||||
|
let signal = { |
||||||
|
"stop": false, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/stop", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(signal), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
applyConfButton.addEventListener("click", (event) => { |
||||||
|
let query = String(confQuery.value); |
||||||
|
|
||||||
|
if (confIsRegexp.value === "0") { |
||||||
|
isRegexp = false; |
||||||
|
} else if (confIsRegexp.value === "1") { |
||||||
|
isRegexp = true; |
||||||
|
}; |
||||||
|
if (confIsRegexp.value === "false") { |
||||||
|
isRegexp = false; |
||||||
|
} else if (confIsRegexp.value === "true") { |
||||||
|
isRegexp = true; |
||||||
|
}; |
||||||
|
|
||||||
|
let newConf = { |
||||||
|
"search": { |
||||||
|
"is_regexp": isRegexp, |
||||||
|
"query": query, |
||||||
|
}, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/conf", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(newConf), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
const interval = setInterval(function () { |
||||||
|
// update statistics |
||||||
|
fetch("/stats") |
||||||
|
.then((response) => response.json()) |
||||||
|
.then((statistics) => { |
||||||
|
pagesVisitedOut.innerText = statistics.pages_visited; |
||||||
|
matchesFoundOut.innerText = statistics.matches_found; |
||||||
|
pagesSavedOut.innerText = statistics.pages_saved; |
||||||
|
startTimeOut.innerText = new Date(1000 * statistics.start_time_unix); |
||||||
|
stoppedOut.innerText = statistics.stopped; |
||||||
|
}); |
||||||
|
// update config |
||||||
|
fetch("/conf") |
||||||
|
.then((response) => response.text()) |
||||||
|
.then((config) => { |
||||||
|
// "print" whole configuration |
||||||
|
confOutput.innerText = config; |
||||||
|
|
||||||
|
// update values in the change table if they're empty |
||||||
|
let confJSON = JSON.parse(config); |
||||||
|
if (confQuery.value == "") { |
||||||
|
confQuery.value = confJSON.search.query; |
||||||
|
} |
||||||
|
if (confIsRegexp.value == "") { |
||||||
|
confIsRegexp.value = confJSON.search.is_regexp; |
||||||
|
} |
||||||
|
}); |
||||||
|
}, 650); |
||||||
|
}(); |
||||||
|
</script> |
||||||
|
|
||||||
|
</html> |
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -0,0 +1,54 @@ |
|||||||
|
package queue |
||||||
|
|
||||||
|
import ( |
||||||
|
"encoding/json" |
||||||
|
"io" |
||||||
|
"os" |
||||||
|
"unbewohnte/wecr/web" |
||||||
|
) |
||||||
|
|
||||||
|
func PopLastJob(queue *os.File) (*web.Job, error) { |
||||||
|
stats, err := queue.Stat() |
||||||
|
if err != nil { |
||||||
|
return nil, err |
||||||
|
} |
||||||
|
|
||||||
|
if stats.Size() == 0 { |
||||||
|
return nil, nil |
||||||
|
} |
||||||
|
|
||||||
|
// find the last job in the queue
|
||||||
|
var job web.Job |
||||||
|
var offset int64 = -1 |
||||||
|
for { |
||||||
|
currentOffset, err := queue.Seek(offset, io.SeekEnd) |
||||||
|
if err != nil { |
||||||
|
return nil, err |
||||||
|
} |
||||||
|
|
||||||
|
decoder := json.NewDecoder(queue) |
||||||
|
err = decoder.Decode(&job) |
||||||
|
if err != nil || job.URL == "" || job.Search.Query == "" { |
||||||
|
offset -= 1 |
||||||
|
continue |
||||||
|
} |
||||||
|
|
||||||
|
queue.Truncate(currentOffset) |
||||||
|
return &job, nil |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
func InsertNewJob(queue *os.File, newJob web.Job) error { |
||||||
|
_, err := queue.Seek(0, io.SeekEnd) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
|
||||||
|
encoder := json.NewEncoder(queue) |
||||||
|
err = encoder.Encode(&newJob) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
|
||||||
|
return nil |
||||||
|
} |
@ -0,0 +1,45 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package web |
||||||
|
|
||||||
|
import ( |
||||||
|
"net/url" |
||||||
|
) |
||||||
|
|
||||||
|
// Tries to find docs' URLs on the page
|
||||||
|
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL { |
||||||
|
var urls []url.URL |
||||||
|
|
||||||
|
// for every element that has "src" attribute
|
||||||
|
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||||
|
if HasDocumentExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// for every "a" element as well
|
||||||
|
for _, link := range FindPageLinks(pageBody, from) { |
||||||
|
if HasDocumentExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// return discovered doc urls
|
||||||
|
return urls |
||||||
|
} |
Loading…
Reference in new issue