Compare commits
25 Commits
26 changed files with 13560 additions and 346 deletions
@ -1,33 +1,91 @@ |
|||||||
# Wecr - simple web crawler |
# Wecr - versatile WEb CRawler |
||||||
|
|
||||||
## Overview |
## Overview |
||||||
|
|
||||||
Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way. |
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. |
||||||
|
|
||||||
## Configuration |
## Configuration Overview |
||||||
|
|
||||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `wDir` (working directory) flag is set to some other value. |
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`. |
||||||
|
|
||||||
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
||||||
|
|
||||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains`. If all initial pages are happen to be blacklisted - the program will end. |
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
||||||
|
|
||||||
|
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). |
||||||
|
|
||||||
|
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true` |
||||||
|
|
||||||
### Search query |
### Search query |
||||||
|
|
||||||
if `is_regexp` is `false`, then `query` is the text to be searched for, but there are some special values: |
There are some special `query` values to control the flow of work: |
||||||
|
|
||||||
|
- `email` - tells wecr to scrape email addresses and output to `output_file` |
||||||
|
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
||||||
|
- `videos` - find and fetch files that look like videos |
||||||
|
- `audio` - find and fetch files that look like audio |
||||||
|
- `documents` - find and fetch files that look like a document |
||||||
|
- `everything` - find and fetch images, audio, video, documents and email addresses |
||||||
|
- `archive` - no text to be searched, save every visited page |
||||||
|
|
||||||
|
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it. |
||||||
|
|
||||||
|
### Data Output |
||||||
|
|
||||||
|
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory. |
||||||
|
|
||||||
|
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. |
||||||
|
|
||||||
|
## Build |
||||||
|
|
||||||
- `links` - tells `webscrape` to search for all links there are on the page |
If you're on *nix - it's as easy as `make`. |
||||||
- `images` - find all image links and output to the `output_dir` (**IMPORTANT**: set `wait_timeout_ms` to `0` so the images load fully) |
|
||||||
|
|
||||||
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it. |
Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies. |
||||||
|
|
||||||
### Output |
## Examples |
||||||
|
|
||||||
By default, if the query is not `images` all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images` - the additional contents will be put in the `output_dir` directory neatly created by the executable's side. |
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples. |
||||||
|
|
||||||
## TODO |
Dump of a basic configuration: |
||||||
|
|
||||||
- **PARSE HTML WITH REGEXP (_EVIL LAUGH_)** |
```json |
||||||
|
{ |
||||||
|
"search": { |
||||||
|
"is_regexp": true, |
||||||
|
"query": "(sequence to search)|(other sequence)" |
||||||
|
}, |
||||||
|
"requests": { |
||||||
|
"request_wait_timeout_ms": 2500, |
||||||
|
"request_pause_ms": 100, |
||||||
|
"content_fetch_timeout_ms": 0, |
||||||
|
"user_agent": "" |
||||||
|
}, |
||||||
|
"depth": 90, |
||||||
|
"workers": 30, |
||||||
|
"initial_pages": [ |
||||||
|
"https://en.wikipedia.org/wiki/Main_Page" |
||||||
|
], |
||||||
|
"allowed_domains": [ |
||||||
|
"https://en.wikipedia.org/" |
||||||
|
], |
||||||
|
"blacklisted_domains": [ |
||||||
|
"" |
||||||
|
], |
||||||
|
"in_memory_visit_queue": false, |
||||||
|
"web_dashboard": { |
||||||
|
"launch_dashboard": true, |
||||||
|
"port": 13370 |
||||||
|
}, |
||||||
|
"save": { |
||||||
|
"output_dir": "scraped", |
||||||
|
"save_pages": false |
||||||
|
}, |
||||||
|
"logging": { |
||||||
|
"output_logs": true, |
||||||
|
"logs_file": "logs.log" |
||||||
|
} |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
## License |
## License |
||||||
AGPLv3 |
wecr is distributed under AGPLv3 license |
@ -0,0 +1,161 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package dashboard |
||||||
|
|
||||||
|
import ( |
||||||
|
"embed" |
||||||
|
"encoding/json" |
||||||
|
"fmt" |
||||||
|
"html/template" |
||||||
|
"io" |
||||||
|
"io/fs" |
||||||
|
"net/http" |
||||||
|
"unbewohnte/wecr/config" |
||||||
|
"unbewohnte/wecr/logger" |
||||||
|
"unbewohnte/wecr/worker" |
||||||
|
) |
||||||
|
|
||||||
|
type Dashboard struct { |
||||||
|
Server *http.Server |
||||||
|
} |
||||||
|
|
||||||
|
//go:embed res
|
||||||
|
var resFS embed.FS |
||||||
|
|
||||||
|
type PageData struct { |
||||||
|
Conf config.Conf |
||||||
|
Stats worker.Statistics |
||||||
|
} |
||||||
|
|
||||||
|
type PoolStop struct { |
||||||
|
Stop bool `json:"stop"` |
||||||
|
} |
||||||
|
|
||||||
|
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard { |
||||||
|
mux := http.NewServeMux() |
||||||
|
res, err := fs.Sub(resFS, "res") |
||||||
|
if err != nil { |
||||||
|
logger.Error("Failed to Sub embedded dashboard FS: %s", err) |
||||||
|
return nil |
||||||
|
} |
||||||
|
|
||||||
|
mux.Handle("/static/", http.FileServer(http.FS(res))) |
||||||
|
|
||||||
|
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
template, err := template.ParseFS(res, "*.html") |
||||||
|
if err != nil { |
||||||
|
logger.Error("Failed to parse embedded dashboard FS: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
template.ExecuteTemplate(w, "index.html", nil) |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
var stop PoolStop |
||||||
|
|
||||||
|
requestBody, err := io.ReadAll(req.Body) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
defer req.Body.Close() |
||||||
|
|
||||||
|
err = json.Unmarshal(requestBody, &stop) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
if stop.Stop { |
||||||
|
// stop worker pool
|
||||||
|
pool.Stop() |
||||||
|
logger.Info("Stopped worker pool via request from dashboard") |
||||||
|
} else { |
||||||
|
// resume work
|
||||||
|
pool.Work() |
||||||
|
logger.Info("Resumed work via request from dashboard") |
||||||
|
} |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ") |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to marshal stats to send to the dashboard: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
w.Header().Add("Content-type", "application/json") |
||||||
|
w.Write(jsonStats) |
||||||
|
}) |
||||||
|
|
||||||
|
mux.HandleFunc("/conf", func(w http.ResponseWriter, req *http.Request) { |
||||||
|
switch req.Method { |
||||||
|
case http.MethodPost: |
||||||
|
var newConfig config.Conf |
||||||
|
|
||||||
|
defer req.Body.Close() |
||||||
|
newConfigData, err := io.ReadAll(req.Body) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to read new configuration from dashboard request: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
err = json.Unmarshal(newConfigData, &newConfig) |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to unmarshal new configuration", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to unmarshal new configuration from dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
|
||||||
|
// DO NOT blindly replace global configuration. Manually check and replace values
|
||||||
|
webConf.Search.IsRegexp = newConfig.Search.IsRegexp |
||||||
|
if len(newConfig.Search.Query) != 0 { |
||||||
|
webConf.Search.Query = newConfig.Search.Query |
||||||
|
} |
||||||
|
|
||||||
|
webConf.Logging.OutputLogs = newConfig.Logging.OutputLogs |
||||||
|
|
||||||
|
default: |
||||||
|
jsonConf, err := json.MarshalIndent(webConf, "", " ") |
||||||
|
if err != nil { |
||||||
|
http.Error(w, "Failed to marshal configuration", http.StatusInternalServerError) |
||||||
|
logger.Error("Failed to marshal current configuration to send to the dashboard UI: %s", err) |
||||||
|
return |
||||||
|
} |
||||||
|
w.Header().Add("Content-type", "application/json") |
||||||
|
w.Write(jsonConf) |
||||||
|
} |
||||||
|
}) |
||||||
|
|
||||||
|
server := &http.Server{ |
||||||
|
Addr: fmt.Sprintf(":%d", port), |
||||||
|
Handler: mux, |
||||||
|
} |
||||||
|
|
||||||
|
return &Dashboard{ |
||||||
|
Server: server, |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
func (board *Dashboard) Launch() error { |
||||||
|
return board.Server.ListenAndServe() |
||||||
|
} |
@ -0,0 +1,223 @@ |
|||||||
|
<!DOCTYPE html> |
||||||
|
<html lang="en"> |
||||||
|
|
||||||
|
<head> |
||||||
|
<meta charset="utf-8"> |
||||||
|
<title>Wecr dashboard</title> |
||||||
|
<!-- <link rel="icon" href="/static/icon.png"> --> |
||||||
|
<link rel="stylesheet" href="/static/bootstrap.css"> |
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
||||||
|
</head> |
||||||
|
|
||||||
|
<body class="d-flex flex-column h-100"> |
||||||
|
<div class="container"> |
||||||
|
<header class="d-flex flex-wrap justify-content-center py-3 mb-4 border-bottom"> |
||||||
|
<a href="/" class="d-flex align-items-center mb-3 mb-md-0 me-md-auto text-dark text-decoration-none"> |
||||||
|
<svg class="bi me-2" width="40" height="32"> |
||||||
|
<use xlink:href="#bootstrap"></use> |
||||||
|
</svg> |
||||||
|
<strong class="fs-4">Wecr</strong> |
||||||
|
</a> |
||||||
|
|
||||||
|
<ul class="nav nav-pills"> |
||||||
|
<li class="nav-item"><a href="/stats" class="nav-link">Stats</a></li> |
||||||
|
<li class="nav-item"><a href="/conf" class="nav-link">Config</a></li> |
||||||
|
</ul> |
||||||
|
</header> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h1>Dashboard</h1> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h2>Statistics</h2> |
||||||
|
<div id="statistics"> |
||||||
|
<ol class="list-group list-group-numbered"> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Pages visited</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="pages_visited">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Matches found</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="matches_found">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Pages saved</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="pages_saved">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Start time</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="start_time_unix">0</span> |
||||||
|
</li> |
||||||
|
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||||
|
<div class="ms-2 me-auto"> |
||||||
|
<div class="fw-bold">Stopped</div> |
||||||
|
</div> |
||||||
|
<span class="badge bg-primary rounded-pill" id="stopped">false</span> |
||||||
|
</li> |
||||||
|
</ol> |
||||||
|
</div> |
||||||
|
|
||||||
|
<button class="btn btn-primary" id="btn_stop">Stop</button> |
||||||
|
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
<div class="container"> |
||||||
|
<h2>Configuration</h2> |
||||||
|
<div> |
||||||
|
<b>Make runtime changes to configuration</b> |
||||||
|
<table class="table table-borderless"> |
||||||
|
<tr> |
||||||
|
<th>Key</th> |
||||||
|
<th>Value</th> |
||||||
|
</tr> |
||||||
|
<tr> |
||||||
|
<th>Query</th> |
||||||
|
<th> |
||||||
|
<input type="text" id="conf_query"> |
||||||
|
</th> |
||||||
|
</tr> |
||||||
|
<tr> |
||||||
|
<th>Is regexp</th> |
||||||
|
<th> |
||||||
|
<input type="text" id="conf_is_regexp"> |
||||||
|
</th> |
||||||
|
</tr> |
||||||
|
</table> |
||||||
|
<button class="btn btn-primary" id="config_apply_button"> |
||||||
|
Apply |
||||||
|
</button> |
||||||
|
</div> |
||||||
|
|
||||||
|
<div style="height: 3rem;"></div> |
||||||
|
|
||||||
|
<pre id="conf_output"></pre> |
||||||
|
</div> |
||||||
|
</div> |
||||||
|
</body> |
||||||
|
|
||||||
|
<script> |
||||||
|
window.onload = function () { |
||||||
|
let confOutput = document.getElementById("conf_output"); |
||||||
|
let pagesVisitedOut = document.getElementById("pages_visited"); |
||||||
|
let matchesFoundOut = document.getElementById("matches_found"); |
||||||
|
let pagesSavedOut = document.getElementById("pages_saved"); |
||||||
|
let startTimeOut = document.getElementById("start_time_unix"); |
||||||
|
let stoppedOut = document.getElementById("stopped"); |
||||||
|
let applyConfButton = document.getElementById("config_apply_button"); |
||||||
|
let confQuery = document.getElementById("conf_query"); |
||||||
|
let confIsRegexp = document.getElementById("conf_is_regexp"); |
||||||
|
let buttonStop = document.getElementById("btn_stop"); |
||||||
|
let buttonResume = document.getElementById("btn_resume"); |
||||||
|
|
||||||
|
buttonStop.addEventListener("click", (event) => { |
||||||
|
buttonStop.disabled = true; |
||||||
|
buttonResume.disabled = false; |
||||||
|
|
||||||
|
// stop worker pool |
||||||
|
let signal = { |
||||||
|
"stop": true, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/stop", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(signal), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
buttonResume.addEventListener("click", (event) => { |
||||||
|
buttonResume.disabled = true; |
||||||
|
buttonStop.disabled = false; |
||||||
|
|
||||||
|
// resume worker pool's work |
||||||
|
let signal = { |
||||||
|
"stop": false, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/stop", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(signal), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
applyConfButton.addEventListener("click", (event) => { |
||||||
|
let query = String(confQuery.value); |
||||||
|
|
||||||
|
if (confIsRegexp.value === "0") { |
||||||
|
isRegexp = false; |
||||||
|
} else if (confIsRegexp.value === "1") { |
||||||
|
isRegexp = true; |
||||||
|
}; |
||||||
|
if (confIsRegexp.value === "false") { |
||||||
|
isRegexp = false; |
||||||
|
} else if (confIsRegexp.value === "true") { |
||||||
|
isRegexp = true; |
||||||
|
}; |
||||||
|
|
||||||
|
let newConf = { |
||||||
|
"search": { |
||||||
|
"is_regexp": isRegexp, |
||||||
|
"query": query, |
||||||
|
}, |
||||||
|
}; |
||||||
|
|
||||||
|
fetch("/conf", { |
||||||
|
method: "POST", |
||||||
|
headers: { |
||||||
|
"Content-type": "application/json", |
||||||
|
}, |
||||||
|
body: JSON.stringify(newConf), |
||||||
|
}); |
||||||
|
}); |
||||||
|
|
||||||
|
const interval = setInterval(function () { |
||||||
|
// update statistics |
||||||
|
fetch("/stats") |
||||||
|
.then((response) => response.json()) |
||||||
|
.then((statistics) => { |
||||||
|
pagesVisitedOut.innerText = statistics.pages_visited; |
||||||
|
matchesFoundOut.innerText = statistics.matches_found; |
||||||
|
pagesSavedOut.innerText = statistics.pages_saved; |
||||||
|
startTimeOut.innerText = new Date(1000 * statistics.start_time_unix); |
||||||
|
stoppedOut.innerText = statistics.stopped; |
||||||
|
}); |
||||||
|
// update config |
||||||
|
fetch("/conf") |
||||||
|
.then((response) => response.text()) |
||||||
|
.then((config) => { |
||||||
|
// "print" whole configuration |
||||||
|
confOutput.innerText = config; |
||||||
|
|
||||||
|
// update values in the change table if they're empty |
||||||
|
let confJSON = JSON.parse(config); |
||||||
|
if (confQuery.value == "") { |
||||||
|
confQuery.value = confJSON.search.query; |
||||||
|
} |
||||||
|
if (confIsRegexp.value == "") { |
||||||
|
confIsRegexp.value = confJSON.search.is_regexp; |
||||||
|
} |
||||||
|
}); |
||||||
|
}, 650); |
||||||
|
}(); |
||||||
|
</script> |
||||||
|
|
||||||
|
</html> |
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -1,2 +0,0 @@ |
|||||||
golang.org/x/net v0.4.0 h1:Q5QPcMlvfxFTAPV0+07Xz/MpK9NTXu2VDUuy0FeMfaU= |
|
||||||
golang.org/x/net v0.4.0/go.mod h1:MBQ8lrhLObU/6UmLb4fmbmk5OcyYmqtbGd/9yIeKjEE= |
|
@ -0,0 +1,54 @@ |
|||||||
|
package queue |
||||||
|
|
||||||
|
import ( |
||||||
|
"encoding/json" |
||||||
|
"io" |
||||||
|
"os" |
||||||
|
"unbewohnte/wecr/web" |
||||||
|
) |
||||||
|
|
||||||
|
func PopLastJob(queue *os.File) (*web.Job, error) { |
||||||
|
stats, err := queue.Stat() |
||||||
|
if err != nil { |
||||||
|
return nil, err |
||||||
|
} |
||||||
|
|
||||||
|
if stats.Size() == 0 { |
||||||
|
return nil, nil |
||||||
|
} |
||||||
|
|
||||||
|
// find the last job in the queue
|
||||||
|
var job web.Job |
||||||
|
var offset int64 = -1 |
||||||
|
for { |
||||||
|
currentOffset, err := queue.Seek(offset, io.SeekEnd) |
||||||
|
if err != nil { |
||||||
|
return nil, err |
||||||
|
} |
||||||
|
|
||||||
|
decoder := json.NewDecoder(queue) |
||||||
|
err = decoder.Decode(&job) |
||||||
|
if err != nil || job.URL == "" || job.Search.Query == "" { |
||||||
|
offset -= 1 |
||||||
|
continue |
||||||
|
} |
||||||
|
|
||||||
|
queue.Truncate(currentOffset) |
||||||
|
return &job, nil |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
func InsertNewJob(queue *os.File, newJob web.Job) error { |
||||||
|
_, err := queue.Seek(0, io.SeekEnd) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
|
||||||
|
encoder := json.NewEncoder(queue) |
||||||
|
err = encoder.Encode(&newJob) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
|
||||||
|
return nil |
||||||
|
} |
@ -0,0 +1,78 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package utilities |
||||||
|
|
||||||
|
import ( |
||||||
|
"encoding/json" |
||||||
|
"fmt" |
||||||
|
"io" |
||||||
|
"os" |
||||||
|
"unbewohnte/wecr/web" |
||||||
|
) |
||||||
|
|
||||||
|
// Extracts data from the output JSON file and puts it in a new file with separators between each entry
|
||||||
|
func ExtractDataFromOutput(inputFilename string, outputFilename string, separator string, keepDuplicates bool) error { |
||||||
|
inputFile, err := os.Open(inputFilename) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
defer inputFile.Close() |
||||||
|
|
||||||
|
outputFile, err := os.Create(outputFilename) |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
defer outputFile.Close() |
||||||
|
|
||||||
|
var processedData []string |
||||||
|
|
||||||
|
decoder := json.NewDecoder(inputFile) |
||||||
|
for { |
||||||
|
var result web.Result |
||||||
|
|
||||||
|
err := decoder.Decode(&result) |
||||||
|
if err == io.EOF { |
||||||
|
break |
||||||
|
} |
||||||
|
if err != nil { |
||||||
|
return err |
||||||
|
} |
||||||
|
|
||||||
|
for _, dataEntry := range result.Data { |
||||||
|
var skip = false |
||||||
|
if !keepDuplicates { |
||||||
|
for _, processedEntry := range processedData { |
||||||
|
if dataEntry == processedEntry { |
||||||
|
skip = true |
||||||
|
break |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
if skip { |
||||||
|
continue |
||||||
|
} |
||||||
|
processedData = append(processedData, dataEntry) |
||||||
|
} |
||||||
|
|
||||||
|
outputFile.WriteString(fmt.Sprintf("%s%s", dataEntry, separator)) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
return nil |
||||||
|
} |
@ -0,0 +1,44 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package web |
||||||
|
|
||||||
|
import ( |
||||||
|
"net/url" |
||||||
|
) |
||||||
|
|
||||||
|
// Tries to find audio URLs on the page
|
||||||
|
func FindPageAudio(pageBody []byte, from url.URL) []url.URL { |
||||||
|
var urls []url.URL |
||||||
|
|
||||||
|
// for every element that has "src" attribute
|
||||||
|
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||||
|
if HasAudioExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// for every "a" element as well
|
||||||
|
for _, link := range FindPageLinks(pageBody, from) { |
||||||
|
if HasAudioExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
return urls |
||||||
|
} |
@ -0,0 +1,45 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package web |
||||||
|
|
||||||
|
import ( |
||||||
|
"net/url" |
||||||
|
) |
||||||
|
|
||||||
|
// Tries to find docs' URLs on the page
|
||||||
|
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL { |
||||||
|
var urls []url.URL |
||||||
|
|
||||||
|
// for every element that has "src" attribute
|
||||||
|
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||||
|
if HasDocumentExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// for every "a" element as well
|
||||||
|
for _, link := range FindPageLinks(pageBody, from) { |
||||||
|
if HasDocumentExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// return discovered doc urls
|
||||||
|
return urls |
||||||
|
} |
@ -0,0 +1,174 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package web |
||||||
|
|
||||||
|
import "strings" |
||||||
|
|
||||||
|
var AudioExtentions = []string{ |
||||||
|
".3gp", |
||||||
|
".aa", |
||||||
|
".aac", |
||||||
|
".aax", |
||||||
|
".act", |
||||||
|
".aiff", |
||||||
|
".alac", |
||||||
|
".amr", |
||||||
|
".ape", |
||||||
|
".au", |
||||||
|
".flac", |
||||||
|
".m4a", |
||||||
|
".mp3", |
||||||
|
".mpc", |
||||||
|
".msv", |
||||||
|
".ogg", |
||||||
|
".oga", |
||||||
|
".mogg", |
||||||
|
".opus", |
||||||
|
".tta", |
||||||
|
".wav", |
||||||
|
".cda", |
||||||
|
} |
||||||
|
|
||||||
|
var ImageExtentions = []string{ |
||||||
|
".jpeg", |
||||||
|
".jpg", |
||||||
|
".jpe", |
||||||
|
".jfif", |
||||||
|
".png", |
||||||
|
".ppm", |
||||||
|
".svg", |
||||||
|
".gif", |
||||||
|
".tiff", |
||||||
|
".bmp", |
||||||
|
".webp", |
||||||
|
".ico", |
||||||
|
".kra", |
||||||
|
".bpg", |
||||||
|
".drw", |
||||||
|
".tga", |
||||||
|
".kra", |
||||||
|
} |
||||||
|
|
||||||
|
var VideoExtentions = []string{ |
||||||
|
".webm", |
||||||
|
".mkv", |
||||||
|
".flv", |
||||||
|
".wmv", |
||||||
|
".avi", |
||||||
|
".yuv", |
||||||
|
".mp2", |
||||||
|
".mp4", |
||||||
|
".mpeg", |
||||||
|
".mpg", |
||||||
|
".mpv", |
||||||
|
".m4v", |
||||||
|
".3gp", |
||||||
|
".3g2", |
||||||
|
".nsv", |
||||||
|
".vob", |
||||||
|
".ogv", |
||||||
|
} |
||||||
|
|
||||||
|
var DocumentExtentions = []string{ |
||||||
|
".pdf", |
||||||
|
".doc", |
||||||
|
".docx", |
||||||
|
".epub", |
||||||
|
".fb2", |
||||||
|
".pub", |
||||||
|
".ppt", |
||||||
|
".pptx", |
||||||
|
".txt", |
||||||
|
".tex", |
||||||
|
".odt", |
||||||
|
".bib", |
||||||
|
".ps", |
||||||
|
".dwg", |
||||||
|
".lyx", |
||||||
|
".key", |
||||||
|
".ott", |
||||||
|
".odf", |
||||||
|
".odc", |
||||||
|
".ppg", |
||||||
|
".xlc", |
||||||
|
".latex", |
||||||
|
".c", |
||||||
|
".cpp", |
||||||
|
".sh", |
||||||
|
".go", |
||||||
|
".java", |
||||||
|
".cs", |
||||||
|
".rs", |
||||||
|
".lua", |
||||||
|
".php", |
||||||
|
".py", |
||||||
|
".pl", |
||||||
|
".lua", |
||||||
|
".kt", |
||||||
|
".rb", |
||||||
|
".asm", |
||||||
|
".rar", |
||||||
|
".tar", |
||||||
|
".db", |
||||||
|
".7z", |
||||||
|
".zip", |
||||||
|
".gbr", |
||||||
|
".tex", |
||||||
|
".ttf", |
||||||
|
".ttc", |
||||||
|
".woff", |
||||||
|
".otf", |
||||||
|
".exif", |
||||||
|
} |
||||||
|
|
||||||
|
func HasImageExtention(urlPath string) bool { |
||||||
|
for _, extention := range ImageExtentions { |
||||||
|
if strings.HasSuffix(urlPath, extention) { |
||||||
|
return true |
||||||
|
} |
||||||
|
} |
||||||
|
return false |
||||||
|
} |
||||||
|
|
||||||
|
func HasDocumentExtention(urlPath string) bool { |
||||||
|
for _, extention := range DocumentExtentions { |
||||||
|
if strings.HasSuffix(urlPath, extention) { |
||||||
|
return true |
||||||
|
} |
||||||
|
} |
||||||
|
return false |
||||||
|
} |
||||||
|
|
||||||
|
func HasVideoExtention(urlPath string) bool { |
||||||
|
for _, extention := range VideoExtentions { |
||||||
|
if strings.HasSuffix(urlPath, extention) { |
||||||
|
return true |
||||||
|
} |
||||||
|
} |
||||||
|
return false |
||||||
|
} |
||||||
|
|
||||||
|
func HasAudioExtention(urlPath string) bool { |
||||||
|
for _, extention := range AudioExtentions { |
||||||
|
if strings.HasSuffix(urlPath, extention) { |
||||||
|
return true |
||||||
|
} |
||||||
|
} |
||||||
|
return false |
||||||
|
} |
@ -0,0 +1,44 @@ |
|||||||
|
/* |
||||||
|
Wecr - crawl the web for data |
||||||
|
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||||
|
|
||||||
|
This program is free software: you can redistribute it and/or modify |
||||||
|
it under the terms of the GNU Affero General Public License as published by |
||||||
|
the Free Software Foundation, either version 3 of the License, or |
||||||
|
(at your option) any later version. |
||||||
|
|
||||||
|
This program is distributed in the hope that it will be useful, |
||||||
|
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||||
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||||
|
GNU Affero General Public License for more details. |
||||||
|
|
||||||
|
You should have received a copy of the GNU Affero General Public License |
||||||
|
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||||
|
*/ |
||||||
|
|
||||||
|
package web |
||||||
|
|
||||||
|
import ( |
||||||
|
"net/url" |
||||||
|
) |
||||||
|
|
||||||
|
// Tries to find videos' URLs on the page
|
||||||
|
func FindPageVideos(pageBody []byte, from url.URL) []url.URL { |
||||||
|
var urls []url.URL |
||||||
|
|
||||||
|
// for every element that has "src" attribute
|
||||||
|
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||||
|
if HasVideoExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
// for every "a" element as well
|
||||||
|
for _, link := range FindPageLinks(pageBody, from) { |
||||||
|
if HasVideoExtention(link.EscapedPath()) { |
||||||
|
urls = append(urls, link) |
||||||
|
} |
||||||
|
} |
||||||
|
|
||||||
|
return urls |
||||||
|
} |
Loading…
Reference in new issue