Compare commits
No commits in common. 'master' and 'v0.1.2' have entirely different histories.
26 changed files with 384 additions and 13557 deletions
@ -1,91 +1,33 @@ |
|||||||
# Wecr - versatile WEb CRawler |
# Wecr - simple web crawler |
||||||
|
|
||||||
## Overview |
## Overview |
||||||
|
|
||||||
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. |
Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way. |
||||||
|
|
||||||
## Configuration Overview |
## Configuration |
||||||
|
|
||||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`. |
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `wDir` (working directory) flag is set to some other value. |
||||||
|
|
||||||
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
||||||
|
|
||||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. |
||||||
|
|
||||||
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). |
|
||||||
|
|
||||||
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true` |
|
||||||
|
|
||||||
### Search query |
### Search query |
||||||
|
|
||||||
There are some special `query` values to control the flow of work: |
if `is_regexp` is `false`, then `query` is the text to be searched for, but there are some special values: |
||||||
|
|
||||||
- `email` - tells wecr to scrape email addresses and output to `output_file` |
|
||||||
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
|
||||||
- `videos` - find and fetch files that look like videos |
|
||||||
- `audio` - find and fetch files that look like audio |
|
||||||
- `documents` - find and fetch files that look like a document |
|
||||||
- `everything` - find and fetch images, audio, video, documents and email addresses |
|
||||||
- `archive` - no text to be searched, save every visited page |
|
||||||
|
|
||||||
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it. |
|
||||||
|
|
||||||
### Data Output |
|
||||||
|
|
||||||
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory. |
|
||||||
|
|
||||||
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. |
|
||||||
|
|
||||||
## Build |
|
||||||
|
|
||||||
If you're on *nix - it's as easy as `make`. |
- `links` - tells `wecr` to search for all links there are on the page |
||||||
|
- `images` - find all image links and output to the `output_dir` (**IMPORTANT**: set `wait_timeout_ms` to `0` so the images load fully) |
||||||
|
|
||||||
Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies. |
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it. |
||||||
|
|
||||||
## Examples |
### Output |
||||||
|
|
||||||
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples. |
By default, if the query is not `images` all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images` - the additional contents will be put in the `output_dir` directory neatly created by the executable's side. |
||||||
|
|
||||||
Dump of a basic configuration: |
## TODO |
||||||
|
|
||||||
```json |
- **PARSE HTML WITH REGEXP (_EVIL LAUGH_)** |
||||||
{ |
|
||||||
"search": { |
|
||||||
"is_regexp": true, |
|
||||||
"query": "(sequence to search)|(other sequence)" |
|
||||||
}, |
|
||||||
"requests": { |
|
||||||
"request_wait_timeout_ms": 2500, |
|
||||||
"request_pause_ms": 100, |
|
||||||
"content_fetch_timeout_ms": 0, |
|
||||||
"user_agent": "" |
|
||||||
}, |
|
||||||
"depth": 90, |
|
||||||
"workers": 30, |
|
||||||
"initial_pages": [ |
|
||||||
"https://en.wikipedia.org/wiki/Main_Page" |
|
||||||
], |
|
||||||
"allowed_domains": [ |
|
||||||
"https://en.wikipedia.org/" |
|
||||||
], |
|
||||||
"blacklisted_domains": [ |
|
||||||
"" |
|
||||||
], |
|
||||||
"in_memory_visit_queue": false, |
|
||||||
"web_dashboard": { |
|
||||||
"launch_dashboard": true, |
|
||||||
"port": 13370 |
|
||||||
}, |
|
||||||
"save": { |
|
||||||
"output_dir": "scraped", |
|
||||||
"save_pages": false |
|
||||||
}, |
|
||||||
"logging": { |
|
||||||
"output_logs": true, |
|
||||||
"logs_file": "logs.log" |
|
||||||
} |
|
||||||
} |
|
||||||
``` |
|
||||||
|
|
||||||
## License |
## License |
||||||
wecr is distributed under AGPLv3 license |
AGPLv3 |
@ -1,161 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package dashboard |
|
||||||
|
|
||||||
import ( |
|
||||||
"embed" |
|
||||||
"encoding/json" |
|
||||||
"fmt" |
|
||||||
"html/template" |
|
||||||
"io" |
|
||||||
"io/fs" |
|
||||||
"net/http" |
|
||||||
"unbewohnte/wecr/config" |
|
||||||
"unbewohnte/wecr/logger" |
|
||||||
"unbewohnte/wecr/worker" |
|
||||||
) |
|
||||||
|
|
||||||
type Dashboard struct { |
|
||||||
Server *http.Server |
|
||||||
} |
|
||||||
|
|
||||||
//go:embed res
|
|
||||||
var resFS embed.FS |
|
||||||
|
|
||||||
type PageData struct { |
|
||||||
Conf config.Conf |
|
||||||
Stats worker.Statistics |
|
||||||
} |
|
||||||
|
|
||||||
type PoolStop struct { |
|
||||||
Stop bool `json:"stop"` |
|
||||||
} |
|
||||||
|
|
||||||
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard { |
|
||||||
mux := http.NewServeMux() |
|
||||||
res, err := fs.Sub(resFS, "res") |
|
||||||
if err != nil { |
|
||||||
logger.Error("Failed to Sub embedded dashboard FS: %s", err) |
|
||||||
return nil |
|
||||||
} |
|
||||||
|
|
||||||
mux.Handle("/static/", http.FileServer(http.FS(res))) |
|
||||||
|
|
||||||
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { |
|
||||||
template, err := template.ParseFS(res, "*.html") |
|
||||||
if err != nil { |
|
||||||
logger.Error("Failed to parse embedded dashboard FS: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
|
|
||||||
template.ExecuteTemplate(w, "index.html", nil) |
|
||||||
}) |
|
||||||
|
|
||||||
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) { |
|
||||||
var stop PoolStop |
|
||||||
|
|
||||||
requestBody, err := io.ReadAll(req.Body) |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
defer req.Body.Close() |
|
||||||
|
|
||||||
err = json.Unmarshal(requestBody, &stop) |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
|
|
||||||
if stop.Stop { |
|
||||||
// stop worker pool
|
|
||||||
pool.Stop() |
|
||||||
logger.Info("Stopped worker pool via request from dashboard") |
|
||||||
} else { |
|
||||||
// resume work
|
|
||||||
pool.Work() |
|
||||||
logger.Info("Resumed work via request from dashboard") |
|
||||||
} |
|
||||||
}) |
|
||||||
|
|
||||||
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) { |
|
||||||
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ") |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to marshal stats to send to the dashboard: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
w.Header().Add("Content-type", "application/json") |
|
||||||
w.Write(jsonStats) |
|
||||||
}) |
|
||||||
|
|
||||||
mux.HandleFunc("/conf", func(w http.ResponseWriter, req *http.Request) { |
|
||||||
switch req.Method { |
|
||||||
case http.MethodPost: |
|
||||||
var newConfig config.Conf |
|
||||||
|
|
||||||
defer req.Body.Close() |
|
||||||
newConfigData, err := io.ReadAll(req.Body) |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to read new configuration from dashboard request: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
err = json.Unmarshal(newConfigData, &newConfig) |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to unmarshal new configuration", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to unmarshal new configuration from dashboard UI: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
|
|
||||||
// DO NOT blindly replace global configuration. Manually check and replace values
|
|
||||||
webConf.Search.IsRegexp = newConfig.Search.IsRegexp |
|
||||||
if len(newConfig.Search.Query) != 0 { |
|
||||||
webConf.Search.Query = newConfig.Search.Query |
|
||||||
} |
|
||||||
|
|
||||||
webConf.Logging.OutputLogs = newConfig.Logging.OutputLogs |
|
||||||
|
|
||||||
default: |
|
||||||
jsonConf, err := json.MarshalIndent(webConf, "", " ") |
|
||||||
if err != nil { |
|
||||||
http.Error(w, "Failed to marshal configuration", http.StatusInternalServerError) |
|
||||||
logger.Error("Failed to marshal current configuration to send to the dashboard UI: %s", err) |
|
||||||
return |
|
||||||
} |
|
||||||
w.Header().Add("Content-type", "application/json") |
|
||||||
w.Write(jsonConf) |
|
||||||
} |
|
||||||
}) |
|
||||||
|
|
||||||
server := &http.Server{ |
|
||||||
Addr: fmt.Sprintf(":%d", port), |
|
||||||
Handler: mux, |
|
||||||
} |
|
||||||
|
|
||||||
return &Dashboard{ |
|
||||||
Server: server, |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
func (board *Dashboard) Launch() error { |
|
||||||
return board.Server.ListenAndServe() |
|
||||||
} |
|
@ -1,223 +0,0 @@ |
|||||||
<!DOCTYPE html> |
|
||||||
<html lang="en"> |
|
||||||
|
|
||||||
<head> |
|
||||||
<meta charset="utf-8"> |
|
||||||
<title>Wecr dashboard</title> |
|
||||||
<!-- <link rel="icon" href="/static/icon.png"> --> |
|
||||||
<link rel="stylesheet" href="/static/bootstrap.css"> |
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
||||||
</head> |
|
||||||
|
|
||||||
<body class="d-flex flex-column h-100"> |
|
||||||
<div class="container"> |
|
||||||
<header class="d-flex flex-wrap justify-content-center py-3 mb-4 border-bottom"> |
|
||||||
<a href="/" class="d-flex align-items-center mb-3 mb-md-0 me-md-auto text-dark text-decoration-none"> |
|
||||||
<svg class="bi me-2" width="40" height="32"> |
|
||||||
<use xlink:href="#bootstrap"></use> |
|
||||||
</svg> |
|
||||||
<strong class="fs-4">Wecr</strong> |
|
||||||
</a> |
|
||||||
|
|
||||||
<ul class="nav nav-pills"> |
|
||||||
<li class="nav-item"><a href="/stats" class="nav-link">Stats</a></li> |
|
||||||
<li class="nav-item"><a href="/conf" class="nav-link">Config</a></li> |
|
||||||
</ul> |
|
||||||
</header> |
|
||||||
</div> |
|
||||||
|
|
||||||
<div class="container"> |
|
||||||
<h1>Dashboard</h1> |
|
||||||
|
|
||||||
<div style="height: 3rem;"></div> |
|
||||||
|
|
||||||
|
|
||||||
<div class="container"> |
|
||||||
<h2>Statistics</h2> |
|
||||||
<div id="statistics"> |
|
||||||
<ol class="list-group list-group-numbered"> |
|
||||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
|
||||||
<div class="ms-2 me-auto"> |
|
||||||
<div class="fw-bold">Pages visited</div> |
|
||||||
</div> |
|
||||||
<span class="badge bg-primary rounded-pill" id="pages_visited">0</span> |
|
||||||
</li> |
|
||||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
|
||||||
<div class="ms-2 me-auto"> |
|
||||||
<div class="fw-bold">Matches found</div> |
|
||||||
</div> |
|
||||||
<span class="badge bg-primary rounded-pill" id="matches_found">0</span> |
|
||||||
</li> |
|
||||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
|
||||||
<div class="ms-2 me-auto"> |
|
||||||
<div class="fw-bold">Pages saved</div> |
|
||||||
</div> |
|
||||||
<span class="badge bg-primary rounded-pill" id="pages_saved">0</span> |
|
||||||
</li> |
|
||||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
|
||||||
<div class="ms-2 me-auto"> |
|
||||||
<div class="fw-bold">Start time</div> |
|
||||||
</div> |
|
||||||
<span class="badge bg-primary rounded-pill" id="start_time_unix">0</span> |
|
||||||
</li> |
|
||||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
|
||||||
<div class="ms-2 me-auto"> |
|
||||||
<div class="fw-bold">Stopped</div> |
|
||||||
</div> |
|
||||||
<span class="badge bg-primary rounded-pill" id="stopped">false</span> |
|
||||||
</li> |
|
||||||
</ol> |
|
||||||
</div> |
|
||||||
|
|
||||||
<button class="btn btn-primary" id="btn_stop">Stop</button> |
|
||||||
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button> |
|
||||||
</div> |
|
||||||
|
|
||||||
<div style="height: 3rem;"></div> |
|
||||||
|
|
||||||
<div class="container"> |
|
||||||
<h2>Configuration</h2> |
|
||||||
<div> |
|
||||||
<b>Make runtime changes to configuration</b> |
|
||||||
<table class="table table-borderless"> |
|
||||||
<tr> |
|
||||||
<th>Key</th> |
|
||||||
<th>Value</th> |
|
||||||
</tr> |
|
||||||
<tr> |
|
||||||
<th>Query</th> |
|
||||||
<th> |
|
||||||
<input type="text" id="conf_query"> |
|
||||||
</th> |
|
||||||
</tr> |
|
||||||
<tr> |
|
||||||
<th>Is regexp</th> |
|
||||||
<th> |
|
||||||
<input type="text" id="conf_is_regexp"> |
|
||||||
</th> |
|
||||||
</tr> |
|
||||||
</table> |
|
||||||
<button class="btn btn-primary" id="config_apply_button"> |
|
||||||
Apply |
|
||||||
</button> |
|
||||||
</div> |
|
||||||
|
|
||||||
<div style="height: 3rem;"></div> |
|
||||||
|
|
||||||
<pre id="conf_output"></pre> |
|
||||||
</div> |
|
||||||
</div> |
|
||||||
</body> |
|
||||||
|
|
||||||
<script> |
|
||||||
window.onload = function () { |
|
||||||
let confOutput = document.getElementById("conf_output"); |
|
||||||
let pagesVisitedOut = document.getElementById("pages_visited"); |
|
||||||
let matchesFoundOut = document.getElementById("matches_found"); |
|
||||||
let pagesSavedOut = document.getElementById("pages_saved"); |
|
||||||
let startTimeOut = document.getElementById("start_time_unix"); |
|
||||||
let stoppedOut = document.getElementById("stopped"); |
|
||||||
let applyConfButton = document.getElementById("config_apply_button"); |
|
||||||
let confQuery = document.getElementById("conf_query"); |
|
||||||
let confIsRegexp = document.getElementById("conf_is_regexp"); |
|
||||||
let buttonStop = document.getElementById("btn_stop"); |
|
||||||
let buttonResume = document.getElementById("btn_resume"); |
|
||||||
|
|
||||||
buttonStop.addEventListener("click", (event) => { |
|
||||||
buttonStop.disabled = true; |
|
||||||
buttonResume.disabled = false; |
|
||||||
|
|
||||||
// stop worker pool |
|
||||||
let signal = { |
|
||||||
"stop": true, |
|
||||||
}; |
|
||||||
|
|
||||||
fetch("/stop", { |
|
||||||
method: "POST", |
|
||||||
headers: { |
|
||||||
"Content-type": "application/json", |
|
||||||
}, |
|
||||||
body: JSON.stringify(signal), |
|
||||||
}); |
|
||||||
}); |
|
||||||
|
|
||||||
buttonResume.addEventListener("click", (event) => { |
|
||||||
buttonResume.disabled = true; |
|
||||||
buttonStop.disabled = false; |
|
||||||
|
|
||||||
// resume worker pool's work |
|
||||||
let signal = { |
|
||||||
"stop": false, |
|
||||||
}; |
|
||||||
|
|
||||||
fetch("/stop", { |
|
||||||
method: "POST", |
|
||||||
headers: { |
|
||||||
"Content-type": "application/json", |
|
||||||
}, |
|
||||||
body: JSON.stringify(signal), |
|
||||||
}); |
|
||||||
}); |
|
||||||
|
|
||||||
applyConfButton.addEventListener("click", (event) => { |
|
||||||
let query = String(confQuery.value); |
|
||||||
|
|
||||||
if (confIsRegexp.value === "0") { |
|
||||||
isRegexp = false; |
|
||||||
} else if (confIsRegexp.value === "1") { |
|
||||||
isRegexp = true; |
|
||||||
}; |
|
||||||
if (confIsRegexp.value === "false") { |
|
||||||
isRegexp = false; |
|
||||||
} else if (confIsRegexp.value === "true") { |
|
||||||
isRegexp = true; |
|
||||||
}; |
|
||||||
|
|
||||||
let newConf = { |
|
||||||
"search": { |
|
||||||
"is_regexp": isRegexp, |
|
||||||
"query": query, |
|
||||||
}, |
|
||||||
}; |
|
||||||
|
|
||||||
fetch("/conf", { |
|
||||||
method: "POST", |
|
||||||
headers: { |
|
||||||
"Content-type": "application/json", |
|
||||||
}, |
|
||||||
body: JSON.stringify(newConf), |
|
||||||
}); |
|
||||||
}); |
|
||||||
|
|
||||||
const interval = setInterval(function () { |
|
||||||
// update statistics |
|
||||||
fetch("/stats") |
|
||||||
.then((response) => response.json()) |
|
||||||
.then((statistics) => { |
|
||||||
pagesVisitedOut.innerText = statistics.pages_visited; |
|
||||||
matchesFoundOut.innerText = statistics.matches_found; |
|
||||||
pagesSavedOut.innerText = statistics.pages_saved; |
|
||||||
startTimeOut.innerText = new Date(1000 * statistics.start_time_unix); |
|
||||||
stoppedOut.innerText = statistics.stopped; |
|
||||||
}); |
|
||||||
// update config |
|
||||||
fetch("/conf") |
|
||||||
.then((response) => response.text()) |
|
||||||
.then((config) => { |
|
||||||
// "print" whole configuration |
|
||||||
confOutput.innerText = config; |
|
||||||
|
|
||||||
// update values in the change table if they're empty |
|
||||||
let confJSON = JSON.parse(config); |
|
||||||
if (confQuery.value == "") { |
|
||||||
confQuery.value = confJSON.search.query; |
|
||||||
} |
|
||||||
if (confIsRegexp.value == "") { |
|
||||||
confIsRegexp.value = confJSON.search.is_regexp; |
|
||||||
} |
|
||||||
}); |
|
||||||
}, 650); |
|
||||||
}(); |
|
||||||
</script> |
|
||||||
|
|
||||||
</html> |
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@ |
|||||||
module unbewohnte/wecr |
module unbewohnte/wecr |
||||||
|
|
||||||
go 1.18 |
go 1.18 |
||||||
|
|
||||||
|
require golang.org/x/net v0.4.0 |
||||||
|
@ -0,0 +1,2 @@ |
|||||||
|
golang.org/x/net v0.4.0 h1:Q5QPcMlvfxFTAPV0+07Xz/MpK9NTXu2VDUuy0FeMfaU= |
||||||
|
golang.org/x/net v0.4.0/go.mod h1:MBQ8lrhLObU/6UmLb4fmbmk5OcyYmqtbGd/9yIeKjEE= |
@ -1,54 +0,0 @@ |
|||||||
package queue |
|
||||||
|
|
||||||
import ( |
|
||||||
"encoding/json" |
|
||||||
"io" |
|
||||||
"os" |
|
||||||
"unbewohnte/wecr/web" |
|
||||||
) |
|
||||||
|
|
||||||
func PopLastJob(queue *os.File) (*web.Job, error) { |
|
||||||
stats, err := queue.Stat() |
|
||||||
if err != nil { |
|
||||||
return nil, err |
|
||||||
} |
|
||||||
|
|
||||||
if stats.Size() == 0 { |
|
||||||
return nil, nil |
|
||||||
} |
|
||||||
|
|
||||||
// find the last job in the queue
|
|
||||||
var job web.Job |
|
||||||
var offset int64 = -1 |
|
||||||
for { |
|
||||||
currentOffset, err := queue.Seek(offset, io.SeekEnd) |
|
||||||
if err != nil { |
|
||||||
return nil, err |
|
||||||
} |
|
||||||
|
|
||||||
decoder := json.NewDecoder(queue) |
|
||||||
err = decoder.Decode(&job) |
|
||||||
if err != nil || job.URL == "" || job.Search.Query == "" { |
|
||||||
offset -= 1 |
|
||||||
continue |
|
||||||
} |
|
||||||
|
|
||||||
queue.Truncate(currentOffset) |
|
||||||
return &job, nil |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
func InsertNewJob(queue *os.File, newJob web.Job) error { |
|
||||||
_, err := queue.Seek(0, io.SeekEnd) |
|
||||||
if err != nil { |
|
||||||
return err |
|
||||||
} |
|
||||||
|
|
||||||
encoder := json.NewEncoder(queue) |
|
||||||
err = encoder.Encode(&newJob) |
|
||||||
if err != nil { |
|
||||||
return err |
|
||||||
} |
|
||||||
|
|
||||||
return nil |
|
||||||
} |
|
@ -1,78 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package utilities |
|
||||||
|
|
||||||
import ( |
|
||||||
"encoding/json" |
|
||||||
"fmt" |
|
||||||
"io" |
|
||||||
"os" |
|
||||||
"unbewohnte/wecr/web" |
|
||||||
) |
|
||||||
|
|
||||||
// Extracts data from the output JSON file and puts it in a new file with separators between each entry
|
|
||||||
func ExtractDataFromOutput(inputFilename string, outputFilename string, separator string, keepDuplicates bool) error { |
|
||||||
inputFile, err := os.Open(inputFilename) |
|
||||||
if err != nil { |
|
||||||
return err |
|
||||||
} |
|
||||||
defer inputFile.Close() |
|
||||||
|
|
||||||
outputFile, err := os.Create(outputFilename) |
|
||||||
if err != nil { |
|
||||||
return err |
|
||||||
} |
|
||||||
defer outputFile.Close() |
|
||||||
|
|
||||||
var processedData []string |
|
||||||
|
|
||||||
decoder := json.NewDecoder(inputFile) |
|
||||||
for { |
|
||||||
var result web.Result |
|
||||||
|
|
||||||
err := decoder.Decode(&result) |
|
||||||
if err == io.EOF { |
|
||||||
break |
|
||||||
} |
|
||||||
if err != nil { |
|
||||||
return err |
|
||||||
} |
|
||||||
|
|
||||||
for _, dataEntry := range result.Data { |
|
||||||
var skip = false |
|
||||||
if !keepDuplicates { |
|
||||||
for _, processedEntry := range processedData { |
|
||||||
if dataEntry == processedEntry { |
|
||||||
skip = true |
|
||||||
break |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
if skip { |
|
||||||
continue |
|
||||||
} |
|
||||||
processedData = append(processedData, dataEntry) |
|
||||||
} |
|
||||||
|
|
||||||
outputFile.WriteString(fmt.Sprintf("%s%s", dataEntry, separator)) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
return nil |
|
||||||
} |
|
@ -1,44 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package web |
|
||||||
|
|
||||||
import ( |
|
||||||
"net/url" |
|
||||||
) |
|
||||||
|
|
||||||
// Tries to find audio URLs on the page
|
|
||||||
func FindPageAudio(pageBody []byte, from url.URL) []url.URL { |
|
||||||
var urls []url.URL |
|
||||||
|
|
||||||
// for every element that has "src" attribute
|
|
||||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
|
||||||
if HasAudioExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// for every "a" element as well
|
|
||||||
for _, link := range FindPageLinks(pageBody, from) { |
|
||||||
if HasAudioExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
return urls |
|
||||||
} |
|
@ -1,45 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package web |
|
||||||
|
|
||||||
import ( |
|
||||||
"net/url" |
|
||||||
) |
|
||||||
|
|
||||||
// Tries to find docs' URLs on the page
|
|
||||||
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL { |
|
||||||
var urls []url.URL |
|
||||||
|
|
||||||
// for every element that has "src" attribute
|
|
||||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
|
||||||
if HasDocumentExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// for every "a" element as well
|
|
||||||
for _, link := range FindPageLinks(pageBody, from) { |
|
||||||
if HasDocumentExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// return discovered doc urls
|
|
||||||
return urls |
|
||||||
} |
|
@ -1,174 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package web |
|
||||||
|
|
||||||
import "strings" |
|
||||||
|
|
||||||
var AudioExtentions = []string{ |
|
||||||
".3gp", |
|
||||||
".aa", |
|
||||||
".aac", |
|
||||||
".aax", |
|
||||||
".act", |
|
||||||
".aiff", |
|
||||||
".alac", |
|
||||||
".amr", |
|
||||||
".ape", |
|
||||||
".au", |
|
||||||
".flac", |
|
||||||
".m4a", |
|
||||||
".mp3", |
|
||||||
".mpc", |
|
||||||
".msv", |
|
||||||
".ogg", |
|
||||||
".oga", |
|
||||||
".mogg", |
|
||||||
".opus", |
|
||||||
".tta", |
|
||||||
".wav", |
|
||||||
".cda", |
|
||||||
} |
|
||||||
|
|
||||||
var ImageExtentions = []string{ |
|
||||||
".jpeg", |
|
||||||
".jpg", |
|
||||||
".jpe", |
|
||||||
".jfif", |
|
||||||
".png", |
|
||||||
".ppm", |
|
||||||
".svg", |
|
||||||
".gif", |
|
||||||
".tiff", |
|
||||||
".bmp", |
|
||||||
".webp", |
|
||||||
".ico", |
|
||||||
".kra", |
|
||||||
".bpg", |
|
||||||
".drw", |
|
||||||
".tga", |
|
||||||
".kra", |
|
||||||
} |
|
||||||
|
|
||||||
var VideoExtentions = []string{ |
|
||||||
".webm", |
|
||||||
".mkv", |
|
||||||
".flv", |
|
||||||
".wmv", |
|
||||||
".avi", |
|
||||||
".yuv", |
|
||||||
".mp2", |
|
||||||
".mp4", |
|
||||||
".mpeg", |
|
||||||
".mpg", |
|
||||||
".mpv", |
|
||||||
".m4v", |
|
||||||
".3gp", |
|
||||||
".3g2", |
|
||||||
".nsv", |
|
||||||
".vob", |
|
||||||
".ogv", |
|
||||||
} |
|
||||||
|
|
||||||
var DocumentExtentions = []string{ |
|
||||||
".pdf", |
|
||||||
".doc", |
|
||||||
".docx", |
|
||||||
".epub", |
|
||||||
".fb2", |
|
||||||
".pub", |
|
||||||
".ppt", |
|
||||||
".pptx", |
|
||||||
".txt", |
|
||||||
".tex", |
|
||||||
".odt", |
|
||||||
".bib", |
|
||||||
".ps", |
|
||||||
".dwg", |
|
||||||
".lyx", |
|
||||||
".key", |
|
||||||
".ott", |
|
||||||
".odf", |
|
||||||
".odc", |
|
||||||
".ppg", |
|
||||||
".xlc", |
|
||||||
".latex", |
|
||||||
".c", |
|
||||||
".cpp", |
|
||||||
".sh", |
|
||||||
".go", |
|
||||||
".java", |
|
||||||
".cs", |
|
||||||
".rs", |
|
||||||
".lua", |
|
||||||
".php", |
|
||||||
".py", |
|
||||||
".pl", |
|
||||||
".lua", |
|
||||||
".kt", |
|
||||||
".rb", |
|
||||||
".asm", |
|
||||||
".rar", |
|
||||||
".tar", |
|
||||||
".db", |
|
||||||
".7z", |
|
||||||
".zip", |
|
||||||
".gbr", |
|
||||||
".tex", |
|
||||||
".ttf", |
|
||||||
".ttc", |
|
||||||
".woff", |
|
||||||
".otf", |
|
||||||
".exif", |
|
||||||
} |
|
||||||
|
|
||||||
func HasImageExtention(urlPath string) bool { |
|
||||||
for _, extention := range ImageExtentions { |
|
||||||
if strings.HasSuffix(urlPath, extention) { |
|
||||||
return true |
|
||||||
} |
|
||||||
} |
|
||||||
return false |
|
||||||
} |
|
||||||
|
|
||||||
func HasDocumentExtention(urlPath string) bool { |
|
||||||
for _, extention := range DocumentExtentions { |
|
||||||
if strings.HasSuffix(urlPath, extention) { |
|
||||||
return true |
|
||||||
} |
|
||||||
} |
|
||||||
return false |
|
||||||
} |
|
||||||
|
|
||||||
func HasVideoExtention(urlPath string) bool { |
|
||||||
for _, extention := range VideoExtentions { |
|
||||||
if strings.HasSuffix(urlPath, extention) { |
|
||||||
return true |
|
||||||
} |
|
||||||
} |
|
||||||
return false |
|
||||||
} |
|
||||||
|
|
||||||
func HasAudioExtention(urlPath string) bool { |
|
||||||
for _, extention := range AudioExtentions { |
|
||||||
if strings.HasSuffix(urlPath, extention) { |
|
||||||
return true |
|
||||||
} |
|
||||||
} |
|
||||||
return false |
|
||||||
} |
|
@ -1,44 +0,0 @@ |
|||||||
/* |
|
||||||
Wecr - crawl the web for data |
|
||||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
|
||||||
|
|
||||||
This program is free software: you can redistribute it and/or modify |
|
||||||
it under the terms of the GNU Affero General Public License as published by |
|
||||||
the Free Software Foundation, either version 3 of the License, or |
|
||||||
(at your option) any later version. |
|
||||||
|
|
||||||
This program is distributed in the hope that it will be useful, |
|
||||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
||||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
||||||
GNU Affero General Public License for more details. |
|
||||||
|
|
||||||
You should have received a copy of the GNU Affero General Public License |
|
||||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
|
||||||
*/ |
|
||||||
|
|
||||||
package web |
|
||||||
|
|
||||||
import ( |
|
||||||
"net/url" |
|
||||||
) |
|
||||||
|
|
||||||
// Tries to find videos' URLs on the page
|
|
||||||
func FindPageVideos(pageBody []byte, from url.URL) []url.URL { |
|
||||||
var urls []url.URL |
|
||||||
|
|
||||||
// for every element that has "src" attribute
|
|
||||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
|
||||||
if HasVideoExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
// for every "a" element as well
|
|
||||||
for _, link := range FindPageLinks(pageBody, from) { |
|
||||||
if HasVideoExtention(link.EscapedPath()) { |
|
||||||
urls = append(urls, link) |
|
||||||
} |
|
||||||
} |
|
||||||
|
|
||||||
return urls |
|
||||||
} |
|
Loading…
Reference in new issue