Compare commits
No commits in common. 'master' and 'v0.1.2' have entirely different histories.
26 changed files with 384 additions and 13557 deletions
@ -1,91 +1,33 @@
|
||||
# Wecr - versatile WEb CRawler |
||||
# Wecr - simple web crawler |
||||
|
||||
## Overview |
||||
|
||||
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. |
||||
Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way. |
||||
|
||||
## Configuration Overview |
||||
## Configuration |
||||
|
||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`. |
||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `wDir` (working directory) flag is set to some other value. |
||||
|
||||
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
||||
|
||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
||||
|
||||
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). |
||||
|
||||
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true` |
||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. |
||||
|
||||
### Search query |
||||
|
||||
There are some special `query` values to control the flow of work: |
||||
|
||||
- `email` - tells wecr to scrape email addresses and output to `output_file` |
||||
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
||||
- `videos` - find and fetch files that look like videos |
||||
- `audio` - find and fetch files that look like audio |
||||
- `documents` - find and fetch files that look like a document |
||||
- `everything` - find and fetch images, audio, video, documents and email addresses |
||||
- `archive` - no text to be searched, save every visited page |
||||
|
||||
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it. |
||||
|
||||
### Data Output |
||||
|
||||
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory. |
||||
|
||||
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. |
||||
|
||||
## Build |
||||
if `is_regexp` is `false`, then `query` is the text to be searched for, but there are some special values: |
||||
|
||||
If you're on *nix - it's as easy as `make`. |
||||
- `links` - tells `wecr` to search for all links there are on the page |
||||
- `images` - find all image links and output to the `output_dir` (**IMPORTANT**: set `wait_timeout_ms` to `0` so the images load fully) |
||||
|
||||
Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies. |
||||
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it. |
||||
|
||||
## Examples |
||||
### Output |
||||
|
||||
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples. |
||||
By default, if the query is not `images` all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images` - the additional contents will be put in the `output_dir` directory neatly created by the executable's side. |
||||
|
||||
Dump of a basic configuration: |
||||
## TODO |
||||
|
||||
```json |
||||
{ |
||||
"search": { |
||||
"is_regexp": true, |
||||
"query": "(sequence to search)|(other sequence)" |
||||
}, |
||||
"requests": { |
||||
"request_wait_timeout_ms": 2500, |
||||
"request_pause_ms": 100, |
||||
"content_fetch_timeout_ms": 0, |
||||
"user_agent": "" |
||||
}, |
||||
"depth": 90, |
||||
"workers": 30, |
||||
"initial_pages": [ |
||||
"https://en.wikipedia.org/wiki/Main_Page" |
||||
], |
||||
"allowed_domains": [ |
||||
"https://en.wikipedia.org/" |
||||
], |
||||
"blacklisted_domains": [ |
||||
"" |
||||
], |
||||
"in_memory_visit_queue": false, |
||||
"web_dashboard": { |
||||
"launch_dashboard": true, |
||||
"port": 13370 |
||||
}, |
||||
"save": { |
||||
"output_dir": "scraped", |
||||
"save_pages": false |
||||
}, |
||||
"logging": { |
||||
"output_logs": true, |
||||
"logs_file": "logs.log" |
||||
} |
||||
} |
||||
``` |
||||
- **PARSE HTML WITH REGEXP (_EVIL LAUGH_)** |
||||
|
||||
## License |
||||
wecr is distributed under AGPLv3 license |
||||
AGPLv3 |
@ -1,161 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package dashboard |
||||
|
||||
import ( |
||||
"embed" |
||||
"encoding/json" |
||||
"fmt" |
||||
"html/template" |
||||
"io" |
||||
"io/fs" |
||||
"net/http" |
||||
"unbewohnte/wecr/config" |
||||
"unbewohnte/wecr/logger" |
||||
"unbewohnte/wecr/worker" |
||||
) |
||||
|
||||
type Dashboard struct { |
||||
Server *http.Server |
||||
} |
||||
|
||||
//go:embed res
|
||||
var resFS embed.FS |
||||
|
||||
type PageData struct { |
||||
Conf config.Conf |
||||
Stats worker.Statistics |
||||
} |
||||
|
||||
type PoolStop struct { |
||||
Stop bool `json:"stop"` |
||||
} |
||||
|
||||
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard { |
||||
mux := http.NewServeMux() |
||||
res, err := fs.Sub(resFS, "res") |
||||
if err != nil { |
||||
logger.Error("Failed to Sub embedded dashboard FS: %s", err) |
||||
return nil |
||||
} |
||||
|
||||
mux.Handle("/static/", http.FileServer(http.FS(res))) |
||||
|
||||
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { |
||||
template, err := template.ParseFS(res, "*.html") |
||||
if err != nil { |
||||
logger.Error("Failed to parse embedded dashboard FS: %s", err) |
||||
return |
||||
} |
||||
|
||||
template.ExecuteTemplate(w, "index.html", nil) |
||||
}) |
||||
|
||||
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) { |
||||
var stop PoolStop |
||||
|
||||
requestBody, err := io.ReadAll(req.Body) |
||||
if err != nil { |
||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err) |
||||
return |
||||
} |
||||
defer req.Body.Close() |
||||
|
||||
err = json.Unmarshal(requestBody, &stop) |
||||
if err != nil { |
||||
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError) |
||||
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err) |
||||
return |
||||
} |
||||
|
||||
if stop.Stop { |
||||
// stop worker pool
|
||||
pool.Stop() |
||||
logger.Info("Stopped worker pool via request from dashboard") |
||||
} else { |
||||
// resume work
|
||||
pool.Work() |
||||
logger.Info("Resumed work via request from dashboard") |
||||
} |
||||
}) |
||||
|
||||
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) { |
||||
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ") |
||||
if err != nil { |
||||
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError) |
||||
logger.Error("Failed to marshal stats to send to the dashboard: %s", err) |
||||
return |
||||
} |
||||
w.Header().Add("Content-type", "application/json") |
||||
w.Write(jsonStats) |
||||
}) |
||||
|
||||
mux.HandleFunc("/conf", func(w http.ResponseWriter, req *http.Request) { |
||||
switch req.Method { |
||||
case http.MethodPost: |
||||
var newConfig config.Conf |
||||
|
||||
defer req.Body.Close() |
||||
newConfigData, err := io.ReadAll(req.Body) |
||||
if err != nil { |
||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||
logger.Error("Failed to read new configuration from dashboard request: %s", err) |
||||
return |
||||
} |
||||
err = json.Unmarshal(newConfigData, &newConfig) |
||||
if err != nil { |
||||
http.Error(w, "Failed to unmarshal new configuration", http.StatusInternalServerError) |
||||
logger.Error("Failed to unmarshal new configuration from dashboard UI: %s", err) |
||||
return |
||||
} |
||||
|
||||
// DO NOT blindly replace global configuration. Manually check and replace values
|
||||
webConf.Search.IsRegexp = newConfig.Search.IsRegexp |
||||
if len(newConfig.Search.Query) != 0 { |
||||
webConf.Search.Query = newConfig.Search.Query |
||||
} |
||||
|
||||
webConf.Logging.OutputLogs = newConfig.Logging.OutputLogs |
||||
|
||||
default: |
||||
jsonConf, err := json.MarshalIndent(webConf, "", " ") |
||||
if err != nil { |
||||
http.Error(w, "Failed to marshal configuration", http.StatusInternalServerError) |
||||
logger.Error("Failed to marshal current configuration to send to the dashboard UI: %s", err) |
||||
return |
||||
} |
||||
w.Header().Add("Content-type", "application/json") |
||||
w.Write(jsonConf) |
||||
} |
||||
}) |
||||
|
||||
server := &http.Server{ |
||||
Addr: fmt.Sprintf(":%d", port), |
||||
Handler: mux, |
||||
} |
||||
|
||||
return &Dashboard{ |
||||
Server: server, |
||||
} |
||||
} |
||||
|
||||
func (board *Dashboard) Launch() error { |
||||
return board.Server.ListenAndServe() |
||||
} |
@ -1,223 +0,0 @@
|
||||
<!DOCTYPE html> |
||||
<html lang="en"> |
||||
|
||||
<head> |
||||
<meta charset="utf-8"> |
||||
<title>Wecr dashboard</title> |
||||
<!-- <link rel="icon" href="/static/icon.png"> --> |
||||
<link rel="stylesheet" href="/static/bootstrap.css"> |
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
||||
</head> |
||||
|
||||
<body class="d-flex flex-column h-100"> |
||||
<div class="container"> |
||||
<header class="d-flex flex-wrap justify-content-center py-3 mb-4 border-bottom"> |
||||
<a href="/" class="d-flex align-items-center mb-3 mb-md-0 me-md-auto text-dark text-decoration-none"> |
||||
<svg class="bi me-2" width="40" height="32"> |
||||
<use xlink:href="#bootstrap"></use> |
||||
</svg> |
||||
<strong class="fs-4">Wecr</strong> |
||||
</a> |
||||
|
||||
<ul class="nav nav-pills"> |
||||
<li class="nav-item"><a href="/stats" class="nav-link">Stats</a></li> |
||||
<li class="nav-item"><a href="/conf" class="nav-link">Config</a></li> |
||||
</ul> |
||||
</header> |
||||
</div> |
||||
|
||||
<div class="container"> |
||||
<h1>Dashboard</h1> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
|
||||
<div class="container"> |
||||
<h2>Statistics</h2> |
||||
<div id="statistics"> |
||||
<ol class="list-group list-group-numbered"> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Pages visited</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="pages_visited">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Matches found</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="matches_found">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Pages saved</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="pages_saved">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Start time</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="start_time_unix">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Stopped</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="stopped">false</span> |
||||
</li> |
||||
</ol> |
||||
</div> |
||||
|
||||
<button class="btn btn-primary" id="btn_stop">Stop</button> |
||||
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button> |
||||
</div> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
<div class="container"> |
||||
<h2>Configuration</h2> |
||||
<div> |
||||
<b>Make runtime changes to configuration</b> |
||||
<table class="table table-borderless"> |
||||
<tr> |
||||
<th>Key</th> |
||||
<th>Value</th> |
||||
</tr> |
||||
<tr> |
||||
<th>Query</th> |
||||
<th> |
||||
<input type="text" id="conf_query"> |
||||
</th> |
||||
</tr> |
||||
<tr> |
||||
<th>Is regexp</th> |
||||
<th> |
||||
<input type="text" id="conf_is_regexp"> |
||||
</th> |
||||
</tr> |
||||
</table> |
||||
<button class="btn btn-primary" id="config_apply_button"> |
||||
Apply |
||||
</button> |
||||
</div> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
<pre id="conf_output"></pre> |
||||
</div> |
||||
</div> |
||||
</body> |
||||
|
||||
<script> |
||||
window.onload = function () { |
||||
let confOutput = document.getElementById("conf_output"); |
||||
let pagesVisitedOut = document.getElementById("pages_visited"); |
||||
let matchesFoundOut = document.getElementById("matches_found"); |
||||
let pagesSavedOut = document.getElementById("pages_saved"); |
||||
let startTimeOut = document.getElementById("start_time_unix"); |
||||
let stoppedOut = document.getElementById("stopped"); |
||||
let applyConfButton = document.getElementById("config_apply_button"); |
||||
let confQuery = document.getElementById("conf_query"); |
||||
let confIsRegexp = document.getElementById("conf_is_regexp"); |
||||
let buttonStop = document.getElementById("btn_stop"); |
||||
let buttonResume = document.getElementById("btn_resume"); |
||||
|
||||
buttonStop.addEventListener("click", (event) => { |
||||
buttonStop.disabled = true; |
||||
buttonResume.disabled = false; |
||||
|
||||
// stop worker pool |
||||
let signal = { |
||||
"stop": true, |
||||
}; |
||||
|
||||
fetch("/stop", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(signal), |
||||
}); |
||||
}); |
||||
|
||||
buttonResume.addEventListener("click", (event) => { |
||||
buttonResume.disabled = true; |
||||
buttonStop.disabled = false; |
||||
|
||||
// resume worker pool's work |
||||
let signal = { |
||||
"stop": false, |
||||
}; |
||||
|
||||
fetch("/stop", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(signal), |
||||
}); |
||||
}); |
||||
|
||||
applyConfButton.addEventListener("click", (event) => { |
||||
let query = String(confQuery.value); |
||||
|
||||
if (confIsRegexp.value === "0") { |
||||
isRegexp = false; |
||||
} else if (confIsRegexp.value === "1") { |
||||
isRegexp = true; |
||||
}; |
||||
if (confIsRegexp.value === "false") { |
||||
isRegexp = false; |
||||
} else if (confIsRegexp.value === "true") { |
||||
isRegexp = true; |
||||
}; |
||||
|
||||
let newConf = { |
||||
"search": { |
||||
"is_regexp": isRegexp, |
||||
"query": query, |
||||
}, |
||||
}; |
||||
|
||||
fetch("/conf", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(newConf), |
||||
}); |
||||
}); |
||||
|
||||
const interval = setInterval(function () { |
||||
// update statistics |
||||
fetch("/stats") |
||||
.then((response) => response.json()) |
||||
.then((statistics) => { |
||||
pagesVisitedOut.innerText = statistics.pages_visited; |
||||
matchesFoundOut.innerText = statistics.matches_found; |
||||
pagesSavedOut.innerText = statistics.pages_saved; |
||||
startTimeOut.innerText = new Date(1000 * statistics.start_time_unix); |
||||
stoppedOut.innerText = statistics.stopped; |
||||
}); |
||||
// update config |
||||
fetch("/conf") |
||||
.then((response) => response.text()) |
||||
.then((config) => { |
||||
// "print" whole configuration |
||||
confOutput.innerText = config; |
||||
|
||||
// update values in the change table if they're empty |
||||
let confJSON = JSON.parse(config); |
||||
if (confQuery.value == "") { |
||||
confQuery.value = confJSON.search.query; |
||||
} |
||||
if (confIsRegexp.value == "") { |
||||
confIsRegexp.value = confJSON.search.is_regexp; |
||||
} |
||||
}); |
||||
}, 650); |
||||
}(); |
||||
</script> |
||||
|
||||
</html> |
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -1,3 +1,5 @@
|
||||
module unbewohnte/wecr |
||||
|
||||
go 1.18 |
||||
|
||||
require golang.org/x/net v0.4.0 |
||||
|
@ -0,0 +1,2 @@
|
||||
golang.org/x/net v0.4.0 h1:Q5QPcMlvfxFTAPV0+07Xz/MpK9NTXu2VDUuy0FeMfaU= |
||||
golang.org/x/net v0.4.0/go.mod h1:MBQ8lrhLObU/6UmLb4fmbmk5OcyYmqtbGd/9yIeKjEE= |
@ -1,54 +0,0 @@
|
||||
package queue |
||||
|
||||
import ( |
||||
"encoding/json" |
||||
"io" |
||||
"os" |
||||
"unbewohnte/wecr/web" |
||||
) |
||||
|
||||
func PopLastJob(queue *os.File) (*web.Job, error) { |
||||
stats, err := queue.Stat() |
||||
if err != nil { |
||||
return nil, err |
||||
} |
||||
|
||||
if stats.Size() == 0 { |
||||
return nil, nil |
||||
} |
||||
|
||||
// find the last job in the queue
|
||||
var job web.Job |
||||
var offset int64 = -1 |
||||
for { |
||||
currentOffset, err := queue.Seek(offset, io.SeekEnd) |
||||
if err != nil { |
||||
return nil, err |
||||
} |
||||
|
||||
decoder := json.NewDecoder(queue) |
||||
err = decoder.Decode(&job) |
||||
if err != nil || job.URL == "" || job.Search.Query == "" { |
||||
offset -= 1 |
||||
continue |
||||
} |
||||
|
||||
queue.Truncate(currentOffset) |
||||
return &job, nil |
||||
} |
||||
} |
||||
|
||||
func InsertNewJob(queue *os.File, newJob web.Job) error { |
||||
_, err := queue.Seek(0, io.SeekEnd) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
|
||||
encoder := json.NewEncoder(queue) |
||||
err = encoder.Encode(&newJob) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
|
||||
return nil |
||||
} |
@ -1,78 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package utilities |
||||
|
||||
import ( |
||||
"encoding/json" |
||||
"fmt" |
||||
"io" |
||||
"os" |
||||
"unbewohnte/wecr/web" |
||||
) |
||||
|
||||
// Extracts data from the output JSON file and puts it in a new file with separators between each entry
|
||||
func ExtractDataFromOutput(inputFilename string, outputFilename string, separator string, keepDuplicates bool) error { |
||||
inputFile, err := os.Open(inputFilename) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
defer inputFile.Close() |
||||
|
||||
outputFile, err := os.Create(outputFilename) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
defer outputFile.Close() |
||||
|
||||
var processedData []string |
||||
|
||||
decoder := json.NewDecoder(inputFile) |
||||
for { |
||||
var result web.Result |
||||
|
||||
err := decoder.Decode(&result) |
||||
if err == io.EOF { |
||||
break |
||||
} |
||||
if err != nil { |
||||
return err |
||||
} |
||||
|
||||
for _, dataEntry := range result.Data { |
||||
var skip = false |
||||
if !keepDuplicates { |
||||
for _, processedEntry := range processedData { |
||||
if dataEntry == processedEntry { |
||||
skip = true |
||||
break |
||||
} |
||||
} |
||||
|
||||
if skip { |
||||
continue |
||||
} |
||||
processedData = append(processedData, dataEntry) |
||||
} |
||||
|
||||
outputFile.WriteString(fmt.Sprintf("%s%s", dataEntry, separator)) |
||||
} |
||||
} |
||||
|
||||
return nil |
||||
} |
@ -1,44 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package web |
||||
|
||||
import ( |
||||
"net/url" |
||||
) |
||||
|
||||
// Tries to find audio URLs on the page
|
||||
func FindPageAudio(pageBody []byte, from url.URL) []url.URL { |
||||
var urls []url.URL |
||||
|
||||
// for every element that has "src" attribute
|
||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||
if HasAudioExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// for every "a" element as well
|
||||
for _, link := range FindPageLinks(pageBody, from) { |
||||
if HasAudioExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
return urls |
||||
} |
@ -1,45 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package web |
||||
|
||||
import ( |
||||
"net/url" |
||||
) |
||||
|
||||
// Tries to find docs' URLs on the page
|
||||
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL { |
||||
var urls []url.URL |
||||
|
||||
// for every element that has "src" attribute
|
||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||
if HasDocumentExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// for every "a" element as well
|
||||
for _, link := range FindPageLinks(pageBody, from) { |
||||
if HasDocumentExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// return discovered doc urls
|
||||
return urls |
||||
} |
@ -1,174 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package web |
||||
|
||||
import "strings" |
||||
|
||||
var AudioExtentions = []string{ |
||||
".3gp", |
||||
".aa", |
||||
".aac", |
||||
".aax", |
||||
".act", |
||||
".aiff", |
||||
".alac", |
||||
".amr", |
||||
".ape", |
||||
".au", |
||||
".flac", |
||||
".m4a", |
||||
".mp3", |
||||
".mpc", |
||||
".msv", |
||||
".ogg", |
||||
".oga", |
||||
".mogg", |
||||
".opus", |
||||
".tta", |
||||
".wav", |
||||
".cda", |
||||
} |
||||
|
||||
var ImageExtentions = []string{ |
||||
".jpeg", |
||||
".jpg", |
||||
".jpe", |
||||
".jfif", |
||||
".png", |
||||
".ppm", |
||||
".svg", |
||||
".gif", |
||||
".tiff", |
||||
".bmp", |
||||
".webp", |
||||
".ico", |
||||
".kra", |
||||
".bpg", |
||||
".drw", |
||||
".tga", |
||||
".kra", |
||||
} |
||||
|
||||
var VideoExtentions = []string{ |
||||
".webm", |
||||
".mkv", |
||||
".flv", |
||||
".wmv", |
||||
".avi", |
||||
".yuv", |
||||
".mp2", |
||||
".mp4", |
||||
".mpeg", |
||||
".mpg", |
||||
".mpv", |
||||
".m4v", |
||||
".3gp", |
||||
".3g2", |
||||
".nsv", |
||||
".vob", |
||||
".ogv", |
||||
} |
||||
|
||||
var DocumentExtentions = []string{ |
||||
".pdf", |
||||
".doc", |
||||
".docx", |
||||
".epub", |
||||
".fb2", |
||||
".pub", |
||||
".ppt", |
||||
".pptx", |
||||
".txt", |
||||
".tex", |
||||
".odt", |
||||
".bib", |
||||
".ps", |
||||
".dwg", |
||||
".lyx", |
||||
".key", |
||||
".ott", |
||||
".odf", |
||||
".odc", |
||||
".ppg", |
||||
".xlc", |
||||
".latex", |
||||
".c", |
||||
".cpp", |
||||
".sh", |
||||
".go", |
||||
".java", |
||||
".cs", |
||||
".rs", |
||||
".lua", |
||||
".php", |
||||
".py", |
||||
".pl", |
||||
".lua", |
||||
".kt", |
||||
".rb", |
||||
".asm", |
||||
".rar", |
||||
".tar", |
||||
".db", |
||||
".7z", |
||||
".zip", |
||||
".gbr", |
||||
".tex", |
||||
".ttf", |
||||
".ttc", |
||||
".woff", |
||||
".otf", |
||||
".exif", |
||||
} |
||||
|
||||
func HasImageExtention(urlPath string) bool { |
||||
for _, extention := range ImageExtentions { |
||||
if strings.HasSuffix(urlPath, extention) { |
||||
return true |
||||
} |
||||
} |
||||
return false |
||||
} |
||||
|
||||
func HasDocumentExtention(urlPath string) bool { |
||||
for _, extention := range DocumentExtentions { |
||||
if strings.HasSuffix(urlPath, extention) { |
||||
return true |
||||
} |
||||
} |
||||
return false |
||||
} |
||||
|
||||
func HasVideoExtention(urlPath string) bool { |
||||
for _, extention := range VideoExtentions { |
||||
if strings.HasSuffix(urlPath, extention) { |
||||
return true |
||||
} |
||||
} |
||||
return false |
||||
} |
||||
|
||||
func HasAudioExtention(urlPath string) bool { |
||||
for _, extention := range AudioExtentions { |
||||
if strings.HasSuffix(urlPath, extention) { |
||||
return true |
||||
} |
||||
} |
||||
return false |
||||
} |
@ -1,44 +0,0 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package web |
||||
|
||||
import ( |
||||
"net/url" |
||||
) |
||||
|
||||
// Tries to find videos' URLs on the page
|
||||
func FindPageVideos(pageBody []byte, from url.URL) []url.URL { |
||||
var urls []url.URL |
||||
|
||||
// for every element that has "src" attribute
|
||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||
if HasVideoExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// for every "a" element as well
|
||||
for _, link := range FindPageLinks(pageBody, from) { |
||||
if HasVideoExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
return urls |
||||
} |
Loading…
Reference in new issue