Compare commits
16 Commits
Author | SHA1 | Date |
---|---|---|
Kasianov Nikolai Alekseevich | 722f3fb536 | 2 years ago |
Kasianov Nikolai Alekseevich | c91986d42d | 2 years ago |
Kasianov Nikolai Alekseevich | c2ec2073dc | 2 years ago |
Kasianov Nikolai Alekseevich | 812fd2adf7 | 2 years ago |
Kasianov Nikolai Alekseevich | b256d8a83e | 2 years ago |
Kasianov Nikolai Alekseevich | e5af2939cc | 2 years ago |
Kasianov Nikolai Alekseevich | 6fab9031b1 | 2 years ago |
Kasianov Nikolai Alekseevich | fd484c665e | 2 years ago |
Kasianov Nikolai Alekseevich | 00bc33d5de | 2 years ago |
Kasianov Nikolai Alekseevich | f96bad448a | 2 years ago |
Kasianov Nikolai Alekseevich | d877a483a2 | 2 years ago |
Kasianov Nikolai Alekseevich | 023c2e5a19 | 2 years ago |
Kasianov Nikolai Alekseevich | 1771d19b82 | 2 years ago |
Kasianov Nikolai Alekseevich | 5150edc41c | 2 years ago |
Kasianov Nikolai Alekseevich | d4888dab92 | 2 years ago |
Kasianov Nikolai Alekseevich | 793c2b2a70 | 2 years ago |
21 changed files with 12941 additions and 476 deletions
@ -1,38 +1,91 @@
|
||||
# Wecr - simple web crawler |
||||
# Wecr - versatile WEb CRawler |
||||
|
||||
## Overview |
||||
|
||||
Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way. |
||||
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. |
||||
|
||||
## Configuration |
||||
## Configuration Overview |
||||
|
||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wDir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`. |
||||
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`. |
||||
|
||||
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. |
||||
|
||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
||||
The parsing starts from `initial_pages` and goes deeper while ignoring the pages on domains that are in `blacklisted_domains` or are NOT in `allowed_domains`. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that `*_domains` should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports **matter**: `https://unbewohnte.su:3000/` and `https://unbewohnte.su/` are **different**. |
||||
|
||||
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). |
||||
|
||||
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true` |
||||
|
||||
### Search query |
||||
|
||||
There are some special `query` values: |
||||
There are some special `query` values to control the flow of work: |
||||
|
||||
- `email` - tells wecr to scrape email addresses and output to `output_file` |
||||
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully) |
||||
- `videos` - find and fetch files that look like videos |
||||
- `audio` - find and fetch files that look like audio |
||||
- `everything` - find and fetch images, audio and video |
||||
- `documents` - find and fetch files that look like a document |
||||
- `everything` - find and fetch images, audio, video, documents and email addresses |
||||
- `archive` - no text to be searched, save every visited page |
||||
|
||||
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it. |
||||
|
||||
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it. |
||||
### Data Output |
||||
|
||||
### Output |
||||
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory. |
||||
|
||||
By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side. |
||||
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. |
||||
|
||||
## Build |
||||
|
||||
If you're on *nix - it's as easy as `make`. |
||||
|
||||
Otherwise - `go build` in the `src` directory to build `wecr`. |
||||
Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies. |
||||
|
||||
## Examples |
||||
|
||||
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples. |
||||
|
||||
Dump of a basic configuration: |
||||
|
||||
```json |
||||
{ |
||||
"search": { |
||||
"is_regexp": true, |
||||
"query": "(sequence to search)|(other sequence)" |
||||
}, |
||||
"requests": { |
||||
"request_wait_timeout_ms": 2500, |
||||
"request_pause_ms": 100, |
||||
"content_fetch_timeout_ms": 0, |
||||
"user_agent": "" |
||||
}, |
||||
"depth": 90, |
||||
"workers": 30, |
||||
"initial_pages": [ |
||||
"https://en.wikipedia.org/wiki/Main_Page" |
||||
], |
||||
"allowed_domains": [ |
||||
"https://en.wikipedia.org/" |
||||
], |
||||
"blacklisted_domains": [ |
||||
"" |
||||
], |
||||
"in_memory_visit_queue": false, |
||||
"web_dashboard": { |
||||
"launch_dashboard": true, |
||||
"port": 13370 |
||||
}, |
||||
"save": { |
||||
"output_dir": "scraped", |
||||
"save_pages": false |
||||
}, |
||||
"logging": { |
||||
"output_logs": true, |
||||
"logs_file": "logs.log" |
||||
} |
||||
} |
||||
``` |
||||
|
||||
## License |
||||
AGPLv3 |
||||
wecr is distributed under AGPLv3 license |
@ -0,0 +1,161 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package dashboard |
||||
|
||||
import ( |
||||
"embed" |
||||
"encoding/json" |
||||
"fmt" |
||||
"html/template" |
||||
"io" |
||||
"io/fs" |
||||
"net/http" |
||||
"unbewohnte/wecr/config" |
||||
"unbewohnte/wecr/logger" |
||||
"unbewohnte/wecr/worker" |
||||
) |
||||
|
||||
type Dashboard struct { |
||||
Server *http.Server |
||||
} |
||||
|
||||
//go:embed res
|
||||
var resFS embed.FS |
||||
|
||||
type PageData struct { |
||||
Conf config.Conf |
||||
Stats worker.Statistics |
||||
} |
||||
|
||||
type PoolStop struct { |
||||
Stop bool `json:"stop"` |
||||
} |
||||
|
||||
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard { |
||||
mux := http.NewServeMux() |
||||
res, err := fs.Sub(resFS, "res") |
||||
if err != nil { |
||||
logger.Error("Failed to Sub embedded dashboard FS: %s", err) |
||||
return nil |
||||
} |
||||
|
||||
mux.Handle("/static/", http.FileServer(http.FS(res))) |
||||
|
||||
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) { |
||||
template, err := template.ParseFS(res, "*.html") |
||||
if err != nil { |
||||
logger.Error("Failed to parse embedded dashboard FS: %s", err) |
||||
return |
||||
} |
||||
|
||||
template.ExecuteTemplate(w, "index.html", nil) |
||||
}) |
||||
|
||||
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) { |
||||
var stop PoolStop |
||||
|
||||
requestBody, err := io.ReadAll(req.Body) |
||||
if err != nil { |
||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err) |
||||
return |
||||
} |
||||
defer req.Body.Close() |
||||
|
||||
err = json.Unmarshal(requestBody, &stop) |
||||
if err != nil { |
||||
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError) |
||||
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err) |
||||
return |
||||
} |
||||
|
||||
if stop.Stop { |
||||
// stop worker pool
|
||||
pool.Stop() |
||||
logger.Info("Stopped worker pool via request from dashboard") |
||||
} else { |
||||
// resume work
|
||||
pool.Work() |
||||
logger.Info("Resumed work via request from dashboard") |
||||
} |
||||
}) |
||||
|
||||
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) { |
||||
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ") |
||||
if err != nil { |
||||
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError) |
||||
logger.Error("Failed to marshal stats to send to the dashboard: %s", err) |
||||
return |
||||
} |
||||
w.Header().Add("Content-type", "application/json") |
||||
w.Write(jsonStats) |
||||
}) |
||||
|
||||
mux.HandleFunc("/conf", func(w http.ResponseWriter, req *http.Request) { |
||||
switch req.Method { |
||||
case http.MethodPost: |
||||
var newConfig config.Conf |
||||
|
||||
defer req.Body.Close() |
||||
newConfigData, err := io.ReadAll(req.Body) |
||||
if err != nil { |
||||
http.Error(w, "Failed to read request body", http.StatusInternalServerError) |
||||
logger.Error("Failed to read new configuration from dashboard request: %s", err) |
||||
return |
||||
} |
||||
err = json.Unmarshal(newConfigData, &newConfig) |
||||
if err != nil { |
||||
http.Error(w, "Failed to unmarshal new configuration", http.StatusInternalServerError) |
||||
logger.Error("Failed to unmarshal new configuration from dashboard UI: %s", err) |
||||
return |
||||
} |
||||
|
||||
// DO NOT blindly replace global configuration. Manually check and replace values
|
||||
webConf.Search.IsRegexp = newConfig.Search.IsRegexp |
||||
if len(newConfig.Search.Query) != 0 { |
||||
webConf.Search.Query = newConfig.Search.Query |
||||
} |
||||
|
||||
webConf.Logging.OutputLogs = newConfig.Logging.OutputLogs |
||||
|
||||
default: |
||||
jsonConf, err := json.MarshalIndent(webConf, "", " ") |
||||
if err != nil { |
||||
http.Error(w, "Failed to marshal configuration", http.StatusInternalServerError) |
||||
logger.Error("Failed to marshal current configuration to send to the dashboard UI: %s", err) |
||||
return |
||||
} |
||||
w.Header().Add("Content-type", "application/json") |
||||
w.Write(jsonConf) |
||||
} |
||||
}) |
||||
|
||||
server := &http.Server{ |
||||
Addr: fmt.Sprintf(":%d", port), |
||||
Handler: mux, |
||||
} |
||||
|
||||
return &Dashboard{ |
||||
Server: server, |
||||
} |
||||
} |
||||
|
||||
func (board *Dashboard) Launch() error { |
||||
return board.Server.ListenAndServe() |
||||
} |
@ -0,0 +1,223 @@
|
||||
<!DOCTYPE html> |
||||
<html lang="en"> |
||||
|
||||
<head> |
||||
<meta charset="utf-8"> |
||||
<title>Wecr dashboard</title> |
||||
<!-- <link rel="icon" href="/static/icon.png"> --> |
||||
<link rel="stylesheet" href="/static/bootstrap.css"> |
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
||||
</head> |
||||
|
||||
<body class="d-flex flex-column h-100"> |
||||
<div class="container"> |
||||
<header class="d-flex flex-wrap justify-content-center py-3 mb-4 border-bottom"> |
||||
<a href="/" class="d-flex align-items-center mb-3 mb-md-0 me-md-auto text-dark text-decoration-none"> |
||||
<svg class="bi me-2" width="40" height="32"> |
||||
<use xlink:href="#bootstrap"></use> |
||||
</svg> |
||||
<strong class="fs-4">Wecr</strong> |
||||
</a> |
||||
|
||||
<ul class="nav nav-pills"> |
||||
<li class="nav-item"><a href="/stats" class="nav-link">Stats</a></li> |
||||
<li class="nav-item"><a href="/conf" class="nav-link">Config</a></li> |
||||
</ul> |
||||
</header> |
||||
</div> |
||||
|
||||
<div class="container"> |
||||
<h1>Dashboard</h1> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
|
||||
<div class="container"> |
||||
<h2>Statistics</h2> |
||||
<div id="statistics"> |
||||
<ol class="list-group list-group-numbered"> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Pages visited</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="pages_visited">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Matches found</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="matches_found">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Pages saved</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="pages_saved">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Start time</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="start_time_unix">0</span> |
||||
</li> |
||||
<li class="list-group-item d-flex justify-content-between align-items-start"> |
||||
<div class="ms-2 me-auto"> |
||||
<div class="fw-bold">Stopped</div> |
||||
</div> |
||||
<span class="badge bg-primary rounded-pill" id="stopped">false</span> |
||||
</li> |
||||
</ol> |
||||
</div> |
||||
|
||||
<button class="btn btn-primary" id="btn_stop">Stop</button> |
||||
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button> |
||||
</div> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
<div class="container"> |
||||
<h2>Configuration</h2> |
||||
<div> |
||||
<b>Make runtime changes to configuration</b> |
||||
<table class="table table-borderless"> |
||||
<tr> |
||||
<th>Key</th> |
||||
<th>Value</th> |
||||
</tr> |
||||
<tr> |
||||
<th>Query</th> |
||||
<th> |
||||
<input type="text" id="conf_query"> |
||||
</th> |
||||
</tr> |
||||
<tr> |
||||
<th>Is regexp</th> |
||||
<th> |
||||
<input type="text" id="conf_is_regexp"> |
||||
</th> |
||||
</tr> |
||||
</table> |
||||
<button class="btn btn-primary" id="config_apply_button"> |
||||
Apply |
||||
</button> |
||||
</div> |
||||
|
||||
<div style="height: 3rem;"></div> |
||||
|
||||
<pre id="conf_output"></pre> |
||||
</div> |
||||
</div> |
||||
</body> |
||||
|
||||
<script> |
||||
window.onload = function () { |
||||
let confOutput = document.getElementById("conf_output"); |
||||
let pagesVisitedOut = document.getElementById("pages_visited"); |
||||
let matchesFoundOut = document.getElementById("matches_found"); |
||||
let pagesSavedOut = document.getElementById("pages_saved"); |
||||
let startTimeOut = document.getElementById("start_time_unix"); |
||||
let stoppedOut = document.getElementById("stopped"); |
||||
let applyConfButton = document.getElementById("config_apply_button"); |
||||
let confQuery = document.getElementById("conf_query"); |
||||
let confIsRegexp = document.getElementById("conf_is_regexp"); |
||||
let buttonStop = document.getElementById("btn_stop"); |
||||
let buttonResume = document.getElementById("btn_resume"); |
||||
|
||||
buttonStop.addEventListener("click", (event) => { |
||||
buttonStop.disabled = true; |
||||
buttonResume.disabled = false; |
||||
|
||||
// stop worker pool |
||||
let signal = { |
||||
"stop": true, |
||||
}; |
||||
|
||||
fetch("/stop", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(signal), |
||||
}); |
||||
}); |
||||
|
||||
buttonResume.addEventListener("click", (event) => { |
||||
buttonResume.disabled = true; |
||||
buttonStop.disabled = false; |
||||
|
||||
// resume worker pool's work |
||||
let signal = { |
||||
"stop": false, |
||||
}; |
||||
|
||||
fetch("/stop", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(signal), |
||||
}); |
||||
}); |
||||
|
||||
applyConfButton.addEventListener("click", (event) => { |
||||
let query = String(confQuery.value); |
||||
|
||||
if (confIsRegexp.value === "0") { |
||||
isRegexp = false; |
||||
} else if (confIsRegexp.value === "1") { |
||||
isRegexp = true; |
||||
}; |
||||
if (confIsRegexp.value === "false") { |
||||
isRegexp = false; |
||||
} else if (confIsRegexp.value === "true") { |
||||
isRegexp = true; |
||||
}; |
||||
|
||||
let newConf = { |
||||
"search": { |
||||
"is_regexp": isRegexp, |
||||
"query": query, |
||||
}, |
||||
}; |
||||
|
||||
fetch("/conf", { |
||||
method: "POST", |
||||
headers: { |
||||
"Content-type": "application/json", |
||||
}, |
||||
body: JSON.stringify(newConf), |
||||
}); |
||||
}); |
||||
|
||||
const interval = setInterval(function () { |
||||
// update statistics |
||||
fetch("/stats") |
||||
.then((response) => response.json()) |
||||
.then((statistics) => { |
||||
pagesVisitedOut.innerText = statistics.pages_visited; |
||||
matchesFoundOut.innerText = statistics.matches_found; |
||||
pagesSavedOut.innerText = statistics.pages_saved; |
||||
startTimeOut.innerText = new Date(1000 * statistics.start_time_unix); |
||||
stoppedOut.innerText = statistics.stopped; |
||||
}); |
||||
// update config |
||||
fetch("/conf") |
||||
.then((response) => response.text()) |
||||
.then((config) => { |
||||
// "print" whole configuration |
||||
confOutput.innerText = config; |
||||
|
||||
// update values in the change table if they're empty |
||||
let confJSON = JSON.parse(config); |
||||
if (confQuery.value == "") { |
||||
confQuery.value = confJSON.search.query; |
||||
} |
||||
if (confIsRegexp.value == "") { |
||||
confIsRegexp.value = confJSON.search.is_regexp; |
||||
} |
||||
}); |
||||
}, 650); |
||||
}(); |
||||
</script> |
||||
|
||||
</html> |
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -0,0 +1,54 @@
|
||||
package queue |
||||
|
||||
import ( |
||||
"encoding/json" |
||||
"io" |
||||
"os" |
||||
"unbewohnte/wecr/web" |
||||
) |
||||
|
||||
func PopLastJob(queue *os.File) (*web.Job, error) { |
||||
stats, err := queue.Stat() |
||||
if err != nil { |
||||
return nil, err |
||||
} |
||||
|
||||
if stats.Size() == 0 { |
||||
return nil, nil |
||||
} |
||||
|
||||
// find the last job in the queue
|
||||
var job web.Job |
||||
var offset int64 = -1 |
||||
for { |
||||
currentOffset, err := queue.Seek(offset, io.SeekEnd) |
||||
if err != nil { |
||||
return nil, err |
||||
} |
||||
|
||||
decoder := json.NewDecoder(queue) |
||||
err = decoder.Decode(&job) |
||||
if err != nil || job.URL == "" || job.Search.Query == "" { |
||||
offset -= 1 |
||||
continue |
||||
} |
||||
|
||||
queue.Truncate(currentOffset) |
||||
return &job, nil |
||||
} |
||||
} |
||||
|
||||
func InsertNewJob(queue *os.File, newJob web.Job) error { |
||||
_, err := queue.Seek(0, io.SeekEnd) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
|
||||
encoder := json.NewEncoder(queue) |
||||
err = encoder.Encode(&newJob) |
||||
if err != nil { |
||||
return err |
||||
} |
||||
|
||||
return nil |
||||
} |
@ -0,0 +1,45 @@
|
||||
/* |
||||
Wecr - crawl the web for data |
||||
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte) |
||||
|
||||
This program is free software: you can redistribute it and/or modify |
||||
it under the terms of the GNU Affero General Public License as published by |
||||
the Free Software Foundation, either version 3 of the License, or |
||||
(at your option) any later version. |
||||
|
||||
This program is distributed in the hope that it will be useful, |
||||
but WITHOUT ANY WARRANTY; without even the implied warranty of |
||||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
||||
GNU Affero General Public License for more details. |
||||
|
||||
You should have received a copy of the GNU Affero General Public License |
||||
along with this program. If not, see <https://www.gnu.org/licenses/>.
|
||||
*/ |
||||
|
||||
package web |
||||
|
||||
import ( |
||||
"net/url" |
||||
) |
||||
|
||||
// Tries to find docs' URLs on the page
|
||||
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL { |
||||
var urls []url.URL |
||||
|
||||
// for every element that has "src" attribute
|
||||
for _, link := range FindPageSrcLinks(pageBody, from) { |
||||
if HasDocumentExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// for every "a" element as well
|
||||
for _, link := range FindPageLinks(pageBody, from) { |
||||
if HasDocumentExtention(link.EscapedPath()) { |
||||
urls = append(urls, link) |
||||
} |
||||
} |
||||
|
||||
// return discovered doc urls
|
||||
return urls |
||||
} |
Loading…
Reference in new issue