Compare commits

...

4 Commits

  1. 17
      README.md
  2. 1
      src/config/config.go
  3. 56
      src/dashboard/dashboard.go
  4. 43
      src/dashboard/res/index.html
  5. 130
      src/main.go
  6. 90
      src/web/audio.go
  7. 107
      src/web/documents.go
  8. 38
      src/web/extentions.go
  9. 90
      src/web/images.go
  10. 86
      src/web/text.go
  11. 90
      src/web/videos.go
  12. 4
      src/worker/pool.go
  13. 196
      src/worker/worker.go

17
README.md

@ -4,9 +4,9 @@
A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way.
## Configuration
## Configuration Overview
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wdir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`.
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the working directory unless the `-wdir` (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run `wecr -h`.
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.
@ -18,7 +18,7 @@ You can change search `query` at **runtime** via web dashboard if `launch_dashbo
### Search query
There are some special `query` values:
There are some special `query` values to control the flow of work:
- `email` - tells wecr to scrape email addresses and output to `output_file`
- `images` - find all images on pages and output to the corresponding directory in `output_dir` (**IMPORTANT**: set `content_fetch_timeout_ms` to `0` so the images (and other content below) load fully)
@ -26,12 +26,13 @@ There are some special `query` values:
- `audio` - find and fetch files that look like audio
- `documents` - find and fetch files that look like a document
- `everything` - find and fetch images, audio, video, documents and email addresses
- `archive` - no text to be searched, save every visited page
When `is_regexp` is enabled, the `query` is treated as a regexp string and pages will be scanned for matches that satisfy it.
When `is_regexp` is enabled, the `query` is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it.
### Output
### Data Output
By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side.
If the query is not something of special value, all text matches will be outputted to `found_text.json` file as separate continuous JSON objects in `output_dir`; if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be also put in the corresponding directories inside `output_dir`, which is neatly created in the working directory or, if `-wdir` flag is set - there. If `output_dir` is happened to be empty - contents will be outputted directly to the working directory.
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file.
@ -43,7 +44,7 @@ Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies.
## Examples
See [page on my website](https://unbewohnte.su/wecr) for some basic examples.
See [a page on my website](https://unbewohnte.su/wecr) for some basic examples.
Dump of a basic configuration:
@ -87,4 +88,4 @@ Dump of a basic configuration:
```
## License
AGPLv3
wecr is distributed under AGPLv3 license

1
src/config/config.go

@ -31,6 +31,7 @@ const (
QueryEmail string = "email"
QueryDocuments string = "documents"
QueryEverything string = "everything"
QueryArchive string = "archive"
)
const (

56
src/dashboard/dashboard.go

@ -1,3 +1,21 @@
/*
Wecr - crawl the web for data
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
*/
package dashboard
import (
@ -25,7 +43,11 @@ type PageData struct {
Stats worker.Statistics
}
func NewDashboard(port uint16, webConf *config.Conf, statistics *worker.Statistics) *Dashboard {
type PoolStop struct {
Stop bool `json:"stop"`
}
func NewDashboard(port uint16, webConf *config.Conf, pool *worker.Pool) *Dashboard {
mux := http.NewServeMux()
res, err := fs.Sub(resFS, "res")
if err != nil {
@ -34,6 +56,7 @@ func NewDashboard(port uint16, webConf *config.Conf, statistics *worker.Statisti
}
mux.Handle("/static/", http.FileServer(http.FS(res)))
mux.HandleFunc("/", func(w http.ResponseWriter, req *http.Request) {
template, err := template.ParseFS(res, "*.html")
if err != nil {
@ -44,8 +67,37 @@ func NewDashboard(port uint16, webConf *config.Conf, statistics *worker.Statisti
template.ExecuteTemplate(w, "index.html", nil)
})
mux.HandleFunc("/stop", func(w http.ResponseWriter, req *http.Request) {
var stop PoolStop
requestBody, err := io.ReadAll(req.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusInternalServerError)
logger.Error("Failed to read stop|resume signal from dashboard request: %s", err)
return
}
defer req.Body.Close()
err = json.Unmarshal(requestBody, &stop)
if err != nil {
http.Error(w, "Failed to unmarshal stop|resume signal", http.StatusInternalServerError)
logger.Error("Failed to unmarshal stop|resume signal from dashboard UI: %s", err)
return
}
if stop.Stop {
// stop worker pool
pool.Stop()
logger.Info("Stopped worker pool via request from dashboard")
} else {
// resume work
pool.Work()
logger.Info("Resumed work via request from dashboard")
}
})
mux.HandleFunc("/stats", func(w http.ResponseWriter, req *http.Request) {
jsonStats, err := json.MarshalIndent(statistics, "", " ")
jsonStats, err := json.MarshalIndent(pool.Stats, "", " ")
if err != nil {
http.Error(w, "Failed to marshal statistics", http.StatusInternalServerError)
logger.Error("Failed to marshal stats to send to the dashboard: %s", err)

43
src/dashboard/res/index.html

@ -68,6 +68,9 @@
</li>
</ol>
</div>
<button class="btn btn-primary" id="btn_stop">Stop</button>
<button class="btn btn-primary" id="btn_resume" disabled>Resume</button>
</div>
<div style="height: 3rem;"></div>
@ -117,6 +120,44 @@
let applyConfButton = document.getElementById("config_apply_button");
let confQuery = document.getElementById("conf_query");
let confIsRegexp = document.getElementById("conf_is_regexp");
let buttonStop = document.getElementById("btn_stop");
let buttonResume = document.getElementById("btn_resume");
buttonStop.addEventListener("click", (event) => {
buttonStop.disabled = true;
buttonResume.disabled = false;
// stop worker pool
let signal = {
"stop": true,
};
fetch("/stop", {
method: "POST",
headers: {
"Content-type": "application/json",
},
body: JSON.stringify(signal),
});
});
buttonResume.addEventListener("click", (event) => {
buttonResume.disabled = true;
buttonStop.disabled = false;
// resume worker pool's work
let signal = {
"stop": false,
};
fetch("/stop", {
method: "POST",
headers: {
"Content-type": "application/json",
},
body: JSON.stringify(signal),
});
});
applyConfButton.addEventListener("click", (event) => {
let query = String(confQuery.value);
@ -139,8 +180,6 @@
},
};
console.log(newConf);
fetch("/conf", {
method: "POST",
headers: {

130
src/main.go

@ -19,7 +19,6 @@
package main
import (
"encoding/json"
"flag"
"fmt"
"io"
@ -40,7 +39,7 @@ import (
"unbewohnte/wecr/worker"
)
const version = "v0.3.2"
const version = "v0.3.5"
const (
configFilename string = "conf.json"
@ -68,7 +67,7 @@ var (
extractDataFilename = flag.String(
"extractData", "",
"Set filename for output JSON file and extract data from it, put each entry nicely on a new line in a new file, then exit",
"Specify previously outputted JSON file and extract data from it, put each entry nicely on a new line in a new file, exit afterwards",
)
workingDirectory string
@ -108,12 +107,12 @@ func init() {
if *wDir != "" {
workingDirectory = *wDir
} else {
exePath, err := os.Executable()
wdir, err := os.Getwd()
if err != nil {
logger.Error("Failed to determine executable's path: %s", err)
logger.Error("Failed to determine working directory path: %s", err)
return
}
workingDirectory = filepath.Dir(exePath)
workingDirectory = wdir
}
logger.Info("Working in \"%s\"", workingDirectory)
@ -157,17 +156,6 @@ func main() {
}
logger.Info("Successfully opened configuration file")
// Prepare global statistics variable
statistics := worker.Statistics{}
// open dashboard if needed
var board *dashboard.Dashboard = nil
if conf.Dashboard.UseDashboard {
board = dashboard.NewDashboard(conf.Dashboard.Port, conf, &statistics)
go board.Launch()
logger.Info("Launched dashboard at http://localhost:%d", conf.Dashboard.Port)
}
// sanitize and correct inputs
if len(conf.InitialPages) == 0 {
logger.Error("No initial page URLs have been set")
@ -306,6 +294,8 @@ func main() {
logger.Info("Looking for audio (%+s)", web.AudioExtentions)
case config.QueryDocuments:
logger.Info("Looking for documents (%+s)", web.DocumentExtentions)
case config.QueryArchive:
logger.Info("Archiving every visited page")
case config.QueryEverything:
logger.Info("Looking for email addresses, images, videos, audio and various documents (%+s - %+s - %+s - %+s)",
web.ImageExtentions,
@ -321,33 +311,6 @@ func main() {
}
}
// create logs if needed
if conf.Logging.OutputLogs {
if conf.Logging.LogsFile != "" {
// output logs to a file
logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
if err != nil {
logger.Error("Failed to create logs file: %s", err)
return
}
defer logFile.Close()
logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
logger.SetOutput(logFile)
} else {
// output logs to stdout
logger.Info("Outputting logs to stdout")
logger.SetOutput(os.Stdout)
}
} else {
// no logging needed
logger.Info("No further logs will be outputted")
logger.SetOutput(nil)
}
jobs := make(chan web.Job, conf.Workers*5)
results := make(chan web.Result, conf.Workers*5)
// create visit queue file if not turned off
var visitQueueFile *os.File = nil
if !conf.InMemoryVisitQueue {
@ -364,6 +327,7 @@ func main() {
}
// create initial jobs
initialJobs := make(chan web.Job, conf.Workers*5)
if !conf.InMemoryVisitQueue {
for _, initialPage := range conf.InitialPages {
var newJob web.Job = web.Job{
@ -380,7 +344,7 @@ func main() {
visitQueueFile.Seek(0, io.SeekStart)
} else {
for _, initialPage := range conf.InitialPages {
jobs <- web.Job{
initialJobs <- web.Job{
URL: initialPage,
Search: conf.Search,
Depth: conf.Depth,
@ -388,8 +352,11 @@ func main() {
}
}
// Prepare global statistics variable
statistics := worker.Statistics{}
// form a worker pool
workerPool := worker.NewWorkerPool(jobs, results, conf.Workers, &worker.WorkerConf{
workerPool := worker.NewWorkerPool(initialJobs, conf.Workers, &worker.WorkerConf{
Search: &conf.Search,
Requests: &conf.Requests,
Save: &conf.Save,
@ -399,22 +366,42 @@ func main() {
VisitQueue: visitQueueFile,
Lock: &sync.Mutex{},
},
EmailsOutput: emailsOutputFile,
TextOutput: textOutputFile,
}, &statistics)
logger.Info("Created a worker pool with %d workers", conf.Workers)
// set up graceful shutdown
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt)
go func() {
<-sig
logger.Info("Received interrupt signal. Exiting...")
// open dashboard if needed
var board *dashboard.Dashboard = nil
if conf.Dashboard.UseDashboard {
board = dashboard.NewDashboard(conf.Dashboard.Port, conf, workerPool)
go board.Launch()
logger.Info("Launched dashboard at http://localhost:%d", conf.Dashboard.Port)
}
// stop workers
workerPool.Stop()
// create and redirect logs if needed
if conf.Logging.OutputLogs {
if conf.Logging.LogsFile != "" {
// output logs to a file
logFile, err := os.Create(filepath.Join(workingDirectory, conf.Logging.LogsFile))
if err != nil {
logger.Error("Failed to create logs file: %s", err)
return
}
defer logFile.Close()
// close results channel
close(results)
}()
logger.Info("Outputting logs to %s", conf.Logging.LogsFile)
logger.SetOutput(logFile)
} else {
// output logs to stdout
logger.Info("Outputting logs to stdout")
logger.SetOutput(os.Stdout)
}
} else {
// no logging needed
logger.Info("No further logs will be outputted")
logger.SetOutput(nil)
}
// launch concurrent scraping !
workerPool.Work()
@ -441,27 +428,12 @@ func main() {
}()
}
// get text text results and write it to the output file (found files are handled by each worker separately)
var outputFile *os.File
for {
result, ok := <-results
if !ok {
break
}
// as it is possible to change configuration "on the fly" - it's better to not mess up different outputs
if result.Search.Query == config.QueryEmail {
outputFile = emailsOutputFile
} else {
outputFile = textOutputFile
}
// set up graceful shutdown
sig := make(chan os.Signal, 1)
signal.Notify(sig, os.Interrupt)
<-sig
logger.Info("Received interrupt signal. Exiting...")
// each entry in output file is a self-standing JSON object
entryBytes, err := json.MarshalIndent(result, " ", "\t")
if err != nil {
continue
}
outputFile.Write(entryBytes)
outputFile.Write([]byte("\n"))
}
// stop workers
workerPool.Stop()
}

90
src/web/audio.go

@ -20,99 +20,25 @@ package web
import (
"net/url"
"strings"
)
func HasAudioExtention(url string) bool {
for _, extention := range AudioExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}
return false
}
// Tries to find audio URLs on the page
func FindPageAudio(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageAudio(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL
// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasAudioExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasAudioExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasAudioExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasAudioExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// return discovered mutual video urls
return urls
}

107
src/web/documents.go

@ -1,97 +1,42 @@
/*
Wecr - crawl the web for data
Copyright (C) 2023 Kasyanov Nikolay Alexeyevich (Unbewohnte)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
*/
package web
import (
"net/url"
"strings"
)
func HasDocumentExtention(url string) bool {
for _, extention := range DocumentExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}
return false
}
// Tries to find docs' URLs on the page
func FindPageDocuments(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageDocuments(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL
// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasDocumentExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasDocumentExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasDocumentExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasDocumentExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}

38
src/web/extentions.go

@ -18,6 +18,8 @@
package web
import "strings"
var AudioExtentions = []string{
".3gp",
".aa",
@ -134,3 +136,39 @@ var DocumentExtentions = []string{
".otf",
".exif",
}
func HasImageExtention(urlPath string) bool {
for _, extention := range ImageExtentions {
if strings.HasSuffix(urlPath, extention) {
return true
}
}
return false
}
func HasDocumentExtention(urlPath string) bool {
for _, extention := range DocumentExtentions {
if strings.HasSuffix(urlPath, extention) {
return true
}
}
return false
}
func HasVideoExtention(urlPath string) bool {
for _, extention := range VideoExtentions {
if strings.HasSuffix(urlPath, extention) {
return true
}
}
return false
}
func HasAudioExtention(urlPath string) bool {
for _, extention := range AudioExtentions {
if strings.HasSuffix(urlPath, extention) {
return true
}
}
return false
}

90
src/web/images.go

@ -20,99 +20,25 @@ package web
import (
"net/url"
"strings"
)
func HasImageExtention(url string) bool {
for _, extention := range ImageExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}
return false
}
// Tries to find images' URLs on the page
func FindPageImages(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageImages(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL
// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasImageExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasImageExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasImageExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasImageExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// return discovered mutual image urls from <img> and <a> tags
return urls
}

86
src/web/text.go

@ -36,28 +36,28 @@ var tagSrcRegexp *regexp.Regexp = regexp.MustCompile(`(?i)(src)[\s]*=[\s]*("|')(
var emailRegexp *regexp.Regexp = regexp.MustCompile(`[A-Za-z0-9._%+\-!%&?~^#$]+@[A-Za-z0-9.\-]+\.[a-zA-Z]{2,4}`)
// var emailRegexp *regexp.Regexp = regexp.MustCompile("[a-zA-Z0-9.!#$%&'*+\\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*")
// Fix relative link and construct an absolute one. Does nothing if the URL already looks alright
func ResolveLink(url *url.URL, fromHost string) string {
if !url.IsAbs() {
if url.Scheme == "" {
func ResolveLink(link url.URL, fromHost string) url.URL {
var resolvedURL url.URL = link
if !resolvedURL.IsAbs() {
if resolvedURL.Scheme == "" {
// add scheme
url.Scheme = "http"
resolvedURL.Scheme = "https"
}
if url.Host == "" {
if resolvedURL.Host == "" {
// add host
url.Host = fromHost
resolvedURL.Host = fromHost
}
}
return url.String()
return resolvedURL
}
// Find all links on page that are specified in <a> tag
func FindPageLinks(pageBody []byte, from *url.URL) []string {
var urls []string
// Find all links on page that are specified in href attribute. Do not resolve links. Return URLs as they are on the page
func FindPageLinksDontResolve(pageBody []byte) []url.URL {
var urls []url.URL
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
@ -90,9 +90,69 @@ func FindPageLinks(pageBody []byte, from *url.URL) []string {
continue
}
urls = append(urls, ResolveLink(link, from.Host))
urls = append(urls, *link)
}
return urls
}
// Find all links on page that are specified in href attribute
func FindPageLinks(pageBody []byte, from url.URL) []url.URL {
urls := FindPageLinksDontResolve(pageBody)
for index := 0; index < len(urls); index++ {
urls[index] = ResolveLink(urls[index], from.Host)
}
return urls
}
// Find all links on page that are specified in "src" attribute. Do not resolve ULRs, return them as they are on page
func FindPageSrcLinksDontResolve(pageBody []byte) []url.URL {
var urls []url.URL
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
urls = append(urls, *link)
}
return urls
}
// Find all links on page that are specified in "src" attribute
func FindPageSrcLinks(pageBody []byte, from url.URL) []url.URL {
urls := FindPageSrcLinksDontResolve(pageBody)
for index := 0; index < len(urls); index++ {
urls[index] = ResolveLink(urls[index], from.Host)
}
return urls
}

90
src/web/videos.go

@ -20,99 +20,25 @@ package web
import (
"net/url"
"strings"
)
func HasVideoExtention(url string) bool {
for _, extention := range VideoExtentions {
if strings.HasSuffix(url, extention) {
return true
}
}
return false
}
// Tries to find videos' URLs on the page
func FindPageVideos(pageBody []byte, from *url.URL) []string {
var urls []string
func FindPageVideos(pageBody []byte, from url.URL) []url.URL {
var urls []url.URL
// for every element that has "src" attribute
for _, match := range tagSrcRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasVideoExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageSrcLinks(pageBody, from) {
if HasVideoExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// for every "a" element as well
for _, match := range tagHrefRegexp.FindAllString(string(pageBody), -1) {
var linkStartIndex int
var linkEndIndex int
linkStartIndex = strings.Index(match, "\"")
if linkStartIndex == -1 {
linkStartIndex = strings.Index(match, "'")
if linkStartIndex == -1 {
continue
}
linkEndIndex = strings.LastIndex(match, "'")
if linkEndIndex == -1 {
continue
}
} else {
linkEndIndex = strings.LastIndex(match, "\"")
if linkEndIndex == -1 {
continue
}
}
if linkEndIndex <= linkStartIndex+1 {
continue
}
link, err := url.Parse(match[linkStartIndex+1 : linkEndIndex])
if err != nil {
continue
}
linkResolved := ResolveLink(link, from.Host)
if HasVideoExtention(linkResolved) {
urls = append(urls, linkResolved)
for _, link := range FindPageLinks(pageBody, from) {
if HasVideoExtention(link.EscapedPath()) {
urls = append(urls, link)
}
}
// return discovered mutual video urls
return urls
}

4
src/worker/pool.go

@ -48,7 +48,7 @@ type Pool struct {
}
// Create a new worker pool
func NewWorkerPool(jobs chan web.Job, results chan web.Result, workerCount uint, workerConf *WorkerConf, stats *Statistics) *Pool {
func NewWorkerPool(initialJobs chan web.Job, workerCount uint, workerConf *WorkerConf, stats *Statistics) *Pool {
var newPool Pool = Pool{
workersCount: workerCount,
workers: nil,
@ -61,7 +61,7 @@ func NewWorkerPool(jobs chan web.Job, results chan web.Result, workerCount uint,
var i uint
for i = 0; i < workerCount; i++ {
newWorker := NewWorker(jobs, results, workerConf, &newPool.visited, newPool.Stats)
newWorker := NewWorker(initialJobs, workerConf, &newPool.visited, newPool.Stats)
newPool.workers = append(newPool.workers, &newWorker)
}

196
src/worker/worker.go

@ -19,12 +19,16 @@
package worker
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/url"
"os"
"path"
"path/filepath"
"regexp"
"strings"
"sync"
"time"
"unbewohnte/wecr/config"
@ -46,12 +50,13 @@ type WorkerConf struct {
BlacklistedDomains []string
AllowedDomains []string
VisitQueue VisitQueue
TextOutput io.Writer
EmailsOutput io.Writer
}
// Web worker
type Worker struct {
Jobs chan web.Job
Results chan web.Result
Conf *WorkerConf
visited *visited
stats *Statistics
@ -59,10 +64,9 @@ type Worker struct {
}
// Create a new worker
func NewWorker(jobs chan web.Job, results chan web.Result, conf *WorkerConf, visited *visited, stats *Statistics) Worker {
func NewWorker(jobs chan web.Job, conf *WorkerConf, visited *visited, stats *Statistics) Worker {
return Worker{
Jobs: jobs,
Results: results,
Conf: conf,
visited: visited,
stats: stats,
@ -70,8 +74,8 @@ func NewWorker(jobs chan web.Job, results chan web.Result, conf *WorkerConf, vis
}
}
func (w *Worker) saveContent(links []string, pageURL *url.URL) {
var alreadyProcessedUrls []string
func (w *Worker) saveContent(links []url.URL, pageURL *url.URL) {
var alreadyProcessedUrls []url.URL
for count, link := range links {
// check if this URL has been processed already
var skip bool = false
@ -89,29 +93,29 @@ func (w *Worker) saveContent(links []string, pageURL *url.URL) {
}
alreadyProcessedUrls = append(alreadyProcessedUrls, link)
var fileName string = fmt.Sprintf("%s_%d_%s", pageURL.Host, count, path.Base(link))
var fileName string = fmt.Sprintf("%s_%d_%s", pageURL.Host, count, path.Base(link.Path))
var filePath string
if web.HasImageExtention(link) {
if web.HasImageExtention(link.Path) {
filePath = filepath.Join(w.Conf.Save.OutputDir, config.SaveImagesDir, fileName)
} else if web.HasVideoExtention(link) {
} else if web.HasVideoExtention(link.Path) {
filePath = filepath.Join(w.Conf.Save.OutputDir, config.SaveVideosDir, fileName)
} else if web.HasAudioExtention(link) {
} else if web.HasAudioExtention(link.Path) {
filePath = filepath.Join(w.Conf.Save.OutputDir, config.SaveAudioDir, fileName)
} else if web.HasDocumentExtention(link) {
} else if web.HasDocumentExtention(link.Path) {
filePath = filepath.Join(w.Conf.Save.OutputDir, config.SaveDocumentsDir, fileName)
} else {
filePath = filepath.Join(w.Conf.Save.OutputDir, fileName)
}
err := web.FetchFile(
link,
link.String(),
w.Conf.Requests.UserAgent,
w.Conf.Requests.ContentFetchTimeoutMs,
filePath,
)
if err != nil {
logger.Error("Failed to fetch file at %s: %s", link, err)
logger.Error("Failed to fetch file located at %s: %s", link.String(), err)
return
}
@ -120,22 +124,115 @@ func (w *Worker) saveContent(links []string, pageURL *url.URL) {
}
}
// Save page to the disk with a corresponding name
func (w *Worker) savePage(baseURL *url.URL, pageData []byte) {
if w.Conf.Save.SavePages && w.Conf.Save.OutputDir != "" {
var pageName string = fmt.Sprintf("%s_%s.html", baseURL.Host, path.Base(baseURL.String()))
pageFile, err := os.Create(filepath.Join(w.Conf.Save.OutputDir, config.SavePagesDir, pageName))
// Save page to the disk with a corresponding name; Download any src files, stylesheets and JS along the way
func (w *Worker) savePage(baseURL url.URL, pageData []byte) {
var findPageFileContentURLs func([]byte) []url.URL = func(pageBody []byte) []url.URL {
var urls []url.URL
for _, link := range web.FindPageLinksDontResolve(pageData) {
if strings.Contains(link.Path, ".css") ||
strings.Contains(link.Path, ".scss") ||
strings.Contains(link.Path, ".js") ||
strings.Contains(link.Path, ".mjs") {
urls = append(urls, link)
}
}
urls = append(urls, web.FindPageSrcLinksDontResolve(pageBody)...)
return urls
}
var cleanLink func(url.URL, url.URL) url.URL = func(link url.URL, from url.URL) url.URL {
resolvedLink := web.ResolveLink(link, from.Host)
cleanLink, err := url.Parse(resolvedLink.Scheme + "://" + resolvedLink.Host + resolvedLink.Path)
if err != nil {
logger.Error("Failed to create page of \"%s\": %s", baseURL.String(), err)
return resolvedLink
}
return *cleanLink
}
// Create directory with all file content on the page
var pageFilesDirectoryName string = fmt.Sprintf(
"%s_%s_files",
baseURL.Host,
strings.ReplaceAll(baseURL.Path, "/", "_"),
)
err := os.MkdirAll(filepath.Join(w.Conf.Save.OutputDir, config.SavePagesDir, pageFilesDirectoryName), os.ModePerm)
if err != nil {
logger.Error("Failed to create directory to store file contents of %s: %s", baseURL.String(), err)
return
}
defer pageFile.Close()
pageFile.Write(pageData)
// Save files on page
srcLinks := findPageFileContentURLs(pageData)
for _, srcLink := range srcLinks {
web.FetchFile(srcLink.String(),
w.Conf.Requests.UserAgent,
w.Conf.Requests.ContentFetchTimeoutMs,
filepath.Join(
w.Conf.Save.OutputDir,
config.SavePagesDir,
pageFilesDirectoryName,
path.Base(srcLink.String()),
),
)
}
// Redirect old content URLs to local files
for _, srcLink := range srcLinks {
cleanLink := cleanLink(srcLink, baseURL)
pageData = bytes.ReplaceAll(
pageData,
[]byte(srcLink.String()),
[]byte("./"+filepath.Join(pageFilesDirectoryName, path.Base(cleanLink.String()))),
)
}
// Create page output file
pageName := fmt.Sprintf(
"%s_%s.html",
baseURL.Host,
strings.ReplaceAll(baseURL.Path, "/", "_"),
)
outfile, err := os.Create(filepath.Join(
filepath.Join(w.Conf.Save.OutputDir, config.SavePagesDir),
pageName,
))
if err != nil {
fmt.Printf("Failed to create output file: %s\n", err)
}
defer outfile.Close()
outfile.Write(pageData)
logger.Info("Saved \"%s\"", pageName)
w.stats.PagesSaved++
}
const (
textTypeMatch = iota
textTypeEmail = iota
)
// Save text result to an appropriate file
func (w *Worker) saveResult(result web.Result, textType int) {
// write result to the output file
var output io.Writer
switch textType {
case textTypeEmail:
output = w.Conf.EmailsOutput
default:
output = w.Conf.TextOutput
}
// each entry in output file is a self-standing JSON object
entryBytes, err := json.MarshalIndent(result, " ", "\t")
if err != nil {
return
}
output.Write(entryBytes)
output.Write([]byte("\n"))
}
// Launch scraping process on this worker
@ -236,7 +333,7 @@ func (w *Worker) Work() {
}
// find links
pageLinks := web.FindPageLinks(pageData, pageURL)
pageLinks := web.FindPageLinks(pageData, *pageURL)
go func() {
if job.Depth > 1 {
// decrement depth and add new jobs
@ -246,9 +343,9 @@ func (w *Worker) Work() {
// add to the visit queue
w.Conf.VisitQueue.Lock.Lock()
for _, link := range pageLinks {
if link != job.URL {
if link.String() != job.URL {
err = queue.InsertNewJob(w.Conf.VisitQueue.VisitQueue, web.Job{
URL: link,
URL: link.String(),
Search: *w.Conf.Search,
Depth: job.Depth,
})
@ -262,9 +359,9 @@ func (w *Worker) Work() {
} else {
// add to the in-memory channel
for _, link := range pageLinks {
if link != job.URL {
if link.String() != job.URL {
w.Jobs <- web.Job{
URL: link,
URL: link.String(),
Search: *w.Conf.Search,
Depth: job.Depth,
}
@ -280,9 +377,12 @@ func (w *Worker) Work() {
var savePage bool = false
switch job.Search.Query {
case config.QueryArchive:
savePage = true
case config.QueryImages:
// find image URLs, output images to the file while not saving already outputted ones
imageLinks := web.FindPageImages(pageData, pageURL)
imageLinks := web.FindPageImages(pageData, *pageURL)
if len(imageLinks) > 0 {
w.saveContent(imageLinks, pageURL)
savePage = true
@ -291,7 +391,7 @@ func (w *Worker) Work() {
case config.QueryVideos:
// search for videos
// find video URLs, output videos to the files while not saving already outputted ones
videoLinks := web.FindPageVideos(pageData, pageURL)
videoLinks := web.FindPageVideos(pageData, *pageURL)
if len(videoLinks) > 0 {
w.saveContent(videoLinks, pageURL)
savePage = true
@ -300,7 +400,7 @@ func (w *Worker) Work() {
case config.QueryAudio:
// search for audio
// find audio URLs, output audio to the file while not saving already outputted ones
audioLinks := web.FindPageAudio(pageData, pageURL)
audioLinks := web.FindPageAudio(pageData, *pageURL)
if len(audioLinks) > 0 {
w.saveContent(audioLinks, pageURL)
savePage = true
@ -309,7 +409,7 @@ func (w *Worker) Work() {
case config.QueryDocuments:
// search for various documents
// find documents URLs, output docs to the file while not saving already outputted ones
docsLinks := web.FindPageDocuments(pageData, pageURL)
docsLinks := web.FindPageDocuments(pageData, *pageURL)
if len(docsLinks) > 0 {
w.saveContent(docsLinks, pageURL)
savePage = true
@ -319,11 +419,11 @@ func (w *Worker) Work() {
// search for email
emailAddresses := web.FindPageEmailsWithCheck(pageData)
if len(emailAddresses) > 0 {
w.Results <- web.Result{
w.saveResult(web.Result{
PageURL: job.URL,
Search: job.Search,
Data: emailAddresses,
}
}, textTypeEmail)
w.stats.MatchesFound += uint64(len(emailAddresses))
savePage = true
}
@ -332,29 +432,29 @@ func (w *Worker) Work() {
// search for everything
// files
var contentLinks []string
contentLinks = append(contentLinks, web.FindPageImages(pageData, pageURL)...)
contentLinks = append(contentLinks, web.FindPageAudio(pageData, pageURL)...)
contentLinks = append(contentLinks, web.FindPageVideos(pageData, pageURL)...)
contentLinks = append(contentLinks, web.FindPageDocuments(pageData, pageURL)...)
var contentLinks []url.URL
contentLinks = append(contentLinks, web.FindPageImages(pageData, *pageURL)...)
contentLinks = append(contentLinks, web.FindPageAudio(pageData, *pageURL)...)
contentLinks = append(contentLinks, web.FindPageVideos(pageData, *pageURL)...)
contentLinks = append(contentLinks, web.FindPageDocuments(pageData, *pageURL)...)
w.saveContent(contentLinks, pageURL)
if len(contentLinks) > 0 {
savePage = true
}
// email
emailAddresses := web.FindPageEmailsWithCheck(pageData)
if len(emailAddresses) > 0 {
w.Results <- web.Result{
w.saveResult(web.Result{
PageURL: job.URL,
Search: job.Search,
Data: emailAddresses,
}
}, textTypeEmail)
w.stats.MatchesFound += uint64(len(emailAddresses))
savePage = true
}
if len(contentLinks) > 0 || len(emailAddresses) > 0 {
savePage = true
}
default:
// text search
switch job.Search.IsRegexp {
@ -368,11 +468,11 @@ func (w *Worker) Work() {
matches := web.FindPageRegexp(re, pageData)
if len(matches) > 0 {
w.Results <- web.Result{
w.saveResult(web.Result{
PageURL: job.URL,
Search: job.Search,
Data: matches,
}
}, textTypeMatch)
logger.Info("Found matches: %+v", matches)
w.stats.MatchesFound += uint64(len(matches))
savePage = true
@ -380,11 +480,11 @@ func (w *Worker) Work() {
case false:
// just text
if web.IsTextOnPage(job.Search.Query, true, pageData) {
w.Results <- web.Result{
w.saveResult(web.Result{
PageURL: job.URL,
Search: job.Search,
Data: []string{job.Search.Query},
}
}, textTypeMatch)
logger.Info("Found \"%s\" on page", job.Search.Query)
w.stats.MatchesFound++
savePage = true
@ -393,8 +493,8 @@ func (w *Worker) Work() {
}
// save page
if savePage {
w.savePage(pageURL, pageData)
if savePage && w.Conf.Save.SavePages {
w.savePage(*pageURL, pageData)
}
pageData = nil
pageURL = nil

Loading…
Cancel
Save