Surf the web for data recursively

Kasianov Nikolai Alekseevich 722f3fb536 Fixed emails being saved to a wrong file under query=everything; improved page saving process; fixed pages being saved not considering the actual setting; Added non-link resolving variation of FindPageLinks; Added query=archive functionality; working directory is now an actual working directory instead an executables directory		2 years ago
src	Fixed emails being saved to a wrong file under query=everything; improved page saving process; fixed pages being saved not considering the actual setting; Added non-link resolving variation of FindPageLinks; Added query=archive functionality; working directory is now an actual working directory instead an executables directory	2 years ago
.gitignore	In-file memory queue! No more insane RAM consumption	2 years ago
COPYING	Initial commit	2 years ago
Makefile	HUGE FIX: partially fixed abnormal memory consumption via actually closing http connections; Side-effect: no more what I thought was "bandwidth-throttling"	2 years ago
README.md	Fixed emails being saved to a wrong file under query=everything; improved page saving process; fixed pages being saved not considering the actual setting; Added non-link resolving variation of FindPageLinks; Added query=archive functionality; working directory is now an actual working directory instead an executables directory	2 years ago

README.md

Wecr - versatile WEb CRawler

Overview

A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way.

Configuration Overview

The flow of work fully depends on the configuration file. By default conf.json is used as a configuration file, but the name can be changed via -conf flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new conf.json will be created in the working directory unless the -wdir (working directory) flag is set to some other value, in which case it has a bigger importance. To see all available flags run wecr -h.

The configuration is split into different branches like requests (how requests are made, ie: request timeout, wait time, user agent), logging (use logs, output to a file), save (output file|directory, save pages or not) or search (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as workers (working threads that make requests in parallel) and depth (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.

The parsing starts from initial_pages and goes deeper while ignoring the pages on domains that are in blacklisted_domains or are NOT in allowed_domains. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that *_domains should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports matter: https://unbewohnte.su:3000/ and https://unbewohnte.su/ are different.

Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of v0.2.4 it is possible to offload the queue to the persistent storage via in_memory_visit_queue option (false by default).

You can change search query at runtime via web dashboard if launch_dashboard is set to true

Search query

There are some special query values to control the flow of work:

email - tells wecr to scrape email addresses and output to output_file
images - find all images on pages and output to the corresponding directory in output_dir (IMPORTANT: set content_fetch_timeout_ms to 0 so the images (and other content below) load fully)
videos - find and fetch files that look like videos
audio - find and fetch files that look like audio
documents - find and fetch files that look like a document
everything - find and fetch images, audio, video, documents and email addresses
archive - no text to be searched, save every visited page

When is_regexp is enabled, the query is treated as a regexp string (in Go "flavor") and pages will be scanned for matches that satisfy it.

Data Output

If the query is not something of special value, all text matches will be outputted to found_text.json file as separate continuous JSON objects in output_dir; if save_pages is set to true and|or query is set to images, videos, audio, etc. - the additional contents will be also put in the corresponding directories inside output_dir, which is neatly created in the working directory or, if -wdir flag is set - there. If output_dir is happened to be empty - contents will be outputted directly to the working directory.

The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use -extractData with the output JSON file argument (like found_text.json, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file.

Build

If you're on *nix - it's as easy as make.

Otherwise - go build in the src directory to build wecr. No dependencies.

Examples

See a page on my website for some basic examples.

Dump of a basic configuration:

{
	"search": {
		"is_regexp": true,
		"query": "(sequence to search)|(other sequence)"
	},
	"requests": {
		"request_wait_timeout_ms": 2500,
		"request_pause_ms": 100,
		"content_fetch_timeout_ms": 0,
		"user_agent": ""
	},
	"depth": 90,
	"workers": 30,
	"initial_pages": [
		"https://en.wikipedia.org/wiki/Main_Page"
	],
	"allowed_domains": [
		"https://en.wikipedia.org/"
	],
	"blacklisted_domains": [
		""
	],
	"in_memory_visit_queue": false,
	"web_dashboard": {
		"launch_dashboard": true,
		"port": 13370
	},
	"save": {
		"output_dir": "scraped",
		"save_pages": false
	},
	"logging": {
		"output_logs": true,
		"logs_file": "logs.log"
	}
}

License

wecr is distributed under AGPLv3 license