Surf the web for data recursively
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
Kasianov Nikolai Alekseevich 91112b89ba NO DEPENDENCIES !; Audio, and video search; separate timeout for file fetching 2 years ago
src NO DEPENDENCIES !; Audio, and video search; separate timeout for file fetching 2 years ago
.gitignore Corrected README text 2 years ago
COPYING Initial commit 2 years ago
Makefile NO DEPENDENCIES !; Audio, and video search; separate timeout for file fetching 2 years ago
README.md NO DEPENDENCIES !; Audio, and video search; separate timeout for file fetching 2 years ago

README.md

Wecr - simple web crawler

Overview

Just a simple HTML web spider with minimal dependencies. It is possible to search for pages with a text on them or for the text itself, extract images and save pages that satisfy the criteria along the way.

Configuration

The flow of work fully depends on the configuration file. By default conf.json is used as a configuration file, but the name can be changed via -conf flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new conf.json will be created in the same directory as the executable itself unless the -wDir (working directory) flag is set to some other value.

The configuration is split into different branches like requests (how requests are made, ie: request timeout, wait time, user agent), logging (use logs, output to a file), save (output file|directory, save pages or not) or search (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as workers (working threads that make requests in parallel) and depth (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.

The parsing starts from initial_pages and goes deeper while ignoring the pages on domains that are in blacklisted_domains or are NOT in allowed_domains. If all initial pages are happen to be on blacklisted domains or are not in the allowed list - the program will get stuck. It is important to note that *_domains should be specified with an existing scheme (ie: https://en.wikipedia.org). Subdomains and ports matter: https://unbewohnte.su:3000/ and https://unbewohnte.su/ are different.

Search query

There are some special query values:

  • links - tells wecr to search for all links there are on the page
  • images - find all images on pages and output to the corresponding directory in output_dir (IMPORTANT: set content_fetch_timeout_ms to 0 so the images (and other content below) load fully)
  • videos - find and fetch files that look like videos
  • audio - find and fetch files that look like audio

When is_regexp is enabled, the query is treated as a regexp string and pages will be scanned for matches that satisfy it.

Output

By default, if the query is not something of special values all the matches and other data will be outputted to output.json file as separate continuous JSON objects, but if save_pages is set to true and|or query is set to images, videos, audio, etc. - the additional contents will be put in the corresponding directories inside output_dir, which is neatly created by the executable's side.

TODO

  • PARSE HTML WITH REGEXP (EVIL LAUGH) -
  • Search for videos -
  • Search for audio -
  • Search for documents - []

License

AGPLv3