/ All / Categories /

Wecr

Scrape the web for data recursively: text, videos, audio, images...

As a person who somehow ended up a data hoarder, it was only natural to write my own web spider to automate such a delicate task up to my needs. So here we have it.

Capabilities

As of version 0.2.1:

Search for: text (static or via regular expressions, + email addresses preset), images, videos and audio
Requests-control
Save pages on which needed content has been found
Blacklisting, whitelisting domains
Depth of search
Parallel worker amount

Documentation

For detailed instructions see README.md on the project page.

Example

Created a new empty directory

Create a default configuration file in the current directory

Outputted config

Match everything on these 2 initial pages, only these 2 domains are allowed, save pages with content

Launch the search

New files and directories have been created

The log file shows the latest activity

Found text can be examined in output.json

Other content can be found in the corresponding directory inside scraped

Stopped the scrape via CTRL+C and now extracting the text data

This is what we've ended up with (each entry is on a new line, no duplicates)

Now let's search for some text with regular expressions

Found matches

Extracted output

Code is

[Categories:Programming,Utilities:] [Date:January 2023:]