/ All / Categories /

Wecr

Scrape the web for data recursively: text, videos, audio, images...

As a person who somehow ended up a data hoarder, it was only natural to write my own web spider to automate such a delicate task up to my needs. So here we have it.

Capabilities

As of version 0.2.1:

Documentation

For detailed instructions see README.md on the project page.

Example

Created a new empty directory
Create a default configuration file in the current directory
Outputted config
Match everything on these 2 initial pages, only these 2 domains are allowed, save pages with content
Launch the search
New files and directories have been created
The log file shows the latest activity
Found text can be examined in output.json
Other content can be found in the corresponding directory inside scraped
Stopped the scrape via CTRL+C and now extracting the text data
This is what we've ended up with (each entry is on a new line, no duplicates)
Now let's search for some text with regular expressions
Found matches
Extracted output

Code is

here

[Categories:Programming,Utilities:] [Date:January 2023:]