Wecr
Scrape the web for data recursively: text, videos, audio, images...
As a person who somehow ended up a data hoarder, it was only natural to write my own web spider to automate such a delicate task up to my needs. So here we have it.
Capabilities
As of version 0.2.1:- Search for: text (static or via regular expressions, + email addresses preset), images, videos and audio
- Requests-control
- Save pages on which needed content has been found
- Blacklisting, whitelisting domains
- Depth of search
- Parallel worker amount
Documentation
For detailed instructions seeREADME.md
on the project page.
Example
![](/res/wecr/wecr0.png)
![](/res/wecr/wecr1.png)
![](/res/wecr/wecr2.png)
![](/res/wecr/wecr3.png)
![](/res/wecr/wecr4.png)
![](/res/wecr/wecr6.png)
![](/res/wecr/wecr5.png)
![](/res/wecr/wecr7.png)
output.json
![](/res/wecr/wecr8.png)
scraped
![](/res/wecr/wecr9.png)
CTRL+C
and now extracting the text data
![](/res/wecr/wecr10.png)
![](/res/wecr/wecr11.png)
![](/res/wecr/wecr12.png)
![](/res/wecr/wecr13.png)