Browse Source

No more unified text output file. Text searches of different kinds go into their own files

master v0.3.2
parent
commit
b256d8a83e
  1. 53
      README.md
  2. 2
      src/config/config.go
  3. 64
      src/main.go
  4. 4
      src/worker/worker.go

53
README.md

@ -1,12 +1,12 @@
# Wecr - simple web crawler # Wecr - versatile WEb CRawler
## Overview ## Overview
Just a simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way. A simple HTML web spider with no dependencies. It is possible to search for pages with a text on them or for the text itself, extract images, video, audio and save pages that satisfy the criteria along the way.
## Configuration ## Configuration
The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wDir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`. The flow of work fully depends on the configuration file. By default `conf.json` is used as a configuration file, but the name can be changed via `-conf` flag. The default configuration is embedded in the program so on the first launch or by simply deleting the file, a new `conf.json` will be created in the same directory as the executable itself unless the `-wdir` (working directory) flag is set to some other value. To see al available flags run `wecr -h`.
The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them. The configuration is split into different branches like `requests` (how requests are made, ie: request timeout, wait time, user agent), `logging` (use logs, output to a file), `save` (output file|directory, save pages or not) or `search` (use regexp, query string) each of which contain tweakable parameters. There are global ones as well such as `workers` (working threads that make requests in parallel) and `depth` (literally, how deep the recursive search should go). The names are simple and self-explanatory so no attribute-by-attribute explanation needed for most of them.
@ -14,6 +14,8 @@ The parsing starts from `initial_pages` and goes deeper while ignoring the pages
Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default). Previous versions stored the entire visit queue in memory, resulting in gigabytes of memory usage but as of `v0.2.4` it is possible to offload the queue to the persistent storage via `in_memory_visit_queue` option (`false` by default).
You can change search `query` at **runtime** via web dashboard if `launch_dashboard` is set to `true`
### Search query ### Search query
There are some special `query` values: There are some special `query` values:
@ -31,17 +33,58 @@ When `is_regexp` is enabled, the `query` is treated as a regexp string and pages
By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side. By default, if the query is not something of special values all the matches and other data will be outputted to `output.json` file as separate continuous JSON objects, but if `save_pages` is set to `true` and|or `query` is set to `images`, `videos`, `audio`, etc. - the additional contents will be put in the corresponding directories inside `output_dir`, which is neatly created by the executable's side.
The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `output.json`, which is the default output file name) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file. The output almost certainly contains some duplicates and is not easy to work with programmatically, so you can use `-extractData` with the output JSON file argument (like `found_text.json`, which is the default output file name for simple text searches) to extract the actual data, filter out the duplicates and put each entry on its new line in a new text file.
## Build ## Build
If you're on *nix - it's as easy as `make`. If you're on *nix - it's as easy as `make`.
Otherwise - `go build` in the `src` directory to build `wecr`. Otherwise - `go build` in the `src` directory to build `wecr`. No dependencies.
## Examples ## Examples
See [page on my website](https://unbewohnte.su/wecr) for some basic examples. See [page on my website](https://unbewohnte.su/wecr) for some basic examples.
Dump of a basic configuration:
```json
{
"search": {
"is_regexp": true,
"query": "(sequence to search)|(other sequence)"
},
"requests": {
"request_wait_timeout_ms": 2500,
"request_pause_ms": 100,
"content_fetch_timeout_ms": 0,
"user_agent": ""
},
"depth": 90,
"workers": 30,
"initial_pages": [
"https://en.wikipedia.org/wiki/Main_Page"
],
"allowed_domains": [
"https://en.wikipedia.org/"
],
"blacklisted_domains": [
""
],
"in_memory_visit_queue": false,
"web_dashboard": {
"launch_dashboard": true,
"port": 13370
},
"save": {
"output_dir": "scraped",
"save_pages": false
},
"logging": {
"output_logs": true,
"logs_file": "logs.log"
}
}
```
## License ## License
AGPLv3 AGPLv3

2
src/config/config.go

@ -48,7 +48,6 @@ type Search struct {
type Save struct { type Save struct {
OutputDir string `json:"output_dir"` OutputDir string `json:"output_dir"`
OutputFile string `json:"output_file"`
SavePages bool `json:"save_pages"` SavePages bool `json:"save_pages"`
} }
@ -94,7 +93,6 @@ func Default() *Conf {
Save: Save{ Save: Save{
OutputDir: "scraped", OutputDir: "scraped",
SavePages: false, SavePages: false,
OutputFile: "scraped.json",
}, },
Requests: Requests{ Requests: Requests{
UserAgent: "", UserAgent: "",

64
src/main.go

@ -40,13 +40,14 @@ import (
"unbewohnte/wecr/worker" "unbewohnte/wecr/worker"
) )
const version = "v0.3.1" const version = "v0.3.2"
const ( const (
defaultConfigFile string = "conf.json" configFilename string = "conf.json"
defaultOutputFile string = "output.json" prettifiedTextOutputFilename string = "extracted_data.txt"
defaultPrettifiedOutputFile string = "extracted_data.txt" visitQueueFilename string = "visit_queue.tmp"
defaultVisitQueueFile string = "visit_queue.tmp" textOutputFilename string = "found_text.json"
emailsOutputFilename string = "found_emails.json"
) )
var ( var (
@ -61,15 +62,10 @@ var (
) )
configFile = flag.String( configFile = flag.String(
"conf", defaultConfigFile, "conf", configFilename,
"Configuration file name to create|look for", "Configuration file name to create|look for",
) )
outputFile = flag.String(
"out", defaultOutputFile,
"Output file name to output information into",
)
extractDataFilename = flag.String( extractDataFilename = flag.String(
"extractData", "", "extractData", "",
"Set filename for output JSON file and extract data from it, put each entry nicely on a new line in a new file, then exit", "Set filename for output JSON file and extract data from it, put each entry nicely on a new line in a new file, then exit",
@ -77,7 +73,6 @@ var (
workingDirectory string workingDirectory string
configFilePath string configFilePath string
outputFilePath string
) )
func init() { func init() {
@ -126,20 +121,17 @@ func init() {
// extract data if needed // extract data if needed
if strings.TrimSpace(*extractDataFilename) != "" { if strings.TrimSpace(*extractDataFilename) != "" {
logger.Info("Extracting data from %s...", *extractDataFilename) logger.Info("Extracting data from %s...", *extractDataFilename)
err := utilities.ExtractDataFromOutput(*extractDataFilename, defaultPrettifiedOutputFile, "\n", false) err := utilities.ExtractDataFromOutput(*extractDataFilename, prettifiedTextOutputFilename, "\n", false)
if err != nil { if err != nil {
logger.Error("Failed to extract data from %s: %s", *extractDataFilename, err) logger.Error("Failed to extract data from %s: %s", *extractDataFilename, err)
os.Exit(1) os.Exit(1)
} }
logger.Info("Outputted \"%s\"", defaultPrettifiedOutputFile) logger.Info("Outputted \"%s\"", prettifiedTextOutputFilename)
os.Exit(0) os.Exit(0)
} }
// global path to configuration file // global path to configuration file
configFilePath = filepath.Join(workingDirectory, *configFile) configFilePath = filepath.Join(workingDirectory, *configFile)
// global path to output file
outputFilePath = filepath.Join(workingDirectory, *outputFile)
} }
func main() { func main() {
@ -249,7 +241,7 @@ func main() {
logger.Warning("User agent is not set. Forced to \"%s\"", conf.Requests.UserAgent) logger.Warning("User agent is not set. Forced to \"%s\"", conf.Requests.UserAgent)
} }
// create output directories and corresponding specialized ones // create output directory and corresponding specialized ones, text output files
if !filepath.IsAbs(conf.Save.OutputDir) { if !filepath.IsAbs(conf.Save.OutputDir) {
conf.Save.OutputDir = filepath.Join(workingDirectory, conf.Save.OutputDir) conf.Save.OutputDir = filepath.Join(workingDirectory, conf.Save.OutputDir)
} }
@ -289,6 +281,20 @@ func main() {
return return
} }
textOutputFile, err := os.Create(filepath.Join(conf.Save.OutputDir, textOutputFilename))
if err != nil {
logger.Error("Failed to create text output file: %s", err)
return
}
defer textOutputFile.Close()
emailsOutputFile, err := os.Create(filepath.Join(conf.Save.OutputDir, emailsOutputFilename))
if err != nil {
logger.Error("Failed to create email addresses output file: %s", err)
return
}
defer emailsOutputFile.Close()
switch conf.Search.Query { switch conf.Search.Query {
case config.QueryEmail: case config.QueryEmail:
logger.Info("Looking for email addresses") logger.Info("Looking for email addresses")
@ -315,14 +321,6 @@ func main() {
} }
} }
// create output file
outputFile, err := os.Create(outputFilePath)
if err != nil {
logger.Error("Failed to create output file: %s", err)
return
}
defer outputFile.Close()
// create logs if needed // create logs if needed
if conf.Logging.OutputLogs { if conf.Logging.OutputLogs {
if conf.Logging.LogsFile != "" { if conf.Logging.LogsFile != "" {
@ -354,14 +352,14 @@ func main() {
var visitQueueFile *os.File = nil var visitQueueFile *os.File = nil
if !conf.InMemoryVisitQueue { if !conf.InMemoryVisitQueue {
var err error var err error
visitQueueFile, err = os.Create(filepath.Join(workingDirectory, defaultVisitQueueFile)) visitQueueFile, err = os.Create(filepath.Join(workingDirectory, visitQueueFilename))
if err != nil { if err != nil {
logger.Error("Could not create visit queue temporary file: %s", err) logger.Error("Could not create visit queue temporary file: %s", err)
return return
} }
defer func() { defer func() {
visitQueueFile.Close() visitQueueFile.Close()
os.Remove(filepath.Join(workingDirectory, defaultVisitQueueFile)) os.Remove(filepath.Join(workingDirectory, visitQueueFilename))
}() }()
} }
@ -443,13 +441,21 @@ func main() {
}() }()
} }
// get text results and write them to the output file (files are handled by each worker separately) // get text text results and write it to the output file (found files are handled by each worker separately)
var outputFile *os.File
for { for {
result, ok := <-results result, ok := <-results
if !ok { if !ok {
break break
} }
// as it is possible to change configuration "on the fly" - it's better to not mess up different outputs
if result.Search.Query == config.QueryEmail {
outputFile = emailsOutputFile
} else {
outputFile = textOutputFile
}
// each entry in output file is a self-standing JSON object // each entry in output file is a self-standing JSON object
entryBytes, err := json.MarshalIndent(result, " ", "\t") entryBytes, err := json.MarshalIndent(result, " ", "\t")
if err != nil { if err != nil {

4
src/worker/worker.go

@ -375,7 +375,6 @@ func (w *Worker) Work() {
} }
logger.Info("Found matches: %+v", matches) logger.Info("Found matches: %+v", matches)
w.stats.MatchesFound += uint64(len(matches)) w.stats.MatchesFound += uint64(len(matches))
savePage = true savePage = true
} }
case false: case false:
@ -384,11 +383,10 @@ func (w *Worker) Work() {
w.Results <- web.Result{ w.Results <- web.Result{
PageURL: job.URL, PageURL: job.URL,
Search: job.Search, Search: job.Search,
Data: nil, Data: []string{job.Search.Query},
} }
logger.Info("Found \"%s\" on page", job.Search.Query) logger.Info("Found \"%s\" on page", job.Search.Query)
w.stats.MatchesFound++ w.stats.MatchesFound++
savePage = true savePage = true
} }
} }

Loading…
Cancel
Save