Using the command-line tool
Once you've successfully installed zavod, you can use the built-in command-line tool to run parts of the system:
# Before everything else, flush away cached source data. If you don't
# do this, you'll essentially work in developer mode where a local
# cached copy of the source data is used instead of fetching fresh
# files.
$ zavod clear datasets/_global/icij_offshoreleaks/icij_offshoreleaks.yml
# Crawl the ICIJ OffshoreLeaks database:
$ zavod crawl datasets/_global/icij_offshoreleaks/icij_offshoreleaks.yml
# You can also export a dataset without re-crawling the sources:
$ zavod export datasets/_global/icij_offshoreleaks/icij_offshoreleaks.yml
# You can publish a dataset to the archive:
$ zavod publish --latest datasets/_global/icij_offshoreleaks/icij_offshoreleaks.yml
# Combine crawl, export and publish in one command:
$ zavod run --latest datasets/_global/icij_offshoreleaks/icij_offshoreleaks.yml
When you are developing a crawler, it can be handy to rerun the crawler a number of times using the data source cache, then export the data rebuilding the intermediate storage.
First run the crawler without --clear
until you are ready to export:
Then run the exporter with --clear
to ensure the latest statements are included in the output:
Debugging Crawlers in VSCode
It is possible to debug crawlers through the Python debugger that comes with the standard VSCode install. To enable it either rename .vscode/launch.json.example
to .vscode/launch.json
or copy over the launch configuration you find in it to your own launch.json
file.
You should now be able to run crawlers by navigating to their .yaml
file and running the "Debug: Crawl of current .YAML" launch configuration.