Developing Enforcements crawlers
Limit enforcements to the enforcement age limit
Enforcements pages often go back into decades of historical notices whose relevance is questionable while potentially adding a big maintenance burden (in crawler code and review effort).
We have a standard support period defined for enforcement actions and a helper to check whether an action date is within scope:
for row in h.parse_html_table(table):
enforcement_date = h.element_text(row["date"])
if not enforcements.within_max_age(context, enforcement_date):
return
zavod.shed.enforcements.within_max_age(context, date, max_age_days=MAX_ENFORCEMENT_DAYS)
Check if an enforcement date is within the maximum age of enforcement actions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
date
|
datetime | str
|
The enforcement date to check. |
required |
max_age_days
|
int
|
The maximum age of enforcement actions in days, if different from the default. |
MAX_ENFORCEMENT_DAYS
|
Source code in zavod/shed/enforcements.py
Create Article and Documentation entities
Create an Article for each notice or press release, and a Documentation for each distinct entity emitted based on that article.
The point of the article is not normally to emit the content of the article, but rather to easily find all the significant entities mentioned in the same document.
zavod.helpers.articles.make_article(context, url, key_extra=None, title=None, published_at=None)
Create an article entity based on the URL where it was published.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
url
|
str
|
The URL where the article was published. |
required |
key_extra
|
Optional[str]
|
An optional value to be included in the generated Article ID hash. |
None
|
title
|
Optional[str]
|
The title the article. |
None
|
published_at
|
Optional[str]
|
The publication date of the article. |
None
|
Source code in zavod/helpers/articles.py
zavod.helpers.articles.make_documentation(context, entity, article, key_extra=None, date=None)
Creates a documentation entity to link an article to a related entity. The article's publishedAt date is added to the Documentation date property unless the date argument is provided.
This is useful to link one or more entities to an article they were mentioned in.
Create a distinct Documentation entity for each entity-article pair.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
entity
|
Entity
|
The entity related to the article. |
required |
article
|
Entity
|
The related article. |
required |
key_extra
|
Optional[str]
|
An optional value to be included in the generated Documentation ID hash. |
None
|
date
|
Optional[str]
|
The publication date of the article, added to the Documentation date property. |
None
|