Developing Enforcements crawlers
Limit enforcements to the enforcement age limit
Enforcements pages often go back into decades of historical notices whose relevance is questionable while potentially adding a big maintenance burden (in crawler code and review effort).
We have a standard support period defined for enforcement actions and a helper to check whether an action date is within scope:
for row in h.parse_html_table(table):
enforcement_date = h.element_text(row["date"])
if not h.within_max_age(context, enforcement_date):
return
zavod.helpers.dates.within_max_age(context, date, max_age_days=MAX_ENFORCEMENT_DAYS)
Check if a the given date is within a specified maximum age, defaulting to MAX_ENFORCEMENT_DAYS.
This is useful for filtering out all but the most recent items, e.g. sanctions announcements or enforcement actions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
date
|
datetime | str
|
The date to check. |
required |
max_age_days
|
int
|
The maximum allowable age in days, if different from the default. |
MAX_ENFORCEMENT_DAYS
|
Source code in zavod/helpers/dates.py
Create Article and Documentation entities
Create an Article for each notice or press release, and a Documentation for each distinct entity emitted based on that article.
The point of the article is not normally to emit the content of the article, but rather to easily find all the significant entities mentioned in the same document.
zavod.helpers.articles.make_article(context, url, key_extra=None, title=None, published_at=None)
Create an article entity based on the URL where it was published.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
url
|
str
|
The URL where the article was published. |
required |
key_extra
|
Optional[str]
|
An optional value to be included in the generated Article ID hash. |
None
|
title
|
Optional[str]
|
The title the article. |
None
|
published_at
|
Optional[str]
|
The publication date of the article. |
None
|
Source code in zavod/helpers/articles.py
zavod.helpers.articles.make_documentation(context, entity, article, key_extra=None, date=None)
Creates a documentation entity to link an article to a related entity. The article's publishedAt date is added to the Documentation date property unless the date argument is provided.
This is useful to link one or more entities to an article they were mentioned in.
Create a distinct Documentation entity for each entity-article pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
Context
|
The runner context with dataset metadata. |
required |
entity
|
Entity
|
The entity related to the article. |
required |
article
|
Entity
|
The related article. |
required |
key_extra
|
Optional[str]
|
An optional value to be included in the generated Documentation ID hash. |
None
|
date
|
Optional[str]
|
The publication date of the article, added to the Documentation date property. |
None
|