Skip to content

Developing Enforcements crawlers

Limit enforcements to the enforcement age limit

Enforcements pages often go back into decades of historical notices whose relevance is questionable while potentially adding a big maintenance burden (in crawler code and review effort).

We have a standard support period defined for enforcement actions and a helper to check whether an action date is within scope:

    for row in h.parse_html_table(table):
        enforcement_date = h.element_text(row["date"])
        if not enforcements.within_max_age(context, enforcement_date):
            return

zavod.shed.enforcements.within_max_age(context, date, max_age_days=MAX_ENFORCEMENT_DAYS)

Check if an enforcement date is within the maximum age of enforcement actions.

Parameters:

Name Type Description Default
context Context

The runner context with dataset metadata.

required
date datetime | str

The enforcement date to check.

required
max_age_days int

The maximum age of enforcement actions in days, if different from the default.

MAX_ENFORCEMENT_DAYS
Source code in zavod/shed/enforcements.py
def within_max_age(
    context: Context,
    date: datetime | str,
    max_age_days: int = MAX_ENFORCEMENT_DAYS,
) -> bool:
    """
    Check if an enforcement date is within the maximum age of enforcement actions.

    Args:
        context: The runner context with dataset metadata.
        date: The enforcement date to check.
        max_age_days: The maximum age of enforcement actions in days, if different from the default.
    """
    if isinstance(date, str):
        date = date.strip()
    cleaned_date = h.extract_date(context.dataset, date, fallback_to_original=False)[0]
    return cleaned_date > h.backdate(RUN_TIME, max_age_days)

Create Article and Documentation entities

Create an Article for each notice or press release, and a Documentation for each distinct entity emitted based on that article.

The point of the article is not normally to emit the content of the article, but rather to easily find all the significant entities mentioned in the same document.

zavod.helpers.articles.make_article(context, url, key_extra=None, title=None, published_at=None)

Create an article entity based on the URL where it was published.

Parameters:

Name Type Description Default
context Context

The runner context with dataset metadata.

required
url str

The URL where the article was published.

required
key_extra Optional[str]

An optional value to be included in the generated Article ID hash.

None
title Optional[str]

The title the article.

None
published_at Optional[str]

The publication date of the article.

None
Source code in zavod/helpers/articles.py
def make_article(
    context: Context,
    url: str,
    key_extra: Optional[str] = None,
    title: Optional[str] = None,
    published_at: Optional[str] = None,
) -> Entity:
    """
    Create an article entity based on the URL where it was published.

    Args:
        context: The runner context with dataset metadata.
        url: The URL where the article was published.
        key_extra: An optional value to be included in the generated Article ID hash.
        title: The title the article.
        published_at: The publication date of the article.
    """

    article = context.make("Article")
    article.id = context.make_id("Article", url, key_extra)
    article.add("sourceUrl", url)
    article.add("title", title)
    h.apply_date(article, "publishedAt", published_at)

    return article

zavod.helpers.articles.make_documentation(context, entity, article, key_extra=None, date=None)

Creates a documentation entity to link an article to a related entity. The article's publishedAt date is added to the Documentation date property unless the date argument is provided.

This is useful to link one or more entities to an article they were mentioned in.

Create a distinct Documentation entity for each entity-article pair.

Parameters:

Name Type Description Default
context Context

The runner context with dataset metadata.

required
entity Entity

The entity related to the article.

required
article Entity

The related article.

required
key_extra Optional[str]

An optional value to be included in the generated Documentation ID hash.

None
date Optional[str]

The publication date of the article, added to the Documentation date property.

None
Source code in zavod/helpers/articles.py
def make_documentation(
    context: Context,
    entity: Entity,
    article: Entity,
    key_extra: Optional[str] = None,
    date: Optional[str] = None,
) -> Entity:
    """
    Creates a documentation entity to link an article to a related entity.
    The article's publishedAt date is added to the Documentation date property
    unless the date argument is provided.

    This is useful to link one or more entities to an article they were mentioned in.

    Create a distinct Documentation entity for each entity-article pair.

    Args:
        context: The runner context with dataset metadata.
        entity: The entity related to the article.
        article: The related article.
        key_extra: An optional value to be included in the generated Documentation ID hash.
        date: The publication date of the article, added to the Documentation date property.
    """

    documentation = context.make("Documentation")
    assert entity.id is not None
    assert article.id is not None
    documentation.id = context.make_id(
        "Documentation", entity.id, article.id, key_extra
    )
    documentation.add("entity", entity)
    documentation.add("document", article)

    if date:
        h.apply_date(documentation, "date", date)
    else:
        documentation.set("date", article.get("publishedAt"))
    return documentation