Data Reviews

When we don't believe that automated extraction will be sufficiently accurate, we can use Data Reviews to request reviews by human reviewers who can fix extraction issues before accepting an extraction result.

Context

We want the following properties from reviews:

We want to be notified when there are new reviews that need attention
Data removed from the source should also drop out of the dataset.
If the source data changes in a way that the automated extraction changes, e.g. for a correction, we want to update and re-review the data.
If we change the data model, e.g. to extract additional fields, we want to be able to decide whether existing reviews should be redone or can be backward compatible.
- We want incompatible data model changes to fail early and loudly.
We want user-edited data changes to be validated early (ideally in the UI) to prevent painful slow review/editing turnaround time.

Implementation

The basic workflow is:

define a pydantic model for the data
Perform the automated extraction
Call review = review_extraction()
If review.accepted is true, use review.extracted_data
Assert that all the reviews related to a given crawler are accepted using assert_all_accepted, opting to either emit a warning, or raise an exception. An exception is useful to prevent publishing a partial dataset - if we would prefer to hold off publishing a new release until the data has been accepted.
After the crawler has run, reviewers can review the data in Zavod UI and correct/accept extraction results. The crawler can then use the accepted data in its next run.

For example, imagine a crawler crawling web pages with regulatory notices.

from zavod.extract.llm import run_typed_text_prompt
from zavod.stateful.review import review_extraction, assert_all_accepted, HtmlSourceValue


Schema = Literal["Person", "Company", "LegalEntity"]
schema_field = Field(
    description=(
        "Use LegalEntity if it isn't clear whether the entity is a person or a company."
    )
)


class Defendant(BaseModel):
    entity_schema: Schema = schema_field
    name: str


class Defendants(BaseModel):
    defendants: List[Defendant]


PROMPT = f"""
Extract the defendants in the attached article. ONLY include names mentioned
in the article text.

Instructions for specific fields:

  - entity_schema: {schema_field.description}
"""

def crawl_page(context: Context, url: str, page: _Element) -> None:
    article = doc.xpath(".//article")[0]
    source_value = HtmlSourceValue(
        key_parts=notice_id(url),
        label="Notice of regulatory action taken",
        element=article_element,
        url=url,
    )
    prompt_result = run_typed_text_prompt(
        context,
        prompt=PROMPT,
        string=source_value.value_string,
        response_type=Defendants,
    )
    # If a review has previously been requested for the same source_value.key_parts,
    # it'll be found here.
    review = review_extraction(
        context,
        source_value=source_value
        original_extraction=prompt_result,
        origin=gpt.DEFAULT_MODEL,
    )
    # Once it's been accepted by a reviewer, we can use it
    if not review.accepted:
        return

    for item in review.extracted_data.defendants:
        entity = context.make(item.entity_schema)
        entity.id = context.make_id(item.name)
        entity.add("name", item.name, origin=review.origin)
        context.emit(entity)

def crawl(context: Context) -> None:
    ...
    for url in urls:
        ...
        crawl_page(context, url, page)

    # This will raise an exception unless all the reviews fetched
    # during a given crawl have `accepted == True`.
    assert_all_accepted(context)

`zavod.stateful.review.review_extraction(context, source_value, original_extraction, origin, crawler_version=1, default_accepted=False)`

Ensures a Review exists for the given key to allow human review of automated data extraction from a source value.

If it's new, extracted_data will default to original_extraction and accepted to default_accepted.
last_seen_version is always updated to the current crawl version.
If it's not accepted yet, original_extraction and extracted_data will be updated.
If both source_value and original_extraction have changed or crawler_version has been bumped, all values are reset as if it's new.

Parameters:

Name	Type	Description	Default
`context`	`Context`	The runner context with dataset metadata.	required
`source_value`	`SourceValue`	The source value for the extracted data.	required
`original_extraction`	`ModelType`	An instance of a pydantic model of data extracted from the source. Any reviewer changes to the data will be validated against this model. Initially `Review.extracted_data` will be set to this value. Reviewer edits will then be stored in `Review.extracted_data` with `Review.original_extraction` remaining as the original.	required
`origin`	`str`	A short string indicating the origin of the extraction, e.g. the model name or "lookups" if it's backfilled from datapatches lookups.	required
`crawler_version`	`int`	A version number a crawler can use as a checkpoint for changes requiring re-extraction and/or re-review. Don't bump this for every crawler change, only for backward incompatible data model changes or to force re-review for some other reason.	`1`
`default_accepted`	`bool`	Whether the data should be marked as accepted on creation or reset.	`False`