Data Reviews
When we don't believe that automated extraction will be sufficiently accurate, we can use Data Reviews to request reviews by human reviewers who can fix extraction issues before accepting an extraction result.
Context
We want the following properties from reviews:
- We want to be notified when there are new reviews that need attention
 - Data removed from the source should also drop out of the dataset.
 - If the source data changes in a way that the automated extraction changes, e.g. for a correction, we want to update and re-review the data.
 - 
If we change the data model, e.g. to extract additional fields, we want to be able to decide whether existing reviews should be redone or can be backward compatible.
- We want incompatible data model changes to fail early and loudly.
 
 - 
We want user-edited data changes to be validated early (ideally in the UI) to prevent painful slow review/editing turnaround time.
 
Implementation
The basic workflow is:
- define a pydantic model for the data
 - Perform the automated extraction
 - Call 
review = review_extraction() - If 
review.acceptedis true, usereview.extracted_data - Assert that all the reviews related to a given crawler are accepted using 
assert_all_accepted, opting to either emit a warning, or raise an exception. An exception is useful to prevent publishing a partial dataset - if we would prefer to hold off publishing a new release until the data has been accepted. - After the crawler has run, reviewers can review the data in Zavod UI and correct/accept extraction results. The crawler can then use the accepted data in its next run.
 
For example, imagine a crawler crawling web pages with regulatory notices.
from zavod.shed.gpt import run_typed_text_prompt
from zavod.stateful.review import review_extraction, assert_all_accepted, HtmlSourceValue
Schema = Literal["Person", "Company", "LegalEntity"]
schema_field = Field(
    description=(
        "Use LegalEntity if it isn't clear whether the entity is a person or a company."
    )
)
class Defendant(BaseModel):
    entity_schema: Schema = schema_field
    name: str
class Defendants(BaseModel):
    defendants: List[Defendant]
PROMPT = f"""
Extract the defendants in the attached article. ONLY include names mentioned
in the article text.
Instructions for specific fields:
  - entity_schema: {schema_field.description}
"""
def crawl_page(context: Context, url: str, page: _Element) -> None:
    article = doc.xpath(".//article")[0]
    source_value = HtmlSourceValue(
        key_parts=notice_id(url),
        label="Notice of regulatory action taken",
        element=article_element,
        url=url,
    )
    prompt_result = run_typed_text_prompt(
        context,
        prompt=PROMPT,
        string=source_value.value_string,
        response_type=Defendants,
    )
    # If a review has previously been requested for the same source_value.key_parts,
    # it'll be found here.
    review = review_extraction(
        context,
        source_value=source_value
        original_extraction=prompt_result,
        origin=gpt.DEFAULT_MODEL,
    )
    # Once it's been accepted by a reviewer, we can use it
    if not review.accepted:
        return
    for item in review.extracted_data.defendants:
        entity = context.make(item.entity_schema)
        entity.id = context.make_id(item.name)
        entity.add("name", item.name, origin=review.origin)
        context.emit(entity)
def crawl(context: Context) -> None:
    ...
    for url in urls:
        ...
        crawl_page(context, url, page)
    # This will raise an exception unless all the reviews fetched
    # during a given crawl have `accepted == True`.
    assert_all_accepted(context)
            zavod.stateful.review.review_extraction(context, source_value, original_extraction, origin, crawler_version=1, default_accepted=False)
    Ensures a Review exists for the given key to allow human review of automated data extraction from a source value.
- If it's new, 
extracted_datawill default tooriginal_extractionandacceptedtodefault_accepted. last_seen_versionis always updated to the current crawl version.- If it's not accepted yet, 
original_extractionandextracted_datawill be updated. - If both 
source_valueandoriginal_extractionhave changed orcrawler_versionhas been bumped, all values are reset as if it's new. 
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                context
             | 
            
                  Context
             | 
            
               The runner context with dataset metadata.  | 
            required | 
                source_value
             | 
            
                  SourceValue
             | 
            
               The source value for the extracted data.  | 
            required | 
                original_extraction
             | 
            
                  ModelType
             | 
            
               An instance of a pydantic model of data extracted
from the source. Any reviewer changes to the data will be validated against
this model. Initially   | 
            required | 
                origin
             | 
            
                  str
             | 
            
               A short string indicating the origin of the extraction, e.g. the model name or "lookups" if it's backfilled from datapatches lookups.  | 
            required | 
                crawler_version
             | 
            
                  int
             | 
            
               A version number a crawler can use as a checkpoint for changes requiring re-extraction and/or re-review. Don't bump this for every crawler change, only for backward incompatible data model changes or to force re-review for some other reason.  | 
            
                  1
             | 
          
                default_accepted
             | 
            
                  bool
             | 
            
               Whether the data should be marked as accepted on creation or reset.  | 
            
                  False
             | 
          
Source code in zavod/stateful/review.py
              286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403  |  | 
            zavod.stateful.review.assert_all_accepted(context, *, raise_on_unaccepted=True)
    Raise an exception or warning with the number of unaccepted items if any extraction entries for the current dataset and version are not accepted yet.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                context
             | 
            
                  Context
             | 
            
               The runner context with dataset metadata.  | 
            required | 
                raise_on_unaccepted
             | 
            
                  bool
             | 
            
               Whether to raise an exception if there are unaccepted items. If False, a warning will be logged instead.  | 
            
                  True
             | 
          
Source code in zavod/stateful/review.py
              
            zavod.stateful.review.HtmlSourceValue
    
              Bases: SourceValue
Source code in zavod/stateful/review.py
                
            __init__(key_parts, label, element, url)
    Sets value_string as the serialized HTML of the element.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                key_parts
             | 
            
                  str | List[str]
             | 
            
               Information from the source that uniquely and consistently identifies the review within the dataset. For an enforcement action, that might be an action reference number or publication url (for lack of a consistent identifier).  | 
            required | 
Source code in zavod/stateful/review.py
              
            zavod.stateful.review.TextSourceValue
    
              Bases: SourceValue
Source code in zavod/stateful/review.py
                
            __init__(key_parts, label, text, url=None)
    Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
                key_parts
             | 
            
                  str | List[str]
             | 
            
               Information from the source that uniquely and consistently identifies the review within the dataset. For a string of names that rarely changes, that string itself might work.  | 
            required | 
Source code in zavod/stateful/review.py
              
            matches(review)
    Performs the same normalisation as review_key so that we consider
multiple source values normalising to the same key a match.
Source code in zavod/stateful/review.py
              Best practices
Review keys
The key should uniquely identify a given piece of data extraction/review content. Ideally it should be consistent in spite of changes to the content, but this isn't always possible. Key input gets slugified by review_extraction.
For free text in a CSV cell that doesn't have a consistent identifier, e.g. John, Sally, and partners, just use the string as the key.
For web pages, e.g. of regulatory notices, try and use the notice ID if one can be extracted reliably, rather than the web page URL, because the the URL can change if they reorganise the website, and the notice ID could likely be extracted consistently despite such a change.
Model Documentation
Use model documentation (e.g. fieldname: MyEnum = Field(description="...")) to explain how fields should be extracted. This gets
included in the JSON schema so it's made available to the human reviewer in Zavod UI.
OpenAI's structured output API doesn't seem to support JSON schema description properties yet so also include it explicitly in the prompt. See example above.