Skip to content

HTTP Operations

Making requests

Use context methods for HTTP operations, as they handle common tasks such as caching, using request sessions with sensible retry defaults, and checking the response status.

text = context.fetch_text(context.data_url, method="POST", headers=headers, data=body, cache_days=cache_days)

Handling bot blocking

Many sites employ bot blocking strategies. We believe this is primarily to mitigate Denial of Service attacks and manage server load, rather than protecting the content from extraction, since the purpose of the sites we scrape is dissemination of their block lists. As long as we are sensitive to our impact on their service and identifiable in their requests, we believe it is ok to work around their bot blocking strategies.

Blocking might result in error statuses like 403; redirects to error pages; or 200 status responses but with different content from what you've seen in the browser.

Header-based restrictions

If a request using zavod fails but your browser succeeds, try setting a more browser-like user-agent header.

http:
  user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36 (zavod; opensanctions.org)

If that doesn't work, try more of the common headers sent by browsers:

HEADERS = {
    "origin": "https://www.interpol.int",
    "referer": "https://www.interpol.int/",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 (zavod; opensanctions.org)",
}
context.fetch_...(url, headers=HEADERS)

Network/geo-blocking

If it fails in production but not locally, the site might be blocking the production network IP range. It's common to block hosting provider networks for websites intended for humans only. It's also common to block requests from a country other than the publisher; if it works using a VPN exit point in that country, pass that country to the geolocation argument.

In both cases, route the request through the Zyte API (zavod.extract.zyte_api), which proxies and unblocks requests. The fetch_* helpers default to the httpResponseBody scrape type, which is faster and cheaper than browserHtml. Prefer it unless the page needs JavaScript (see below).

JavaScript challenges

If it works in the browser but you see different content when fetching using zavod or curl, there might be a javascript challenge that checks whether a full browser is rendering the page. This usually sets a cookie so the browser doesn't have to complete the challenge on each request. These challenges can also be intermittent.

Use zyte_api.fetch_html, whose html_source defaults to "browserHtml": it renders the page in a browser, executes the JavaScript, and returns the resulting DOM as parsed HTML. It requires an unblock_validator XPath that matches only when unblocking succeeded, so challenge pages aren't cached. If you need to wait for specific content or click to reveal data, use the actions argument.

The Zyte API

The helpers in zavod.extract.zyte_api route requests through the Zyte API, which handles proxying, geolocation, and browser rendering. They require the OPENSANCTIONS_ZYTE_API_KEY environment variable, so crawlers that use them set ci_test: false. Most take an optional geolocation (e.g. "US") and cache_days. For requests that need a POST body, custom headers, cookies, or browser actions, build a ZyteAPIRequest and call the lower-level fetch.

zavod.extract.zyte_api.fetch_text(context, url, geolocation=None, cache_days=None, expected_media_type=None, expected_charset=None)

Fetch a text document using the Zyte API.

The content type and charset can be used to assert expected types (and successful unblocking by Zyte) and for appropriate text decoding when the encoding can vary. Do so via the expected_ arguments unless more logic is required.

Parameters:

Name Type Description Default
context Context

The context object.

required
url str

The URL of the text document.

required
headers

A list of dicts of headers to send with the request.

required
expected_media_type Optional[str]

If set, assert that the media type in the response content-type header matches this value.

None
expected_charset Optional[str]

If set, assert that the charset in the response content-type header matches this value.

None

Returns:

Type Description
Tuple[bool, str | None, str | None, str]

A tuple of: - A boolean indicating whether the text was cached. - The media type of the response, None if cached. - The charset of the response, None if cached. - The text content.

Source code in zavod/extract/zyte_api.py
def fetch_text(
    context: Context,
    url: str,
    geolocation: Optional[str] = None,
    cache_days: Optional[int] = None,
    expected_media_type: Optional[str] = None,
    expected_charset: Optional[str] = None,
) -> Tuple[bool, str | None, str | None, str]:
    """
    Fetch a text document using the Zyte API.

    The content type and charset can be used to assert expected
    types (and successful unblocking by Zyte) and for appropriate
    text decoding when the encoding can vary. Do so via the expected_
    arguments unless more logic is required.

    Args:
        context: The context object.
        url: The URL of the text document.
        headers: A list of dicts of headers to send with the request.
        expected_media_type: If set, assert that the media type in the
            response content-type header matches this value.
        expected_charset: If set, assert that the charset in the response
            content-type header matches this value.

    Returns:
        A tuple of:
            - A boolean indicating whether the text was cached.
            - The media type of the response, None if cached.
            - The charset of the response, None if cached.
            - The text content.
    """
    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType.HTTP_RESPONSE_BODY,
            url=url,
            geolocation=geolocation,
        ),
        cache_days=cache_days,
    )
    if expected_media_type and zyte_result.media_type:
        assert zyte_result.media_type == expected_media_type, (
            zyte_result.media_type,
            zyte_result.charset,
            url,
        )
    if expected_charset and zyte_result.charset:
        assert zyte_result.charset == expected_charset, (
            zyte_result.media_type,
            zyte_result.charset,
            url,
        )

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)

    return (
        zyte_result.from_cache,
        zyte_result.media_type,
        zyte_result.charset,
        zyte_result.response_text,
    )

zavod.extract.zyte_api.fetch_json(context, url, cache_days=None, expected_media_type='application/json', geolocation=None)

Returns:

Type Description
Any

A JSON document.

Source code in zavod/extract/zyte_api.py
def fetch_json(
    context: Context,
    url: str,
    cache_days: Optional[int] = None,
    expected_media_type: Optional[str] = "application/json",
    geolocation: Optional[str] = None,
) -> Any:
    """
    Returns:
        A JSON document.
    """

    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType.HTTP_RESPONSE_BODY,
            url=url,
            headers={"Accept": "application/json"},
            geolocation=geolocation,
        ),
        cache_days=cache_days,
    )
    if (
        expected_media_type
        and zyte_result.media_type
        and zyte_result.media_type != expected_media_type
    ):
        msg = f"Expected media type {expected_media_type} but got {zyte_result.media_type} for {url}"
        context.log.error(
            msg,
            expected_media_type=expected_media_type,
            media_type=zyte_result.media_type,
            charset=zyte_result.charset,
            response_text=zyte_result.response_text,
        )
        raise AssertionError(msg)

    doc = json.loads(zyte_result.response_text)

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)
    return doc

zavod.extract.zyte_api.fetch_resource(context, filename, url, expected_media_type=None, expected_charset=None, geolocation=None, method=None, body=None, headers=None)

Fetch a resource using Zyte API and save to filesystem.

The content type and charset can be used to assert expected types (and successful unblocking by Zyte) and for appropriate text decoding when the encoding can vary. Do so via the expected_ arguments unless more logic is required.

Parameters:

Name Type Description Default
context Context

The context object.

required
filename str

The name to use when saving the file.

required
url str

The URL of the resource.

required
expected_media_type Optional[str]

If set, assert that the media type in the response content-type header matches this value. Not enforced when the file already exists locally.

None
expected_charset Optional[str]

If set, assert that the charset in the response content-type header matches this value. Not enforced when the file already exists locally.

None

Returns:

Type Description
Tuple[bool, str | None, str | None, Path]

A tuple of: - A boolean indicating whether the file was cached. - The media type of the response, None if cached. - The charset of the response, None if cached. - The path to the saved file.

Source code in zavod/extract/zyte_api.py
def fetch_resource(
    context: Context,
    filename: str,
    url: str,
    expected_media_type: Optional[str] = None,
    expected_charset: Optional[str] = None,
    geolocation: Optional[str] = None,
    method: Optional[str] = None,
    body: Optional[bytes] = None,
    headers: Optional[Dict[str, str]] = None,
) -> Tuple[bool, str | None, str | None, Path]:
    """
    Fetch a resource using Zyte API and save to filesystem.

    The content type and charset can be used to assert expected
    types (and successful unblocking by Zyte) and for appropriate
    text decoding when the encoding can vary. Do so via the expected_
    arguments unless more logic is required.

    Args:
        context: The context object.
        filename: The name to use when saving the file.
        url: The URL of the resource.
        expected_media_type: If set, assert that the media type in the
            response content-type header matches this value. Not enforced
            when the file already exists locally.
        expected_charset: If set, assert that the charset in the response
            content-type header matches this value. Not enforced
            when the file already exists locally.

    Returns:
        A tuple of:
            - A boolean indicating whether the file was cached.
            - The media type of the response, None if cached.
            - The charset of the response, None if cached.
            - The path to the saved file.
    """
    data_path = dataset_data_path(context.dataset.name)
    out_path = data_path.joinpath(filename)
    if out_path.exists():
        return True, None, None, out_path

    if settings.ZYTE_API_KEY is None:
        raise RuntimeError("OPENSANCTIONS_ZYTE_API_KEY is not set")

    context.log.info("Fetching file", url=url)
    # This repeats a lot of what's in fetch(), but fetch() focuses on decoding either
    # a text response or a base64-encoded response to text, whereas here we want to
    # save the raw bytes to a file. Letting fetch cover both cases seems more complex than
    # just constructing the right request here.
    zyte_data: Dict[str, Any] = {
        "httpResponseBody": True,
        "httpResponseHeaders": True,
    }
    if method is not None:
        zyte_data["httpRequestMethod"] = method
    if body is not None:
        zyte_data["httpRequestBody"] = b64encode(body).decode("utf-8")
    if headers is not None:
        zyte_data["customHttpRequestHeaders"] = [
            {"name": k, "value": v} for k, v in headers.items()
        ]
    if geolocation is not None:
        zyte_data["geolocation"] = geolocation
    context.log.debug(f"Zyte API request: {url}", data=zyte_data)
    zyte_data["url"] = url
    out_path.parent.mkdir(parents=True, exist_ok=True)
    configure_session(context.http)

    api_response = context.http.post(
        ZYTE_API_URL,
        auth=(settings.ZYTE_API_KEY, ""),
        json=zyte_data,
    )
    api_response.raise_for_status()

    file_base64 = api_response.json()["httpResponseBody"]
    with open(out_path, "wb") as fh:
        fh.write(b64decode(file_base64))
    media_type, charset = get_content_type(api_response.json()["httpResponseHeaders"])

    if expected_media_type:
        assert media_type == expected_media_type, (media_type, charset, url)
    if expected_charset:
        assert charset == expected_charset, (media_type, charset, url)

    return False, media_type, charset, out_path

zavod.extract.zyte_api.fetch_html(context, url, unblock_validator, actions=[], html_source='browserHtml', javascript=None, geolocation=None, cache_days=None, retries=3, backoff_factor=3, previous_retries=0, absolute_links=False)

Fetch a web page using the Zyte API.

Parameters:

Name Type Description Default
unblock_validator str

XPath matching at least one element if and only if unblocking was successful. This is important to ensure we don't cache pages that weren't actually unblocked successfully.

required
html_source str

browserHtml | httpResponseBody

'browserHtml'
retries int

The number of times to retry if unblocking fails.

3
backoff_factor int

Factor to scale the pause between retries.

3
absolute_links bool

Whether to convert relative links to absolute links. Doesn't take redirects into account.

False

Returns:

Type Description
_Element

The parsed HTML document serialized from the DOM.

Source code in zavod/extract/zyte_api.py
def fetch_html(
    context: Context,
    url: str,
    unblock_validator: str,
    actions: list[Dict[str, Any]] = [],
    html_source: str = "browserHtml",
    javascript: Optional[bool] = None,
    geolocation: Optional[str] = None,
    cache_days: Optional[int] = None,
    retries: int = 3,
    backoff_factor: int = 3,
    previous_retries: int = 0,
    absolute_links: bool = False,
) -> etree._Element:
    """
    Fetch a web page using the Zyte API.

    Args:
        unblock_validator: XPath matching at least one element if and only if
            unblocking was successful. This is important to ensure we don't cache
            pages that weren't actually unblocked successfully.
        html_source: browserHtml | httpResponseBody
        retries: The number of times to retry if unblocking fails.
        backoff_factor: Factor to scale the pause between retries.
        absolute_links: Whether to convert relative links to absolute links.
            Doesn't take redirects into account.

    Returns:
        The parsed HTML document serialized from the DOM.
    """
    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType(html_source),
            url=url,
            geolocation=geolocation,
            actions=actions,
            javascript=javascript,
        ),
        cache_days=cache_days,
    )

    doc = html.fromstring(zyte_result.response_text)
    if absolute_links and isinstance(doc, html.HtmlElement):
        cast(html.HtmlElement, doc).make_links_absolute(url)

    matches = doc.xpath(unblock_validator)
    if not isinstance(matches, list) or not len(matches) > 0:
        # If we've cached a response that no longer passes validation (likely because the code changed),
        # invalidate it so that we don't just get the same cached response on retry.
        zyte_result.invalidate_cache(context)

        if previous_retries < retries:
            pause = backoff_factor * (2 ** (previous_retries + 1))
            context.log.debug(
                f"Unblocking failed, sleeping {pause}s then retrying",
                url=url,
                retries=retries,
                previous_retries=previous_retries,
            )
            sleep(pause)
            return fetch_html(
                context,
                url,
                unblock_validator,
                actions,
                html_source=html_source,
                javascript=javascript,
                cache_days=cache_days,
                retries=retries,
                backoff_factor=backoff_factor,
                previous_retries=previous_retries + 1,
            )
        context.log.debug("Unblocking failed", url=url, html=zyte_result.response_text)
        raise UnblockFailedException(url, unblock_validator)

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)
    return doc

zavod.extract.zyte_api.fetch(context, zyte_request, cache_days=None)

Fetch using the Zyte API.

Note that this function uses the cache, but does not set the cache. This should be done by callers after verifying that the content is valid and worthy of being cached.

Parameters:

Name Type Description Default
context Context

The context object.

required
zyte_request ZyteAPIRequest

The request to send

required
cache_days Optional[int]

The allowed age of a cache hit.

None

Returns: A ZyteResult

Source code in zavod/extract/zyte_api.py
def fetch(
    context: Context,
    zyte_request: ZyteAPIRequest,
    cache_days: Optional[int] = None,
) -> ZyteResult:
    """
    Fetch using the Zyte API.

    Note that this function uses the cache, but does not set the cache. This should be done by
    callers after verifying that the content is valid and worthy of being cached.

    Args:
        context: The context object.
        zyte_request: The request to send
        cache_days: The allowed age of a cache hit.
    Returns:
        A ZyteResult
    """

    if settings.ZYTE_API_KEY is None:
        raise RuntimeError("OPENSANCTIONS_ZYTE_API_KEY is not set")

    zyte_data: Dict[str, Any] = {
        "url": zyte_request.url,
        "httpResponseHeaders": True,
    }
    if zyte_request.method is not None:
        zyte_data["httpRequestMethod"] = zyte_request.method
    if zyte_request.body is not None:
        zyte_data["httpRequestBody"] = b64encode(zyte_request.body).decode("utf-8")

    if zyte_request.headers is not None:
        zyte_data["customHttpRequestHeaders"] = [
            {"name": k, "value": v} for k, v in zyte_request.headers.items()
        ]
    if zyte_request.geolocation is not None:
        zyte_data["geolocation"] = zyte_request.geolocation
    if zyte_request.actions is not None:
        zyte_data["actions"] = zyte_request.actions
    if zyte_request.javascript is not None:
        zyte_data["javascript"] = zyte_request.javascript
    if zyte_request.response_cookies:
        zyte_data["responseCookies"] = True
    zyte_data[zyte_request.scrape_type.value] = True

    fingerprint = get_cache_fingerprint(zyte_data)

    if cache_days is not None:
        text = context.cache.get(fingerprint, max_age=cache_days)
        if text is not None:
            context.log.debug(
                "HTTP cache hit", url=zyte_request.url, fingerprint=fingerprint
            )
            return ZyteResult(
                response_text=text,
                status_code=None,
                from_cache=True,
                cache_fingerprint=fingerprint,
            )

    context.log.debug(f"Zyte API request: {zyte_request.url}", data=zyte_data)
    configure_session(context.http)

    api_response = context.http.post(
        ZYTE_API_URL,
        auth=(settings.ZYTE_API_KEY, ""),
        json=zyte_data,
    )
    api_response.raise_for_status()

    text = api_response.json()[zyte_request.scrape_type.value]
    assert text is not None
    media_type, charset = get_content_type(
        api_response.json().get("httpResponseHeaders", [])
    )
    if zyte_request.scrape_type == ZyteScrapeType.HTTP_RESPONSE_BODY:
        b64_text = b64decode(text)
        text = b64_text.decode(charset) if charset is not None else b64_text.decode()

    cookies = (
        api_response.json().get("responseCookies")
        if zyte_request.response_cookies
        else None
    )
    return ZyteResult(
        status_code=api_response.json()["statusCode"],
        response_text=text,
        media_type=media_type,
        charset=charset,
        from_cache=False,
        cache_fingerprint=fingerprint,
        cookies=cookies,
    )

zavod.extract.zyte_api.ZyteAPIRequest dataclass

Container dataclass for possible arguments to the Zyte API.

Source code in zavod/extract/zyte_api.py
@dataclass
class ZyteAPIRequest:
    """Container dataclass for possible arguments to the Zyte API."""

    url: str
    method: Optional[str] = None  # Defaults to GET server-side
    body: Optional[bytes] = None

    scrape_type: ZyteScrapeType = ZyteScrapeType.HTTP_RESPONSE_BODY
    actions: Optional[List[Dict[str, Any]]] = None
    headers: Optional[Dict[str, str]] = None
    geolocation: Optional[str] = None
    # Forces JavaScript execution on a browser request to be enabled
    javascript: Optional[bool] = None
    # Request that response cookies be included in the ZyteResult
    response_cookies: bool = False