HTTP Operations

Making requests

Use context methods for HTTP operations, as they handle common tasks such as caching, using request sessions with sensible retry defaults, and checking the response status.

text = context.fetch_text(context.data_url, method="POST", headers=headers, data=body, cache_days=cache_days)

Handling bot blocking

Many sites employ bot blocking strategies. We believe this is primarily to mitigate Denial of Service attacks and manage server load, rather than protecting the content from extraction, since the purpose of the sites we scrape is dissemination of their block lists. As long as we are sensitive to our impact on their service and identifiable in their requests, we believe it is ok to work around their bot blocking strategies.

Blocking might result in error statuses like 403; redirects to error pages; or 200 status responses but with different content from what you've seen in the browser.

Header-based restrictions

If a request using zavod fails but your browser succeeds, try setting a more browser-like user-agent header.

http:
  user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36 (zavod; opensanctions.org)

If that doesn't work, try more of the common headers sent by browsers:

HEADERS = {
    "origin": "https://www.interpol.int",
    "referer": "https://www.interpol.int/",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 (zavod; opensanctions.org)",
}
context.fetch_...(url, headers=HEADERS)

Network/geo-blocking

If it fails in production but not locally, the site might be blocking the production network IP range. It's common to block hosting provider networks for websites intended for humans only. It's also common to block requests from a country other than the publisher; if it works using a VPN exit point in that country, pass that country to the geolocation argument.

In both cases, route the request through the Zyte API (zavod.extract.zyte_api), which proxies and unblocks requests. The fetch_* helpers default to the httpResponseBody scrape type, which is faster and cheaper than browserHtml. Prefer it unless the page needs JavaScript (see below).

JavaScript challenges

If it works in the browser but you see different content when fetching using zavod or curl, there might be a javascript challenge that checks whether a full browser is rendering the page. This usually sets a cookie so the browser doesn't have to complete the challenge on each request. These challenges can also be intermittent.

Use zyte_api.fetch_html, whose html_source defaults to "browserHtml": it renders the page in a browser, executes the JavaScript, and returns the resulting DOM as parsed HTML. It requires an unblock_validator XPath that matches only when unblocking succeeded, so challenge pages aren't cached. If you need to wait for specific content or click to reveal data, use the actions argument.

The Zyte API

The helpers in zavod.extract.zyte_api route requests through the Zyte API, which handles proxying, geolocation, and browser rendering. They require the OPENSANCTIONS_ZYTE_API_KEY environment variable, so crawlers that use them set ci_test: false. Most take an optional geolocation (e.g. "US") and cache_days. For requests that need a POST body, custom headers, cookies, or browser actions, build a ZyteAPIRequest and call the lower-level fetch.

`zavod.extract.zyte_api.fetch_text(context, url, geolocation=None, cache_days=None, expected_media_type=None, expected_charset=None)`

Fetch a text document using the Zyte API.

The content type and charset can be used to assert expected types (and successful unblocking by Zyte) and for appropriate text decoding when the encoding can vary. Do so via the expected_ arguments unless more logic is required.

Parameters:

Name	Type	Description	Default
`context`	`Context`	The context object.	required
`url`	`str`	The URL of the text document.	required
`headers`		A list of dicts of headers to send with the request.	required
`expected_media_type`	`str \| None`	If set, assert that the media type in the response content-type header matches this value.	`None`
`expected_charset`	`str \| None`	If set, assert that the charset in the response content-type header matches this value.	`None`

Returns:

Type	Description
`tuple[bool, str \| None, str \| None, str]`	A tuple of: - A boolean indicating whether the text was cached. - The media type of the response, None if cached. - The charset of the response, None if cached. - The text content.

Source code in zavod/extract/zyte_api.py

def fetch_text(
    context: Context,
    url: str,
    geolocation: str | None = None,
    cache_days: int | None = None,
    expected_media_type: str | None = None,
    expected_charset: str | None = None,
) -> tuple[bool, str | None, str | None, str]:
    """
    Fetch a text document using the Zyte API.

    The content type and charset can be used to assert expected
    types (and successful unblocking by Zyte) and for appropriate
    text decoding when the encoding can vary. Do so via the expected_
    arguments unless more logic is required.

    Args:
        context: The context object.
        url: The URL of the text document.
        headers: A list of dicts of headers to send with the request.
        expected_media_type: If set, assert that the media type in the
            response content-type header matches this value.
        expected_charset: If set, assert that the charset in the response
            content-type header matches this value.

    Returns:
        A tuple of:
            - A boolean indicating whether the text was cached.
            - The media type of the response, None if cached.
            - The charset of the response, None if cached.
            - The text content.
    """
    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType.HTTP_RESPONSE_BODY,
            url=url,
            geolocation=geolocation,
        ),
        cache_days=cache_days,
    )
    if expected_media_type and zyte_result.media_type:
        assert zyte_result.media_type == expected_media_type, (
            zyte_result.media_type,
            zyte_result.charset,
            url,
        )
    if expected_charset and zyte_result.charset:
        assert zyte_result.charset == expected_charset, (
            zyte_result.media_type,
            zyte_result.charset,
            url,
        )

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)

    return (
        zyte_result.from_cache,
        zyte_result.media_type,
        zyte_result.charset,
        zyte_result.response_text,
    )

`zavod.extract.zyte_api.fetch_json(context, url, cache_days=None, expected_media_type='application/json', geolocation=None)`

Returns:

Type	Description
`Any`	A JSON document.

Source code in zavod/extract/zyte_api.py

def fetch_json(
    context: Context,
    url: str,
    cache_days: int | None = None,
    expected_media_type: str | None = "application/json",
    geolocation: str | None = None,
) -> Any:
    """
    Returns:
        A JSON document.
    """

    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType.HTTP_RESPONSE_BODY,
            url=url,
            headers={"Accept": "application/json"},
            geolocation=geolocation,
        ),
        cache_days=cache_days,
    )
    if (
        expected_media_type
        and zyte_result.media_type
        and zyte_result.media_type != expected_media_type
    ):
        msg = f"Expected media type {expected_media_type} but got {zyte_result.media_type} for {url}"
        context.log.error(
            msg,
            expected_media_type=expected_media_type,
            media_type=zyte_result.media_type,
            charset=zyte_result.charset,
            response_text=zyte_result.response_text,
        )
        raise AssertionError(msg)

    doc = json.loads(zyte_result.response_text)

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)
    return doc

`zavod.extract.zyte_api.fetch_resource(context, filename, url, expected_media_type=None, expected_charset=None, geolocation=None, method=None, body=None, headers=None)`

Fetch a resource using Zyte API and save to filesystem.

The content type and charset can be used to assert expected types (and successful unblocking by Zyte) and for appropriate text decoding when the encoding can vary. Do so via the expected_ arguments unless more logic is required.

Parameters:

Name	Type	Description	Default
`context`	`Context`	The context object.	required
`filename`	`str`	The name to use when saving the file.	required
`url`	`str`	The URL of the resource.	required
`expected_media_type`	`str \| None`	If set, assert that the media type in the response content-type header matches this value. Not enforced when the file already exists locally.	`None`
`expected_charset`	`str \| None`	If set, assert that the charset in the response content-type header matches this value. Not enforced when the file already exists locally.	`None`

Returns:

Type	Description
`tuple[bool, str \| None, str \| None, Path]`	A tuple of: - A boolean indicating whether the file was cached. - The media type of the response, None if cached. - The charset of the response, None if cached. - The path to the saved file.

Source code in zavod/extract/zyte_api.py

def fetch_resource(
    context: Context,
    filename: str,
    url: str,
    expected_media_type: str | None = None,
    expected_charset: str | None = None,
    geolocation: str | None = None,
    method: str | None = None,
    body: bytes | None = None,
    headers: dict[str, str] | None = None,
) -> tuple[bool, str | None, str | None, Path]:
    """
    Fetch a resource using Zyte API and save to filesystem.

    The content type and charset can be used to assert expected
    types (and successful unblocking by Zyte) and for appropriate
    text decoding when the encoding can vary. Do so via the expected_
    arguments unless more logic is required.

    Args:
        context: The context object.
        filename: The name to use when saving the file.
        url: The URL of the resource.
        expected_media_type: If set, assert that the media type in the
            response content-type header matches this value. Not enforced
            when the file already exists locally.
        expected_charset: If set, assert that the charset in the response
            content-type header matches this value. Not enforced
            when the file already exists locally.

    Returns:
        A tuple of:
            - A boolean indicating whether the file was cached.
            - The media type of the response, None if cached.
            - The charset of the response, None if cached.
            - The path to the saved file.
    """
    data_path = dataset_data_path(context.dataset.name)
    out_path = data_path.joinpath(filename)
    if out_path.exists():
        return True, None, None, out_path

    if settings.ZYTE_API_KEY is None:
        raise RuntimeError("OPENSANCTIONS_ZYTE_API_KEY is not set")

    context.log.info("Fetching file", url=url)
    # This repeats a lot of what's in fetch(), but fetch() focuses on decoding either
    # a text response or a base64-encoded response to text, whereas here we want to
    # save the raw bytes to a file. Letting fetch cover both cases seems more complex than
    # just constructing the right request here.
    zyte_data: dict[str, Any] = {
        "httpResponseBody": True,
        "httpResponseHeaders": True,
    }
    if method is not None:
        zyte_data["httpRequestMethod"] = method
    if body is not None:
        zyte_data["httpRequestBody"] = b64encode(body).decode("utf-8")
    if headers is not None:
        zyte_data["customHttpRequestHeaders"] = [
            {"name": k, "value": v} for k, v in headers.items()
        ]
    if geolocation is not None:
        zyte_data["geolocation"] = geolocation
    context.log.debug(f"Zyte API request: {url}", data=zyte_data)
    zyte_data["url"] = url
    out_path.parent.mkdir(parents=True, exist_ok=True)
    configure_session(context.http)

    timeout = context.dataset.http.timeout
    api_response = context.http.post(
        ZYTE_API_URL,
        auth=(settings.ZYTE_API_KEY, ""),
        json=zyte_data,
        timeout=(timeout, timeout),
    )
    api_response.raise_for_status()

    file_base64 = api_response.json()["httpResponseBody"]
    with open(out_path, "wb") as fh:
        fh.write(b64decode(file_base64))
    media_type, charset = get_content_type(api_response.json()["httpResponseHeaders"])

    if expected_media_type:
        assert media_type == expected_media_type, (media_type, charset, url)
    if expected_charset:
        assert charset == expected_charset, (media_type, charset, url)

    return False, media_type, charset, out_path

`zavod.extract.zyte_api.fetch_html(context, url, unblock_validator, actions=[], html_source='browserHtml', javascript=None, geolocation=None, request_cookies=None, cache_days=None, retries=3, backoff_factor=3, previous_retries=0, absolute_links=False)`

Fetch a web page using the Zyte API.

Parameters:

Name	Type	Description	Default
`unblock_validator`	`str`	XPath matching at least one element if and only if unblocking was successful. This is important to ensure we don't cache pages that weren't actually unblocked successfully.	required
`html_source`	`str`	browserHtml \| httpResponseBody	`'browserHtml'`
`retries`	`int`	The number of times to retry if unblocking fails.	`3`
`backoff_factor`	`int`	Factor to scale the pause between retries.	`3`
`absolute_links`	`bool`	Whether to convert relative links to absolute links. Doesn't take redirects into account.	`False`

Returns:

Type	Description
`_Element`	The parsed HTML document serialized from the DOM.

Source code in zavod/extract/zyte_api.py

def fetch_html(
    context: Context,
    url: str,
    unblock_validator: str,
    actions: list[dict[str, Any]] = [],
    html_source: str = "browserHtml",
    javascript: bool | None = None,
    geolocation: str | None = None,
    request_cookies: list[dict[str, Any]] | None = None,
    cache_days: int | None = None,
    retries: int = 3,
    backoff_factor: int = 3,
    previous_retries: int = 0,
    absolute_links: bool = False,
) -> etree._Element:
    """
    Fetch a web page using the Zyte API.

    Args:
        unblock_validator: XPath matching at least one element if and only if
            unblocking was successful. This is important to ensure we don't cache
            pages that weren't actually unblocked successfully.
        html_source: browserHtml | httpResponseBody
        retries: The number of times to retry if unblocking fails.
        backoff_factor: Factor to scale the pause between retries.
        absolute_links: Whether to convert relative links to absolute links.
            Doesn't take redirects into account.

    Returns:
        The parsed HTML document serialized from the DOM.
    """
    zyte_result = fetch(
        context,
        ZyteAPIRequest(
            scrape_type=ZyteScrapeType(html_source),
            url=url,
            geolocation=geolocation,
            actions=actions,
            javascript=javascript,
            request_cookies=request_cookies,
        ),
        cache_days=cache_days,
    )

    doc = html.fromstring(zyte_result.response_text)
    if absolute_links and isinstance(doc, html.HtmlElement):
        cast(html.HtmlElement, doc).make_links_absolute(url)

    matches = doc.xpath(unblock_validator)
    if not isinstance(matches, list) or not len(matches) > 0:
        # If we've cached a response that no longer passes validation (likely because the code changed),
        # invalidate it so that we don't just get the same cached response on retry.
        zyte_result.invalidate_cache(context)

        if previous_retries < retries:
            pause = backoff_factor * (2 ** (previous_retries + 1))
            context.log.debug(
                f"Unblocking failed, sleeping {pause}s then retrying",
                url=url,
                retries=retries,
                previous_retries=previous_retries,
            )
            sleep(pause)
            return fetch_html(
                context,
                url,
                unblock_validator,
                actions,
                html_source=html_source,
                javascript=javascript,
                geolocation=geolocation,
                request_cookies=request_cookies,
                cache_days=cache_days,
                retries=retries,
                backoff_factor=backoff_factor,
                previous_retries=previous_retries + 1,
                absolute_links=absolute_links,
            )
        context.log.debug("Unblocking failed", url=url, html=zyte_result.response_text)
        raise UnblockFailedException(url, unblock_validator)

    if not zyte_result.from_cache and cache_days is not None:
        context.cache.set(zyte_result.cache_fingerprint, zyte_result.response_text)
    return doc

`zavod.extract.zyte_api.fetch(context, zyte_request, cache_days=None)`

Fetch using the Zyte API.

Note that this function uses the cache, but does not set the cache. This should be done by callers after verifying that the content is valid and worthy of being cached.

Parameters:

Name	Type	Description	Default
`context`	`Context`	The context object.	required
`zyte_request`	`ZyteAPIRequest`	The request to send	required
`cache_days`	`int \| None`	The allowed age of a cache hit.	`None`

Returns: A ZyteResult

Source code in zavod/extract/zyte_api.py

def fetch(
    context: Context,
    zyte_request: ZyteAPIRequest,
    cache_days: int | None = None,
) -> ZyteResult:
    """
    Fetch using the Zyte API.

    Note that this function uses the cache, but does not set the cache. This should be done by
    callers after verifying that the content is valid and worthy of being cached.

    Args:
        context: The context object.
        zyte_request: The request to send
        cache_days: The allowed age of a cache hit.
    Returns:
        A ZyteResult
    """

    if settings.ZYTE_API_KEY is None:
        raise RuntimeError("OPENSANCTIONS_ZYTE_API_KEY is not set")

    zyte_data: dict[str, Any] = {
        "url": zyte_request.url,
        "httpResponseHeaders": True,
    }
    if zyte_request.method is not None:
        zyte_data["httpRequestMethod"] = zyte_request.method
    if zyte_request.body is not None:
        zyte_data["httpRequestBody"] = b64encode(zyte_request.body).decode("utf-8")

    if zyte_request.headers is not None:
        zyte_data["customHttpRequestHeaders"] = [
            {"name": k, "value": v} for k, v in zyte_request.headers.items()
        ]
    if zyte_request.geolocation is not None:
        zyte_data["geolocation"] = zyte_request.geolocation
    if zyte_request.actions is not None:
        zyte_data["actions"] = zyte_request.actions
    if zyte_request.javascript is not None:
        zyte_data["javascript"] = zyte_request.javascript
    if zyte_request.request_cookies is not None:
        zyte_data["requestCookies"] = zyte_request.request_cookies
    if zyte_request.response_cookies:
        zyte_data["responseCookies"] = True
    zyte_data[zyte_request.scrape_type.value] = True

    fingerprint = get_cache_fingerprint(zyte_data)

    if cache_days is not None:
        text = context.cache.get(fingerprint, max_age=cache_days)
        if text is not None:
            context.log.debug(
                "HTTP cache hit", url=zyte_request.url, fingerprint=fingerprint
            )
            return ZyteResult(
                response_text=text,
                status_code=None,
                from_cache=True,
                cache_fingerprint=fingerprint,
            )

    context.log.debug(f"Zyte API request: {zyte_request.url}", data=zyte_data)
    configure_session(context.http)

    timeout = context.dataset.http.timeout
    api_response = context.http.post(
        ZYTE_API_URL,
        auth=(settings.ZYTE_API_KEY, ""),
        json=zyte_data,
        timeout=(timeout, timeout),
    )
    api_response.raise_for_status()

    text = api_response.json()[zyte_request.scrape_type.value]
    assert text is not None
    media_type, charset = get_content_type(
        api_response.json().get("httpResponseHeaders", [])
    )
    if zyte_request.scrape_type == ZyteScrapeType.HTTP_RESPONSE_BODY:
        b64_text = b64decode(text)
        # The Content-Type header often omits the charset, leaving the encoding
        # declared only in an HTML <meta> tag (e.g. legacy windows-1257 pages).
        # Assuming UTF-8 then raises on the first non-ASCII byte, so detect the
        # encoding from the bytes when the header doesn't state it.
        encoding = charset if charset is not None else predict_encoding(b64_text)
        text = b64_text.decode(encoding)

    cookies = (
        api_response.json().get("responseCookies")
        if zyte_request.response_cookies
        else None
    )
    return ZyteResult(
        status_code=api_response.json()["statusCode"],
        response_text=text,
        media_type=media_type,
        charset=charset,
        from_cache=False,
        cache_fingerprint=fingerprint,
        cookies=cookies,
    )

`zavod.extract.zyte_api.ZyteAPIRequest` `dataclass`

Container dataclass for possible arguments to the Zyte API.

Source code in zavod/extract/zyte_api.py

@dataclass
class ZyteAPIRequest:
    """Container dataclass for possible arguments to the Zyte API."""

    url: str
    method: str | None = None  # Defaults to GET server-side
    body: bytes | None = None

    scrape_type: ZyteScrapeType = ZyteScrapeType.HTTP_RESPONSE_BODY
    actions: list[dict[str, Any]] | None = None
    headers: dict[str, str] | None = None
    geolocation: str | None = None
    # Forces JavaScript execution on a browser request to be enabled
    javascript: bool | None = None
    # Cookies sent with the request, e.g. [{"name": "x", "value": "y", "domain": ".example.com"}]
    request_cookies: list[dict[str, Any]] | None = None
    # Request that response cookies be included in the ZyteResult
    response_cookies: bool = False

HTTP Operations

Making requests

Handling bot blocking

Header-based restrictions

Network/geo-blocking

JavaScript challenges

The Zyte API

zavod.extract.zyte_api.fetch_text(context, url, geolocation=None, cache_days=None, expected_media_type=None, expected_charset=None)

zavod.extract.zyte_api.fetch_json(context, url, cache_days=None, expected_media_type='application/json', geolocation=None)

zavod.extract.zyte_api.fetch_resource(context, filename, url, expected_media_type=None, expected_charset=None, geolocation=None, method=None, body=None, headers=None)

zavod.extract.zyte_api.fetch_html(context, url, unblock_validator, actions=[], html_source='browserHtml', javascript=None, geolocation=None, request_cookies=None, cache_days=None, retries=3, backoff_factor=3, previous_retries=0, absolute_links=False)

zavod.extract.zyte_api.fetch(context, zyte_request, cache_days=None)

zavod.extract.zyte_api.ZyteAPIRequest dataclass

`zavod.extract.zyte_api.fetch_text(context, url, geolocation=None, cache_days=None, expected_media_type=None, expected_charset=None)`

`zavod.extract.zyte_api.fetch_json(context, url, cache_days=None, expected_media_type='application/json', geolocation=None)`

`zavod.extract.zyte_api.fetch_resource(context, filename, url, expected_media_type=None, expected_charset=None, geolocation=None, method=None, body=None, headers=None)`

`zavod.extract.zyte_api.fetch_html(context, url, unblock_validator, actions=[], html_source='browserHtml', javascript=None, geolocation=None, request_cookies=None, cache_days=None, retries=3, backoff_factor=3, previous_retries=0, absolute_links=False)`

`zavod.extract.zyte_api.fetch(context, zyte_request, cache_days=None)`

`zavod.extract.zyte_api.ZyteAPIRequest` `dataclass`