HTTP Operations

Making requests

Use context methods for HTTP operations, as they handle common tasks such as caching, using request sessions with sensible retry defaults, and checking the response status.

text = context.fetch_text(context.data_url, method="POST", headers=headers, data=body, cache_days=cache_days)

Handling bot blocking

Many sites employ bot blocking strategies. We believe this is primarily to mitigate Denial of Service attacks and manage server load, rather than protecting the content from extraction, since the purpose of the sites we scrape is dissemination of their block lists. As long as we are sensitive to our impact on their service and identifiable in their requests, we believe it is ok to work around their bot blocking strategies.

Blocking might result in error statuses like 403; redirects to error pages; or 200 status responses but with different content from what you've seen in the browser.

Header-based restrictions

If a request using zavod fails but your browser succeeds, try setting a more browser-like user-agent header.

http:
  user_agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36 (zavod; opensanctions.org)

If that doesn't work, try more of the common headers sent by browsers:

HEADERS = {  
    "origin": "https://www.interpol.int",  
    "referer": "https://www.interpol.int/",  
    "sec-fetch-mode": "navigate",  
    "sec-fetch-site": "none",  
    "sec-fetch-user": "?1",  
    "upgrade-insecure-requests": "1",  
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 (zavod; opensanctions.org)",  
}
context.fetch_...(url, headers=HEADERS)

Network/geo-blocking

If it fails in production but not locally, they might be blocking our production network IP range. It's common to block hosting provider networks for websites intended for humans only.

Use zyte with with httpResponseBody approach (default in zavod.shed.zyte_api.fetch_* functions except fetch_htm whose html_source defaults to browser_html). httpResponseBody is faster and cheaper than browserHtml.

It's also common to block requests from a country other than the publisher. If it works using a VPN exit point in that country, also try zyte using the geolocation argument.

JavaScript challenges

If it works in the browser but you see different content when fetching using zavod or curl, there might be a javascript challenge that checks whether a full browser is rendering the page. This usually sets a cookie so the browser doesn't have to complete the challenge on each request. These challenges can also be intermittent.

For HTML, try requesting using zyte_api.fetch_text with html_source="browserHtml" (the default). This will render the page in a browser, execute any javascript, then turn the DOM back into HTML and return that.

If tricks like waiting for specific content or clicking on something to render the data is needed, look at the actions argument.