Welcome to scrapy-proxy-headers’s documentation!¶
The scrapy-proxy-headers package is designed for adding proxy headers to HTTPS requests in Scrapy.
The Problem¶
In normal usage, custom headers put in request.headers cannot be read by a proxy when you make an HTTPS request, because the headers are encrypted and passed through the proxy tunnel, along with the rest of the request body. You can read more about this at Proxy Server Requests over HTTPS.
┌──────────┐ CONNECT ┌───────┐ Encrypted ┌────────────┐
│ Scrapy │ ───────────────► │ Proxy │ ════════════════► │ Target URL │
└──────────┘ (unencrypted) └───────┘ (tunnel) └────────────┘
│ │
Proxy headers request.headers
go HERE go here (encrypted)
Because Scrapy does not have a good way to pass custom headers to a proxy when you make HTTPS requests, we at ProxyMesh made this extension to support our customers that use Scrapy and want to use custom headers to control our proxy behavior. But this extension can work for handling custom headers through any proxy.
Installation¶
To use this extension, do the following:
Install the package:
pip install scrapy-proxy-headers
In your Scrapy
settings.py, add the following:DOWNLOAD_HANDLERS = { "https": "scrapy_proxy_headers.HTTP11ProxyDownloadHandler" }
This configures Scrapy to use our custom download handler for HTTPS requests, which enables proxy header support.
Quick Start¶
Sending Proxy Headers¶
When you want to make a request with a custom proxy header, instead of using request.headers, use request.meta["proxy_headers"]:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
def start_requests(self):
yield scrapy.Request(
url="https://api.ipify.org?format=json",
meta={
"proxy": "http://PROXYHOST:PORT",
"proxy_headers": {"X-ProxyMesh-Country": "US"}
}
)
def parse(self, response):
# Access proxy response headers
proxy_ip = response.headers.get("X-ProxyMesh-IP")
self.logger.info(f"Proxy IP: {proxy_ip}")
yield {"ip": response.json()["ip"], "proxy_ip": proxy_ip}
Receiving Proxy Response Headers¶
Any response headers that come from the proxy will be available in response.headers:
def parse(self, response):
# Proxy response headers are merged into response.headers
proxy_ip = response.headers.get("X-ProxyMesh-IP")
print(f"Request was made through IP: {proxy_ip}")
Proxy Headers Overview¶
Proxy headers are custom HTTP headers that can be used to communicate with proxy servers. They allow you to:
Control proxy behavior: Send headers like
X-ProxyMesh-Countryto select a specific country for your proxy connectionReceive proxy information: Get headers like
X-ProxyMesh-IPto know which IP address was assigned to your requestMaintain session consistency: Use headers like
X-ProxyMesh-IPto ensure you get the same IP address across multiple requests
The exact headers available depend on your proxy provider. Check your proxy provider’s documentation for the specific headers they support.
Complete Spider Example¶
Here’s a complete example spider that uses proxy headers:
import scrapy
class ProxyHeadersSpider(scrapy.Spider):
name = "proxy_headers_example"
custom_settings = {
"DOWNLOAD_HANDLERS": {
"https": "scrapy_proxy_headers.HTTP11ProxyDownloadHandler"
}
}
def start_requests(self):
# Request with proxy headers to select US country
yield scrapy.Request(
url="https://api.ipify.org?format=json",
meta={
"proxy": "http://us.proxymesh.com:31280",
"proxy_headers": {"X-ProxyMesh-Country": "US"}
},
callback=self.parse_ip
)
def parse_ip(self, response):
data = response.json()
proxy_ip = response.headers.get(b"X-ProxyMesh-IP")
self.logger.info(f"Public IP: {data['ip']}")
if proxy_ip:
self.logger.info(f"Proxy IP: {proxy_ip.decode()}")
yield {
"public_ip": data["ip"],
"proxy_ip": proxy_ip.decode() if proxy_ip else None
}
Extension Classes¶
The scrapy_proxy_headers package provides several extension classes that work together to enable proxy header support in Scrapy.
HTTP11ProxyDownloadHandler¶
The main entry point for using proxy headers with Scrapy. This class extends scrapy.core.downloader.handlers.http11.HTTP11DownloadHandler and should be configured in your Scrapy settings.
DOWNLOAD_HANDLERS = {
"https": "scrapy_proxy_headers.HTTP11ProxyDownloadHandler"
}
The handler:
Creates a
ScrapyProxyHeadersAgentfor each download requestManages a cache of proxy response headers by proxy URL (
_proxy_headers_by_proxy)Ensures proxy response headers are available even when tunnel connections are reused
Why header caching is needed: When Scrapy reuses a proxy tunnel connection for multiple requests, the proxy response headers are only available in the first response (when the tunnel is established). The handler caches these headers by proxy URL so they can be added to subsequent responses that reuse the same tunnel.
Methods:
download_request(request, spider)- Downloads a request using the custom agent and ensures proxy response headers are properly cached and applied to responses.
ScrapyProxyHeadersAgent¶
Extends scrapy.core.downloader.handlers.http11.ScrapyAgent to use our custom tunneling agent for HTTPS requests through proxies.
from scrapy_proxy_headers.agent import ScrapyProxyHeadersAgent
The agent:
Checks if the request has both a
proxyandproxy_headersin its metaFor HTTPS requests, configures the tunneling agent with the custom proxy headers
After the response body is received, merges any proxy response headers into the response
Class Attributes:
_TunnelingAgent- Set toTunnelingHeadersAgentto use our custom tunneling implementation
Methods:
_get_agent(request, timeout)- Returns an agent configured with proxy headers fromrequest.meta["proxy_headers"]_cb_bodydone(result, *args)- Callback that merges proxy response headers into the final response (compatible with Scrapy 2.14 and 2.15+)
TunnelingHeadersAgent¶
Extends scrapy.core.downloader.handlers.http11.TunnelingAgent to support custom proxy headers in HTTPS tunnel establishment.
from scrapy_proxy_headers.agent import TunnelingHeadersAgent
The agent maintains proxy headers and creates endpoints that include them in the CONNECT request.
Methods:
set_proxy_headers(proxy_headers)- Sets the proxy headers dictionary to be sent with CONNECT requests_getEndpoint(uri)- Creates aTunnelingHeadersTCP4ClientEndpointconfigured with the proxy headers
TunnelingHeadersTCP4ClientEndpoint¶
Extends scrapy.core.downloader.handlers.http11.TunnelingTCP4ClientEndpoint to include custom headers in the CONNECT request and capture proxy response headers.
from scrapy_proxy_headers.agent import TunnelingHeadersTCP4ClientEndpoint
This is the lowest-level class that actually handles the tunnel establishment.
Constructor Parameters:
All standard TunnelingTCP4ClientEndpoint parameters, plus:
**proxy_headers- Keyword arguments for additional headers to send in the CONNECT request
Methods:
requestTunnel(protocol)- Sends the CONNECT request with custom proxy headers usingtunnel_request_data_with_headers()processProxyResponse(data)- Parses the proxy’s CONNECT response and captures any response headers into_proxy_response_headers
Attributes:
_proxy_headers- Dictionary of headers to send to the proxy (includesProxy-Authorizationif configured)_proxy_response_headers-scrapy.http.Headersobject containing headers from the proxy’s CONNECT response
Helper Functions¶
tunnel_request_data_with_headers¶
Builds the binary content of a CONNECT request with custom headers.
from scrapy_proxy_headers.agent import tunnel_request_data_with_headers
# Basic CONNECT request
data = tunnel_request_data_with_headers("example.com", 8080)
# Returns: b'CONNECT example.com:8080 HTTP/1.1\r\nHost: example.com:8080\r\n\r\n'
# CONNECT request with custom headers
data = tunnel_request_data_with_headers(
"example.com", 8080,
**{"X-ProxyMesh-Country": "US"}
)
# Returns: b'CONNECT example.com:8080 HTTP/1.1\r\nHost: example.com:8080\r\nX-ProxyMesh-Country: US\r\n\r\n'
Parameters:
host(str) - The target host for the tunnelport(int) - The target port for the tunnel**proxy_headers- Additional headers to include in the CONNECT request
Returns:
bytes- The complete CONNECT request as bytes, ready to send to the proxy
How It Works¶
The extension classes work together in the following flow:
HTTP11ProxyDownloadHandler receives a download request and creates a
ScrapyProxyHeadersAgentScrapyProxyHeadersAgent checks for
proxyandproxy_headersin the request meta, and configures the tunneling agentTunnelingHeadersAgent creates a
TunnelingHeadersTCP4ClientEndpointwith the proxy headersTunnelingHeadersTCP4ClientEndpoint sends a CONNECT request with the custom headers using
tunnel_request_data_with_headers()When the proxy responds to the CONNECT request,
processProxyResponse()captures any response headersAfter the request completes, the proxy response headers are merged into the final
ResponseobjectHTTP11ProxyDownloadHandler caches the proxy headers by proxy URL for reuse with subsequent requests on the same tunnel
This allows proxy response headers to be transparently available in your spider’s parse methods without any special handling.
Test Harness¶
A test harness is included in the repository to verify proxy header functionality works correctly with your proxy configuration.
Running the Test¶
# Basic test
PROXY_URL=http://your-proxy:port python test_proxy_headers.py
# With custom response header to check
PROXY_URL=http://your-proxy:port PROXY_HEADER=X-ProxyMesh-IP python test_proxy_headers.py
# Send a custom header to the proxy
PROXY_URL=http://your-proxy:port \
SEND_PROXY_HEADER=X-ProxyMesh-Country \
SEND_PROXY_VALUE=US \
python test_proxy_headers.py
# Verbose output (shows header values)
python test_proxy_headers.py -v
Environment Variables¶
Variable |
Description |
Default |
|---|---|---|
|
Proxy URL (also checks |
Required |
|
URL to request through the proxy |
|
|
Response header to check for |
|
|
Header name to send to proxy |
Optional |
|
Value for the send header |
Optional |
Expected Output¶
On success:
Testing scrapy-proxy-headers
============================
Proxy URL: http://your-proxy:port
Test URL: https://api.ipify.org?format=json
Checking for header: X-ProxyMesh-IP
[PASS] Received header X-ProxyMesh-IP
With verbose flag (-v):
[PASS] Received header X-ProxyMesh-IP: 192.168.1.1
Troubleshooting¶
Headers Not Being Received¶
If you’re not receiving proxy response headers:
Verify the proxy supports custom headers: Not all proxies send response headers in the CONNECT response
Check the header name: Header names are case-insensitive but the exact spelling matters
Ensure HTTPS URL: Proxy headers only work with HTTPS URLs (HTTP requests don’t use CONNECT tunneling)
Headers Only Available on First Request¶
This is expected behavior when Scrapy reuses tunnel connections. The HTTP11ProxyDownloadHandler automatically caches headers by proxy URL to ensure they’re available on subsequent requests.
Request Failing with Connection Errors¶
Check proxy URL format: Should be
http://host:portorhttp://user:pass@host:portVerify proxy is accessible: Test with
curl -x http://your-proxy:port https://example.comCheck firewall rules: Ensure your environment can connect to the proxy
Use Cases¶
Geographic Targeting¶
Route requests through specific countries:
yield scrapy.Request(
url="https://example.com",
meta={
"proxy": "http://proxy.example.com:8080",
"proxy_headers": {"X-ProxyMesh-Country": "US"}
}
)
Session Consistency¶
Request the same IP across multiple requests:
# First, capture the assigned IP
proxy_ip = response.headers.get(b"X-ProxyMesh-IP")
# Then request that same IP for subsequent requests
yield scrapy.Request(
url="https://example.com/page2",
meta={
"proxy": "http://proxy.example.com:8080",
"proxy_headers": {"X-ProxyMesh-IP": proxy_ip.decode()}
}
)
Debugging and Logging¶
Log proxy information for debugging:
def parse(self, response):
proxy_ip = response.headers.get(b"X-ProxyMesh-IP")
self.logger.info(f"Request to {response.url} via proxy IP: {proxy_ip}")