URL Consistency Annotator
The TLS fingerprint arms race: when IP blocking fails, impersonate Chrome at the handshake
Chrome 124
TLS Impersonation
30+5+5
Concurrency
3×
Ad-Rotation Checks
$0
LLM API Cost
Bot detection has gone through three generations: IP blocking (beaten by rotating proxies), JavaScript challenges like Cloudflare Bot Management (beaten by headless browsers), and now TLS fingerprinting (2022+). Most Python HTTP libraries—requests, aiohttp, even httpx—have distinctive TLS ClientHello signatures that modern CDNs flag instantly. The Python ecosystem's response is curl_cffi: Python bindings for curl that impersonate specific browser TLS handshakes at the socket level.
Chrome 124's TLS fingerprint includes specific cipher suite ordering, ALPN protocol negotiation, session ticket support, and extension sequencing. curl_cffi replicates all of it. From the server's perspective, it's talking to a real Chrome browser. For the ~60% of pages that still need JS rendering after TLS bypass, Playwright handles full DOM execution with 5 concurrent pages—enough for the annotation throughput we need without memory exhaustion.
The local LLM choice (Qwen3:1.7b via Ollama) was deliberate. Cloud LLM APIs add latency, cost, and GDPR complexity—every page content chunk you send to OpenAI is a potential data exposure. Running inference locally means zero marginal cost, zero data leaving the machine, and no rate limit surprises. At 1.7B parameters, Qwen3 is small enough to run on a MacBook Pro but accurate enough to distinguish 'same product different URL structure' from 'different product same domain'.
The rotating-ad detection is the edge case that kills naive URL deduplication: ad networks serve different landing pages on sequential requests from the same URL. We re-fetch each URL 3 times with fresh TLS sessions and flag any case where the final redirect domain changes—those are rotating ad slots, not duplicate content.