After building and operating a web crawler that processes millions of pages, here are the architectural decisions that matter most.
The crawler uses a multi-worker architecture: a coordinator distributes URLs from a priority queue, and workers fetch pages concurrently. Each worker has three rendering strategies: fast HTTP (curl-cffi), headless browser (Playwright for JS-heavy sites), and fallback (httpx with retry logic).
Content extraction uses trafilatura for article text, with custom extractors for PDF, DOCX, and XLSX files. Metadata extraction captures OG tags, JSON-LD structured data, meta descriptions, and canonical URLs.
The canonical URL check is critical: if a page’s canonical URL differs from the crawled URL, we skip indexing it. This prevents duplicate content from paginated pages, tracking URLs, and www/non-www variants.
Anti-bot detection (Cloudflare challenges, CAPTCHAs) is handled by the Playwright rendering daemon, which maintains persistent browser contexts with shared cookies. We detect challenge pages by looking for specific HTML patterns and JavaScript challenges.
Embedding generation happens at flush time: when the buffer reaches 100 documents, we batch-embed them using E5-large-instruct (1024 dimensions) before sending to Solr. The MAX_EMBED_PAYLOAD_CHARS limit (40,000) prevents API timeouts.

Leave a Reply