Tag: web-crawler

  • Web Crawling at Scale: Architecture and Optimization Strategies

    Web Crawling at Scale: Architecture and Optimization Strategies

    Building a web crawler that can process millions of pages requires careful architectural decisions. This article covers the multi-worker architecture pattern where a coordinator distributes URLs from a priority queue while workers fetch and process pages concurrently. Key optimization strategies include DNS caching, connection pooling, adaptive politeness delays, and incremental crawling using ETag and Last-Modified headers. We examine content extraction using techniques like DOM-based article extraction, boilerplate removal, and structured data parsing from JSON-LD and microdata. Error handling patterns cover retry strategies, circuit breakers for unresponsive hosts, and graceful degradation when facing rate limits or CAPTCHAs. The article includes real benchmarks showing how these optimizations reduced crawl time from 48 hours to 6 hours for a 500,000-page site.

  • Building a Web Crawler from Scratch: Architecture and Lessons Learned

    Building a Web Crawler from Scratch: Architecture and Lessons Learned

    After building and operating a web crawler that processes millions of pages, here are the architectural decisions that matter most.

    The crawler uses a multi-worker architecture: a coordinator distributes URLs from a priority queue, and workers fetch pages concurrently. Each worker has three rendering strategies: fast HTTP (curl-cffi), headless browser (Playwright for JS-heavy sites), and fallback (httpx with retry logic).

    Content extraction uses trafilatura for article text, with custom extractors for PDF, DOCX, and XLSX files. Metadata extraction captures OG tags, JSON-LD structured data, meta descriptions, and canonical URLs.

    The canonical URL check is critical: if a page’s canonical URL differs from the crawled URL, we skip indexing it. This prevents duplicate content from paginated pages, tracking URLs, and www/non-www variants.

    Anti-bot detection (Cloudflare challenges, CAPTCHAs) is handled by the Playwright rendering daemon, which maintains persistent browser contexts with shared cookies. We detect challenge pages by looking for specific HTML patterns and JavaScript challenges.

    Embedding generation happens at flush time: when the buffer reaches 100 documents, we batch-embed them using E5-large-instruct (1024 dimensions) before sending to Solr. The MAX_EMBED_PAYLOAD_CHARS limit (40,000) prevents API timeouts.