Tag: devops

  • Containerizing Search: Docker and Kubernetes for Solr Deployments

    Containerizing Search: Docker and Kubernetes for Solr Deployments

    Container orchestration has transformed how we deploy and manage search infrastructure. This guide covers Docker best practices for Apache Solr, including image optimization, volume management for index persistence, and health check configuration. We then move to Kubernetes deployments using StatefulSets for Solr nodes, persistent volume claims for index storage, and horizontal pod autoscaling based on query load. Advanced topics include implementing rolling updates with zero downtime, configuring resource limits and requests for predictable performance, and setting up monitoring with Prometheus and Grafana. Production patterns cover multi-AZ deployments, backup strategies using Kubernetes CronJobs, and disaster recovery procedures.

  • Building a Web Crawler from Scratch: Architecture and Lessons Learned

    Building a Web Crawler from Scratch: Architecture and Lessons Learned

    After building and operating a web crawler that processes millions of pages, here are the architectural decisions that matter most.

    The crawler uses a multi-worker architecture: a coordinator distributes URLs from a priority queue, and workers fetch pages concurrently. Each worker has three rendering strategies: fast HTTP (curl-cffi), headless browser (Playwright for JS-heavy sites), and fallback (httpx with retry logic).

    Content extraction uses trafilatura for article text, with custom extractors for PDF, DOCX, and XLSX files. Metadata extraction captures OG tags, JSON-LD structured data, meta descriptions, and canonical URLs.

    The canonical URL check is critical: if a page’s canonical URL differs from the crawled URL, we skip indexing it. This prevents duplicate content from paginated pages, tracking URLs, and www/non-www variants.

    Anti-bot detection (Cloudflare challenges, CAPTCHAs) is handled by the Playwright rendering daemon, which maintains persistent browser contexts with shared cookies. We detect challenge pages by looking for specific HTML patterns and JavaScript challenges.

    Embedding generation happens at flush time: when the buffer reaches 100 documents, we batch-embed them using E5-large-instruct (1024 dimensions) before sending to Solr. The MAX_EMBED_PAYLOAD_CHARS limit (40,000) prevents API timeouts.