Category: Technology

  • How to Build a High-Performance Search Engine with Apache Solr

    How to Build a High-Performance Search Engine with Apache Solr

    Building a high-performance search engine requires careful consideration of indexing strategies, query optimization, and infrastructure design. Apache Solr provides a robust foundation with features like inverted indexes, faceted search, and real-time indexing. This guide covers schema design, including field types and analyzers for multilingual content. We explore SolrCloud for distributed search across multiple shards, replication strategies for high availability, and caching configurations that dramatically reduce query latency. Performance tuning tips include: use docValues for sorting and faceting, minimize stored fields, leverage filter queries for frequently-used constraints, and implement warming queries for cold starts. Real-world benchmarks show that a properly tuned Solr cluster can handle 10,000+ queries per second with sub-100ms latency.

  • Understanding Hybrid Search: Combining Vector and Lexical Approaches

    Understanding Hybrid Search: Combining Vector and Lexical Approaches

    Hybrid search represents a paradigm shift in information retrieval. By combining traditional lexical (keyword-based) search with modern vector (semantic) search, we can achieve results that are both precise and contextually relevant.

    Lexical search excels at exact matches — when a user searches for “PHP 8.3 migration guide”, lexical search finds documents containing those exact terms. However, it fails at understanding intent. A search for “how to upgrade my scripting language” won’t match documents about PHP migration.

    Vector search solves this by encoding queries and documents into high-dimensional vector spaces using embedding models like E5-large-instruct. Semantically similar content clusters together, so “upgrade scripting language” lands near “PHP migration” in vector space.

    The {!bool} query parser in Apache Solr combines both approaches in a single request. Lexical scores from edismax and KNN vector scores are summed, with configurable weights controlling the balance. Union mode surfaces hits from either signal; intersection mode requires both.

    Key tuning parameters include: lexical_weight (0.1 = semantic-dominant, 1.0 = full lexical), vector_topk (candidate pool size), mm (minimum match), and quality_boost (content richness scoring).

  • Apache Solr vs Elasticsearch: A 2026 Comparison for Enterprise Search

    Apache Solr vs Elasticsearch: A 2026 Comparison for Enterprise Search

    The search engine landscape in 2026 has evolved significantly. Both Apache Solr and Elasticsearch remain dominant players, but their strengths have diverged.

    Apache Solr, now with native KNN vector search and the {!bool} query parser for hybrid search, excels in structured data scenarios. Its faceting capabilities remain unmatched — nested facets, pivot facets, range facets with stats, and hierarchical drill-down navigation are all first-class features.

    Elasticsearch has invested heavily in its ML infrastructure with ELSER (Elastic Learned Sparse EncodeR) and vector search via dense_vector fields. Its strength lies in observability, log analytics, and the ELK stack ecosystem.

    For e-commerce and content search with faceted navigation, Solr’s combination of edismax, function queries, and the QueryElevation component provides a more flexible and performant foundation. The ability to pin/exclude results per query, boost by content quality, and apply complex mm (minimum match) rules gives search engineers fine-grained control.

    Cost considerations: Solr runs on commodity hardware without licensing fees. Elasticsearch’s open-source fork (OpenSearch) competes on price, but Elastic’s proprietary features require a subscription.

  • Building a Web Crawler from Scratch: Architecture and Lessons Learned

    Building a Web Crawler from Scratch: Architecture and Lessons Learned

    After building and operating a web crawler that processes millions of pages, here are the architectural decisions that matter most.

    The crawler uses a multi-worker architecture: a coordinator distributes URLs from a priority queue, and workers fetch pages concurrently. Each worker has three rendering strategies: fast HTTP (curl-cffi), headless browser (Playwright for JS-heavy sites), and fallback (httpx with retry logic).

    Content extraction uses trafilatura for article text, with custom extractors for PDF, DOCX, and XLSX files. Metadata extraction captures OG tags, JSON-LD structured data, meta descriptions, and canonical URLs.

    The canonical URL check is critical: if a page’s canonical URL differs from the crawled URL, we skip indexing it. This prevents duplicate content from paginated pages, tracking URLs, and www/non-www variants.

    Anti-bot detection (Cloudflare challenges, CAPTCHAs) is handled by the Playwright rendering daemon, which maintains persistent browser contexts with shared cookies. We detect challenge pages by looking for specific HTML patterns and JavaScript challenges.

    Embedding generation happens at flush time: when the buffer reaches 100 documents, we batch-embed them using E5-large-instruct (1024 dimensions) before sending to Solr. The MAX_EMBED_PAYLOAD_CHARS limit (40,000) prevents API timeouts.

  • WordPress Plugin Development Best Practices: Security, Performance, and Standards

    WordPress Plugin Development Best Practices: Security, Performance, and Standards

    Building a WordPress plugin that passes the WordPress.org review requires strict adherence to coding standards, security best practices, and performance optimization.

    Security essentials: 1) Nonces on every form (wp_nonce_field/wp_verify_nonce). 2) Capability checks (current_user_can) on every admin action. 3) Sanitize ALL input: sanitize_text_field(), absint(), esc_url_raw(). 4) Escape ALL output: esc_html(), esc_attr(), esc_url(), wp_kses(). 5) Never use eval(), never trust $_GET/$_POST without sanitization.

    Performance: Enqueue scripts/styles only where needed (check the current page before loading). Use transients for caching API responses. Minimize database queries — batch operations instead of per-item queries. Use wp_remote_post() instead of cURL for HTTP requests (respects WordPress proxy settings).

    Coding standards: TABS for indentation (not spaces!). Yoda conditions: if ( ‘value’ === $var ). Snake_case for functions, PascalCase for classes. File naming: class-name-here.php. Prefix everything with your plugin slug to avoid conflicts.

    The WordPress Settings API handles option storage, validation, and nonce verification in one place. Use register_setting() with a sanitize_callback for validation. Group related options in a single array option to reduce database queries.