Opensolr Search

Tag: api

Building a Web Crawler from Scratch: Architecture and Lessons Learned

After building and operating a web crawler that processes millions of pages, here are the architectural decisions that matter most.

The crawler uses a multi-worker architecture: a coordinator distributes URLs from a priority queue, and workers fetch pages concurrently. Each worker has three rendering strategies: fast HTTP (curl-cffi), headless browser (Playwright for JS-heavy sites), and fallback (httpx with retry logic).

Content extraction uses trafilatura for article text, with custom extractors for PDF, DOCX, and XLSX files. Metadata extraction captures OG tags, JSON-LD structured data, meta descriptions, and canonical URLs.

The canonical URL check is critical: if a page’s canonical URL differs from the crawled URL, we skip indexing it. This prevents duplicate content from paginated pages, tracking URLs, and www/non-www variants.

Anti-bot detection (Cloudflare challenges, CAPTCHAs) is handled by the Playwright rendering daemon, which maintains persistent browser contexts with shared cookies. We detect challenge pages by looking for specific HTML patterns and JavaScript challenges.

Embedding generation happens at flush time: when the buffer reaches 100 documents, we batch-embed them using E5-large-instruct (1024 dimensions) before sending to Solr. The MAX_EMBED_PAYLOAD_CHARS limit (40,000) prevents API timeouts.

April 7, 2026
The Complete Guide to Search Analytics: From Query Logs to Business Insights

Search analytics transforms raw query logs into actionable business intelligence. Every search query is a signal of user intent — understanding these signals drives product decisions, content strategy, and revenue optimization.

Key metrics to track: Query volume (trending up = growing engagement), No-results rate (content gaps to fill), Click-through rate per query (relevance quality), Average result position of clicks (are users finding answers quickly?), and Unique visitor patterns (new vs returning searchers).

The analytics pipeline: 1) Log every query with timestamp, results count, response time, and IP hash (SHA-256 for privacy). 2) Track clicks with query context, result URL, position, and timestamp. 3) Aggregate daily for dashboard visualizations. 4) Identify patterns: which queries have 0 results? Which results are never clicked despite appearing?

Click-through rate analysis reveals relevance issues. If a query returns 50 results but users consistently click only the 5th result, your ranking needs tuning. If they click nothing and refine their query, the results aren’t matching intent.

No-results queries are your content roadmap. Every “0 results” query is a user telling you what they want but can’t find. Group them by topic, prioritize by volume, and create content to fill those gaps.

April 7, 2026
WordPress Plugin Development Best Practices: Security, Performance, and Standards

Building a WordPress plugin that passes the WordPress.org review requires strict adherence to coding standards, security best practices, and performance optimization.

Security essentials: 1) Nonces on every form (wp_nonce_field/wp_verify_nonce). 2) Capability checks (current_user_can) on every admin action. 3) Sanitize ALL input: sanitize_text_field(), absint(), esc_url_raw(). 4) Escape ALL output: esc_html(), esc_attr(), esc_url(), wp_kses(). 5) Never use eval(), never trust $_GET/$_POST without sanitization.

Performance: Enqueue scripts/styles only where needed (check the current page before loading). Use transients for caching API responses. Minimize database queries — batch operations instead of per-item queries. Use wp_remote_post() instead of cURL for HTTP requests (respects WordPress proxy settings).

Coding standards: TABS for indentation (not spaces!). Yoda conditions: if ( ‘value’ === $var ). Snake_case for functions, PascalCase for classes. File naming: class-name-here.php. Prefix everything with your plugin slug to avoid conflicts.

The WordPress Settings API handles option storage, validation, and nonce verification in one place. Use register_setting() with a sanitize_callback for validation. Group related options in a single array option to reduce database queries.

April 7, 2026