Step-by-step guide to configuring the Opensolr Web Crawler for your website. Covers: sitemap registration, crawl modes (1-6), thread configuration, relax delay tuning, and firewall whitelisting.
Mode 5 (Shallow Host) is recommended for most sites. It crawls URLs from the sitemap and follows links on those pages, but stays on the same hostname. Mode 2 (Sitemap Only) only crawls URLs explicitly listed in the sitemap — no link following.
Content extraction: the crawler uses trafilatura for text extraction from HTML pages and Apache Tika for binary documents (PDF, DOCX, XLSX). Metadata is extracted from OG tags, JSON-LD, and meta tags.
Embedding: documents are batch-embedded using the E5-large-instruct model (1024 dimensions) at flush time. The MAX_EMBED_PAYLOAD_CHARS limit is 40,000 characters per document.
