domain

Data & search harnesses for AI agents

44 open-source Data & search harnesses an AI agent can use — MCP servers, SDKs, and adapters. Browse them on Loadbay. An agent can search these over Loadbay's MCP:

claude mcp add --transport http loadbay https://loadbay.xyz/api/mcp

→ Best Data & search harnesses (top picks, ranked)

markitdown — Python tool that converts Office documents, PDFs, and other files to Markdown for LLM ingestion.
Firecrawl — Search, scrape, and crawl the web at scale and get clean, structured content. The data layer behind a lot of agents.
RAGFlow — Open-source RAG engine built on deep document understanding, with grounded citations and an end-to-end retrieval pipeline.
crawl4ai — Open-source LLM-friendly web crawler and scraper that outputs clean markdown and structured data for agents.
MinerU — Converts PDFs and documents into machine-readable Markdown and JSON, extracting formulas, tables, and reading order for RAG.
Scrapy — Fast asynchronous Python framework for large-scale web crawling and structured data extraction.
docling — Document parsing and ingestion toolkit that converts PDFs, Office files, and images into structured data for gen AI.
AnythingLLM — All-in-one app turning documents into a private RAG chatbot, bundling ingestion, vector storage, and retrieval; MCP-compatible.
mem0 — Universal memory layer that gives AI agents persistent long-term memory across sessions via SDK and MCP.
Meilisearch — Fast, typo-tolerant open-source search engine with built-in hybrid keyword and vector/semantic search.
llama_index — Data framework for building document agents and RAG pipelines that connect LLMs to private and external data sources.
milvus — High-performance, cloud-native vector database built for scalable vector ANN search over billions of vectors.
Faiss — Meta library for efficient similarity search and clustering of dense vectors; the de-facto ANN index under many vector DBs.
Marker — Fast, high-accuracy converter of PDF, EPUB, and docs to Markdown and JSON with table, equation, and layout handling.
graphrag — Modular graph-based retrieval-augmented generation system that builds knowledge graphs from documents for agent queries.
qdrant — High-performance, massive-scale vector database and vector search engine with REST and gRPC APIs.
searxng — Free, self-hostable metasearch engine that aggregates results from many services without tracking users.
chroma — Open-source embedding database and search infrastructure for building AI apps with retrieval and memory.
graphiti — Framework for building real-time temporal knowledge graphs as memory for AI agents, with an MCP server.
Scrapegraph-ai — AI-powered Python scraper that uses LLMs and graph pipelines to extract data from websites and documents.
Typesense — Open-source typo-tolerant search engine with native vector and hybrid semantic search, a lightweight retrieval backend.
haystack — Orchestration framework for production LLM applications with modular pipelines for retrieval, RAG, and agent workflows.
Crawlee — Web scraping and browser-automation library (Node and Python) built to extract data for AI, LLMs, and RAG.
letta — Platform for stateful agents with advanced memory (formerly MemGPT) that learn and self-improve over time.
pgvector — Postgres extension adding vector types and HNSW/IVFFlat similarity search, for vector retrieval inside a relational DB.
cognee — Open-source AI memory platform giving agents persistent long-term memory via a self-hosted knowledge-graph engine.
weaviate — Open-source vector database storing objects and vectors, combining vector search with structured filtering.
unstructured — Open-source ETL library that transforms complex documents into clean structured data for language models.
PentestGPT — Automated penetration-testing agentic framework powered by LLMs that guides and runs offensive-security workflows.
txtai — All-in-one embeddings framework for semantic search, RAG, and language-model workflows over your own data.
lancedb — Developer-friendly embedded retrieval library and vector database for multimodal AI search.
hexstrike-ai — MCP server that lets AI agents autonomously run 150+ cybersecurity tools for pentesting, vuln discovery, and bug-bounty automation.
MindSearch — An open multi-agent web-search framework, in the spirit of Perplexity Pro, that plans queries and synthesizes answers.
Integuru — AI agent that reverse-engineers a platform's internal APIs from browser traffic to build permissionless integrations.
exa-mcp-server — MCP server letting agents perform web search and crawling through the Exa neural search API.
AI-Infra-Guard — Full-stack AI red-teaming platform for agent scan, MCP scan, AI infra scan, and LLM jailbreak evaluation.
DBHub — A zero-dependency database MCP server for Postgres, MySQL, SQL Server, and more — query your data in natural language.
agent-scan — Security scanner for AI agents, MCP servers, and agent skills that detects vulnerabilities and misconfigurations.
modelcontextprotocol — Official Perplexity MCP server that gives AI assistants web-wide search and answers through the Perplexity API.
tavily-mcp — Production MCP server giving agents real-time web search, extract, map, and crawl via the Tavily API.
mcp-server-qdrant — Official Qdrant MCP server exposing vector storage and semantic search as a memory layer for agents.
brave-search-mcp-server — Official Brave Search MCP server providing web, image, video, news, and local search.
mcp-filesystem-server — A filesystem MCP server — give an agent scoped read/write access to files and directories.
mcp-google-map — MCP server for Google Maps including geocoding, place search, directions, and distance calculations.

Browse all 370+ harnesses on Loadbay · this domain as JSON