Data & search harnesses for AI agents
44 open-source Data & search harnesses an AI agent can use — MCP servers, SDKs, and adapters. Browse them on Loadbay. An agent can search these over Loadbay's MCP:
claude mcp add --transport http loadbay https://loadbay.xyz/api/mcp
→ Best Data & search harnesses (top picks, ranked)
- markitdown — Python tool that converts Office documents, PDFs, and other files to Markdown for LLM ingestion.
- Firecrawl — Search, scrape, and crawl the web at scale and get clean, structured content. The data layer behind a lot of agents.
- RAGFlow — Open-source RAG engine built on deep document understanding, with grounded citations and an end-to-end retrieval pipeline.
- crawl4ai — Open-source LLM-friendly web crawler and scraper that outputs clean markdown and structured data for agents.
- MinerU — Converts PDFs and documents into machine-readable Markdown and JSON, extracting formulas, tables, and reading order for RAG.
- Scrapy — Fast asynchronous Python framework for large-scale web crawling and structured data extraction.
- docling — Document parsing and ingestion toolkit that converts PDFs, Office files, and images into structured data for gen AI.
- AnythingLLM — All-in-one app turning documents into a private RAG chatbot, bundling ingestion, vector storage, and retrieval; MCP-compatible.
- mem0 — Universal memory layer that gives AI agents persistent long-term memory across sessions via SDK and MCP.
- Meilisearch — Fast, typo-tolerant open-source search engine with built-in hybrid keyword and vector/semantic search.
- llama_index — Data framework for building document agents and RAG pipelines that connect LLMs to private and external data sources.
- milvus — High-performance, cloud-native vector database built for scalable vector ANN search over billions of vectors.
- Faiss — Meta library for efficient similarity search and clustering of dense vectors; the de-facto ANN index under many vector DBs.
- Marker — Fast, high-accuracy converter of PDF, EPUB, and docs to Markdown and JSON with table, equation, and layout handling.
- graphrag — Modular graph-based retrieval-augmented generation system that builds knowledge graphs from documents for agent queries.
- qdrant — High-performance, massive-scale vector database and vector search engine with REST and gRPC APIs.
- searxng — Free, self-hostable metasearch engine that aggregates results from many services without tracking users.
- chroma — Open-source embedding database and search infrastructure for building AI apps with retrieval and memory.
- graphiti — Framework for building real-time temporal knowledge graphs as memory for AI agents, with an MCP server.
- Scrapegraph-ai — AI-powered Python scraper that uses LLMs and graph pipelines to extract data from websites and documents.
- Typesense — Open-source typo-tolerant search engine with native vector and hybrid semantic search, a lightweight retrieval backend.
- haystack — Orchestration framework for production LLM applications with modular pipelines for retrieval, RAG, and agent workflows.
- Crawlee — Web scraping and browser-automation library (Node and Python) built to extract data for AI, LLMs, and RAG.
- letta — Platform for stateful agents with advanced memory (formerly MemGPT) that learn and self-improve over time.
- pgvector — Postgres extension adding vector types and HNSW/IVFFlat similarity search, for vector retrieval inside a relational DB.
- cognee — Open-source AI memory platform giving agents persistent long-term memory via a self-hosted knowledge-graph engine.
- weaviate — Open-source vector database storing objects and vectors, combining vector search with structured filtering.
- unstructured — Open-source ETL library that transforms complex documents into clean structured data for language models.
- PentestGPT — Automated penetration-testing agentic framework powered by LLMs that guides and runs offensive-security workflows.
- txtai — All-in-one embeddings framework for semantic search, RAG, and language-model workflows over your own data.
- lancedb — Developer-friendly embedded retrieval library and vector database for multimodal AI search.
- hexstrike-ai — MCP server that lets AI agents autonomously run 150+ cybersecurity tools for pentesting, vuln discovery, and bug-bounty automation.
- MindSearch — An open multi-agent web-search framework, in the spirit of Perplexity Pro, that plans queries and synthesizes answers.
- Integuru — AI agent that reverse-engineers a platform's internal APIs from browser traffic to build permissionless integrations.
- exa-mcp-server — MCP server letting agents perform web search and crawling through the Exa neural search API.
- AI-Infra-Guard — Full-stack AI red-teaming platform for agent scan, MCP scan, AI infra scan, and LLM jailbreak evaluation.
- DBHub — A zero-dependency database MCP server for Postgres, MySQL, SQL Server, and more — query your data in natural language.
- agent-scan — Security scanner for AI agents, MCP servers, and agent skills that detects vulnerabilities and misconfigurations.
- modelcontextprotocol — Official Perplexity MCP server that gives AI assistants web-wide search and answers through the Perplexity API.
- tavily-mcp — Production MCP server giving agents real-time web search, extract, map, and crawl via the Tavily API.
- mcp-server-qdrant — Official Qdrant MCP server exposing vector storage and semantic search as a memory layer for agents.
- brave-search-mcp-server — Official Brave Search MCP server providing web, image, video, news, and local search.
- mcp-filesystem-server — A filesystem MCP server — give an agent scoped read/write access to files and directories.
- mcp-google-map — MCP server for Google Maps including geocoding, place search, directions, and distance calculations.