Voice & media harnesses for AI agents
46 open-source Voice & media harnesses an AI agent can use — MCP servers, SDKs, and adapters. Browse them on Loadbay. An agent can search these over Loadbay's MCP:
claude mcp add --transport http loadbay https://loadbay.xyz/api/mcp
→ Best Voice & media harnesses (top picks, ranked)
- stable-diffusion-webui — The most widely used Stable Diffusion web UI with extensions and an API endpoint agents can call for text-to-image.
- ComfyUI — Modular node-graph diffusion GUI, API, and backend for image and video generation that agents can drive via workflows.
- Whisper — OpenAI robust multilingual speech-to-text model and the de-facto open standard for transcription and translation.
- screenshot-to-code — Drops in a screenshot and converts it to clean HTML, Tailwind, React, or Vue code using vision models.
- whisper.cpp — High-performance C/C++ port of Whisper for fast local and on-device speech-to-text with no Python runtime.
- Fooocus — Streamlined Stable Diffusion image generator focused on prompting with minimal configuration.
- TTS — Deep-learning text-to-speech and voice-cloning toolkit with many pretrained multilingual models.
- ChatTTS — Generative speech model optimized for natural conversational dialogue in English and Chinese.
- bark — Text-prompted generative audio model that produces speech, music, sound effects, and nonverbal sounds from text.
- OpenVoice — Instant voice-cloning audio model that copies tone color and controls style across languages from one reference clip.
- Real-ESRGAN — Practical algorithms for general image and video super-resolution and restoration; the standard open upscaler.
- diffusers — Hugging Face library of state-of-the-art diffusion models and pipelines for image, video, and audio generation in PyTorch.
- Fish Speech — State-of-the-art open multilingual text-to-speech with voice cloning and low-latency synthesis.
- Open-Sora — Open-source text-to-video generation framework aiming to democratize Sora-style video with full training and inference code.
- FaceFusion — Open face-swapping and face-manipulation platform with CLI and headless modes for automated media pipelines.
- InvokeAI — Creative engine and WebUI for Stable Diffusion with a node workflow system and REST API for generating visual media.
- Stability generative-models — Stability AI official repo for SDXL, SD 2.x, and Stable Video Diffusion, the canonical Stable Diffusion model family.
- FLUX.1 — Official inference for the FLUX.1 family, the leading open-weight text-to-image diffusion models.
- Chatterbox — Resemble AI open SOTA text-to-speech with expressive zero-shot voice cloning and emotion control.
- audiocraft — Meta's library for audio generation featuring the MusicGen music model and the EnCodec audio tokenizer.
- blender-mcp — MCP server that connects Claude to Blender for prompt-driven 3D modeling, scene creation, and rendering.
- WhisperX — Whisper with word-level timestamps, voice activity detection, and speaker diarization for aligned, speaker-attributed transcripts.
- LivePortrait — Efficient portrait animation that drives a single source image with video, audio, or image-derived motion.
- Wan2.1 — Open large-scale video generative models from Alibaba supporting text-to-video and image-to-video synthesis.
- F5-TTS — Flow-matching text-to-speech model for fast, fluent zero-shot voice cloning from a short reference clip.
- CSM — Sesame open conversational speech generation model for natural, context-aware spoken dialogue and voice agents.
- Hunyuan3D-2 — High-resolution image-to-3D and text-to-3D asset generation using large-scale Hunyuan3D diffusion models.
- SadTalker — Generates stylized talking-head videos from a single portrait image and an audio clip.
- draw-a-ui — Draw a low-fidelity mockup on a canvas and generate working HTML from it with a vision model.
- Wav2Lip — Lip-sync model that accurately matches a face video to any target speech audio in the wild.
- pipecat — Open-source framework for building realtime voice and multimodal conversational AI agents with pluggable STT, LLM, and TTS.
- CogVideo — Text-to-video and image-to-video generation models including CogVideoX with open weights and inference code.
- HunyuanVideo — Tencent framework and open weights for large-scale text-to-video generation.
- AnimateDiff — Official implementation that animates personalized text-to-image diffusion models into short videos.
- agents — Framework for building realtime voice and video AI agents that join LiveKit rooms with streaming STT, LLM, and TTS pipelines.
- kokoro — Lightweight 82M-parameter text-to-speech model delivering high-quality multilingual voices with low compute.
- riffusion-hobby — Real-time music generation using Stable Diffusion applied to audio spectrogram images.
- mochi — Open text-to-video generation model from Genmo with code and weights for high-fidelity motion synthesis.
- whisper_streaming — Realtime streaming wrapper around Whisper for long-form, low-latency speech-to-text transcription and translation.
- Generative-Media-Skills — Multimodal generative-media skills (image, video, audio) you mount into a coding or agent runtime.
- elevenlabs-python — Official Python SDK for the ElevenLabs API, giving agents text-to-speech, voice cloning, and audio generation.
- MiniMax-MCP — MiniMax's official MCP server — give an agent text-to-speech, voice cloning, and image and video generation.
- ElevenLabs MCP — Official ElevenLabs MCP server exposing text-to-speech, voice cloning, speech-to-text, and audio tools to MCP clients.
- Bolna — An open framework for building conversational voice AI agents, including telephony.
- spotify-mcp — MCP server that connects an LLM to Spotify to control playback, search the catalog, and manage queues and playlists.
- mcp-server-youtube-transcript — MCP server that fetches YouTube video transcripts so an AI assistant can read and summarize video content.