domain

Voice & media harnesses for AI agents

46 open-source Voice & media harnesses an AI agent can use — MCP servers, SDKs, and adapters. Browse them on Loadbay. An agent can search these over Loadbay's MCP:

claude mcp add --transport http loadbay https://loadbay.xyz/api/mcp

→ Best Voice & media harnesses (top picks, ranked)

stable-diffusion-webui — The most widely used Stable Diffusion web UI with extensions and an API endpoint agents can call for text-to-image.
ComfyUI — Modular node-graph diffusion GUI, API, and backend for image and video generation that agents can drive via workflows.
Whisper — OpenAI robust multilingual speech-to-text model and the de-facto open standard for transcription and translation.
screenshot-to-code — Drops in a screenshot and converts it to clean HTML, Tailwind, React, or Vue code using vision models.
whisper.cpp — High-performance C/C++ port of Whisper for fast local and on-device speech-to-text with no Python runtime.
Fooocus — Streamlined Stable Diffusion image generator focused on prompting with minimal configuration.
TTS — Deep-learning text-to-speech and voice-cloning toolkit with many pretrained multilingual models.
ChatTTS — Generative speech model optimized for natural conversational dialogue in English and Chinese.
bark — Text-prompted generative audio model that produces speech, music, sound effects, and nonverbal sounds from text.
OpenVoice — Instant voice-cloning audio model that copies tone color and controls style across languages from one reference clip.
Real-ESRGAN — Practical algorithms for general image and video super-resolution and restoration; the standard open upscaler.
diffusers — Hugging Face library of state-of-the-art diffusion models and pipelines for image, video, and audio generation in PyTorch.
Fish Speech — State-of-the-art open multilingual text-to-speech with voice cloning and low-latency synthesis.
Open-Sora — Open-source text-to-video generation framework aiming to democratize Sora-style video with full training and inference code.
FaceFusion — Open face-swapping and face-manipulation platform with CLI and headless modes for automated media pipelines.
InvokeAI — Creative engine and WebUI for Stable Diffusion with a node workflow system and REST API for generating visual media.
Stability generative-models — Stability AI official repo for SDXL, SD 2.x, and Stable Video Diffusion, the canonical Stable Diffusion model family.
FLUX.1 — Official inference for the FLUX.1 family, the leading open-weight text-to-image diffusion models.
Chatterbox — Resemble AI open SOTA text-to-speech with expressive zero-shot voice cloning and emotion control.
audiocraft — Meta's library for audio generation featuring the MusicGen music model and the EnCodec audio tokenizer.
blender-mcp — MCP server that connects Claude to Blender for prompt-driven 3D modeling, scene creation, and rendering.
WhisperX — Whisper with word-level timestamps, voice activity detection, and speaker diarization for aligned, speaker-attributed transcripts.
LivePortrait — Efficient portrait animation that drives a single source image with video, audio, or image-derived motion.
Wan2.1 — Open large-scale video generative models from Alibaba supporting text-to-video and image-to-video synthesis.
F5-TTS — Flow-matching text-to-speech model for fast, fluent zero-shot voice cloning from a short reference clip.
CSM — Sesame open conversational speech generation model for natural, context-aware spoken dialogue and voice agents.
Hunyuan3D-2 — High-resolution image-to-3D and text-to-3D asset generation using large-scale Hunyuan3D diffusion models.
SadTalker — Generates stylized talking-head videos from a single portrait image and an audio clip.
draw-a-ui — Draw a low-fidelity mockup on a canvas and generate working HTML from it with a vision model.
Wav2Lip — Lip-sync model that accurately matches a face video to any target speech audio in the wild.
pipecat — Open-source framework for building realtime voice and multimodal conversational AI agents with pluggable STT, LLM, and TTS.
CogVideo — Text-to-video and image-to-video generation models including CogVideoX with open weights and inference code.
HunyuanVideo — Tencent framework and open weights for large-scale text-to-video generation.
AnimateDiff — Official implementation that animates personalized text-to-image diffusion models into short videos.
agents — Framework for building realtime voice and video AI agents that join LiveKit rooms with streaming STT, LLM, and TTS pipelines.
kokoro — Lightweight 82M-parameter text-to-speech model delivering high-quality multilingual voices with low compute.
riffusion-hobby — Real-time music generation using Stable Diffusion applied to audio spectrogram images.
mochi — Open text-to-video generation model from Genmo with code and weights for high-fidelity motion synthesis.
whisper_streaming — Realtime streaming wrapper around Whisper for long-form, low-latency speech-to-text transcription and translation.
Generative-Media-Skills — Multimodal generative-media skills (image, video, audio) you mount into a coding or agent runtime.
elevenlabs-python — Official Python SDK for the ElevenLabs API, giving agents text-to-speech, voice cloning, and audio generation.
MiniMax-MCP — MiniMax's official MCP server — give an agent text-to-speech, voice cloning, and image and video generation.
ElevenLabs MCP — Official ElevenLabs MCP server exposing text-to-speech, voice cloning, speech-to-text, and audio tools to MCP clients.
Bolna — An open framework for building conversational voice AI agents, including telephony.
spotify-mcp — MCP server that connects an LLM to Spotify to control playback, search the catalog, and manage queues and playlists.
mcp-server-youtube-transcript — MCP server that fetches YouTube video transcripts so an AI assistant can read and summarize video content.

Browse all 370+ harnesses on Loadbay · this domain as JSON