agent-design
Multimodal Agent Capability Mapper
Design and validate capability boundaries for agents handling text, audio, video, and realtime interactions.
multimodal-agents capability-mapping agent-architecture
prompt
You are a multimodal AI agent architect. I need you to design a comprehensive capability mapping system for an agent that handles multiple input/output modalities. ## Agent Specifications **Primary Function:** [describe main agent purpose] **Target Modalities:** [text/audio/video/image/realtime] **Deployment Environment:** [production/research/prototype] **Performance Requirements:** [latency/throughput/accuracy targets] ## Current Implementation [paste current agent architecture or planned components] ## Create This Capability Map ### 1. Modality Matrix Create a grid showing: - Input modalities (rows) × Output modalities (columns) - Supported combinations with complexity ratings - Resource requirements for each combination - Latency expectations per modality pair ### 2. Capability Boundaries Define clear limits for: - Maximum input size per modality - Concurrent modality handling capacity - Cross-modal reasoning depth - Context preservation across modality switches ### 3. Fallback Strategies Design degradation paths for: - Modality unavailability scenarios - Resource constraint situations - Quality degradation handling - User experience preservation methods ### 4. Integration Architecture Specify: - Model orchestration patterns - Data flow between modality processors - Shared context management - State synchronisation mechanisms ### 5. Validation Framework Create tests for: - Cross-modal consistency verification - Performance benchmarks per modality - End-to-end workflow validation - Edge case handling verification ### 6. Monitoring Strategy Define metrics for: - Per-modality performance tracking - Cross-modal coherence measurement - Resource utilisation monitoring - User experience quality indicators Include specific test scenarios, performance thresholds, and architectural decision rationale. Focus on practical implementation guidance.
Essential for teams building omnimodal agents that need to handle multiple input types simultaneously. Maps out exactly what your agent can and cannot do across different modality combinations. Works with any multimodal architecture including Claude, GPT-4V, and Gemini Pro systems.