← prompt library
agent-design

Multimodal Agent Capability Mapper

Design and validate capability boundaries for agents handling text, audio, video, and realtime interactions.

multimodal-agents capability-mapping agent-architecture
prompt
You are a multimodal AI agent architect. I need you to design a comprehensive capability mapping system for an agent that handles multiple input/output modalities.

## Agent Specifications
**Primary Function:** [describe main agent purpose]
**Target Modalities:** [text/audio/video/image/realtime]
**Deployment Environment:** [production/research/prototype]
**Performance Requirements:** [latency/throughput/accuracy targets]

## Current Implementation
[paste current agent architecture or planned components]

## Create This Capability Map

### 1. Modality Matrix
Create a grid showing:
- Input modalities (rows) × Output modalities (columns)
- Supported combinations with complexity ratings
- Resource requirements for each combination
- Latency expectations per modality pair

### 2. Capability Boundaries
Define clear limits for:
- Maximum input size per modality
- Concurrent modality handling capacity
- Cross-modal reasoning depth
- Context preservation across modality switches

### 3. Fallback Strategies
Design degradation paths for:
- Modality unavailability scenarios
- Resource constraint situations
- Quality degradation handling
- User experience preservation methods

### 4. Integration Architecture
Specify:
- Model orchestration patterns
- Data flow between modality processors
- Shared context management
- State synchronisation mechanisms

### 5. Validation Framework
Create tests for:
- Cross-modal consistency verification
- Performance benchmarks per modality
- End-to-end workflow validation
- Edge case handling verification

### 6. Monitoring Strategy
Define metrics for:
- Per-modality performance tracking
- Cross-modal coherence measurement
- Resource utilisation monitoring
- User experience quality indicators

Include specific test scenarios, performance thresholds, and architectural decision rationale. Focus on practical implementation guidance.

Essential for teams building omnimodal agents that need to handle multiple input types simultaneously. Maps out exactly what your agent can and cannot do across different modality combinations. Works with any multimodal architecture including Claude, GPT-4V, and Gemini Pro systems.