Multimodal
Multimodal
Section titled “Multimodal”AI-Lib supports multimodal inputs and outputs — text combined with images, audio, and video — through the same unified API. The V2 protocol provides comprehensive multimodal capabilities with format validation, provider-aware modality checking, and additive video-generation output contracts.
Supported Capabilities
Section titled “Supported Capabilities”| Capability | Direction | Providers |
|---|---|---|
| Vision (images) | Input | OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Doubao |
| Image generation | Output | OpenAI (DALL-E), select providers |
| Audio input | Input | Gemini, Qwen (omni_mode), Doubao |
| Audio output | Output | OpenAI (TTS), Qwen (omni_mode), Doubao, select providers |
| Video input | Input | Gemini, Qwen |
| Video generation contract | Output schema | V2 multimodal.output.video |
| Omni mode | Input + Output | Qwen (simultaneous text + audio) |
Sending Images
Section titled “Sending Images”use ai_lib_rust::{AiClient, Message, ContentBlock};
let client = AiClient::new("openai/gpt-4o").await?;
let message = Message::user_with_content(vec![ ContentBlock::Text("What's in this image?".into()), ContentBlock::ImageUrl { url: "https://example.com/photo.jpg".into(), },]);
let response = client.chat() .messages(vec![message]) .execute() .await?;
println!("{}", response.content);Python
Section titled “Python”from ai_lib_python import AiClient, Message, ContentBlock
client = await AiClient.create("openai/gpt-4o")
message = Message.user_with_content([ ContentBlock.text("What's in this image?"), ContentBlock.image_url("https://example.com/photo.jpg"),])
response = await client.chat() \ .messages([message]) \ .execute()
print(response.content)TypeScript
Section titled “TypeScript”import { AiClient, Message, ContentBlock } from '@hiddenpath/ai-lib-ts';
const client = await AiClient.new('openai/gpt-4o');
const message = Message.userWithContent([ ContentBlock.text("What's in this image?"), ContentBlock.imageUrl('https://example.com/photo.jpg'),]);
const response = await client.chat().messages([message]).execute();
console.log(response.content);import "github.com/ailib-official/ai-lib-go/client"
message := client.NewUserMessageWithContent([]client.ContentBlock{ client.NewTextContentBlock("What's in this image?"), client.NewImageUrlContentBlock("https://example.com/photo.jpg"),})
response, _ := aiClient.Chat(). Messages([]client.Message{message}). Execute(ctx)
fmt.Println(response.Content)Base64 Images
Section titled “Base64 Images”For local images, use base64 encoding:
let image_data = std::fs::read("photo.jpg")?;let base64 = base64::engine::general_purpose::STANDARD.encode(&image_data);
let message = Message::user_with_content(vec![ ContentBlock::Text("Describe this".into()), ContentBlock::ImageBase64 { data: base64, media_type: "image/jpeg".into(), },]);Python
Section titled “Python”import base64
with open("photo.jpg", "rb") as f: image_data = base64.b64encode(f.read()).decode()
message = Message.user_with_content([ ContentBlock.text("Describe this"), ContentBlock.image_base64(image_data, "image/jpeg"),])TypeScript
Section titled “TypeScript”import { readFileSync } from 'fs';
const imageBuffer = readFileSync('photo.jpg');const imageData = imageBuffer.toString('base64');
const message = Message.userWithContent([ ContentBlock.text('Describe this'), ContentBlock.imageBase64(imageData, 'image/jpeg'),]);imageBytes, _ := os.ReadFile("photo.jpg")base64Data := base64.StdEncoding.EncodeToString(imageBytes)
message := client.NewUserMessageWithContent([]client.ContentBlock{ client.NewTextContentBlock("Describe this"), client.NewImageBase64ContentBlock(base64Data, "image/jpeg"),})V2 Multimodal Capabilities
Section titled “V2 Multimodal Capabilities”The V2 protocol provides a MultimodalCapabilities module that validates content against provider declarations before sending requests.
Modality Detection
Section titled “Modality Detection”The runtime automatically detects modalities in your content blocks:
use ai_lib_rust::multimodal::{detect_modalities, Modality};
let modalities = detect_modalities(&content_blocks);// Returns: {Text, Image} or {Text, Audio, Video} etc.from ai_lib_python.multimodal import detect_modalities, Modality
modalities = detect_modalities(content_blocks)# Returns: {Modality.TEXT, Modality.IMAGE}// TypeScriptimport { detectModalities, Modality } from '@hiddenpath/ai-lib-ts/multimodal';
const modalities = detectModalities(contentBlocks);// Returns: Set { Modality.TEXT, Modality.IMAGE }Format Validation
Section titled “Format Validation”The runtime validates formats against what the provider supports:
use ai_lib_rust::multimodal::MultimodalCapabilities;
let caps = MultimodalCapabilities::from_config(&manifest.multimodal);assert!(caps.validate_image_format("png"));assert!(caps.validate_audio_format("wav"));from ai_lib_python.multimodal import MultimodalCapabilities
caps = MultimodalCapabilities.from_config(manifest_multimodal)assert caps.validate_image_format("png")assert caps.validate_audio_format("wav")// TypeScriptimport { MultimodalCapabilities } from '@hiddenpath/ai-lib-ts/multimodal';
const caps = MultimodalCapabilities.fromConfig(manifestMultimodal);console.assert(caps.validateImageFormat('png'));console.assert(caps.validateAudioFormat('wav'));Content Validation
Section titled “Content Validation”Before sending a request, validate that the provider supports all modalities in the content:
use ai_lib_rust::multimodal::validate_content_modalities;
match validate_content_modalities(&blocks, &caps) { Ok(()) => { /* all modalities supported */ } Err(unsupported) => { eprintln!("Provider doesn't support: {:?}", unsupported); }}from ai_lib_python.multimodal import validate_content_modalities
# Validate content blocks against provider capabilities// TypeScriptimport { validateContentModalities } from '@hiddenpath/ai-lib-ts/multimodal';
try { validateContentModalities(blocks, caps); // all modalities supported} catch (unsupported) { console.error(`Provider doesn't support: ${unsupported}`);}How It Works
Section titled “How It Works”- The runtime constructs a multimodal message with mixed content blocks
- V2 validation:
MultimodalCapabilitieschecks that all content modalities are supported by the provider - The protocol manifest maps content blocks to the provider’s format
- Different providers use different structures:
- OpenAI:
contentarray withtype: "image_url"objects - Anthropic:
contentarray withtype: "image"objects - Gemini:
partsarray withinline_dataobjects (supports videoparts)
- OpenAI:
- The protocol handles all format differences automatically
Provider Multimodal Matrix
Section titled “Provider Multimodal Matrix”The V2 manifest declares each provider’s multimodal capabilities explicitly:
| Provider | Image In | Audio In | Video In | Image Out | Audio Out | Video Out | Omni |
|---|---|---|---|---|---|---|---|
| OpenAI | ✅ png, jpg, gif, webp | ✅ mp3, wav, flac | — | ✅ | ✅ | declared (currently false) | — |
| Anthropic | ✅ png, jpg, gif, webp | — | — | — | — | declared (currently false) | — |
| Gemini | ✅ png, jpg, gif, webp | ✅ wav, mp3, flac | ✅ mp4, avi | ✅ | — | declared (currently false) | — |
| Qwen | ✅ png, jpg | ✅ wav, mp3 | ✅ mp4, webm | ✅ | ✅ | declared (currently false) | ✅ |
| DeepSeek | ✅ png, jpg | — | — | — | — | declared (currently false) | — |
| Doubao | ✅ png, jpg | ✅ mp3, wav | — | ✅ | ✅ | declared (currently false) | — |
Check multimodal.input and multimodal.output sections in the V2 provider manifest for the complete declaration.
Async Video Generation Testing (Mock)
Section titled “Async Video Generation Testing (Mock)”For integration and compliance pipelines, ai-protocol-mock provides an async video generation lifecycle:
POST /v1/video/generationswith{"async": true}returns202with ajob_idGET /v1/video/generations/{job_id}transitionsqueued -> running -> succeeded- Failure simulation headers are available:
X-Mock-StatusX-Mock-Timeout-MsX-Mock-Invalid-Content-Type
This enables deterministic testing for polling behavior, timeout handling, and terminal-state parsing before enabling real provider endpoints.