跳转到内容

多模态

AI-Lib 支持多模态输入 — 文本与图像组合 — 通过相同的统一 API。

  • ai-protocol 已在 V2 清单中补齐 QwenDoubao 生成式能力声明。
  • multimodal schema 新增 multimodal.output.video 字段,用于统一视频生成能力声明(当前大多数提供商为 supported: false,但契约已就绪)。
  • ai-protocol-mock 增补 Gemini generateContent / streamGenerateContent 路由,便于三运行时一致性验证。
CapabilityDirectionProviders
Vision(图像)InputOpenAI, Anthropic, Gemini, Qwen, DeepSeek
Image generationOutputOpenAI (DALL-E), 部分提供商
Audio inputInputGemini, Qwen (omni_mode)
Audio outputOutputQwen (omni_mode), 部分提供商
Video inputInputGemini
Omni modeInput + OutputQwen (文本+音频同步)
use ai_lib_rust::{AiClient, Message, ContentBlock};
let client = AiClient::from_model("openai/gpt-4o").await?;
let message = Message::user_with_content(vec![
ContentBlock::Text("What's in this image?".into()),
ContentBlock::ImageUrl {
url: "https://example.com/photo.jpg".into(),
},
]);
let response = client.chat()
.messages(vec![message])
.execute()
.await?;
println!("{}", response.content);
from ai_lib_python import AiClient, Message, ContentBlock
client = await AiClient.create("openai/gpt-4o")
message = Message.user_with_content([
ContentBlock.text("What's in this image?"),
ContentBlock.image_url("https://example.com/photo.jpg"),
])
response = await client.chat() \
.messages([message]) \
.execute()
print(response.content)
import { AiClient, Message, ContentBlock } from '@hiddenpath/ai-lib-ts';
const client = await AiClient.new('openai/gpt-4o');
const message = Message.userWithContent([
ContentBlock.text("What's in this image?"),
ContentBlock.imageUrl('https://example.com/photo.jpg'),
]);
const response = await client.chat().messages([message]).execute();
console.log(response.content);
import "github.com/ailib-official/ai-lib-go/client"
message := client.NewUserMessageWithContent([]client.ContentBlock{
client.NewTextContentBlock("What's in this image?"),
client.NewImageUrlContentBlock("https://example.com/photo.jpg"),
})
response, _ := aiClient.Chat().
Messages([]client.Message{message}).
Execute(ctx)
fmt.Println(response.Content)

对于本地图像,使用 base64 编码:

let image_data = std::fs::read("photo.jpg")?;
let base64 = base64::engine::general_purpose::STANDARD.encode(&image_data);
let message = Message::user_with_content(vec![
ContentBlock::Text("Describe this".into()),
ContentBlock::ImageBase64 {
data: base64,
media_type: "image/jpeg".into(),
},
]);
import base64
with open("photo.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
message = Message.user_with_content([
ContentBlock.text("Describe this"),
ContentBlock.image_base64(image_data, "image/jpeg"),
])
import { readFileSync } from 'fs';
const imageBuffer = readFileSync('photo.jpg');
const imageData = imageBuffer.toString('base64');
const message = Message.userWithContent([
ContentBlock.text('Describe this'),
ContentBlock.imageBase64(imageData, 'image/jpeg'),
]);
imageBytes, _ := os.ReadFile("photo.jpg")
base64Data := base64.StdEncoding.EncodeToString(imageBytes)
message := client.NewUserMessageWithContent([]client.ContentBlock{
client.NewTextContentBlock("Describe this"),
client.NewImageBase64ContentBlock(base64Data, "image/jpeg"),
})

V2 协议提供了 MultimodalCapabilities 模块,在发送请求前根据提供商声明验证内容。

运行时自动检测内容块中的模态:

use ai_lib_rust::multimodal::{detect_modalities, Modality};
let modalities = detect_modalities(&content_blocks);
// 返回: {Text, Image} 或 {Text, Audio, Video} 等
from ai_lib_python.multimodal import detect_modalities, Modality
modalities = detect_modalities(content_blocks)
# 返回: {Modality.TEXT, Modality.IMAGE}
// TypeScript
import { detectModalities, Modality } from '@hiddenpath/ai-lib-ts/multimodal';
const modalities = detectModalities(contentBlocks);
// 返回: Set { Modality.TEXT, Modality.IMAGE }

运行时根据提供商的支持验证格式:

use ai_lib_rust::multimodal::MultimodalCapabilities;
let caps = MultimodalCapabilities::from_config(&manifest.multimodal);
assert!(caps.validate_image_format("png"));
assert!(caps.validate_audio_format("wav"));
from ai_lib_python.multimodal import MultimodalCapabilities
caps = MultimodalCapabilities.from_config(manifest_multimodal)
assert caps.validate_image_format("png")
assert caps.validate_audio_format("wav")
// TypeScript
import { MultimodalCapabilities } from '@hiddenpath/ai-lib-ts/multimodal';
const caps = MultimodalCapabilities.fromConfig(manifestMultimodal);
console.assert(caps.validateImageFormat('png'));
console.assert(caps.validateAudioFormat('wav'));

发送请求前,验证提供商是否支持内容中的所有模态:

use ai_lib_rust::multimodal::validate_content_modalities;
match validate_content_modalities(&blocks, &caps) {
Ok(()) => { /* 所有模态均支持 */ }
Err(unsupported) => {
eprintln!("提供商不支持: {:?}", unsupported);
}
}
from ai_lib_python.multimodal import validate_content_modalities
# 根据提供商能力验证内容块
// TypeScript
import { validateContentModalities } from '@hiddenpath/ai-lib-ts/multimodal';
try {
validateContentModalities(blocks, caps);
// 所有模态均支持
} catch (unsupported) {
console.error(`提供商不支持: ${unsupported}`);
}
  1. 运行时构建包含混合内容块的多模态消息
  2. V2 验证MultimodalCapabilities 检查提供商是否支持所有内容的模态
  3. 协议清单将内容块映射到提供商格式
  4. 不同提供商使用不同结构:
    • OpenAI:带 type: "image_url" 对象的 content 数组
    • Anthropic:带 type: "image" 对象的 content 数组
    • Gemini:带 inline_data 对象的 parts 数组(支持视频 parts
  5. 协议自动处理所有格式差异

V2 清单在配置中显式声明了每个提供商的多模态功能:

ProviderImage InAudio InVideo InImage OutAudio OutOmni
OpenAI✅ png, jpg, gif, webp
Anthropic✅ png, jpg, gif, webp
Gemini✅ png, jpg, gif, webp✅ wav, mp3, flac✅ mp4, avi
Qwen✅ png, jpg✅ wav, mp3
DeepSeek✅ png, jpg

请查看 V2 提供商清单中的 multimodal.inputmultimodal.output 部分获取完整声明。

ai-protocol-mock 新增了可用于集成测试的异步视频生成生命周期:

  • POST /v1/video/generations{"async": true})返回 202job_id
  • GET /v1/video/generations/{job_id}queued -> running -> succeeded 进入终态
  • 支持失败注入请求头:
    • X-Mock-Status
    • X-Mock-Timeout-Ms
    • X-Mock-Invalid-Content-Type

该能力可用于在接入真实 provider 前,先验证轮询流程、超时处理和终态解析逻辑。