catalog/repos/agntswrm--agent-media.md

# AI媒体处理命令行工具

`媒体处理` `图像生成` `视频生成` `音频转录` `AI工具` `CLI`

# agent-media

面向 AI 代理的媒体处理命令行工具。

- **图像**：生成、编辑、去除背景、超分辨率放大、调整尺寸、格式转换、扩展画布、裁剪
- **视频**：生成（文本生成视频 & 图像生成视频）
- **音频**：从视频中提取、转录（支持说话人识别）

## 安装

### 全局安装

```bash
npm install -g agent-media@latest
```

### 从源码安装

```bash
git clone https://github.com/agntswrm/agent-media
cd agent-media
pnpm install && pnpm build && pnpm link --global
```

### 通过 bunx / npx 使用

无需安装，直接运行：

```bash
bunx agent-media@latest --help
npx agent-media@latest --help
```

### 为 AI 代理安装技能

将 agent-media 技能安装到你的编程代理（Claude Code、Cursor、Codex 等）：

```bash
npx skills add agntswrm/agent-media
```

这将为你的 AI 代理添加媒体处理技能，可自动调用。可用技能如下：
- `agent-media` - 所有功能概览
- `image-generate` - 从文本生成图像
- `image-edit` - 使用文本提示编辑图像
- `image-resize` - 调整图像尺寸
- `image-convert` - 转换图像格式
- `image-extend` - 通过填充扩展图像画布
- `image-remove-background` - 去除背景
- `image-crop` - 裁剪图像至指定尺寸
- `image-upscale` - 使用 AI 超分辨率放大图像
- `audio-extract` - 从视频中提取音频
- `audio-transcribe` - 将音频转录为文本
- `video-generate` - 从文本或图像生成视频

## 快速上手

```bash
# 生成图像
agent-media image generate --prompt "a robot" --out rob.png

# 去除背景
agent-media image remove-background --in rob.png --out rob_nobg.png

# 编辑图像
agent-media image edit --in rob_nobg.png --prompt "the robot is sitting on a bench next to a cat, in the background you can see the Eiffel Tower in Paris" --out rob_cat_paris.png

# 生成带音频的视频（猫叫声，机器人说话！）
agent-media video generate --in rob_cat_paris.png --prompt "the cat meows and the robot says: \"Yes, me too.\"" --audio --out rob_cat_video.mp4

# 从视频中提取音频
agent-media audio extract --in rob_cat_video.mp4 --out rob_cat_audio.mp3

# 转录音频
agent-media audio transcribe --in rob_cat_audio.mp3
```

## 环境要求

- Node.js >= 18.0.0
- 使用 AI 功能需要来自 [fal.ai](https://fal.ai/dashboard/keys)、[Replicate](https://replicate.com/account/api-tokens)、[Runpod](https://www.runpod.io/console/user/settings) 或 [AI Gateway](https://vercel.com/ai-gateway) 的 API 密钥

**本地处理**（无需 API 密钥）：调整尺寸、格式转换、扩展画布、裁剪、超分辨率放大、音频提取、去除背景、转录

**云端处理**（需要 API 密钥）：图像生成、图像编辑、超分辨率放大、视频生成、去除背景、转录

> **注意**：使用本地去除背景、放大或转录时，可能会看到 `mutex lock failed` 错误——忽略即可，若 JSON 显示 `"ok": true` 则输出正常。

---

## image（图像）

```bash
agent-media image resize --in <路径> [选项]
agent-media image convert --in <路径> --format <格式>
agent-media image extend --in <路径> --padding <像素> --color <十六进制颜色>
agent-media image crop --in <路径> --width <像素> --height <像素>
agent-media image generate --prompt <文本>
agent-media image edit --in <路径...> --prompt <文本>
agent-media image remove-background --in <路径>
agent-media image upscale --in <路径>
```

### resize（调整尺寸）

*本地处理*

```bash
agent-media image resize --in sunset-mountains.jpg --width 800
agent-media image resize --in sunset-mountains.jpg --height 600
agent-media image resize --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 800
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--width <像素>` | 目标宽度（像素） |
| `--height <像素>` | 目标高度（像素） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |

### convert（格式转换）

*本地处理*

```bash
agent-media image convert --in sunset-mountains.png --format webp
agent-media image convert --in sunset-mountains.jpg --format png
agent-media image convert --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --format jpg --quality 90
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--format <格式>` | 输出格式：png、jpg、webp（必填） |
| `--quality <数值>` | 有损格式的质量 1-100（默认：80） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |

### extend（扩展画布）

*本地处理*

在图像四周添加指定颜色的纯色填充，扩展画布尺寸。

```bash
agent-media image extend --in sunset-mountains.jpg --padding 50 --color "#E4ECF8"
agent-media image extend --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.png --padding 100 --color "#FFFFFF"
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--padding <像素>` | 四周各添加的填充像素数（必填） |
| `--color <十六进制>` | 扩展区域的背景颜色（必填），同时会合并透明度。 |
| `--dpi <数值>` | 输出图像的 DPI/分辨率（默认：300） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |

### crop（裁剪）

*本地处理*

以焦点为中心裁剪图像至指定尺寸。裁剪区域以焦点为中心计算，同时保持在图像边界内。

```bash
agent-media image crop --in sunset-mountains.jpg --width 800 --height 600
agent-media image crop --in sunset-mountains.jpg --width 800 --height 600 --focus-x 20 --focus-y 30
agent-media image crop --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --width 400 --height 400
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--width <像素>` | 裁剪区域宽度（像素，必填） |
| `--height <像素>` | 裁剪区域高度（像素，必填） |
| `--focus-x <数值>` | 焦点 X 坐标 0-100（默认：50 = 居中） |
| `--focus-y <数值>` | 焦点 Y 坐标 0-100（默认：50 = 居中） |
| `--dpi <数值>` | 输出图像的 DPI/分辨率（默认：300） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |

### generate（生成）

*需要 API 密钥*

```bash
agent-media image generate --prompt "a cat wearing a hat"
agent-media image generate --prompt "sunset over mountains" --width 1024 --height 768
```

| 选项 | 说明 |
|------|------|
| `--prompt <文本>` | 文字描述（必填） |
| `--width <像素>` | 宽度（默认：1280） |
| `--height <像素>` | 高度（默认：720） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（fal、replicate、runpod、ai-gateway） |
| `--model <名称>` | 模型覆盖（如 `fal-ai/flux-2`、`bfl/flux-2-pro`） |

### edit（编辑）

*需要 API 密钥*

使用文本提示编辑一张或多张图像（图像到图像）。支持多张输入图像，用于融合风格、主体或场景。

```bash
agent-media image edit --in sunset-mountains.jpg --prompt "make the sky more vibrant"
agent-media image edit --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png --prompt "add sunglasses"
agent-media image edit --in style.jpg content.jpg --prompt "apply the style of the first image to the second"
```

| 选项 | 说明 |
|------|------|
| `--in <路径...>` | 一张或多张输入文件路径或 URL（必填） |
| `--prompt <文本>` | 所需编辑效果的文字描述（必填） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（fal、replicate、runpod、ai-gateway） |
| `--model <名称>` | 模型覆盖（如 `fal-ai/flux-2/edit`、`google/gemini-3-pro-image`） |

### remove-background（去除背景）

*本地或云端处理*

```bash
agent-media image remove-background --in man-portrait.png
agent-media image remove-background --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/man-portrait.png
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（local、fal、replicate） |

### upscale（超分辨率放大）

*本地或云端处理*

使用 AI 超分辨率技术放大图像，在提升分辨率的同时生成细节。

```bash
agent-media image upscale --in sunset-mountains.jpg
agent-media image upscale --in sunset-mountains.jpg --scale 4 --provider fal
agent-media image upscale --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/sunset-mountains.jpg --provider replicate
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入文件路径或 URL（必填） |
| `--scale <数值>` | 放大倍数：2 或 4（默认：2）。本地模式始终输出 4 倍。 |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（local、fal、replicate） |
| `--model <名称>` | 模型覆盖 |

---

## video（视频）

```bash
# 从文本生成视频
agent-media video generate --prompt <文本>

# 从图像生成视频（为图像添加动效）
agent-media video generate --in <图像> --prompt <文本>
```

### generate（生成）

*需要 API 密钥*

从文本提示生成视频。可选择提供输入图像为其添加动效（图像生成视频）。提示词描述视频中应发生的内容。

```bash
# 文本生成视频
agent-media video generate --prompt "a cat walking through a garden"

# 图像生成视频（为图像添加动效）
agent-media video generate --in woman-portrait.png --prompt "person smiles and waves hello"

# 带音频/语音生成（runpod）
agent-media video generate --in woman-portrait.png --prompt "The woman says: \"Hello, welcome to our channel!\"" --audio --provider runpod

# 带环境音效（fal）
agent-media video generate --prompt "fireworks in the night sky" --audio --duration 10 --provider fal

# 更高分辨率
agent-media video generate --prompt "ocean waves" --resolution 1080p
```

| 选项 | 说明 |
|------|------|
| `--prompt <文本>` | 视频的文字描述（必填） |
| `--in <路径>` | 图像生成视频的输入图像（可选） |
| `--duration <秒>` | 时长（默认：runpod 为 5 秒，其他为 6 秒） |
| `--resolution <分辨率>` | 分辨率：720p、1080p（默认：720p） |
| `--fps <数值>` | 帧率：25、50（默认：25） |
| `--audio` | 生成音轨（使用 runpod 时，引号中的文本将生成语音） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（fal、replicate、runpod） |
| `--model <名称>` | 模型覆盖 |

---

## audio（音频）

```bash
# 从视频中提取音频
agent-media audio extract --in <视频>

# 将音频转录为文本
agent-media audio transcribe --in <音频>
```

### extract（提取）

*本地处理*

从视频文件中提取音轨。

```bash
agent-media audio extract --in woman-greeting.mp4
agent-media audio extract --in woman-greeting.mp4 --format wav
agent-media audio extract --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp4
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入视频文件路径或 URL（必填） |
| `--format <格式>` | 输出格式：mp3、wav（默认：mp3） |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |

### transcribe（转录）

*本地或云端处理（说话人识别需要云端）*

将音频转录为带时间戳的文本。说话人识别（分离）需要使用云端服务商。

```bash
agent-media audio transcribe --in woman-greeting.mp3
agent-media audio transcribe --in woman-greeting.mp3 --diarize --speakers 2
agent-media audio transcribe --in https://ytrzap04kkm0giml.public.blob.vercel-storage.com/woman-greeting.mp3
```

| 选项 | 说明 |
|------|------|
| `--in <路径>` | 输入音频文件路径或 URL（必填） |
| `--diarize` | 启用说话人识别（仅限云端） |
| `--language <代码>` | 语言代码（未提供时自动检测） |
| `--speakers <数值>` | 说话人数量提示 |
| `--out <路径>` | 输出路径、文件名或目录（默认：./） |
| `--provider <名称>` | 服务商（local、fal、replicate） |
| `--model <名称>` | 模型覆盖 |

---

## 输出格式

所有命令均向标准输出返回 JSON：

```json
{
  "ok": true,
  "media_type": "image",
  "action": "resize",
  "provider": "local",
  "output_path": "resized_123_abc.png",
  "mime": "image/png",
  "bytes": 45678
}
```

发生错误时：

```json
{
  "ok": false,
  "error": {
    "code": "INVALID_INPUT",
    "message": "At least one of width or height must be specified"
  }
}
```

成功时退出码为 `0`，出错时为 `1`。

## 服务商

### 默认模型

| 服务商 | resize | convert | extend | crop | 图像生成 | 图像编辑 | 去除背景 | 放大 | 视频生成 | 转录 |
|--------|--------|---------|--------|------|----------|----------|----------|------|----------|------|
| **local** | ✓* | ✓* | ✓* | ✓* | - | - | `Xenova/modnet`** | `Xenova/swin2SR`** | - | `moonshine-base`** |
| **fal** | - | - | - | - | `fal-ai/flux-2` | `fal-ai/flux-2/edit` | `fal-ai/birefnet/v2` | `fal-ai/esrgan` | `fal-ai/ltx-2` | `fal-ai/wizper` |
| **replicate** | - | - | - | - | `black-forest-labs/flux-2-dev` | `black-forest-labs/flux-kontext-dev` | `men1scus/birefnet` | `nightmareai/real-esrgan` | `lightricks/ltx-video` | `whisper-diarization` |
| **runpod** | - | - | - | - | `alibaba/wan-2.6` | `google/nano-banana-pro-edit` | - | - | `wan-2.6` | - |
| **ai-gateway** | - | - | - | - | `bfl/flux-2-pro` | `google/gemini-3-pro-image` | - | - | - | - |

\* 由 [Sharp](https://sharp.pixelplumbing.com/) 驱动，提供快速图像处理
\** 由 [Transformers.js](https://huggingface.co/docs/transformers.js) 驱动，支持本地机器学习推理（首次使用时自动下载模型）

使用 `--model <名称>` 可覆盖任意命令的默认模型。

### 服务商选择规则

1. **显式标志**（最高优先级）：`--provider fal`
2. **环境变量自动检测**：设置 `FAL_API_KEY`、`REPLICATE_API_TOKEN`、`RUNPOD_API_KEY` 或 `AI_GATEWAY_API_KEY` 可自动选择对应服务商
3. **回退到本地**：未指定服务商时，resize/convert 使用本地处理
4. **首个支持的服务商**：用于 generate/remove-background

## 环境变量

| 变量 | 说明 | 获取密钥 |
|------|------|----------|
| `FAL_API_KEY` | fal.ai API 密钥 | [fal.ai](https://fal.ai/dashboard/keys) |
| `REPLICATE_API_TOKEN` | Replicate API 令牌 | [replicate.com](https://replicate.com/account/api-tokens) |
| `RUNPOD_API_KEY` | Runpod API 密钥 | [runpod.io](https://www.runpod.io/console/user/settings) |
| `AI_GATEWAY_API_KEY` | AI Gateway API 密钥 | [vercel.com](https://vercel.com/ai-gateway) |
| `AGENT_MEDIA_DIR` | 输出目录（默认：当前目录） | - |

## 路线图

- [x] 本地去除背景（无需任何 API 密钥）
- [x] 本地转录（无需任何 API 密钥）
- [x] 视频生成（文本生成视频 & 图像生成视频）
- [ ] 批量处理支持