证据化体验评估 · 2026 年 5 月

Evidence-based Evaluation · May 2026

Step 3.7 Flash
能力与体验报告Capability & Experience Report

这版页面不再把重点放在工具链本身，而是回答一个更具体的问题：Step 3.7 Flash 在开发者真实体验、公共平台数据和官方能力声明之间，到底表现如何、风险在哪里、适合怎么用。

This report focuses on Step 3.7 Flash itself: how it feels in developer workflows, how public platform data supports or contradicts that experience, where the risks are, and how it should be used.

先看结论Read verdict 能力拆解Capability map 公开渠道评价Public web review

01 · 总体判断

01 · Overall Verdict

不是“发布很强”或“群里吐槽”这么简单

More nuanced than “strong launch” or “community complaints”

核心结论

Step 3.7 Flash 是一次能被开发者明显感知到的跃升：速度、多模态、开放部署、应用想象都很强；但它离“放心全托管的生产级 Agent”还有距离，主要卡在工具调用确定性、复杂代码修复、结构化输出稳定性、并发/cache/规则透明度。

Core verdict

Step 3.7 Flash is a visible capability jump: speed, multimodality, open deployment, and application pull are strong. But it is not yet a fully trusted production-grade autonomous Agent, mainly due to tool reliability, complex bug fixing, structured output stability, and cache/concurrency transparency.

为什么群内更苛刻

群友不是普通聊天用户，而是在把模型接进 OpenClaw、CLI、MCP、PPT、GUI、代码库和长上下文任务。因此他们关心的不是“回答聪不聪明”，而是能否闭环、少破坏环境、少浪费 token、失败后能不能自救。

Why the group is stricter

The group is not testing casual chat. They wire the model into OpenClaw, CLI, MCP, PPT, GUI, repos, and long-context work. Their question is whether it can close loops, avoid environment damage, save tokens, and recover from failure.

02 · 能力拆解

02 · Capability Breakdown

六个维度看 Step 3.7 Flash

Six dimensions of Step 3.7 Flash

03 · 群内体验证据

03 · Private Group Evidence

微信群反馈更像“真实工作流压力测试”

The WeChat group behaves like a workflow stress test

04 · 公共渠道评价

04 · Public Web Evaluation

近一个月公开信息：平台数据多，独立长评少

Past-month public signal: platform data is rich, independent reviews are sparse

通过 web_scan/公开页面读取，近月可验证评价主要来自官方发布、GitHub/HF/OpenRouter/NVIDIA 等平台。独立社区深度体验样本较少，这说明公共口碑还处于早期部署和尝鲜阶段。

Public pages scanned through web_scan show that verifiable recent signals mainly come from official release notes and platforms such as GitHub, Hugging Face, OpenRouter, and NVIDIA. Independent long-form community reviews remain sparse.

05 · 官方叙事 vs 群内体验

05 · Official Narrative vs Group Experience

最大的差异：工具调用在 benchmark 和真实链路中的含义不同

The biggest gap: tool use means different things in benchmarks and real workflows

官方/平台能证明什么

能证明 Step 3.7 Flash 在多模态、搜索、coding benchmark、开源部署、平台吞吐和生态接入上已经有竞争力。OpenRouter 的吞吐/延迟数据也支持“体感快”的判断。

群内体验补充了什么

群内证据暴露了 benchmark 不容易体现的问题：工具参数是否真能选对、命令是否乱猜、OpenClaw/Kimi CLI 是否兼容、修 bug 是否越修越多、长任务是否让用户放心。

What official/platform data proves

It proves competitiveness in multimodality, search, coding benchmarks, open deployment, throughput, and ecosystem access. OpenRouter telemetry also supports the “fast enough” perception.

What group experience adds

It exposes problems that benchmarks miss: tool parameter selection, command guessing, OpenClaw/Kimi CLI compatibility, bug-fix regressions, and long-task trust.

06 · 公共来源

06 · Public Sources

本页纳入的公开来源

Public sources used

07 · 建议

07 · Recommendations

Step 3.7 Flash
能力与体验报告Capability & Experience Report

不是“发布很强”或“群里吐槽”这么简单

More nuanced than “strong launch” or “community complaints”

核心结论

Core verdict

为什么群内更苛刻

Why the group is stricter

六个维度看 Step 3.7 Flash

Six dimensions of Step 3.7 Flash

微信群反馈更像“真实工作流压力测试”

The WeChat group behaves like a workflow stress test

近一个月公开信息：平台数据多，独立长评少

Past-month public signal: platform data is rich, independent reviews are sparse

最大的差异：工具调用在 benchmark 和真实链路中的含义不同

The biggest gap: tool use means different things in benchmarks and real workflows

官方/平台能证明什么

群内体验补充了什么

What official/platform data proves

What group experience adds

本页纳入的公开来源

Public sources used

怎么评价，也怎么使用

How to evaluate it, and how to use it

Step 3.7 Flash能力与体验报告Capability & Experience Report

不是“发布很强”或“群里吐槽”这么简单

More nuanced than “strong launch” or “community complaints”

核心结论

Core verdict

为什么群内更苛刻

Why the group is stricter

六个维度看 Step 3.7 Flash

Six dimensions of Step 3.7 Flash

微信群反馈更像“真实工作流压力测试”

The WeChat group behaves like a workflow stress test

近一个月公开信息：平台数据多，独立长评少

Past-month public signal: platform data is rich, independent reviews are sparse

最大的差异：工具调用在 benchmark 和真实链路中的含义不同

The biggest gap: tool use means different things in benchmarks and real workflows

官方/平台能证明什么

群内体验补充了什么

What official/platform data proves

What group experience adds

本页纳入的公开来源

Public sources used

怎么评价，也怎么使用

How to evaluate it, and how to use it

Step 3.7 Flash
能力与体验报告Capability & Experience Report