The Framework -Designing for a MOving TRAGET
Everyday the AI model improves. I started by asking "At what point does the AI chime in, and what does the human ultimately do?"
So the LLM team and I co-designed an AI maturity framework: three modes routed by confidence score unified under one schema and one interaction model. These three constants are what let the platform absorb model improvement without a redesign. The schema doesn't change. The patterns don't change. Only the routing does.
什么样的体验才是真正的好用?(Design Principle)
researcher打开app,看到queue里只有10个item需要她review。每个item她在3秒内看到confidence、AI rationale、before/after、做出决策。她从来不需要切换view,从来不需要打开新tab验证source。一个session结束,她处理了200个item,没有一次context loss。
好用 = 决策距离短 + context never broken。
不是"美"。不是"feature多"。不是"AI很聪明"。是这两件事。
如何用第一性原理不断修正方向?
第一性原理不是"打破规则",是回到约束本身去重新推理。
我在这个项目里用的第一性原理:
约束1:AI不完美。所以必须设计human fallback。但fallback不能让AI变得没意义——所以是confidence-based routing,不是uniform review。
约束2:30+ tools不能瞬间变成1。所以unification不是"一次性大重写",是"core schema先建好,让其他verticals逐步migrate到上面"。
约束3:researcher害怕被替代。所以not just keep them in the loop——是让他们的judgment成为model的training signal。他们的expertise被encoded,不被丢弃。
每次方向有疑问的时候,回到这三个约束之一。如果一个decision违反了任何一个约束,重做。这就是第一性原理在做的事——不是聪明,是不偏离。
哪些功能值得做,哪些不值得?
值得做:
Confidence routing(核心机制)
Inline resolution(高频动作的体验)
Undo + re-queue(trust的backstop)
Shared schema + pattern library(杠杆最大)
不值得做:
给researcher看80个confidence score的细节dashboard。他们不需要那么多信息,他们需要决策信号。
复杂的AI explainability视觉化。一个simple rationale一句话就够了。再多就是noise。
对每个data type做unique UI。复用pattern的价值远大于"为这个type特别设计"的价值。
判断标准:这个feature是不是降低了researcher到决策的距离?如果是,做。如果它增加了researcher需要process的信息量,不做。
怎么样降低了shipping friction?
三件事:
第一,把AI confidence分tier,让high confidence的data自动通过,不需要人review。这把"每件事都要人确认"的friction消除了。
第二,把QC enforcement从"submit之后人工catch"变成"submit之前系统拦截"。错误不会再下游传播,shipping cost降低。
第三,inline resolution——researcher不需要切换view来修错。一个动作完成,shipping一个修正。
The framework didn't arrive whole. My first sketches had only two modes — I was assuming AI confidence was a single threshold.
When our LLM team showed me that data accuracy varied significantly by datasets, I realized assuming AI confidence score as a single threshold made no sense.
That changed how I thought about the model itself. The AI was strong at structured field extraction from filings. It was weak at per-dataset accuracy variance, lack of historical context, and a tendency to be confidently wrong.
03 · Quality Check · Review Mode
Where Review Mode fits in the pipeline
Researcher Dashboard
Filing queue and personal stats. Pick up the next BDC filing and start a task.
Editor
AI-assisted data entry. Every value has a confidence score and a traceable source.
Quality Check
Review Mode — resolve flagged errors inline, with a side panel for the full queue.
Submission
Clean, validated data lands in the canonical dataset. One click to ship.
Optional flow
Borrower Resolution
Side flow when a reported borrower doesn't match any canonical PitchBook entity.