How We Built a Visual Diff to Make AI Editing Transparent

Most AI tools show you a “before” and “after” but never explain what changed or why. When we started building a resume editor powered by AI, we realized this transparency problem was going to kill trust before we could demonstrate value. So we built a visual diff system that highlights every word an LLM touches. Turns out, making AI transparent is considerably harder than making it powerful.

The Trust Problem

I’ve been using GitHub Copilot for about two years now. I accept maybe 30% of its suggestions. Why? Because I need to understand what it’s proposing before I commit code. Resume writing felt similar, except the stakes are higher—people are job-hunting with this content.

Early user testing revealed that people would stare at AI-generated text for minutes, trying to mentally diff it against what they wrote. They couldn’t trust output they didn’t understand.

The technical challenge became obvious: how do you diff unstructured text in a way that’s actually meaningful to users?

The Technical Challenge: Diffing Natural Language

Code diffing is solved. Git uses Myers’ algorithm, and it works beautifully because code has structure—functions, classes, syntax trees. Natural language doesn’t have those affordances. You can’t rely on indentation or delimiters to signal semantic boundaries.

We experimented with three approaches:

Character-level diffing — Too granular.
If AI changed “managed” to “led,” you’d see:
~~manag~~**l**ed
Noisy and hard to parse at a glance.
Line-level diffing — Too coarse.
If AI restructured a sentence within a paragraph, the entire paragraph showed as changed. Users couldn’t see what specifically was modified.
Word-level diffing with semantic grouping — This is what we settled on.
Using Myers’ diff algorithm with custom tokenization (splitting on whitespace and punctuation), we could highlight individual words.
But we added a crucial layer: grouping consecutive changes to show intent.

Example:

Original:
“Managed team of 5 engineers”
AI version:
“Led cross-functional team of 5 engineers delivering customer-facing features”
Diff view:
[Managed → Led] [Added: cross-functional] team of 5 engineers [Added: delivering customer-facing features]

Implementation Details

We’re using Myers’ diff at the core (there’s a good Dart implementation in the diff package), but with custom tokenization to treat punctuation intelligently. The key insight was that diff algorithms work on sequences—and we just needed the right sequence unit.

// Simplified version of our tokenizer
List<String> tokenize(String text) {
  return text.split(RegExp(r'(\s+|(?<=[.,;!?])|(?=[.,;!?]))'))
    .where((token) => token.trim().isNotEmpty)
    .toList();
}

Then we apply Myers’ diff to the token sequences and classify changes:

Word swap: Single-word substitution (different meaning/tone)
Addition: New content inserted
Deletion: Content removed
Restructure: Multiple consecutive changes (sentence rework)

Color-coding helps:
🟩 Green for additions
🟥 Red for deletions
🟨 Yellow for substitutions

We render this in the preview pane so users see exactly what the AI touched.

Performance Considerations

Resume text is typically 500–2000 words. Myers’ diff is O(ND) where N is the sum of sequence lengths and D is the size of the shortest edit script. For our use case, this is fast enough—diffs compute in under 50ms on typical documents. But we still made it async to avoid blocking the UI during generation.

The trickier performance issue was re-rendering. Every time we highlight changes, we’re rebuilding the document tree in Flutter. For large documents with hundreds of diffs, this got sluggish. We ended up batching diff highlights by paragraph to limit re-render scope.

Beyond Git Diff: Why Text Is Different

Code has syntax trees to lean on. Text doesn’t. This creates challenges:

Context matters more — Changing “managed” to “led” might improve one bullet point but weaken another, depending on the broader narrative. Pure diff doesn’t capture this.
Tone is invisible — AI might make text more formal or more casual, but word-level diff won’t show that shift explicitly. You have to infer it from the substitutions.
Readability can degrade — Sometimes AI optimizes for keywords at the expense of flow. Diff shows the changes, but not whether the result reads well.

We’re experimenting with annotations to address this: showing not just what changed, but why the AI made that choice.

“Added ‘Python’ keyword mentioned 3× in job description.”

Open Questions We’re Still Solving

How granular should diffs be?
Too detailed and you’re overwhelmed by yellow highlighting. Too coarse and you miss important nuances.
We currently let users toggle between summary (major changes only) and detailed (every word). Not sure this is the right UX.
When does transparency become noise?
If the AI changed 200 words in a 500-word resume, highlighting everything might just stress users out.
Should we summarize? Show top N changes? Surface only “risky” edits?
Can we show confidence levels?
Some AI suggestions are safe (adding a keyword). Others are subjective (reframing an achievement).
We’re exploring whether to surface confidence scores per change, but we don’t want to introduce decision fatigue.
Balancing transparency with usability
The diff view is informative but adds cognitive load. Some users want magic: “Just make my resume better.”
Others want full control. How do you serve both audiences without forking the product?

Practical Takeaways

If you’re building an AI tool that modifies user-generated content:

Word-level diffing with semantic grouping works well for natural language.
Character-level is too noisy; line-level too coarse.
Myers’ algorithm is fast enough for documents under 10K words.
Profile before optimizing—your bottleneck is probably rendering, not diffing.
Color-coding by change type helps users parse diffs quickly.
Consistent visual language matters.
Async diff computation prevents UI blocking, even if the computation is fast.
Users perceive the app as more responsive.
Transparency has a UX cost.
Not everyone wants to see how the sausage is made. Design for both power users and casual users.

If you’re interested in trying the diff system or have thoughts on the approach, we’ve got a working implementation at
👉 www.matcharesume.com
(still pretty rough around the edges, and we’re actively iterating based on feedback).

Discussion

For those who’ve built AI tools:

How do you handle transparency without overwhelming users?
What’s the right balance between showing everything the AI changed vs. summarizing major edits?
Has anyone found better algorithms than Myers’ diff for natural language, or is the real challenge in tokenization strategy?