Text Diffing Algorithms Explained: LCS, Myers, and More

The Problem: What Changed?

Given two sequences (lines of text, characters, tokens), a diff algorithm finds the minimal edit script — the smallest set of insertions and deletions that transforms one sequence into the other. This sounds simple, but the search space is enormous. For two 1000-line files, the number of possible edit scripts is astronomical. Efficient diff algorithms use dynamic programming and heuristics to find the minimal edit in linear or near-linear time.

Longest Common Subsequence (LCS)

The classic foundation of text diffing is the Longest Common Subsequence problem. Given sequences A and B, find the longest subsequence that appears in both (not necessarily contiguous). Lines not in the LCS are additions or deletions.

The standard LCS algorithm uses dynamic programming with an O(mn) time and space complexity (m and n are sequence lengths). For large files, this is too slow and memory-intensive.

// Classic LCS (O(mn) — works for small files)
function lcs(a: string[], b: string[]): string[] {
  const m = a.length, n = b.length
  const dp: number[][] = Array.from({length: m+1}, () => new Array(n+1).fill(0))

  for (let i = 1; i <= m; i++)
    for (let j = 1; j <= n; j++)
      dp[i][j] = a[i-1] === b[j-1] ? dp[i-1][j-1] + 1 : Math.max(dp[i-1][j], dp[i][j-1])

  // Backtrack to find the actual subsequence...
  return []
}

Myers Diff Algorithm

Myers diff (1986) is the algorithm used by Git, GNU diff, and most modern diff tools. Its key insight: instead of computing the full LCS, find the shortest edit script directly using a depth-first search on an "edit graph."

Myers achieves O(ND) time complexity, where N is the sum of sequence lengths and D is the number of differences. For files that are mostly similar (small D), it's far faster than LCS. It also uses O(D) space rather than O(mn).

# The edit graph insight:
# Every path from (0,0) to (m,n) represents a sequence of edits
# Diagonal moves = match (no edit)
# Right moves = insert from B
# Down moves = delete from A
# Goal: find shortest path (fewest non-diagonal moves)

Git uses Myers by default. You can see it in action with git diff --diff-algorithm=myers.

Histogram Diff

Histogram diff is Git's alternative to Myers (use with git diff --diff-algorithm=histogram). It prioritizes matching lines that appear infrequently in both files — the idea being that rare lines are more likely to be genuine anchors for the comparison.

In practice, histogram diff produces more readable diffs for code refactors where blocks of code are moved or rewritten. Myers is theoretically optimal (shortest edit script), but histogram's output is often easier for humans to understand:

# Use histogram in git globally
git config --global diff.algorithm histogram

# Or per-invocation
git diff --diff-algorithm=histogram HEAD~1

Patience Diff

Patience diff (developed by Bram Cohen) takes a different approach: find unique lines that appear exactly once in both files, and use those as anchors. Then recursively diff the regions between anchors. This produces very clean diffs for code where functions were reordered or indentation changed.

git diff --diff-algorithm=patience HEAD~1

Patience is particularly good at avoiding the "common line problem" — where generic lines like { or } get matched incorrectly, making the diff harder to read.

Word-Level and Character-Level Diff

Line-level diff is a special case. The same LCS/Myers algorithms apply to any sequence — words, characters, or tokens. Word-level diff is useful for prose and variable renaming:

git diff --word-diff=color HEAD~1

# Output:
# The [-old-]{+new+} variable name is more [-descriptive-]{+accurate+}.

Semantic / Structural Diff

For programming languages, tree-sitter-based tools like difftastic parse source code into syntax trees before diffing. This means moving a function body appears as a move, not as hundreds of deleted and added lines. This is the frontier of diff algorithm research — transforming the problem from sequence diffing to tree diffing.

Text Diffing Algorithms Explained: LCS, Myers, and More

The Problem: What Changed?

Longest Common Subsequence (LCS)

Myers Diff Algorithm

Histogram Diff

Patience Diff

Word-Level and Character-Level Diff

Semantic / Structural Diff

Related Articles

How to Compare JSON Files Online: A Complete Guide

Understanding the Unified Diff Format

How to Compare XML Files: A Complete Guide