Home/Blog/Text Diffing Algorithms Explained: LCS, Myers, and More
Back to blog
Programming10 min read

Text Diffing Algorithms Explained: LCS, Myers, and More

A deep dive into the algorithms behind text diff — Longest Common Subsequence, Myers diff, histogram diff, and how they handle edge cases.

PS

Priya Sharma

Platform Engineer

#algorithms#diff#computer-science#programming

The Problem: What Changed?

Given two sequences (lines of text, characters, tokens), a diff algorithm finds the minimal edit script — the smallest set of insertions and deletions that transforms one sequence into the other. This sounds simple, but the search space is enormous. For two 1000-line files, the number of possible edit scripts is astronomical. Efficient diff algorithms use dynamic programming and heuristics to find the minimal edit in linear or near-linear time.

Longest Common Subsequence (LCS)

The classic foundation of text diffing is the Longest Common Subsequence problem. Given sequences A and B, find the longest subsequence that appears in both (not necessarily contiguous). Lines not in the LCS are additions or deletions.

The standard LCS algorithm uses dynamic programming with an O(mn) time and space complexity (m and n are sequence lengths). For large files, this is too slow and memory-intensive.

// Classic LCS (O(mn) — works for small files)
function lcs(a: string[], b: string[]): string[] {
  const m = a.length, n = b.length
  const dp: number[][] = Array.from({length: m+1}, () => new Array(n+1).fill(0))

  for (let i = 1; i <= m; i++)
    for (let j = 1; j <= n; j++)
      dp[i][j] = a[i-1] === b[j-1] ? dp[i-1][j-1] + 1 : Math.max(dp[i-1][j], dp[i][j-1])

  // Backtrack to find the actual subsequence...
  return []
}

Myers Diff Algorithm

Myers diff (1986) is the algorithm used by Git, GNU diff, and most modern diff tools. Its key insight: instead of computing the full LCS, find the shortest edit script directly using a depth-first search on an "edit graph."

Myers achieves O(ND) time complexity, where N is the sum of sequence lengths and D is the number of differences. For files that are mostly similar (small D), it's far faster than LCS. It also uses O(D) space rather than O(mn).

# The edit graph insight:
# Every path from (0,0) to (m,n) represents a sequence of edits
# Diagonal moves = match (no edit)
# Right moves = insert from B
# Down moves = delete from A
# Goal: find shortest path (fewest non-diagonal moves)

Git uses Myers by default. You can see it in action with git diff --diff-algorithm=myers.

Histogram Diff

Histogram diff is Git's alternative to Myers (use with git diff --diff-algorithm=histogram). It prioritizes matching lines that appear infrequently in both files — the idea being that rare lines are more likely to be genuine anchors for the comparison.

In practice, histogram diff produces more readable diffs for code refactors where blocks of code are moved or rewritten. Myers is theoretically optimal (shortest edit script), but histogram's output is often easier for humans to understand:

# Use histogram in git globally
git config --global diff.algorithm histogram

# Or per-invocation
git diff --diff-algorithm=histogram HEAD~1

Patience Diff

Patience diff (developed by Bram Cohen) takes a different approach: find unique lines that appear exactly once in both files, and use those as anchors. Then recursively diff the regions between anchors. This produces very clean diffs for code where functions were reordered or indentation changed.

git diff --diff-algorithm=patience HEAD~1

Patience is particularly good at avoiding the "common line problem" — where generic lines like { or } get matched incorrectly, making the diff harder to read.

Word-Level and Character-Level Diff

Line-level diff is a special case. The same LCS/Myers algorithms apply to any sequence — words, characters, or tokens. Word-level diff is useful for prose and variable renaming:

git diff --word-diff=color HEAD~1

# Output:
# The [-old-]{+new+} variable name is more [-descriptive-]{+accurate+}.

Semantic / Structural Diff

For programming languages, tree-sitter-based tools like difftastic parse source code into syntax trees before diffing. This means moving a function body appears as a move, not as hundreds of deleted and added lines. This is the frontier of diff algorithm research — transforming the problem from sequence diffing to tree diffing.

Share this article

Was this article helpful?

Ready to try it? Start a free comparison →

PS

Priya Sharma

Platform Engineer

Priya Sharma writes about developer tools, software engineering best practices, and productivity for the DiffChecker Pro blog. With extensive experience in software development, Priya focuses on practical guides that help developers work more effectively.

Related Articles

Tutorials

Understanding the Unified Diff Format

A deep dive into the unified diff format — how to read @@ headers, interpret +/- lines, understand context lines, and work with patch files.

Priya Sharma6 min read
XML Tools

How to Compare XML Files: A Complete Guide

Learn how to compare XML files accurately using online and CLI tools. Covers attribute vs element comparison, namespace handling, and best practices.

Priya Sharma7 min read