Atomic, Not Binary: Opaque Blobs in VCSs

a bit of background

One of the most commonly discussed issues with git1 is the storage of binary files.

There are, broadly, two halves to this problem.

The fist half is that binary files tend to be large, and git stores a full snapshot of a file every time it changes, so frequently changing files can cause a repository to balloon in size. This is especially inconvenient if most people cloning the respository do not care about those files.

git-lfs, a git plugin written by GitHub, attempts to resolve this, but is a bit clunky, and ends up being somewhat difficult when multiple hosts are involved. New work is being done in git to alleviate the need for plugins and help make git more usable with large files natively.

But, crucially, this is not only a problem for binary files -- large text-based files also suffer from this problem (consider, for example, large generated json files). Nor is in inherent to binary files -- small binary files work exacty the same as small text files here.

So, we come to the second half: diffing. Despite git's internals working with snapshots, git's user interface largely deals with diffs. Diffs are how we review changes to files when sharing them, how we check what's in a commit, and how we remind ourselves what we were working on when we get bored with a project and abandon it for 2 years, only to pick it back up as a way to cope with anxiety about the clusterfuck going on outside 2.

A common talking point is that binary files are not diffable. This presents a problem to a dVCS: if you can't diff a file, you can't merge the changes to the file if two or more people make changes independently. While traditional cVCSs3 use file locking to get around this problem 4 5, file locking isn't feasible in a dVCS; one of the main points of a dVCS is that you don't need or have a centralized authority to mediate changes and file copying 6.

And so, if you have binary files in your repository, you're relegated to choosing to either abandon your own changes or overwrite someone else's when applying conflicting patches. Or, manually coordinating with the people who might also be working on those files. Which... sucks. It's not great.

But! this is also not a problem inherent to binary files. It just happens that our main tool for diffing is... outdated.

let's talk terminology

Let's take a step back. When we say a file is diffable, what do we mean? A naive answer would be to say:

Given a file with 2 revisions (A and B), we call that file "diffable" if and only if we can use our diff tool to produce a patch that describes the changes to get from A to B.

This is a clumsy definition, but it's a start. Let's see if we can improve it a bit. First of all, that last clause "changes to get from A to B" has a bit of a hole in it: technically: a diff that says "replace the entire contents of the file" meets this criterion. So instead, we'll talk about the minimal set of changes required. For our purposes, we'll define that as "the smallest possible patch" 7.

Given a file with 2 revisions (A and B), we call that file "diffable" if and only if we can use our diff tool to produce a minimal patch that describes the changes to get from A to B.

Getting better, but there's another hole here: we have an ambient dependendency in our definition: "our diff tool". Let's close that up: instead of saying that a file is "diffable", we'll say that it's "diffable under a tool":

Given a diff tool diff and a file with 2 revisions (A and B), we call that file "diffable under diff" if and only if we can use diff to produce a minimal patch that describes the changes to get from A to B.

Alright, now our definition is shaping up. Notice that we no longer talk about whether a file is diffable in abstract -- instead, we talk about how it relates to a specific tool. Let's do one final revision, though: let's define how we measure minimal. We'll say that our diff has a "unit" (for example, a line), and each piece of the diff cannot be smaller than the unit. Then:

Given a diff tool diff, a "smallest possible change" δ, and a file with 2 revisions (A and B), we say that a minimal patch is a patch consisting of the fewest multiples of δ to describe the changes to get from A to B, and we call that file "diffable under diff measuring with δ" if and only if we can use diff to produce such a minimal patch.

Now, you might be saying "why is that important?". Well, a) it's ✨foreshadowing✨, but b) assume we're still talking about only text. Suppose we have a diff tool that produces line-by-line diffs (like we're used to with the standard git-diff output) but only understands windows line endings 8. Per our definition above, files written on linux would not be considered diffable under that tool; diff would consider the whole file to be one giant line, and produce a non-minimal diff consisting of the entire file 9.

There's another problem with our definition that's lurking just outside the bounds of our terminology: we need a way to compose diffs, an that involves merging, and merging involves both humans and computers.

let's talk about people

Consider a giant json lockfile file that's not well-formatted for humans.

We could produce a diff of this using our standard git-diff, and it might very well be minimal while measuring lines, or even close-to-minimal while measuring with json nodes. but! such a patch would not be particularly readable, and given the types of automated changes made to such a file, it's probably not particularly mergable by line-wise by computers with our typical merge algorithms. Doubly so for humans. Nor does it particularly make sense to merge it as such -- it's generated by a tool, so it likely just makes sense to regenerate it 10.

Now, this is a bit fuzzy, but all of computing is a bit fuzzy, because at the end of the day it's humans using the computers.

Consider an SVG file or a .ma file. It might be formatted nicely, we might be able to get a minimal diff measuring both line wise and element-wise. But, merging it doesn't really make sense: changes to the text of a graphics file format are not particularly reviewable as text, and merging two sets of changes often produces nonsensical results -- the relationship between parts of the file is visuospatial, not textual.

These files are diffable under git-diff under our previous definitions, but doing so doesn't really make a ton of sense.

We need some new terms.

let's talk terminology, again

Let's consider a rust source file or blog post in markdown, and compare it to our json lockfile and svg file.

Both types of file are diffable, but the source file and markdown file and be broken up into chunks that we can safely re-combine as humans, while the lockfile and svg cannot.

Let's call the source file and markdown file compound files, and the svg an atomic file (we'll get to the lockfile in a moment).

We'll say:

Given a file with two revisions A and B, we call the file "compound" if it can be split into chunks such that we can combine chunks from A with chunks from B and still produce a coherent file. An atomic file any file that is not compound.

Let's consider another example: a pallette of tiles for a 2d video game. In theory, it is safe to work on one tile while someone else works on another, even though all tiles are stored in a single file. Thus, we can say that a tileset file is compound, with its chunks being the individual tiles. Doing so, however, would need a specialized diff tool that worked with tileset images. Recall our definition of "diffable" -- a tileset is diffable (measured in tiles) only under a diff tool that understands tileset images.

Let's revisit our lockfile: it's not particularly easy to combine changes text-wise, so usually we'd just regenerate it when a merge conflict happened. However, it'd be nice if our lockfile generator could read the old lockfile and leave any relevant locked packages in place, only deviating when a package specifier was changed, added, or deleted from the main dependency file. That'd get us a mostly reasonable solution, but what if instead, it could take both old files from each revision, and only complain if the revisions specified mutually incompatible package revisions? Then we'd get a perfect new lockfile, without any risk of accidentally introducing package revisions we did not intend to.

Such a program is a merge tool -- instead of considering changes textually, it interprets them semantically. Similarly how to we said that a compound file is diffable under a given diff tool, diffs for a compound file can also mergable under a given merge tool.

a case for why this matters

Our tileset example above might be a bit academic at this point, given the existing processes and issues around gamedev SCM, but changing how we think about diffing and merging is also applicable to every-day programming.

Merging lockfiles is a task causes issues today, and while line-wise diff/merge tooling largely works for many programming languages, AST-based tools like difftastic can provide a much nicer experience.

Furthermore, abandoning the preconception of "binary means non-diffable and non-mergable" paves the way for us to think about programming languages which store binary ASTs instead of text files, annotated with additional information (consider a language that pre-computes and stores typechecking like Unison does, but does not need a whole specialized VCS/SCM to manage it, just some standard plugins).

Just like how it's become common for programming languages to ship language servers as a standard piece of dev tooling, we could have a future in which syntax-aware diff and merge tools ship standard as well, and where our VCSs handled compound binary files (using standard plugins shipped with the tooling to edit those files) just as easily as they handle compound text today.

Our tools could be a lot better, we just need to think in the right frame of reference.

1

and often other dVCSs

2

what, don't look at me like that, you know it's true

3

e.g. Perforce, the favorite VCS of gamedevs.

4

For example, this is how Perforce works (permalink)

6

there is an argument that in this day and age GitHub effectively functions as such a centralized authority. Whether or not that's true, the mentality with which users use Git remains distributed: anybody can go off and start working on a patch on their own, then create a PR or send a patch in -- they don't need to somehow be blessed with write access to a repository to do so 11.

11

...execpt in some corporate uses of GitHub, where forking is turned off. Then you need write access. That's a bad model that discourages transparency and collaboration between teams at a company. But that's a blog post for another day.

7

yes, yes, i know i'm still handwaving a bit. One definition here might be: if we say that we have a patch language consisting of an operation change(range-from, range-to, contents) that change chunks of the file at the given ranges, we can say that the total size of of the patch is the sum of the size of the contents of each operation in the patch. There's still a hole here, but we'll get to it in a minute.

8

that is, \r\n instead of just \n on linux, or \r on some very old systems. the distinction these days is mostly 12 a historical artifact.

12

...mostly. brief aside: \r stands for "carriage return" (sometimes written CR), and \n stands for "line feed" (sometimes written LF). Back in typewriter days, carriage return moved the typewriter carriage back to its starting position, while line feed scrolled the paper down a line. This was carried over into the teletype era 13, and then into the modern era. These days, very few things treat \r and \n separately. However, the sixel graphics API that originally appeared on DEC terminals retains a separate concept of carriage returns and line feeds -- you write all the pixel positions of a single color in one go, and then CR back to the beginning of the line and write pixel positions for a new color, and so on, only LF-ing when you're done with a line.

13

per the vim book, page 120, apparently part of this reason this was carried was to prevent smudging -- the extra time it took to transmit the LF after the CR gave the printer time to finish returning the print head. CR was also apparently involved in overwriting to make bold text.

9

"wait", i hear you asking, "wouldn't that patch still be minimal if our 'lower bound' is still lines?". Sure, but crucially, the lower bound of our measurement might not be the lower bound our diff tool uses. Here, we consider a line terminated by any newline to be our lower bound for measuring the minimality of a patch, while our imagined windiff only considers \r\n, so windiff does not perform favorably.

10

with the caveat of "regenerating it might produce slightly different results than some theoretical merging", but often the results are identical or close enough since the regeneration can be done by feeding in one of the two original files as a base. Ideally, we'd be able to feed in both files. We're getting there.

5

i have been informed by one of my awesome gamedev friends that file locking is also used in gamedev to prevent changes to approved assets. this is an interesting workflow to me, because it is somewhat antithetical to the way i am used to thinking about code and releases -- in my usual realm, approval comes from reviews, and approval management is done with owners/permissions lists (and it's just assumed that if you're an owner you're going to do a reasonable job). but! my understanding is that gamedevs often work under a very different set of constraints, and it's always interesting to see the ways that affects tooling and processes.