Reading a delta created by xdelta3

2025-08-14

This version: 2025-08-14
First version: 2025-06-05

Attention conservation notice: These are informal notes (distantly) related to git packfiles. They’re primarily a reference note for the author. Corrections are welcome.

We know about the benefits of delta encoding. Let’s look at the implementation details.

We have a big text file containing War and Peace. It weighs 3.2 MiB has the filename is war-and-peace-v1.txt.

Let’s create another “edition” by creating a copy of the first version and then adding the following 8 bytes: The End. (no newline). Now we have a second file that’s also 3.2 MiB. It has the filename war-and-peace-v2.txt.

Here’s what’s in our directory right now:

1$ wc -c *.txt
23265324 war-and-peace-v1.txt
33265332 war-and-peace-v2.txt

Before we create the delta, let’s guess how big the patch is going to be. We have the new 8 bytes (The End.). We’ll also likely need to indicate the length of the new data, the 8 bytes (~1 byte). And we’ll need to encode a COPY instruction (~1 byte) and an ADD instruction (~1 byte). We’ll also need to put the offsets for these instructions in the patch. The COPY instruction will need to mention the number of bytes to copy from the original. We know this number is 3265324 (~3 bytes) since that’s the length of war-and-peace-v1.txt. And perhaps the ADD instruction needs this number too since it needs to know where to start inserting (~3 bytes). We probably also need the starting point for the COPY operation (offset 0) (~1 byte). So maybe a total between 18 and 24 bytes?

Let’s see how close we are.

We’ll create the patch using xdelta3, software that should be available on any good Linux distribution.

1$ xdelta3 -S none -s war-and-peace-v1.txt war-and-peace-v2.txt > war-and-peace-v2.xdelta3

(The -S none disables secondary compression and the -s war-and-peace-v1.txt indicates the source file.)

Let’s look at our directory now:

1$ wc -c war-and-peace-v1.txt war-and-peace-v2.txt war-and-peace-v2.xdelta3
23265324 war-and-peace-v1.txt
33265332 war-and-peace-v2.txt
4     83 war-and-peace-v2.xdelta3

83 bytes is a lot more than our guess! It turns out, however, that the VCDIFF header—which contains the 20 byte filenames, among other things—is 49 bytes. And there’s a second window header that takes up 19 bytes, leaving 15 bytes for the instructions themselves. Although the second header includes information that’s not part of the delta, such as a 4-byte checksum of the target file, some of its information is required. For example, the instructions aren’t really useful without information about where the data-to-be-added starts and ends [@AR: in the delta, right?]. So our guess of 18–24 bytes was, in fact, rather accurate.

Let’s dig into the 15-byte instructions portion of the delta, starting at offset 0x44 (byte 69).

1$ xxd -s 0x44 war-and-peace-v2.xdelta3
200000044: 5468 6520 456e 642e 1381 c7a6 2c09 00    The End.....,..

Let’s see if we can figure out what’s going on. We know the following:

The new text, The End. is 5468 6520 456e 642e
The number of bytes to COPY is 3 265 324. If we encode this number using VCDIFF’s variable-length encoding scheme we get 81c7 a62c.¹

We also know that VCDIFF works with three queues (aka streams) which we know should look like the following:

Instruction queue: [COPY(size=3265324), ADD(size=8)]
Address queue: [0]
Data queue: ["The End."]

And we can indeed find these things in the delta shown above. Here’s my annotated version:

00000044: 5468 6520 456e 642e  # "The End."
0000004c: 13                   # Code 19 (0x13) COPY
0000004d: 81c7 a62c            # 3 265 324 bytes to be copied
00000051: 09                   # Code 9 (0x09) ADD
00000052: 00                   # Address 00 (used by the COPY)

The order in which VCDIFF packs information about these queues isn’t easy to remember. But we can verify that we’ve got it right by looking at the output of xdelta printhdr. The output includes three lines that reminds us of the order and gives us the lengths for each section:

VCDIFF data section length:   8
VCDIFF inst section length:   6
VCDIFF addr section length:   1

This information is stored in the window header. As expected, it adds up to 15 bytes.

At this point perhaps we could imagine writing a variant of VCDIFF that produced a delta file of about 18 bytes (rather than 83 bytes) by bundling together critical information in the window header with the operations.

But fighting overhead in this manner is misguided. We’ll see in a subsequent post that the overhead diminishes dramatically when you’re dealing with a delta that includes several COPY and ADD operations.

Next Steps

To be determined!

Supplementary Notes

Python script to encode integers using VCDIFF’s variable length scheme

 1def vcdiff_encode_int(value):
 2    if value == 0:
 3        return bytes([0])
 4
 5    # Convert to base 128 digits (7-bit values)
 6    digits = []
 7    while value > 0:
 8        digits.append(value & 0x7F)  # lower 7 bits
 9        value >>= 7
10
11    # Reverse to get most significant digit first
12    digits.reverse()
13
14    # Set most significant bit (MSB) for all bytes except the last one
15    result = []
16    for i, digit in enumerate(digits):
17        if i < len(digits) - 1:
18            result.append(digit | 0x80)
19        else:
20            result.append(digit)
21
22    return bytes(result)
23
24def vcdiff_decode_int(data, offset=0):
25    value = 0
26    bytes_consumed = 0
27
28    while offset + bytes_consumed < len(data):
29        byte = data[offset + bytes_consumed]
30        bytes_consumed += 1
31
32        # Add the 7-bit value to our result
33        value = (value << 7) | (byte & 0x7F)
34
35        # If MSB is not set, this is the last byte
36        if (byte & 0x80) == 0:
37            return value, bytes_consumed
38
39    raise ValueError("Incomplete integer encoding (missing terminating byte)")
40
41# quick test
42encoded = vcdiff_encode_int(3265324)
43import binascii
44assert binascii.hexlify(bytes) == b'81c7a62c'

Reproducing the Adler32 checksum

1$ python3 -c "import zlib; print(hex(zlib.adler32(open('war-and-peace-v2.txt', 'rb').read())))"
20xe4ccfc51

Notes about the VCDIFF window header

The window header essentially tells the decoder: “You’re about to process 27 bytes of delta data that will transform 3,265,324 source bytes into 3,265,332 target bytes, with checksum verification.”

Breakdown from offset 0x31:

Offset   Bytes              Meaning
0x31:    05                 Window indicator (VCD_SOURCE | VCD_ADLER32)
0x32:    81 c7 a6 2c        Source window length: 3,265,324
0x36:    00                 Source window position: 0
0x37:    1b                 Delta encoding length: 27
0x38:    81 c7 a6 34        Target window length: 3,265,332
0x3c:    08                 Data section length: 8
0x3d:    06                 Instruction section length: 6
0x3e:    01                 Address section length: 1
0x3f:    e4 cc fc 51        Adler32 checksum

VBE Decoding Examples

3,265,332 encoded as 81 c7 a6 34:

0x81 → continue, bits 0000001
0xc7 → continue, bits 1000111
0xa6 → continue, bits 0100110
0x34 → stop, bits 0110100
Result: 0000001|1000111|0100110|0110100 = 3,265,332

Disclaimer: These are my views and my views only. They are not the views of my employer, any stakeholders, or civil society organisations. I’m not an expert in anything, I get a lot of things wrong.

“This encoding treats an integer as a number in base 128. Then, each digit in this representation is encoded in the lower seven bits of a byte. Except for the least significant byte, other bytes have their most significant bit turned on to indicate that there are still more digits in the encoding.” Source: https://www.rfc-editor.org/rfc/rfc3284#section-2. ↩︎