Out of Bounds

Overview of a git packfile: loose ends

Note: part 4 in a series (previous post)

We haven’t seen an example of the “ADD” instruction. Let’s look at one.

For this example we have a packfile with four objects but we will only be looking at two of them. These are the files we are going to examine:

1$ git hash-object -w war-and-peace-v2.txt
2181744139cf7f1d93dc93e49bacaeb414b59b238
3$ git hash-object -w war-and-peace-v4.txt
4e5e07aa754ae469e220b4dde333790bea5bfceb2

And here’s the packfile:

1181744139cf7f1d93dc93e49bacaeb414b59b238 blob   3265332 1167272 12
221233513be7ec86c57894180d1f1c5535cd5980d blob   109 116 1167284 1 181744139cf7f1d93dc93e49bacaeb414b59b238
3e5e07aa754ae469e220b4dde333790bea5bfceb2 blob   263 198 1167400 1 181744139cf7f1d93dc93e49bacaeb414b59b238
4569f86320f0ffc487708a197bebf8ca5f44283bb blob   214 133 1167598 2 e5e07aa754ae469e220b4dde333790bea5bfceb2
5non delta: 1 object
6chain length = 1: 2 objects
7chain length = 2: 1 object
8mydeltas/objects/pack/pack-06a4fcb3a1e19394d828ee8b5b52e854bd184dcc.pack: ok

We can see that war-and-peace-v4.txt (e5e07a…) is stored in delitified form with base object war-and-peace-v2.txt (181744…).

The contents of war-and-peace-v4.txt are identical to the contents of war-and-peace-v2.txt except for the following:

  1. the first chapter has been removed, and
  2. the line “Lorem ipsum dolorem sint” has been added after line 10.

Let’s look at the delta. We know from the output of git verify-pack that the delta starts at offset 1167400.

1$ xxd --binary -s 1167400 -l 6 -g 1 mydeltas/objects/pack/pack-*.pack
20011d028: 11110111 00010000 00011000 00010111 01000100 00010011  ....D.

Recall we’ve got the type followed by the size.

As we did before, let’s uncompress the compressed sequence of instructions.

1$ dd if=mydeltas/objects/pack/pack-06a4fcb3a1e19394d828ee8b5b52e854bd184dcc.pack bs=1 skip=1167422 2> /dev/null | python3 -c "import sys, zlib; sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" > uncompressed-delta.dat
2$ wc -c uncompressed-delta.dat
3263 uncompressed-delta.dat

Looks good! Now let’s look at the first few bytes of the 263 byte delta.

 1$ xxd --binary -l 66 uncompressed-delta.dat
 200000000: 10110100 10100110 11000111 00000001 11011011 11001010  ......
 300000006: 11000110 00000001 00001101 00100000 01000011 01001000  ... CH
 40000000c: 01000001 01010000 01010100 01000101 01010010 00100000  APTER
 500000012: 01001001 01001001 00001010 00001010 10010111 11100001  II....
 600000018: 01000000 00100110 00010011 10010111 11101110 00000100  @&....
 70000001e: 00011101 00001100 10110011 00011111 00101110 11110100  ......
 800000024: 00000110 00011000 01001100 01101111 01110010 01100101  ..Lore
 90000002a: 01101101 00100000 01101001 01110000 01110011 01110101  m ipsu
1000000030: 01101101 00100000 01100100 01101111 01101100 01101111  m dolo
1100000036: 01110010 01100101 01101101 00100000 01110011 01101001  rem si
120000003c: 01101110 01110100 10000011 00010001 00110101 10000111  nt..5.

As expected, we see the “Lorem ipsum dolorem sint”. But we also see the unexpected data CHAPTER II\n\n (15 bytes). This should have been copied from the base object!

Let’s parse it again we did in the previous post.

Recall from man gitformat-pack,

The delta data starts with the size of the base object and the size of the object to be reconstructed. These sizes are encoded using the size encoding from above.

Now skipping those 8 bytes we can start identifying the instructions:

# ADD 13 BYTES
00001101
00100000 01000011 01001000 01000001 01010000 01010100 01000101 01010010 00100000 01001001 01001001 00001010 00001010

# COPY
10010111 11100001 01000000 00100110 00010011

# COPY
10010111 11101110 00000100 00011101 00001100

# COPY
10110011 00011111 00101110 11110100 00000110

# ADD 24 bytes (grouped by eight bytes)
00011000
01001100 01101111 01110010 01100101 01101101 00100000 01101001 01110000
01110011 01110101 01101101 00100000 01100100 01101111 01101100 01101111
01110010 01100101 01101101 00100000 01110011 01101001 01101110 01110100

# COPY
10000011 00010001 00110101

I’m not sure why git constructs the packfile this way.

But we can wrap things up since we have achieved what we set out to do. We’ve seen two ADD/INSERT instructions. They’re far easier to interpret than the COPY instruction.

One observation that seems useful to keep hold of: an ADD instruction can only add 127 bytes. Alternatively, if you’re adding many bytes to a base object you’ll incur one byte of overhead every 127 bytes you add.