Out of Bounds

Overview of a git packfile: undeltified objects

Note: part 2 in a series (previous post)

Between the packfile’s header and the trailer we have object entries. These come in two varieties: undeltified and deltified. We’ll parse an undeltified object here.

Recall our packfile:

1$ git verify-pack -v mydeltas/objects/pack/pack-*.idx
2181744139cf7f1d93dc93e49bacaeb414b59b238 blob   3265332 1167272 12
321233513be7ec86c57894180d1f1c5535cd5980d blob   109 116 1167284 1 181744139cf7f1d93dc93e49bacaeb414b59b238
4non delta: 1 object
5chain length = 1: 1 object

A reminder about the columns:

So we know the first object entry is undeltified and starts at offset 12 aka byte 13. We know the object entry starts with a bit of variable length metadata and that MSB == 1 indicates continuation. Let’s look at the metadata:

1$ xxd --binary -s 12 -l 4 -g 1 mydeltas/objects/pack/pack-*.pack
20000000c: 10110100 10110011 10111010 00001100                    ....

So we have 4 bytes worth of metadata to inspect.

The type is 3 (0b011) which is OBJ_BLOB according to man gitformat-pack and the output of git verify-pack above. Then we have the 25 bit length (3*7 + 4-bit) length of the uncompressed object. Decoding it in least significant chunk first order:

(Aside: “Unpacking Git packfiles” gets this wrong, I believe.)

Here’s the precise decoding instructions from man gitformat-pack:

1-byte size extension bit (MSB)
       type (next 3 bit)
       size0 (lower 4-bit)
n-byte sizeN (as long as MSB is set, each 7-bit)
        size0..sizeN form 4+7+7+..+7 bit integer, size0
        is the least significant part, and sizeN is the
        most significant part.

Now let’s uncompress the object and verify it has the expected length. As “Unpacking Git packfiles” notes, we don’t have to know the length of the compressed data because zlib will stop decompressing once it hits the end of the file. (If we want to know the length of the compressed data, we can derive it from git verify-pack if we know the length of the metadata.)

1$ dd if=mydeltas/objects/pack/pack-455eb389423645dd31c71b560d292bf065657271.pack bs=1 skip=16 2> /dev/null | python3 -c "import sys, zlib; sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" | wc -c
23265332

That’s the first object. The next object is a deltified object so we will look at that in a later post.

But let’s whet our appetite by looking at the metadata. Grabbing the offset from git verify-pack, we’ll print the bits using xxd:

1$ xxd --binary -s 1167284 -l 6 -g 1 mydeltas/objects/pack/pack-*.pack
20011cfb4: 11111101 00000110 00011000 00010111 01000100 00010011  ....D.

So it looks like we have two bytes of metadata. The type is 0b111 == 7, which indicates that we have a OBJ_REF_DELTA.

Note to author: Because we only have a single packfile we expected OBJ_OFS_DELTA not OBJ_REF_DELTA.


UPDATE: Found the reason why it doesn’t use OBJ_OFS_DELTA in man git pack-objects. I need to pass the option --delta-base-offset! Modern git uses this option everywhere by default.

From man git pack-objects:

--delta-base-offset
    A packed archive can express the base object of a delta as either a 20-byte object name or as an offset in the stream, but ancient
    versions of Git don’t understand the latter. By default, git pack-objects only uses the former format for better compatibility.
    This option allows the command to use the latter format for compactness. Depending on the average delta chain length, this option
    typically shrinks the resulting packfile by 3-5 per-cent.

    Note: Porcelain commands such as git gc (see git-gc(1)), git repack (see git-repack(1)) pass this option by default in modern Git
    when they put objects in your repository into pack files. So does git bundle (see git-bundle(1)) when it creates a bundle.