Overview of a git packfile: undeltified objects
Note: part 2 in a series (previous post)
Between the packfile’s header and the trailer we have object entries. These come in two varieties: undeltified and deltified. We’ll parse an undeltified object here.
Recall our packfile:
1$ git verify-pack -v mydeltas/objects/pack/pack-*.idx
2181744139cf7f1d93dc93e49bacaeb414b59b238 blob 3265332 1167272 12
321233513be7ec86c57894180d1f1c5535cd5980d blob 109 116 1167284 1 181744139cf7f1d93dc93e49bacaeb414b59b238
4non delta: 1 object
5chain length = 1: 1 object
A reminder about the columns:
- non-deltified:
SHA-1 type size size-in-packfile offset-in-packfile
- deltified:
SHA-1 type size size-in-packfile offset-in-packfile depth base-SHA-1
So we know the first object entry is undeltified and starts at offset 12 aka byte 13. We know the object entry starts with a bit of variable length metadata and that MSB == 1 indicates continuation. Let’s look at the metadata:
1$ xxd --binary -s 12 -l 4 -g 1 mydeltas/objects/pack/pack-*.pack
20000000c: 10110100 10110011 10111010 00001100 ....
So we have 4 bytes worth of metadata to inspect.
The type is 3 (0b011
) which is OBJ_BLOB according to man gitformat-pack
and the output of git verify-pack
above.
Then we have the 25 bit length (3*7 + 4-bit) length of the uncompressed object.
Decoding it in least significant chunk first order:
0100
- 4 bits = 40110011
- 7 bits = 510111010
- 7 bits = 580001100
- 7 bits = 12- 4 + (51 « (4 + 70) + (58 « (4 + 71)) + (12 « (4 + 7*2)) = 4 + 816 + 118784 + 3145728 = 3265332 (matches
git verify-pack
)
(Aside: “Unpacking Git packfiles” gets this wrong, I believe.)
Here’s the precise decoding instructions from man gitformat-pack
:
1-byte size extension bit (MSB)
type (next 3 bit)
size0 (lower 4-bit)
n-byte sizeN (as long as MSB is set, each 7-bit)
size0..sizeN form 4+7+7+..+7 bit integer, size0
is the least significant part, and sizeN is the
most significant part.
Now let’s uncompress the object and verify it has the expected length. As “Unpacking Git packfiles” notes, we don’t have to know the length of the compressed data because zlib will stop decompressing once it hits the end of the file. (If we want to know the length of the compressed data, we can derive it from git verify-pack
if we know the length of the metadata.)
1$ dd if=mydeltas/objects/pack/pack-455eb389423645dd31c71b560d292bf065657271.pack bs=1 skip=16 2> /dev/null | python3 -c "import sys, zlib; sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" | wc -c
23265332
That’s the first object. The next object is a deltified object so we will look at that in a later post.
But let’s whet our appetite by looking at the metadata. Grabbing the offset from git verify-pack
,
we’ll print the bits using xxd
:
1$ xxd --binary -s 1167284 -l 6 -g 1 mydeltas/objects/pack/pack-*.pack
20011cfb4: 11111101 00000110 00011000 00010111 01000100 00010011 ....D.
So it looks like we have two bytes of metadata. The type is 0b111
== 7, which indicates
that we have a OBJ_REF_DELTA
.
Note to author: Because we only have a single packfile we expected OBJ_OFS_DELTA
not OBJ_REF_DELTA
.
UPDATE: Found the reason why it doesn’t use OBJ_OFS_DELTA in man git pack-objects
. I need to pass the option --delta-base-offset
!
Modern git uses this option everywhere by default.
From man git pack-objects
:
--delta-base-offset
A packed archive can express the base object of a delta as either a 20-byte object name or as an offset in the stream, but ancient
versions of Git don’t understand the latter. By default, git pack-objects only uses the former format for better compatibility.
This option allows the command to use the latter format for compactness. Depending on the average delta chain length, this option
typically shrinks the resulting packfile by 3-5 per-cent.
Note: Porcelain commands such as git gc (see git-gc(1)), git repack (see git-repack(1)) pass this option by default in modern Git
when they put objects in your repository into pack files. So does git bundle (see git-bundle(1)) when it creates a bundle.