Overview of a git packfile: deltified objects
Note: part 3 in a series (previous post)
Let’s look at the second object in our packfile. This one is deltified. In our previous post we looked at the first object which was a comparatively simple non-deltified object.
Recall our packfile:
1$ git verify-pack -v mydeltas/objects/pack/pack-*.idx
2181744139cf7f1d93dc93e49bacaeb414b59b238 blob 3265332 1167272 12
321233513be7ec86c57894180d1f1c5535cd5980d blob 109 116 1167284 1 181744139cf7f1d93dc93e49bacaeb414b59b238
4non delta: 1 object
5chain length = 1: 1 object
A reminder about the columns:
- non-deltified:
SHA-1 type size size-in-packfile offset-in-packfile
- deltified:
SHA-1 type size size-in-packfile offset-in-packfile depth base-SHA-1
We know from git verify-pack
that the second object is deltified and starts at offset 1167284. If we peek at it we see there are two bytes of metadata:
1$ xxd --binary -s 1167284 -l 6 -g 1 mydeltas/objects/pack/pack-*.pack
20011cfb4: 11111101 00000110 00011000 00010111 01000100 00010011 ....D.
The uncompressed size is 0b1101 + (0b00000110 « 4) == 109.
Since it’s a OBJ_REF_DELTA, the next 20 bytes are the sha1 hash. So the actual delta starts at offset 1167284 + 2 + 20 = 1167306. And it’s compressed.
Let’s try to decompress it and, if it succeeds, we can check that the size is correct. (Recall that zlib automatically detects the end of the compressed data.)
1$ dd if=mydeltas/objects/pack/pack-455eb389423645dd31c71b560d292bf065657271.pack bs=1 skip=1167306 2> /dev/null | python3 -c "import sys, zlib; sys.stdout.buffer.write(zlib.decompress(sys.stdin.buffer.read()))" > uncompressed-delta.dat
uncompressed-delta.dat
is 109 bytes which is what we expect.
So what are the instructions in these 109 bytes?
Let’s go look at gitformat-pack:
The delta data starts with the size of the base object and the size of the object to
be reconstructed. These sizes are encoded using the size encoding from above. The
remainder of the delta data is a sequence of instructions to reconstruct the object
from the base object. If the base object is deltified, it must be converted to
canonical form first. Each instruction appends more and more data to the target
object until it’s complete. There are two supported instructions so far: one for
copying a byte range from the source object and one for inserting new data embedded
in the instruction itself.
Each instruction has variable length. Instruction type is determined by the seventh
bit of the first octet. The following diagrams follow the convention in RFC 1951
(Deflate compressed data format).
Instruction to copy from base object
+----------+---------+---------+---------+---------+-------+-------+-------+
| 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
+----------+---------+---------+---------+---------+-------+-------+-------+
This is the instruction format to copy a byte range from the source object. It
encodes the offset to copy from and the number of bytes to copy. Offset and size
are in little-endian order.
All offset and size bytes are optional. This is to reduce the instruction size
when encoding small offsets or sizes. The first seven bits in the first octet
determine which of the next seven octets is present. If bit zero is set, offset1
is present. If bit one is set offset2 is present and so on.
Note that a more compact instruction does not change offset and size encoding.
For example, if only offset2 is omitted like below, offset3 still contains bits
16-23. It does not become offset2 and contains bits 8-15 even if it’s right next
to offset1.
+----------+---------+---------+
| 10000101 | offset1 | offset3 |
+----------+---------+---------+
In its most compact form, this instruction only takes up one byte (0x80) with
both offset and size omitted, which will have default values zero. There is
another exception: size zero is automatically converted to 0x10000.
Instruction to add new data
+----------+============+
| 0xxxxxxx | data |
+----------+============+
This is the instruction to construct the target object without the base object.
The following data is appended to the target object. The first seven bits of the
first octet determine the size of data in bytes. The size must be non-zero.
So we do need to break up the delta into a sequence of instructions. This part of the man page spells it out quite cleanly:
Each instruction appends more and more data to the target object until it’s complete. There are two supported instructions so far: one for copying a byte range from the source object and one for inserting new data embedded in the instruction itself.
We know that git made a delta that’s “backwards” as it records war-and-peace-v2 as the base object and war-and-peace-v1 as deltified. Given that this is the case, we expect there only to be one instruction: a copy instruction. Critically, this copy instruction will not include the final eight bytes The End.
.
So let’s go unpack things.
Let’s read the instructions contained in uncompressed-delta.dat
. We expect a single COPY instruction – although this can’t be right because there’s 109 bytes.
And we can’t possibly be using that many bytes for a single instruction.
1$ xxd --binary -g 1 uncompressed-delta.dat
200000000: 10110100 10100110 11000111 00000001 10101100 10100110 ......
300000006: 11000111 00000001 10000000 10000100 00000001 10000100 ......
40000000c: 00000010 10000100 00000011 10000100 00000100 10000100 ......
500000012: 00000101 10000100 00000110 10000100 00000111 10000100 ......
600000018: 00001000 10000100 00001001 10000100 00001010 10000100 ......
70000001e: 00001011 10000100 00001100 10000100 00001101 10000100 ......
800000024: 00001110 10000100 00001111 10000100 00010000 10000100 ......
90000002a: 00010001 10000100 00010010 10000100 00010011 10000100 ......
1000000030: 00010100 10000100 00010101 10000100 00010110 10000100 ......
1100000036: 00010111 10000100 00011000 10000100 00011001 10000100 ......
120000003c: 00011010 10000100 00011011 10000100 00011100 10000100 ......
1300000042: 00011101 10000100 00011110 10000100 00011111 10000100 ......
1400000048: 00100000 10000100 00100001 10000100 00100010 10000100 .!.".
150000004e: 00100011 10000100 00100100 10000100 00100101 10000100 #.$.%.
1600000054: 00100110 10000100 00100111 10000100 00101000 10000100 &.'.(.
170000005a: 00101001 10000100 00101010 10000100 00101011 10000100 ).*.+.
1800000060: 00101100 10000100 00101101 10000100 00101110 10000100 ,.-...
1900000066: 00101111 10000100 00110000 10110100 00110001 00101100 /.0.1,
200000006c: 11010011
It starts with two sizes that are easy to interpret:
The delta data starts with the size of the base object and the size of the object to be reconstructed. These sizes are encoded using the size encoding from above. The remainder of the delta data is a sequence of instructions to reconstruct the object from the base object. …
1# reading off these values from the xxd output
2# size of the base object
3vals = [int(bitstring, 2) & 0x7f for bitstring in "10110100 10100110 11000111 00000001".split()]
4assert sum(val << (i * 7) for i, val in enumerate(vals)) == 3265332
5# size of the object to be reconstructed
6bitstrings = "10101100 10100110 11000111 00000001".split()
7vals = [int(bitstring, 2) & 0x7f for bitstring in bitstrings]
8assert sum(val << (i * 7) for i, val in enumerate(vals)) == 3265324
All checks out so far. Now we reach the instructions. Instead of one COPY instruction we see 50 COPY instructions. I’ve formatted the instructions nicely, one per line:
10000000
10000100 00000001
10000100 00000010
10000100 00000011
10000100 00000100
10000100 00000101
10000100 00000110
10000100 00000111
10000100 00001000
10000100 00001001
10000100 00001010
10000100 00001011
10000100 00001100
10000100 00001101
10000100 00001110
10000100 00001111
10000100 00010000
10000100 00010001
10000100 00010010
10000100 00010011
10000100 00010100
10000100 00010101
10000100 00010110
10000100 00010111
10000100 00011000
10000100 00011001
10000100 00011010
10000100 00011011
10000100 00011100
10000100 00011101
10000100 00011110
10000100 00011111
10000100 00100000
10000100 00100001
10000100 00100010
10000100 00100011
10000100 00100100
10000100 00100101
10000100 00100110
10000100 00100111
10000100 00101000
10000100 00101001
10000100 00101010
10000100 00101011
10000100 00101100
10000100 00101101
10000100 00101110
10000100 00101111
10000100 00110000
10110100 00110001 00101100 11010011
These aren’t too difficult to decipher once you understand the encoding:
The 1st instruction is 10000000: COPY starting from offset 0 and copying 0x10000 == 65536 bytes.
The 2nd instruction is 10000100 00000001: COPY from offset 0x00 0x00 0x01 == 65536 and copying 0x10000 == 65536 bytes
The 3rd instruction is 10000100 00000010: COPY from offset 0x00 0x00 0x02 == 131072 and copying 0x10000 == 65536 bytes
The 3rd instruction is 10000100 00000011: COPY from offset 0x00 0x00 0x03 and copying 0x10000 == 65536 bytes
...
The 49th instruction is 10000100 00110000: COPY from offset 0x00 0x00 0x30 and copying 0x10000 == 65536 bytes
The 50th instruction is 10000100 00110000: COPY from offset 0x00 0x00 0x31 and copying 0b00101100 0b11010011 == 44 + (211 << 8) == 54060 bytes
Let’s count bytes to make sure we end up in the right place:
# first instruction, 48 identical instructions, final instruction
# 3265324 byte count of war-and-peace-v1.txt
assert 0x10000 + 48 * 0x10000 + 44 + (211 << 8) == 3265324
Basically, the packfile is telling us to copy from the base object 64K (2¹⁶) at a time.
I assume git is doing this to avoid having to read the base object into memory. After all, it could be a truly massive file.
So we’re done. We’re at the end of the packfile.
It would have been nice to see an ADD instruction but perhaps we can see that another time.