Out of Bounds

Overview of a git packfile

We know that a git packfile starts with a 12 byte header containing a magic number, version number, and a count of objects in the packfile. These values are easy to read off the bytes:

1$ xxd -l 12 -g 1 mydeltas/objects/pack/pack-455eb389423645dd31c71b560d292bf065657271.pack
200000000: 50 41 43 4b 00 00 00 02 00 00 00 02              PACK........

At the end of the packfile there’s a checksum of everything that precedes it.

Between the header and the trailer there’s the meat of the packfile: a series of object entries each of which has one of the two forms:

  1. Undeltified representation: (type, length, compressed data)
  2. Deltified representation: (type, length, base object negative relative offset, compressed delta data)

(There’s another variant of the deltified representation that’s relevant if you have multiple packfiles; we ignore this variant because we assume a single packfile.)


Decoding the variable-length size and offset encodings will wait for another day. For today, let’s end by verifying the packfile seems to have the right trailer:

1$ head -c -20 mydeltas/objects/pack/pack-*.pack | sha1sum
2455eb389423645dd31c71b560d292bf065657271  -
3$ tail -c 20 mydeltas/objects/pack/pack-*.pack | xxd -p
4455eb389423645dd31c71b560d292bf065657271

Appendix: Further Reading

From gitformat-pack

PACK-*.PACK FILES HAVE THE FOLLOWING FORMAT:
       •   A header appears at the beginning and consists of the following:

               4-byte signature:
                   The signature is: {'P', 'A', 'C', 'K'}

               4-byte version number (network byte order):
                   Git currently accepts version number 2 or 3 but
                   generates version 2 only.

               4-byte number of objects contained in the pack (network byte order)

               Observation: we cannot have more than 4G versions ;-) and
               more than 4G objects in a pack.

       •   The header is followed by a number of object entries, each of which looks like this:

               (undeltified representation)
               n-byte type and length (3-bit type, (n-1)*7+4-bit length)
               compressed data

               (deltified representation)
               n-byte type and length (3-bit type, (n-1)*7+4-bit length)
               base object name if OBJ_REF_DELTA or a negative relative
                   offset from the delta object's position in the pack if this
                   is an OBJ_OFS_DELTA object
               compressed delta data

               Observation: the length of each object is encoded in a variable
               length format and is not constrained to 32-bit or anything.

       •   The trailer records a pack checksum of all of the above.

From “Unpacking Git packfiles”

From Unpacking Git packfiles by Aditya Mukerjee:

The packfile starts with 12 bytes of meta-information and ends with a 20-byte checksum, all of which we can use to verify our results. The first four bytes spell “PACK” and the next four bytes contain the version number – in our case, [0, 0, 0, 2]. The next four bytes tell us the number of objects contained in the pack. Therefore, a single packfile cannot contain more than 2^32 objects, although a single repository may contain multiple packfiles. The final 20 bytes of the file are a SHA-1 checksum of all the previous data in the file.

Reminder about our packfile

1$ git verify-pack -v mydeltas/objects/pack/pack-*.idx
2181744139cf7f1d93dc93e49bacaeb414b59b238 blob   3265332 1167272 12
321233513be7ec86c57894180d1f1c5535cd5980d blob   109 116 1167284 1 181744139cf7f1d93dc93e49bacaeb414b59b238
4non delta: 1 object
5chain length = 1: 1 object

Columns are:

* non-deltified: SHA-1 type size size-in-packfile offset-in-packfile
* deltified: SHA-1 type size size-in-packfile offset-in-packfile depth base-SHA-1