What is the file format of a git commit object data structure?

Create a minimal example and reverse engineer the format

Create a simple repository, and before any packfiles are created (git gc, git config gc.auto, git-prune-packed ...), unpack a commit object with one of the methods from: How to DEFLATE with a command line tool to extract a git object?

export GIT_AUTHOR_DATE="1970-01-01T00:00:00+0000"
export GIT_AUTHOR_EMAIL="[email protected]"
export GIT_AUTHOR_NAME="Author Name" \
export GIT_COMMITTER_DATE="2000-01-01T00:00:00+0000" \
export GIT_COMMITTER_EMAIL="[email protected]" \
export GIT_COMMITTER_NAME="Committer Name" \

git init

# First commit.
echo
touch a
git add a
git commit -m 'First message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
  <.git/objects/45/3a2378ba0eb310df8741aa26d1c861ac4c512f | hd

# Second commit.
echo
touch b
git add b
git commit -m 'Second message'
python -c "import zlib,sys;sys.stdout.write(zlib.decompress(sys.stdin.read()))" \
  <.git/objects/74/8e6f7e22cac87acec8c26ee690b4ff0388cbf5 | hd

The output is:

Initialized empty Git repository in /home/ciro/test/git/.git/

[master (root-commit) 453a237] First message
 Author: Author Name <[email protected]>
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 a
00000000  63 6f 6d 6d 69 74 20 31  37 34 00 74 72 65 65 20  |commit 174.tree |
00000010  34 39 36 64 36 34 32 38  62 39 63 66 39 32 39 38  |496d6428b9cf9298|
00000020  31 64 63 39 34 39 35 32  31 31 65 36 65 31 31 32  |1dc9495211e6e112|
00000030  30 66 62 36 66 32 62 61  0a 61 75 74 68 6f 72 20  |0fb6f2ba.author |
00000040  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name <aut|
00000050  68 6f 72 40 65 78 61 6d  70 6c 65 2e 63 6f 6d 3e  |[email protected]>|
00000060  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
00000070  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
00000080  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e <committer@exa|
00000090  6d 70 6c 65 2e 63 6f 6d  3e 20 39 34 36 36 38 34  |mple.com> 946684|
000000a0  38 30 30 20 2b 30 30 30  30 0a 0a 46 69 72 73 74  |800 +0000..First|
000000b0  20 6d 65 73 73 61 67 65  0a                       | message.|
000000ba

[master 748e6f7] Second message
 Author: Author Name <[email protected]>
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 b
00000000  63 6f 6d 6d 69 74 20 32  32 33 00 74 72 65 65 20  |commit 223.tree |
00000010  32 39 36 65 35 36 30 32  33 63 64 63 30 33 34 64  |296e56023cdc034d|
00000020  32 37 33 35 66 65 65 38  63 30 64 38 35 61 36 35  |2735fee8c0d85a65|
00000030  39 64 31 62 30 37 66 34  0a 70 61 72 65 6e 74 20  |9d1b07f4.parent |
00000040  34 35 33 61 32 33 37 38  62 61 30 65 62 33 31 30  |453a2378ba0eb310|
00000050  64 66 38 37 34 31 61 61  32 36 64 31 63 38 36 31  |df8741aa26d1c861|
00000060  61 63 34 63 35 31 32 66  0a 61 75 74 68 6f 72 20  |ac4c512f.author |
00000070  41 75 74 68 6f 72 20 4e  61 6d 65 20 3c 61 75 74  |Author Name <aut|
00000080  68 6f 72 40 65 78 61 6d  70 6c 65 2e 63 6f 6d 3e  |[email protected]>|
00000090  20 30 20 2b 30 30 30 30  0a 63 6f 6d 6d 69 74 74  | 0 +0000.committ|
000000a0  65 72 20 43 6f 6d 6d 69  74 74 65 72 20 4e 61 6d  |er Committer Nam|
000000b0  65 20 3c 63 6f 6d 6d 69  74 74 65 72 40 65 78 61  |e <committer@exa|
000000c0  6d 70 6c 65 2e 63 6f 6d  3e 20 39 34 36 36 38 34  |mple.com> 946684|
000000d0  38 30 30 20 2b 30 30 30  30 0a 0a 53 65 63 6f 6e  |800 +0000..Secon|
000000e0  64 20 6d 65 73 73 61 67  65 0a                    |d message.|
000000eb

Then we deduce that the format is as follows:

Top level:
```
commit {size}\0{content}
```
where {size} is the number of bytes in {content}.

This follows the same pattern for all object types.
{content}:
```
tree {tree_sha}
{parents}
author {author_name} <{author_email}> {author_date_seconds} {author_date_timezone}
committer {committer_name} <{committer_email}> {committer_date_seconds} {committer_date_timezone}

{commit message}
```
where:
- {tree_sha}: SHA of the tree object this commit points to.
  
  This represents the top-level Git repo directory.
  
  That SHA comes from the format of the tree object: What is the internal format of a git tree object?
- {parents}: optional list of parent commit objects of form:
```
parent {parent1_sha}
parent {parent2_sha}
...
```
  The list can be empty if there are no parents, e.g. for the first commit in a repo.
  
  Two parents happen in regular merge commits.
  
  More than two parents are possible with git merge -Xoctopus, but this is not a common workflow. Here is an example: https://github.com/cirosantilli/test-octopus-100k
- {author_name}: e.g.: Ciro Santilli. Cannot contain <, \n
- {author_email}: e.g.: [email protected]. Cannot contain >, \n
- {author_date_seconds}: seconds since 1970, e.g. 946684800 is the first second of year 2000
- {author_date_timezone}: e.g.: +0000 is UTC
- committer fields: analogous to author fields
- {commit message}: arbitrary.

I've made a minimal Python script that generates a git repo with a few commits at: https://github.com/cirosantilli/test-git-web-interface/blob/864d809c36b8f3b232d5b0668917060e8bcba3e8/other-test-repos/util.py#L83

I've used that for fun things like:

Who is the user with the longest streak on GitHub?
https://www.quora.com/Which-GitHub-repo-has-the-most-commits/answer/Ciro-Santilli
https://github.com/isaacs/github/issues/1344

Here is an analogous analysis of the tag object format: What is the format of a git tag object and how to calculate its SHA?

Before you head down this path much further, I might recommend that you read through the section in the Git Manual about its internals. I find that knowing the contents of this chapter is usually the difference between liking Git and hating it. Understanding why Git is doing things the way it does often makes all of the sort of weird commands it has for things make more sense.

To answer your question, the gibberish that you are seeing is the data for the object after it has been compressed using zlib. If you look under the heading "Object Storage" in the link above you can see some details about how this works. This is the short version of how files are stored in git:

Create a git specific header for the content.
Generate a hash of the concatenation of the header + content.
Compress the concatenation of the header + content.
Store the compressed data to disk in a folder with a name equal to the first two characters of the data's hash and a file name with the remaining 38 characters.

So that answers your second question, a folder will contain all of the compressed objects that begin with the same two characters, regardless of their contents.

If you want to see the contents of a blob, all you have to do is decompress it. If you just want to view the contents of the file, this can be done easily enough in most programming languages. I would warn you against trying to modify data, however. Modifying even a single byte in a file will change it's hash. All of the metadata in git (namely, directory structures and commits) are stored using references to hashes, so modifying a single file means that you must also update all objects downstream from that file that reference that file's hash. Then you have to update all the objects that reference those hashes. And on, and on, and on... Trying to achieve this becomes very, very complicated very quickly. You'll save your self a lot of time and heartache by just learning git's built in commmands.

What is the file format of a git commit object data structure?

Related

Recent Posts