Using Github

By default Github will use SSH to clone repositories. It defaults to using git@github.com as a user with the default id_rsa key. To override this we can add the following to our git config:

  38   │ Host github
  39   │     User git
  40   │     Hostname github.com
  41   │     IdentityFile ~/.ssh/github
  42   │     IdentitiesOnly yes

This way, all clone/commit commands will be done with git clone github:user/repo rather than git clone git@github.com:user/repo

Git repositories

In git, repositories are made up of two things: a “work tree” where files meant to be version controlled live and a “git directory” where Git stores its own data. So if we are working on a specific project, the “work tree” will generally be the file structure where we do our work - the files we keep and what we generally interact with.

On the other hand, the “git directory” is a child directory of the “work tree” and it’s the place where objects and tracking data is stored (after all, the files we work on themselves will probably not be equipped to handle version controls).

A git directory,then, contains the following:

# .git/            - the git directory itself
# .git/objects/    - the object store
# .git/refs/       - the reference store, which contains 'heads' and 'tags'
# .git/HEAD        - a reference to current HEAD
# .git/config      - the repo's config file
# .git/description - the repo's description file

In most git programs, running init will create a .git directory with all of the above.

The description file is only used by the GitWeb program to generate a description of the repository when browsing through a web interface. The config file contains project-specific configuration options.

The objects, refs, and HEAD are the core part of a git repository. ./objects/ stores all the content of the database, while ./refs./ stores pointers to commit objects in that data (things like branches, tags, etc.). Lastly, HEAD is a file that simply points to the current checked out branch.

There is another file called index that can be created to store staging area information.

What are Git objects?

There are four types of git objects: blobs, tags, trees, and commits.

At its core, Git is a “content-addressed filesystem”. This means that the name (or address) of any given objects is derived from the content of the file itself through a mathematical algorithm; this is unlike regular filesystems that can have arbitrary file names.

The implication of this is that when even a single byte of data changes within a file the internal name of that file will also change. As a result, files are not technically modified but always created in a different location. And so, git objects are simply files in a git repository whose paths are determined by their contents. It is also worth noting that most other items in git are objects too, such as commits and tags.

When creating a git object, the storage format will look like so:

00000000  63 6f 6d 6d 69 74 20 31  30 38 36 00 74 72 65 65  |commit 1086.tree|

Here we see the type header (00000000), an ASCII space (0x20), then the size of the object in ASCII (63 6f 6d 6d 69 74 20 31 30 38 36 00 74 72 65 65) - in this case 1086 - and then a null byte (0x00). The last four bytes are the beginning of that objects contents, tree in this case. If converted from hex (with 0x removed) to ASCII, the line above is converted to everything past the 65 byte:

00000000  63 6f 6d 6d 69 74 20 31  30 38 36 00 74 72 65 65 # Byte stream
����  c o m m i t   1  0 8 6 � t r e e  # Byte stream converted in ASCII

The path to any particular object is computed by using the SHA-1 hash of that file’s content. Once the hash is computed, git will use the first two characters as a directory name and the rest as a file name - which allows for 256 intermediate directories, and it is used to avoid having too many files in a single directory. The directory path will then be the first two characters of the hash, a directory limiter, and then the remaining part:

# path to object `e673d1b7eaa0aa01b5bc2442d570a765bdaae751` is
# `.git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751`

*Note: As of Git v2.13.0 Git does not use SHA-1 but a hardened version called New Hash that produces the exact same results while avoiding collisions generated by SHAttered attacks

To write software that runs a git program, it must be able to handle git objects. In general, all git objects share the same retrieval/storage mechanisms as well as the same header format. However, because there are multiple object types, we must implement object-specific functionality on a case by case basis. As a result, we can start with an abstracted version of a GitObject that itself can form the basis for other objects:

class GitObject(object):
    """A generic git object that can then be subclasses for specific object types, since
    every object has the same storage/retrieval mechanism."""
    repo = None
 
    def __init__(self, repo, data=None):
        self.repo = repo
 
        if data != None:
            self.deserialize(data)
 
    def serialize(self):
        """Implemented by subclasses. Reads object's contents from self.data -  a byte string - 
        and converts it to a meaningful representation, which itself depends on the subclass"""
        raise Exception("Unimplemented.")
 
    def deserialize(self, data):
        raise Exception("Unimplemented.")

Reading objects

Once we have a set of objects, a function must be created in order to retrieve and read as necessary. Consider the following function:

def object_read(repo, sha):
    """Read object {object_id} from Git repository {repo}. Return
    a GitObject whose exact type depends on the object"""
 
    path = repo_file(repo, "objects", sha[0:2], sha[2:])
 
    with open(path, "rb") as f:
        raw = zlib.decompress(f.read())
 
        # Read object type
        x = raw.find(b' ')
        fmt = raw[0:x]
 
        # Read and validate object type
        y = raw.find(b'\x00', x)
        size = int(raw[x:y].decode("ascii"))
        if size != len(raw)-y-1:
            raise Exception(f"Malformed object {sha}: bad lenght.")
 
        # Pick constructor
        if   fmt==b'commit' : c=GitCommit
        elif fmt==b'tree'   : c=GitTree
        elif fmt==b'tag'    : c=GitTag
        elif fmt==b'blob'   : c=GitBlob
        else:
            raise Exception(f"Unkown type {fmt.decode'ascii'} for object {sha}")
 
        # Call constructor and return object
        return c(repo, raw[y+1:])

The function begins by creating a path to a specific object based on the SHA-1 hash of that object. It uses the repo_file function defined earlier in the code, which itself is simply generating a path by joining the repo’s root directory, the objects directory and then a directory based on the first two characters of the objects hash as a directory plus the rest as a file name.

Once found, this function will open that specific file in rb mode - it will be read-only and it will be read as bytes instead of trying to apply a utf-8 or other encoding to represent the string of bytes. With this, the function then uses zlib to decompress the object so that in can be processed in the following steps:

  1. Find the first empty space, b" " (0x20 in ASCII), in order to get the object’s headers and format.
  2. Then we find the null separator, b"\x00" in the byte string starting immediately after the empty space we found before.
  3. Once we found the null separator, we take everything after the headers to null and check the size as an ASCII-decoded int. This value should match the length of the commit itself
  4. Once validated we can select which object constructor subclass to use.

Writing Objects

Writing an object works in the same way as reading it, but in reverse:

def object_write(obj, actually_write=True):
    # Serialize object data
    data = obj.serialize()
    # Add header
    result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data
    # Compute hash
    sha = hashlib.sha1(result).hexdigest()
 
    if actually_write:
        # Compute path
        path=repo_file(obj.repo, "objects", sha[0:2], sha[2:], mkdir=actually_write)
 
        with open(path, 'wb') as f:
            # Compress and write
            f.write(zlib.compress(result))
 
    return sha
  1. We start by taking our GitObject and serialising it into a byte stream (this is to have a byte-by-byte identical copy of the object saved on the file system).
  2. We then prepend the serialised data with our format header, the empty space, b' ', the encoded length of our data, and the null separator
  3. We then take the resulting byte string and hash it with the SHA-1 algorithm
  4. Finally, we find an appropriate path for the new file and we open said file in wb (write bytes) mode. Here, we simply write the compressed result.

Packfiles

The object store that is generated with the above code is known as “loose object” storage. While good, proper Git software will implement a more efficient and complex storage mechanism called a packfile. These are a complication of various lose objects, with the caveat that some of those objects are stored as deltas (or transformations of other objects).

These packfiles are stored in .git/objects/pack/ and are written with a .pack extension along side an index of the same name with a .idx extension.

Blob objects

A blob is a “binary large object”

The simplest type of git object is a blob because they have not actual format: they are just user content without predefined syntax or constraints. In other words, every file we place in git is a blob.

As a result of this, implementing a GitBlob is fairly straight forward: it simply needs to store and return its own input unmodified:

class GitBlob(GitObject):
    fmt=b'blob'
 
    def serialize(self):
        return self.blobdata
 
    def deserialize(self, data):
        self.blobdata = data

Parsing commits

A commit object (when uncompressed and without headers) format is a simplified version of mail messages from RFC2811. The object itself looks like this (this commit message is taken from wyag):

tree 29ff16c9c14e2652b22f8b78bb08a5a07930c147
parent 206941306e8a8af65b66eaaaea388a7ae24d49a0
author Thibault Polge <thibault@thb.lt> 1527025023 +0200
committer Thibault Polge <thibault@thb.lt> 1527025044 +0200
gpgsig -----BEGIN PGP SIGNATURE-----

 iQIzBAABCAAdFiEExwXquOM8bWb4Q2zVGxM2FxoLkGQFAlsEjZQACgkQGxM2FxoL
 kGQdcBAAqPP+ln4nGDd2gETXjvOpOxLzIMEw4A9gU6CzWzm+oB8mEIKyaH0UFIPh
 rNUZ1j7/ZGFNeBDtT55LPdPIQw4KKlcf6kC8MPWP3qSu3xHqx12C5zyai2duFZUU
 wqOt9iCFCscFQYqKs3xsHI+ncQb+PGjVZA8+jPw7nrPIkeSXQV2aZb1E68wa2YIL
 3eYgTUKz34cB6tAq9YwHnZpyPx8UJCZGkshpJmgtZ3mCbtQaO17LoihnqPn4UOMr
 V75R/7FjSuPLS8NaZF4wfi52btXMSxO/u7GuoJkzJscP3p4qtwe6Rl9dc1XC8P7k
 NIbGZ5Yg5cEPcfmhgXFOhQZkD0yxcJqBUcoFpnp2vu5XJl2E5I/quIyVxUXi6O6c
 /obspcvace4wy8uO0bdVhc4nJ+Rla4InVSJaUaBeiHTW8kReSFYyMmDCzLjGIu1q
 doU61OM3Zv1ptsLu3gUE6GU27iWYj2RWN3e3HE4Sbd89IFwLXNdSuM0ifDLZk7AQ
 WBhRhipCCgZhkj9g2NEk7jRVslti1NdN5zoQLaJNqSwO1MtxTmJ15Ksk3QP6kfLB
 Q52UWybBzpaP9HEd4XnR+HuQ4k2K0ns2KgNImsNvIyFwbpMUyUWLMPimaV1DWUXo
 5SBjDB/V/W2JBFR+XKHFJeFwYhj7DD/ocsGr4ZMx/lgc8rjIBkI=
 =lgTX
 -----END PGP SIGNATURE-----

Create first draft

The format begins with a series of key-value pairs separate by a space and ends with a commit message. The fields are as follows:

  • tree is a reference to a tree object. It maps blobs IDs to filesystem locations and describes a state of the worktree. It is the actual content of the commit: files and where they should be.
  • parent is the parent of the current commit. This may be repeated, as is the case for merge commits that may have multiple commits, or may be entirely absent as would be the case for an initial commit.
  • author and committer are separate values because the author may not be the same person who can create the commit
  • gpgsig is the PGP signature of this object

The code to parse a message like this is as follows. Note that the function is called kvlm_parse because it is designed to parse key-value lists with messages. The same format is used for both commits and tags:

def kvlm_parse(raw, start=0, dct=None):
    if not dct:
        dct = collections.OrderedDict()
        # You CANNOT declare the argument as dct=OrderedDict() or all
        # call to the functions will endlessly grow the same dict.
 
    # We search for the next space and the next newline.
    spc = raw.find(b' ', start)
    nl = raw.find(b'\n', start)
 
    # If space appears before newline, we have a keyword.
 
    # Base case
    # =========
    # If newline appears first (or there's no space at all, in which
    # case find returns -1), we assume a blank line.  A blank line
    # means the remainder of the data is the message.
    if (spc < 0) or (nl < spc):
        assert(nl == start)
        dct[b''] = raw[start+1:]
        return dct
 
    # Recursive case
    # ==============
    # we read a key-value pair and recurse for the next.
    key = raw[start:spc]
 
    # Find the end of the value.  Continuation lines begin with a
    # space, so we loop until we find a "\n" not followed by a space.
    end = start
    while True:
        end = raw.find(b'\n', end+1)
        if raw[end+1] != ord(' '): break
 
    # Grab the value
    # Also, drop the leading space on continuation lines
    value = raw[spc+1:end].replace(b'\n ', b'\n')
 
    # Don't overwrite existing data contents
    if key in dct:
        if type(dct[key]) == list:
            dct[key].append(value)
        else:
            dct[key] = [ dct[key], value ]
    else:
        dct[key]=value
 
    return kvlm_parse(raw, start=end+1, dct=dct)

The function here starts by creating an OrderedDict, a Python structure that works as a dictionary but retains the order of all items in the dict (Note: possibly redundant as of Python 3.6+, since this behaviour exists in regular dicts).

We then establish where the next space is and the next newline: on the same line, the data before the space should be a key and after the space should be a value. However, if there is no space and only a newline we can assume we are at the message part of the data.

The log command and graphviz

Commits

When thinking about commits, it is worth remembering that work tree contents are just part of a commit, in addition to everything else: contents, authors, and parents. And thus, because the SHA-1 hash of a commit is made up from all the contents of a commit, we then have immutable commits: changing the author or anything about the author, contents, etc. will generate a new, different object with its own SHA-1 hash.

In other words, a commit ID not only identifies some file contents, but also binds the commit itself to its whole history and the whole repository. Notably, too, each commit is linked to its own past, but not to future commits.

The integral parts of a commit are:

  • A tree object; that is, the contents of a worktree, files, and directories
  • Zero, one, or more parents
  • An author identity (name and email)
  • A committer identity (name and email)
  • An optional PGP signature
  • A message

All of these things then hashed together in a SHA-1 ID.

Trees

A tree describes the content of the worktree. That is, it associates blobs to paths, and it is composed of an array of three-element tuples made up of a file mode, a path, and a SHA-1. A typical tree might look like this:

ModePathSHA-1
100644.gitignore894a44cc066a027465cd26d634948d56d13af9af
100644LICENSE94a9ed024d3859793618152ea559a168bbcbb5e2
100644README.mdbab489c4f4600a38ce6dbfd652b90383a4aa3e45
100644src6d208e47659a2a10f5f8640e0155d9276a2130a9
040000testse7445b03aea61ec801b20d6ab62f076208b7d097
040000main.cd5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38

Mode refers to the file’s file system permission mode, and path to it’s location relative to the tree, while the SHA-1 refers to either a blob or another tree object: when its a blob, the path is a file, and when its a tree, the path is a directory.

In the case of the above tree contents, if we want to instantiate the tree we would begin by loading the object associated with the first path, .gitignore in this case, and check its type. Since it is a blob, we’ll create a file called .gitignore with the blob’s contents, and likewise for LICENSE and README.md. The next path, src, is a directory, and we will therefore create a src directory and repeat the same operation inside the new tree.

Note: a path is a single filesystem entry and it identifies exactly one object. If there are five levels of nested directories we will need five tree objects recursively referring to one another.

Parsing trees

Unlike tags and commits, tree objects are binary objects. Their format is a concatenation of the format’s records:

[mode] 0x20 [path] 0x00 [sha-1]
  • mode is an ASCII representation of a file mode, of size up to six bytes. For example, 100644 is encoded as 49 48 48 54 52 52
  • It is followed by an ASCII space, 0x20
  • Then we have the null-terminated path
  • And finally the SHA-1 in binary encoding on 20 bytes.

The code to parse trees is fairly straightforward. We create a class to represent each leaf of the tree that simply contains the mode, path, and SHA of a particular object. We then use the following parsers:

def tree_parse_one(raw, start=0):
    # Find the space terminator of the mode
    x = raw.find(b' ', start)
    assert(x-start == 5 or x-start==6)
 
    # Read the mode
    mode = raw[start:x]
 
    # Find the NULL terminator of the path
    y = raw.find(b'\x00', x)
    # and read the path
    path = raw[x+1:y]
 
    # Read the SHA and convert to an hex string
    sha = hex(
        int.from_bytes(
            raw[y+1:y+21], "big"))[2:] # hex() adds 0x in front,
                                           # we don't want that.
    return y+21, GitTreeLeaf(mode, path, sha)
 
def tree_parse(raw):
    pos = 0
    max = len(raw)
    ret = list()
    while pos < max:
        pos, data = tree_parse_one(raw, pos)
        ret.append(data)
 
    return ret

The first parser reads a single line of the tree. It separates the data by way of the space and null terminators, converts the SHA into hex (and removes the 0x from hex notation) and simply returns a leaf object.

We then use a more general tree parser that simply goes line by line an returns all the leaves together.

Refs

Git references, or refs, live in subdirectories of .git/refs. They are text files that contain a hexadecimal representation of an object’s hash, encoded in ASCII. They look like this:

6071c08bcb4757d8c89a30d9755d2466cef8c1de

They can also refer to another reference and thereby indirectly to an object. In this case, they look like this:

ref: refs/remotes/origin/master

To work with refs we start with a resolver that takes a ref name and follow recursive references until it returns a SHA-1:

def ref_resolve(repo, ref):
    with open(repo_file(repo, ref), 'r') as fp:
        data = fp.read()[:-1]
        # Drop final \n ^^^^^
    if data.startswith("ref: "):
        return ref_resolve(repo, data[5:])
    else:
        return data

Once we can resolve a single ref, the rest of the code simply involves collecting the relevant refs into a data structure and printing them in order.

Tags

The simplest use of refs is tags (in fact, tags live at .git/refs/tags). These are just user-defined names for an object. They can be used to identify specific releases; rather than referring to a particular commit by a hash, say 4324c09, we can tag it with v3.11.75. The result is that the following two commands become equivalent:

git checkout 4324c09
git checkout v3.11.75

Note that we can use tags to name anything, including blobs.

There are two types of tags: “lightweight tags”, which are regular refs to a commit, tree, or blob, and “tag objects” which are regular tags to a tag object. The latter have authors, a date, an optional PGP signature, and optional annotation, and they have the same format as a commit object.

Because they use the commit object, implementing tags is as easy as:

class GitTag(GitCommit):
	fmt = b'tag'

In Git, the tag command does two things: either it creates a new tag or it lists existing tags.

Branches

A branch is just a reference to a specific commit. In a way, a branch is just a name for a commit, and therefore almost exactly the same as a tag. In fact, branches live at .git/refs/heads.

There are some differences:

  1. Branches are references to a commit, while a tag can refer to any object.
  2. The branch ref is updated at each commit. In other words, whenever we make a commit:
    1. a new commit object is created, with the current branch’s ID as its parent
    2. the commit object is hashed and stored
    3. the branch ref is updated to refer to the new commit’s hash

The current branch is simply a ref file outside the refs hierarchy, located at .git/HEAD. It is formatted as an indirect reference.

Resolving names

Git uses a name resolution algorithm to resolve names such as HEAD, hashes, short hashes, as well as tags and branches matching a given name.

The way it works is as follows:

  1. If name is HEAD, it will resolve .git/HEAD
  2. If name is a full hash, return the hash unmodified
  3. If name looks like a short hash, return objects whose full hash start with this short hash
  4. Resolve tags and branches matching name

Aside: Short hashes

Short hashes are simply a convenience provided by Git to refer to specific objects. For example, 5bd254aa973646fa16f66d702a5826ea14a3eb45 can be referred to as 5bd254


Once a specific hash or name has been found we can follow it until we find an object of the required type, assuming a type argument was provided:

  • If we have a tag and fmt is anything else, we follow the tag
  • If we have a commit and fmt is tree, we return this commit’s tree object
  • Otherwise we abort

Staging and the index

When creating a new commit, we first stage our files and then commit. The way files are staged is making making use of an index file: it represents a set of changes from the HEAD commit; when committing, the changes on the index file must be combined to create a new commit.

The index is made up of three parts:

  • A header with a signature and some basic info (such as the number of entries it holds)
  • A series of sorted entries where each represents a change
  • A series of optional extensions