Duplicated repository having same commit ids

We have a need of having a duplicate github repository which we will be using for future development work and we want to leave the original repository intact and untouched while keeping commit history of old repository in the new repository.

Using mirroring steps the duplicate repository now has all the code of original repository but even the commit ids are same.

  1. Is it alright to have same commit ids in both repository? is there any cons to it? I do not want any impact of the new repository work on the original one.

Solution 1:

This is all perfectly normal.

The commit hash ID is the commit, in a very real and important sense: If two commits have the same hash ID, they have the same author, committer, log message, and source-tree snapshot. That is, their contents are identical if and only if their hash IDs are also identical.1

Since no commit, once made, can ever be changed, this means that if you have, in your Git repository, a commit with hash ID a123456..., and the other Git repository contains a commit with the same hash ID a123456..., you and they have the same commit. This means your repository doesn't need to get their commit, nor vice versa.

While you can change a "branch"—I put this word in quotes on purpose here—you can't change any commits. What you actually change here is not the commits, but the hash ID stored in a branch name. That is, when using Git, you must be aware of several facts:

  • Commits are numbered, contain snapshots and metadata, and are immutable. The numbers are universal across all Git repositories.
  • The metadata in any given commit can hold the commit-numbers of other commits.
  • Branch names are local to one specific Git repository. A branch name holds the hash ID of one commit.

To update the set of commits in a repository, what we do in general is add one or many new commits. We keep all the existing commits. No commit ever disappears.2 Then we have Git create or update a branch name.

Since each commit contains a list of previous commit hash IDs—the stored IDs in a commit's metadata must be those of existing, valid commits, not those of future not-yet-made commits3—we can have Git work backwards from a later commit to an earlier commit. In a sense, this process of working backwards are what really makes a "branch" in Git. The name by which we find the last commit is also a "branch". Some of these names are "branch names" and some of them are other kinds of names. These other kinds of names may or may not be "branch names" depending on who you ask and what might be going on in their brain at that very moment. So what this all boils down to is that the word branch, in Git, is kind of meaningless. If possible, try to use something more specific, such as branch name, or remote-tracking (branch) name, or set of commits as found by a branch name, or something along those lines. But humans being human, we will causally use the word "branch" and expect other humans to guess which of these we mean.

An illustration or two can help clarify all of this. In any one given repository, we have some number of branch names. We say that each one points to a commit. We say also that each commit points to some previous commit. For a very simple case, the result looks like this:

... <-F <-G <-H   <--branch-name

Here the name branch-name points to a commit whose hash we represent with the letter H. That commit points to earlier commit G, which points to earlier commit F, and so on.

In a more complex repository, we might have:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

which shows some branching going on: commits I-J are only on branch br1 and commits K-L are only on branch br2. Commits up through and including H, however, are on both branches.

When we make a merge commit, by doing git checkout br1 && git merge br2 for instance, we end up with this:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

where we add the new merge commit M to branch br1. The name br1 now points to the new commit M. Commit M points backwards not just to the one commit J, but also to the other commit L. So commits K-L have been added to branch br1 as well, because we can find them by starting at M and working backwards.

Adding a normal everyday commit N to br1 produces:

          I--J
         /    \
...--G--H      M--N   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

Note how the name br1 changed (to point to N instead of M, or earlier, to point to M instead of J), but no commit has changed. We merely added new commits. Commits always (and only) point backwards to previous commits.

The history in a repository is nothing more or less than the set of commits in that repository. We find the commits using branch names (and tag names and other such names), but it's the commits that matter. No commit can ever be changed, so if you have the hash ID of some commit, you simply find some repository that has a commit with that hash ID, and you get that commit. This is why we can say that the hash ID is the commit.


1Mathematically, we can prove that this idea—that every commit always gets a unique hash ID—is unworkable, via the pigeonhole principle. Once two different commits get the same hash ID, Git stops working (well, sort of: much of Git still works, it's just some things that break). The sheer size of the commit hash puts off the Day of Doom long enough that with any luck, we'll all have been dead for millions of years, and won't care.

2It is possible to drop commits, but you won't normally see this happen. We won't cover how to make it happen here either.

3Future commit hash IDs are unpredictable. We could in theory make a commit and just throw some random hash ID into it, but if we do that, Git will object to that commit (that's what things like "checking connectivity" messages are about). So Git forbids us from putting bogus commit hash IDs into repositories like this.