git clone and Hard Links

Once upon I found a very large git repository:

[aababilov@git ~]$ du -sh $HUGE_REPO
2.2G nova.git/

The thing is that there was commit trash from different users and git gc was never run. I wanted to make a copy of the huge repo and run git gc on it.

[aababilov@git ~]$ time cp -a $HUGE_REPO nova.git
^C

real 0m37.345s
user 0m0.042s
sys 0m2.664s
[aababilov@git ~]$ du -sh nova.git
958M nova.git
[aababilov@git ~]$ rm -rf nova.git

(Here I have interrupted this long copying.)

Let’s try cloning:

[aababilov@git ~]$ time git clone $HUGE_REPO
Initialized empty Git repository in /home/aababilov/nova/.git/

real 0m5.323s
user 0m0.359s
sys 0m0.204s
[aababilov@git ~]$ du -sh nova
2.2G nova

Surprise! Only 5 seconds – and the new repo is ready. But what is the trick?

The heaviest part of the repo is objects/pack directory:

[aababilov@git ~]$ ls $HUGE_REPO/objects/pack/ -ilh |head -5
total 2.2G
545693 -r--r--r-- 3 osc-robot openstack-core 2.2M Oct 18 17:59 pack-0516d44fabeede6deb0c2d2976995cd86f7887e8.idx
545692 -r--r--r-- 3 osc-robot openstack-core 86M Oct 18 17:59 pack-0516d44fabeede6deb0c2d2976995cd86f7887e8.pack
546881 -r--r--r-- 3 osc-robot openstack-core 2.2M Sep 29 2011 pack-0b59aec9b3c497894dbd87a93a7cc6fcf9959f8a.idx
546880 -r--r--r-- 3 osc-robot openstack-core 86M Sep 29 2011 pack-0b59aec9b3c497894dbd87a93a7cc6fcf9959f8a.pack
[aababilov@git ~]$ ls nova/.git/objects/pack/ -ilh |head -5
total 2.2G
545693 -r--r--r-- 3 osc-robot openstack-core 2.2M Oct 18 17:59 pack-0516d44fabeede6deb0c2d2976995cd86f7887e8.idx
545692 -r--r--r-- 3 osc-robot openstack-core 86M Oct 18 17:59 pack-0516d44fabeede6deb0c2d2976995cd86f7887e8.pack
546881 -r--r--r-- 3 osc-robot openstack-core 2.2M Sep 29 2011 pack-0b59aec9b3c497894dbd87a93a7cc6fcf9959f8a.idx
546880 -r--r--r-- 3 osc-robot openstack-core 86M Sep 29 2011 pack-0b59aec9b3c497894dbd87a93a7cc6fcf9959f8a.pack

git is so smart that it makes hard links to its packed database. All these packs are immutable, so they won’t be changed accidentally in a copied repository. Even more, if one repo is removed, only one hard link will be deleted, and the others will work properly – that’s a reason why symbolic links are not suitable.

That is the most beautiful example of hard link application I’ve ever seen!

Finally, let’s perform garbage collecting.

[aababilov@git ~]$ (cd nova/ && git gc)
Counting objects: 82992, done.
Compressing objects: 100% (16224/16224), done.
Writing objects: 100% (82992/82992), done.
Total 82992 (delta 66287), reused 82328 (delta 65814)
[aababilov@git ~]$ du -sh nova
127M nova
[aababilov@git ~]$ ls nova/.git/objects/pack/ -ilh |head
total 115M
285369 -r--r--r-- 1 aababilov griddynamics 2.3M Apr 12 10:51 pack-dd0555ed660fb2aceaf2a588fee0b9c42b5c4606.idx
285343 -r--r--r-- 1 aababilov griddynamics 113M Apr 12 10:51 pack-dd0555ed660fb2aceaf2a588fee0b9c42b5c4606.pack

Voilà! There is a brand new pack.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s