Archive for January, 2010

Git Is Your Friend not a Foe Vol. 3: Refs and Index

Let’s take a walk along Git repository structure. The central square is Git Object Database. Objects reference each other by 160-bit unique IDs with a certain semantics (for example, commit-object references its parent commit(s) and the tree that corresponds to project’s root directory; tree-object references blob-objects that correspond to file content and tree-objects that correspond to subdirectories; etc., see gittutorial-2(7) for details). For the sake of simplicity, let’s forget about trees and blobs for now, and look at commits only.

Git with only commits

We now have a bunch of commits that know who were their parents. We can trace history from any given commit back to the very beginning. But how do we know what is the current state of things? What was the latest commit in the history? To answer that let’s look at Git refs (short for references). They are basically named references for Git commits. There are two major types of refs: tags and heads. Tags are fixed references that mark a specific point in history, for example v2.6.29. On the contrary, heads are always moved to reflect the current position of project development.

Git with refs

Now we know what is happening in the project. But to know what is happening right here, right now there is a special reference called HEAD. It serves two major purposes: it tells Git which commit to take files from when you checkout and it tells Git where to put new commits when you commit. When you run git checkout ref it points HEAD to the ref you’ve designated and extracts files from it. When you run git commit it creates a new commit object, which becomes a child of current HEAD. Normally HEAD points to one of the heads, so everything works out just fine.

Git with branch HEAD

But if you checkout a specific commit instead of a branch, your HEAD starts pointing at this commit. This is referred to as detached head and you may be told that you are not on a branch (git branch says “(no branch)”). This is perfectly fine, but if you commit anything to it, your commits won’t have a known ref, so if you checkout another branch, you can lose them.

Git with detached HEAD

Having said about committing, can’t help stopping by the process of committing itself. You may already know that the Git’s “add” operation differs from almost every other VCS in that you have to “add” not only files that are not yet known to Git, but also files that you have just modified. This is because Git takes content for next commit not from your working copy, but from a special temporary area, called index. This allows finer control over what is going to be committed. You can not only exclude some files from commit, you can exclude even certain pieces of files from commit (try git add -i). This helps developers stick to atomic commits principle.

And if you have inhuman ability of creating only perfect committs and need stupid VCS only to obey your orders, then you can just use option “-a” for git-commit. And I envy you.

Another special kind of refs are remotes. Whenever you run git fetch, it asks the remote repository, what heads and tags does it have, downloads missing objects (if any) and stores remote refs under refs/remotes prefix. The remote heads are displayed if you run git branch -r.

Some of your branches (notably master) may be what is called tracking branches. That means that a certain branch “tracks” its remote counterpart. Physically that means that when you run git pull on that branch, the corresponding remote branch gets automatically merged into your local branch. Fairly recent versions of Git set up tracking automatically when you checkout a remote branch (for example, git checkout -b stable origin/stable). Note, however, that sometimes it’s better to rebase instead of merge.

But that’s a whole new story…

Previous posts:

Next post:

All posts about Git


Git Is Your Friend not a Foe Vol. 2: Branches

So if you worked with some version control systems for a bit, you’ve probably heard of a concept called branches. It is quite a simple concept: you can perform several development processes in parallel without them interfering with each other. Most projects use branches for experimental features that could set hell loose and for backporting bugfixes to older releases. Subversion and CVS people usually dislike branches, because they involve lots of uninteresting and painful work that they don’t want to do. That is easily explained by the way branches are implemented there.

As you might know, branches in SVN are implemented in a very interesting fashion. They are not, in fact, implemented at all. SVN branch is just a folder, which is created when a branch is started. If you want to merge it back, you need to remember the revision number, when you created the branch, and use that magical number in a complex “svn merge” command. But still, SVN project history remains a straight line.

SVN History

What’s wrong with this way of interpreting the branch concept? Nothing. It’s completely fine, if you don’t want to work with branches. And you should want to work with branches, because they are actually awesome! Especially if implemented as in Git.

Instead of SVN-y linear development history, Git’s commit history is a more complex structure: each commit can have multiple parents and multiple children. In computer science this is called a Directed Acyclic Graph (and if this rings any bells you may want to read Tv’s article “Git for Computer Scientists”). In practice that means that you are not restricted to developing upon the latest revision in project’s history. Instead, you can take any existing commits and start creating commits off it. If you want to merge them, you create a commit that is a child of two commits (such children are called merge commits).

This way you get a graph with several commits with no children (let’s call them branch heads for now). Every commit has a reference to its parent. So if we take the branch head, we can trace back the project history to the very beginning. This is why the Git branch is simply a reference to its head (you can go ahead and look into the files in .git/refs/heads directory of any of your Git repositories).

Git History

Most of the time you will have a checked-out branch (the special reference HEAD points to the current branch, see .git/HEAD for example). When you commit something, your commits are attached to it and the branch reference is moved to the new commit. Simple. But sometimes your HEAD may point to something other than a branch head (for example, when you checkout an older revision by its ID or tag). This is called a detached head. It is a very simple, important and confusing situation. There’s nothing wrong about it, but it hides a peril: if you commit to a detached head, Git creates a new commit and attaches it to the current commit, forming a branch. But this branch has no name! It will just grow sideways as a normal branch without a name. Here’s what’s wrong with it:

  • it is confusing, because commits do not go to master, or whatever branch you had checked out before;
  • if you check out another branch, you won’t be able to return to this branch by its name, it simply doesn’t have any.

Note, that Git won’t let you lose data easily and won’t force you to do unneeded work. Let me tell you what to do in case you have committed to a detached head. Suppose you created just one single commit to a detached head and now just sit and look at it. You have two options:

  • create a new branch with the current commit as a head: git branch branchname,
  • attach the new commit to another branch (suppose it is branch master): remember the ID of the new commit, checkout the required branch (git checkout master), cherry-pick your commit to it (git cherry-pick id).

If you find yourself in the second situation (you’ve just committed to a detached head, checked out another branch and don’t remember the commit id), you may use Git’s reflog (git reflog, or git log -g). This will list the history of your HEAD (checkouts, commits and the such), where you can take commit ID and use it wisely.

Merging is an important part of Git workflow. You will, in fact, do merges frequently even if you don’t use branches other than master, provided you use more than one repository. That is because master of one repository and master of another repository are, in general, different branches. So when you do push or pull to/from another repository you do a merge. Git differentiates two merge types (suppose you attempt to merge branch B into branch A):

  • fast-forward merge. This happens when B is a direct descendant of A. This is resolved trivially: Git simply moves reference A to point to B,
    Git Fast-Forward
  • non fast-forward merge. This covers all the remaining cases, and requires a merge commit to be created (merge commit is a commit with at least two parents).
    Git Fast-Forward

This differentiation is important because the fast-forward merge can be performed automatically without human intervention. That’s why this is the only merge possible during Git push. The non fast-forward merge may result in edit conflicts (the situation when two lines of development changed the same line of the same file differently), so a human intervention may be required. This is what is meant by a (not immediately clear) git-push message “remote rejected: non fast-forward”: sorry, I can’t push your modifications, because remote branch has diverged, please resolve this manually. Most often this occurs when another developer managed to push his changes first. In this case just run “git pull”, resolve conflicts (if any), then run “git push”. Less often this occurs when a remote branch has been changed completely (for example, branch pu of Git Git repository is changed very frequently and is not supposed to be developed upon). This means that either you or the remote repository owner screwed up, so you’d better talk to each other. Sometimes this occurs when you try to push to a completely unrelated repository. So just be careful there.

I should note here, that the “—force” option to git-push along with +refspec notation is not going to solve your problems automagically. It will simply destroy the remote history, replacing with your own. So you should never use it, unless you know exactly what you are doing.

Next up: rebasing and staging area.

Previous post:

Next posts:

All posts about Git


Git Is Your Friend not a Foe Vol. 1: Distributed

Recently, I’ve been preaching Git to everyone that use the inferior version control software (like SVN or, pardon me, CVS). But somewhy the main obstacle I see in these people is that they are so used to SVN workflow that they don’t see the magnificence and flexibility Git offers. They mostly are able to read and acknowledge the fact that more and more projects have been switching over to it.

But still, many of them don’t grasp the benefits Git gives, falling back to classic centralized edit–commit-to-server workflow of SVN and whining that “this stupid Git didn’t commit changes in that file; this stupid Git complains about ‘non fast-forward’; this stupid Git ate my kittens; etc.”. I would like to clear something out and introduce them to a better world.

First of all, Git is a distributed version control system. What does that mean? In classic VCS you have a single holy place called The Repository, where all the project’s history is kept. Developers get only the small fraction of information from it: the actual files from the latest revision (termed the “working copy”, which is obviously an exaggeration). Basically, the only thing SVN client is able to do is compare your files with the latest revision and send this diff to the server. In SVN communications are possible only between The Repository and the puny client with the working copy.


In contrast, Git does not differentiate His Holiness The Repository from mere mortal working copies. Everyone gets a repository of his own. Everyone can do anything they want with it. Each developer can communicate with any other developer. This gives a developer so much freedom, that he often does not get into it, and just simply asks this:

Uhm, an entire development history? With every working copy? Man, that will eat a lot of disk space! And I even can’t imagine how long it will take to checkout that repository!

Well, first, not checkout, but clone. The checkout in Git is a somewhat different operation, and that is a Git club entry fee: you need to lose your centralized VCS habits and get used to new terms and ways. This can be painful at first, but it pays off at the end. You’ll thank me later.

So, back to the repository size. Yes, Git requires you to have the whole repository on your person. Yes, it does increase your project directory size. But Git is extremely efficient in packing stuff, so that increase should not hurt you. In fact, the whole Git repository (with full project history) is known to take less space than an SVN checkout. And SVN’s checkout process is so inefficient, that for most projects Git clone takes less time than SVN checkout.

Okay, now the next question is: what is so cool about having the whole repository along with project files? Well, the most basic advantage is that a developer can do everything without access to the server, i.e.:

  • view the revision log starting from the very first commits;
  • browse old versions of the project;
  • and more importantly, commit his changes.

It is a nice feature being able to browse the history without Internet access for people with slow link, or for people that travel a lot. But being able to commit things without asking anyone’s permission is so important that it’s worth a separate paragraph. Here it goes.

Most software teams recognize the two simple principles that a developer should follow: keep commits atomic and don’t commit bad stuff. The problem is that centralized VCS make these principles incompatible. People just don’t work in a linear discrete fashion, instead they tend to steer between several things: a touch there, a refactor here, an occasional stupid bug fix. In the end you get a working tree with bunch of unrelated, uncommitted and untested changes. In Git you can commit as often as you want because commits are local to your repository, no one sees them except yourself! You can commit total rubbish and test everything later — you can edit every single commit without fear of embarrassment and humiliation. You can find out that the way you started to implement this killer feature everyone wanted is totally wrong and start from scratch — without spoiling the project version history.

The second advantage is that developers can exchange their revisions with each other without the central server. Imagine John having reworked the main loop of nuclear reactor coolant control computer. He doesn’t want to incorporate this change to a live system, so he asks Fred to download the respective changes from his repository and test them on his nuclear plant in less populated area. After not having heard any loud explosions, John knows that at least one plant survived the change.

You can also benefit from this even if you are the only developer. Imagine you have several different computers (for example Mac, Linux x86 and Linux amd64). You have developed something on your Mac box and tested it through and are ready to push this to the main repository. But you may also push it first to your Linux boxes and test it there. In SVN you would have to generate patch, transfer it to the boxes, and apply it. Everything manually. So you most probably wouldn’t bother at all and would discover that nasty bug that occurs only on 64-bit computers only in two month and lose your job.


Finally, the concept of “central repository” may be eliminated altogether. Every developer gets a “public” repository where he keeps the stuff he is not ashamed of and a private repository where he works as he wants. Or a bunch of private repositories. The developers exchange their work by pulling commits from each other’s public repositories. Or they can have a single lead developer, who collects the good commits, and use his repository as a “blessed” repository. The lead developer either watches for changes in other public repositories, or waits for a “merge request”. Merge request is a message (e-mail traditionally) that says something along the lines of “Hey, Sam, I’ve implemented the automatic road crosser for blind one-legged homosexuals, ‘git pull git:// crosser’, love, Dave”. Sam copies-and-pastes the command and gets a new branch, tests it, and then pushes to his blessed public repository.

For large projects (for example, Linux) lead developer has several people responsible for specific subsystems (the so called Lieutenants). They collect the small commits from their fellow developers, test them and forward to Linus, who aggregates all the good stuff in his own repository. This ensures that the code is seen by at least one other person, before it gets stored in the repository and completely forgotten.

The aforementioned site has a nice section about different Git workflows (see under Any workflow) with pictures.

Also, the nice side-effect of Git being a distributed system is that every repository is essentially a backup of the main repository. It doesn’t mean you should not do backups — you should! — it just means, that in case everything crashes and burns, any developer will provide you with full revision history, not only the recent project files.

There are some more things that confuse novice users, especially branches and staging area. I shall cover them in following posts, stay tuned!

Next posts:

All posts about Git


Trac Whine Script

Have you ever wanted a script that sends an email to every your Trac user reminding them of all their open-unsolved tickets? Have no fear, here it is:


    case $1 in
            echo open ticket
            echo open tickets

from="Trac Whiner <$fromaddr>"
sqlite=`which sqlite3`

[ -e $lockfile ] && exit 0
touch $lockfile

request="SELECT value, id, summary
          FROM ticket
          JOIN session_attribute
          WHERE status IN ( 'new', 'accepted' )
           AND name='email'
           AND sid=owner;"

$sqlite "$dbpath" "$request" > $tmpfile

cat $tmpfile | cut -d\| -f 1  | sort -u | while read email; do
    count=`grep "^$email" $tmpfile | wc -l`

    cat >> $mailfile <<ENDOFLINE
From: "${from}"
To: ${email}
Content-Type: text/plain; charset=utf-8
Subject: [Trac Whine] You have $count $(ending $count).

Howdy there!

Don't want to bother you, but you still have $count $(ending $count):

    grep "^$email" $tmpfile | while read data; do
        number=` echo "$data" | cut -d\| -f 2`
        summary=` echo "$data" | cut -d\| -f 3-`

        echo "$tracurl/ticket/$number ($summary)" >> $mailfile

    cat >> $mailfile <<ENDOFLINE

Have a nice work day!

Automatic Trac Whiner <$tracurl>


    cat $mailfile | $sendmail -t -f "$from"

    rm $mailfile

rm $tmpfile
rm $lockfile

Don’t forget to alter “fromaddr”, “dbpath” and “tracurl” variables to your liking and enjoy!


Simple Tools for Simple Tasks

I’d like to tell you today what simple tools there are to make your life easier. Of course, a hammered screw holds better than a screwed nail, but it is much more fun to use appropriate tools for their jobs. In the long run at least.

Imagine you want to create a simple photogallery for your site to tell your readers where have you been this summer, what have you seen and what have you done. Ideally it would be a simple page with a title (the place you visited) and large photos with your short comments. You start with a simple html, like this:

  <p>Wow, there’s an awful lot of Zenith shops there.</p>
  <img src=“zenith01.jpg” />
  <p>The saleswoman seems like a gifted businessman.</p>
  <img src=“businessman01.jpg” />

But you are not a mere mortal, but a fine software developer! Or another kind of brilliant person. So you immediately found a major design flaw: wherever you will need to change the design or layout of this page, or add perhaps a GPS link for each photo, you would need to edit the page (or all the individual photo entries) manually.

You could also use some kind of a dynamic photo gallery. But that is an obvious overkill: web applications are much more CPU and RAM hungry. Java programmers may stop reading at this point: efficiency problems never stop them. Also the web galleries usually have cumbersome point-and-click interfaces which involve too much, well, pointing and clicking for simple tasks. Enterprise programmers may stop reading at this point and join their Java colleagues.

So you would write a simple, relatively fast web application that would generate something like the above code from textual descriptions of photos and a template page. That’s your only option, right? Wrong! That still might not even be an option: some sysadmins still disable CGI and stuff for their web servers (that includes university homepages and some of the cheapest hostings). And even if your web-app is extremely fast, static pages still seem way better.

Here’s how you can have your cake and eat it. For nearly ten years now, there exists a technology called XSL Transformations, or XSLT. In short, it is a language that describes how to convert one XML file into another. Imagine an XML file, say peter.xml:

<?xml version=“1.0” encoding=UTF-8”?>
<?xml-stylesheet type=“text/xsl” href=“album.xslt”?>
  <comment>Wow, there’s an awful lot of Zenith shops there.</comment>
  <comment>The saleswoman seems like a gifted businessman.</comment>

It looks exactly like gazillions of other XML files in the world. Except for the second line, which in English says “Oh, hi, you want to show this file to a user? Just apply a stylesheet at album.xslt and he won’t go mad. Thanks!” Here’s the album.xslt:

<?xml version=“1.0” encoding=UTF-8”?>
<xsl:stylesheet version=“1.0” xmlns:xsl=“”>
 <xsl:template match=“/”>
   <head><title><xsl:value-of select=“album/title” /></title></head>
    <h1><xsl:value-of select=“album/title” /></h1>
    <xsl:for-each select=“album/photo”>
     <p><xsl:value-of select=“comment” /></p>
     <img><xsl:attribute name=“src”>
      <xsl:value-of select=“image” />

If you look closely at it, you will see that it basically tells you (or your browser) how to make peter.html from peter.xml. And that’s it. So here are the bonuses you get:

  • both files are static, you don’t need to have a muscular web server or a friendly (or non-muscular) sysadmin;
  • if you have hundreds of photos on a single page and at some moment decide that comments should go below the photo, you’d need to swap two lines, not hundreds of lines;
  • if you visited thousands of places and at some moment decide that every page should have a copyright footer or be redesigned completely, you’d need to change one single file, not thousands of them;
  • you can even have different styles for the same set of photos. For example, you may want to show smaller images when viewed from a handheld device. Or bigger images when viewed on your 60” plasma;
  • you don’t increase the butthurterol level in your blood.

Next time I shall probably show you some of my deployment tricks. See you later!