git-rebase

Edit: You should probably also read Understanding the Git Workflow. Much more succinct explanation of the point that I dance around a lot below. Almost as if we’d had roughly the same ideas, but he squashed the messages more eloquently ;)

A recent article on git-merge vs git-rebase kicked off a very interesting conversation on twitter the other day among a bunch of us nerds.

It started with Marco saying

Good article on merge vs. rebase. It’s pretty balanced, but I’m still firmly against rebase after reading. blog.sourcetreeapp.com/2012/08/21/mer…
— Marco Rogers (@polotek) December 8, 2012

A few of us objected.

@polotek I wouldn’t recommend being “firmly” against rebase. That’s like being “firmly” against a particular coding style. Forest for trees.
— Jason Smith (@_jhs) December 9, 2012

Then there were like a bazillion more replies back and forth, and it got too much for Twitter.

@techwraith @polotek @rwaldron @nexxylove @oscargodson I’ll reply in blog format. ENOSPC.
— isaacs (@izs) December 9, 2012

So here I am writing a blog post now.

Git is an Editor

I’m pretty sure I stole this line from someone, but I couldn’t find any reference earlier than mine, so maybe I made this up. It’s a way of thinking about git that I really like:

cvs and svn are remote backups that you use to save your changes.
git is an editor that you use to write your code’s biography.
— isaacs (@izs) May 14, 2010

So my reply to Marco was:

@polotek Git is an editor. “Firmly against rebase” is like being firmly against backspace.
— isaacs (@izs) December 9, 2012

I hope that some of the implication of that becomes clearer by the time I get to the end of this post.

“tidy”

A lot of the discussion of git-rebase fixates on keeping the commit history “tidy”.

@techwraith @izs @polotek rebasing into a release branch is clean and tidy. You still have the data in your feature branch.
— Emily Rose (@nexxylove) December 9, 2012

@nexxylove @techwraith @izs what does clean and tidy buy besides a sense of satisfaction?
— Marco Rogers (@polotek) December 9, 2012

The workflow that everyone is discussing seems to be:

The article that started the conversation really only discussed this single use of rebase.

However, rebase is much more versatile that this. Rebase is a tool that can be used to arbitrarily edit the commit history. If that sounds universally scary or bad, I’d argue that your understanding of “edit” and “history” are perhaps a bit limited.

Tidying Up History

Consider the following two stories:

Story 1

I ran out of milk.

I walked down Adams Street to Whole Foods so I could buy milk.

I mean, no, Whole Foods isn’t on Adams, it’s on Bay Place, so I walked down Bay Place to get to Whole Foods to buy milk.

Oh, but first, I had to walk down Grand Ave, then take a right on Bay Place, then parked my car–shit.

I got in my car, drove down Grand Ave, took a right on Bay Place, parked my car at Whole Foods, and needed my wallet, so… right, ok, so before I drove in my car, I got my wallet, drove to Whole Foods on Bay Place by going down Grand, and then bought some milk.

No, wait, I didn’t to whole foods, that’s right, because they’re crazy expensive. I went to the Grand Lake Safeway.

Story 2

I ran out of milk.

I drove to the Grand Lake Safeway and bought some milk.

This seems silly, but the benefit of a tool that can tidy up history is that you get to write whatever crazy awful untidy history you want, while you’re experimenting and editing code, and then organize it later. When I’m actually writing code, my commit history looks a lot like Story 1.

Typically I work on a blahblah-feature-name-wip branch, and then rebase it down to a blahblah-feature-name branch when I’ve got it into a working state and understand how it’s supposed to go. The -wip branches are a godawful mess of kinda working, except breaks all the fs tests commits, and Revert "kinda working, except..." commits. The benefit of tidiness then, is that I can track my work in a more or less play-by-play fashion, and still end up with something readable.

Is it better, then, to push all those play-by-play commits?

What is the purpose of a git history? Is it a forensic record of every edit to a file? Or is it a way for others to determine the reason for those edits?

My answer is: It’s both, depending on context.

I keep the -wip branch around for as long as the forensics are interesting. There’s a lot of times that I notice something is broken in the “real” feature branch, and think, “Oh, I could’ve sworn that was working yesterday.. what was I doing..?” At moments like those, the forensic record of my thoughts is invaluable.

But if someone, years later, is looking at a bit of code that seems strange, and they run git blame on the file, and track the edit back to Revert "Revert "hmm this seems broken but it sorta works i dunno"" then they’ll be rightly upset with me for letting such useless garbage into the repository.

One possible answer is: “Don’t write code like that.” Write a test for the feature, then write the code to make the test pass, and never make mistakes.

The problem is that I’m not smart enough to not make mistakes and know exactly what code to write before I start writing it. Also, I enjoy the liberty of messing around, tracking every bit, and still sharing an elegant useful story with my future collaborators (including future-isaacs.)

Forensics - Platform vs Service

It’s very interesting to me, but not at all surprising, that the people in this conversation that were most in favor of “merge only, never rebase” are generally working on a production web service (Yammer, most of them), and the people who favored rebase seemed to be working primarily on platforms (Node and npm in my case; JQuery Johnny5 and others in the case of Rick Waldron.) A few folks (mikeal, Jason Smith) work on both sorts of systems, and of course this breakdown is imprecise.

I’ve seen a similar reaction to rebase-vs-merge in coworkers at Yahoo, Kakai, and Joyent. (Basically ever since I’ve been using git, and exposed to this sort of conversation.) People who spend most of their day debugging production issues want the history to be as detailed and “forensically accurate” as possible. People who spend most of their day debugging platforms or libraries tend to want the history to be the most “elegant story” possible.

For the purpose of this discussion, “Yammer” is a service, and “node” is a platform. You install a platform and then build your program on it, and you install updates yourself. You use a service in production on someone else’s computers, and it is updated by other people. (From this point of view, “PaaS” is actually a Service, not a Platform. Pedants please direct all complaints to /dev/null.)

There is of course some overlap in the concerns. I definitely do not want any changes whatsoever to the history of release branches in the joyent/node, even if it means the occasional messy revert. If something has gone out as a part of a release, then it is definitely off-limits, and anyone who rebases that into some other shape will be ruthlessly punished (perhaps with the removal of commit access, if the infraction is repeated.)

I think that this split comes down to a simple question: When encountering a new problem, are you likely to use git-bisect to track it down, or are you going to revert to some known-good state and go from there?

Git’s Killer Feature: Bisect

If you have never used git bisect, then you don’t fully understand why git is useful. Git bisect is one of those tools that, once you use it, you realize that there is simply no way you could have ever debugged without it. And, the first time you try to bisect over a commit history littered with merge commits and reverts and broken states, you’ll see why “tidiness” is so powerful.

For the uninitiated, here’s a run-down of what bisect does, in broad strokes:

  1. You run your tests, and realize “Oh, shit, it’s busted.”
  2. You know that it was working yesterday (or whenever).
  3. You run git bisect bad. (Answer “yes” to the prompt.)
  4. You do git checkout <known-good-state>. (Run the test again just to make sure that you’re not mis-remembering.)
  5. Run git bisect good.
  6. Git will proceed to hop you to various commits, at which point you run your test and do git bisect good if it’s good, or git bisect bad if it’s bad. (If necessary, you can also check out commits manually, and mark them as good/bad.)
  7. Shockingly quickly, git tells you which commit was the first bad one.

Because the “test” can be literally anything, I often use this to track down which commit in libuv broke something in node, if the node bisect shows that it was a update libuv to deadbeef type of commit that is the culprit. I run the git bisect in the libuv repo, and the “test” is putting the libuv code in node’s deps/uv folder, and running the node test.

To make bisect even more interesting, you can add something like this to your ~/.gitconfig file:

[alias]
  lg = log --graph --pretty=format:'%Cred%h%Creset %C(yellow)%an%d%Creset %s %Cgreen(%cr)%Creset' --date=relative

Then running git lg will show you a very terse listing of the history, which shows all the bisect tags.

As you can imagine, this works most efficiently when you have a relatively straightforward history. If you have messy back and forth play-by-play commits, where tests are breaking and un-breaking repeatedly, or the build is failing occasionally, then bisect is essentially worthless. Bisect can go over merge commits, but it becomes a lot less trivial to track what it’s doing, and I’ve found a lot of times the “first bad commit” is the merge commit, which is pretty much useless.

Tidiness makes bisect even more powerful. I’d use git without bisect, because it’s the de facto standard of the open source world. But without bisect, I would love git half as much. If I could make bisect even more powerful by rubbing grease on my elbows, I’d do it.

This is not about a sense of satisfaction. Tidiness is about taking 20 minutes to track down a bug, instead of taking all day.

Upstream Motion and Large Feature Branches

Occasionally you have a “Feature” branch that is really a major refactor of some sort, which cannot be accomplished in a single burst of coding. My most recent example is streams2, which I’ve been working on for a few weeks. (There’s been a lot of input and feedback from many others, and the code has been used as a userland module, so this isn’t a code-dump in progress.)

In order to not end up in a state where we have a zillion conflicts to resolve, and to be able to bisect out problems, I’ve been continually rebasing streams2 on top of the upstream master branch.

This is especially important when a change comes into the upstream master that breaks my feature branch, but doesn’t break master. I’m not sure how I’d even go about testing that if I was merging master into streams2 (unless I were to merge each commit individually, which is kind of absurd).

Here’s how I found the source of the problem:

  1. Check out master into a new “clean” workspace. (NB: I didn’t have to download anything again. I could just do cd ..; git clone ./node ./node-clean, and then git pull origin master, since “origin” is the “streams2” workspace.)
  2. Run git bisect bad.
  3. Check out master from 4 weeks ago, which I’m pretty sure worked. (Could have probably been a bit more conservative, but who cares? Bisect uses the magic of a binary search, so it’s probably just one extra test!)
  4. Just to make sure, in the first workspace, rebase streams2 onto that 4-week-old commit. (Because streams2 was already based on master, I did this via git rebase -i <old-commit> and then deleted all the the lines that were above the first streams2 commit.) Sure enough, test passes fine.
  5. Back in the “clean” workspace, run git bisect good.
  6. Repeat step 4 in the “streams2” workspace, running git bisect good or git bisect bad in the “clean” workspace.

I’m sure that there was probably a simpler way to do this. Maybe even some argument to bisect to tell it to rebase onto the commit, or a shell script that’d do it all for me. But this worked, was fast, and totally got the job done. I don’t even know how I would have figured that out otherwise.

(In this case, the culprit was a libuv update. Bisecting through libuv using the technique I described above tracked it down to a remove libev mega-commit, so the moral of the story is that, even with great tools, dependencies suck sometimes.)

Seriously. If you’re doing a big refactor that takes some time, and the upstream root is changing, how do you manage to find these kinds of issues? If I was merging in, all I’d know is that the merge commit made the test fail, but the test doesn’t fail on master. In this case it’s only the combination that is problematic.

I don’t know how to find or solve that sort of problem without rebase. I’m sure that there’s some way, because I vaguely remember handling these kinds of issues when I used CVS and SVN, but it’s all hazy now that I’m spoiled by bisect’s awesome mightiness and the power of a mutable history.

Taking Patches and “Ideal Workflows”

As the maintainer of several open source projects, a few of which get a lot of outside contributions, I am faced with a choice:

  1. Make every potential contributor follow the ideal workflow (for some value of “ideal”), and reject a lot of patches because they are “incorrect” in some trivial way.
  2. Be ok with using git am and git rebase occasionally.

That to me is a nobrainer. My approach to contributions is closer to the “email a patch” model than the “click the green merge button” approach. Having that many merge commits would fuck up my precious bisect, and I always want to test the commit locally before pushing it live.

This doesn’t mean that I don’t use github’s pull request features. On the contrary, I love them! They’re a great lightweight way to do discussion and code reviews, they integrate well with email, and when it’s ready to accept, I can do:

curl https://github.com/joyent/node/pulls/12345.patch | git am

Or to pull just one commit:

curl https://github.com/john-q-contributor/node/deadbeef00.patch | git am

Did the user have a long commit message that is in broken english and will be confusing later? No problem. git rebase -i HEAD^, and reword it. Did they make some minor lint mistake? No matter. make jslintfix && gci -am 'lint fixup' and then squash the commits together. Did they put 3 features in one commit? Easy, just do a git rebase -i, mark the commit for editing, then split it up into 3. They get credit for the work that they did, and I get a history that’s easy to read with a minimum of clutter. In the most extreme case, I can even use git format-patch to output a patch file, edit it manually, and then git am that puppy.

Rebase, am, and apply allow a project lead to be liberal in what they accept, and strict in what they send.

Yes, this can all be abused. So, don’t abuse it. Great power great responsibility blah blah blah. In fact it’s pretty easy to not fuck up, far easier than actually writing perfect code or getting contributors to make perfect commits. If you do mess it up, oh well, there are plenty of forks; just reset and start over.

It’s About History

It would be unreasonable to say that you should always use rebase instead of merge. For example, we routinely merge node’s stable branch into master. Rebasing master onto the latest stable branch would destroy the history of all our previous unstable releases, and would be much harder to manage.

In the case of that “big feature branch” for streams2, even if it ends up being a fast-forward due to being rebased onto the latest master, I’m going to merge it in with --no-ff to force a merge commit, so that it’s easy to pluck off if it turns out that the feature is actually crap.

Git is a tool for managing content. There are a lot of ways to manage content. If something has been pushed live (or even “pushed upstream”) then you probably want to maintain forensic-level control over it.

I think of “history” in terms of levels of granularity.

Take for example, the assassination of Abraham Lincoln, an extremely relevant historical event in the history of the United States.

At the lowest level of granularity, there’s a lot of first-hand data: the eye-witness accounts that were given to police at the scene, the notes from the doctors in the ambulance and at the hospital where Lincoln was rushed after being shot, the descriptions and reports of other forensic evidence at the scene, the writings of John Wilkes Booth and the testimony of his conspirators (who all failed in their missions, for various reasons). There’s also the context of the Civil War being in the process of ending (but that being in some dispute at the time). There’s the account of the Union soldiers who tracked Booth down to a barn, set it on fire, then shot him as he fled, and the account of John St. Helen, who claimed to actually be John Wilkes Booth, having escaped and lived under a pseudonym in Texas for years, before committing suicide.

You don’t learn any of that in history class, because reading police reports makes it significantly more difficult to understand the story. Some context is great, but reading 25 conflicting eye-witness reports is not actually that useful over a hundred years later. Sure, we keep them around, just in case, but what you really want to know is just what actually happened. So, on top of those low-level accounts, we have various different accounts that go into different levels of detail, resolving and synthesizing the inconsistencies into a coherent story. At the highest level of history, you have the kids’ book version: “Abraham Lincoln was the president during the Civil War. He abolished slavery, and then got shot in a theater.”

The -wip branches are my version of the conflicting police reports. The feature branch is a tidied up version which is actually useful, but probably still too detailed for most people to want to parse through. The ChangeLog is the effective “what happened when” sort of history, and at the highest level, you’ll have a tweet from @nodejs that says “Version 0.9.5 Released - streams2 support!” with a link go to download it.

Git is an Editor

One objection to the “git is an editor” mindset is that it’s actually a database.

@izs @polotek Git isn’t an editor, it’s a database. It should keep a record of everything that happens with tracked files. Rebase loses data
— Daniel Erickson (@TechWraith) December 9, 2012

Of course, the problem with that is that a text file can be a database also, and you still use an editor to edit it; and a database is just a fancy editor for manipulating and retrieving blocks of data off a disk, and provides some ways to manipulate that data that take advantage of its regularity.

Many “databases” have mechanisms to delete data. (In fact, all that I know of have something like this.) Losing data is only a problem if you lose data that you care about; losing data that you don’t care about increases the overall understanding and makes the data that you do care about easier to get to.

Use the tools wisely. Use the editor to tell a good story, the right kind of story that your application needs told. If you need forensic details of every change to your production service, well good news, git can do that! If you need a clear explanation of the reasoning behind features in your library, git is great for that, also. Like a good editor, it can be used to tell a variety of different kinds of stories. Like a good database, it gives you the tools to organize your data into the shape that is most useful to you.

The backspace key is very dangerous. You can delete an entire file with it! But the trade-offs are worth the benefit. For the same reason that you delete code rather than comment it out (because it’s in the git history, so why keep the clutter?) it’s also great to rebase work-in-progress branches into a good state for their final merge. Some aspects of history are ok to lose. In many situations, not losing that bit of history makes the story harder to follow.