Category: Theory and Ramblings
Branching and Merging and Branch Types
Ever since the first version control system, SCCS, came out from AT&T's unix distribution over 30 years ago, version control systems allowed one to branch their code. Merging, however, was a completely different story.
The biggest problem with merging is that it can be quite difficult. In order to merge two different files, you not only need the two versions being merged, but you also have to identify the base to use for the merge. This base selection process is very difficult because the results of the merge depends heavily upon the selection of the base. Plus, any merging algorithm has to take into account the chunk size that will be used for the merge process, whether or not the merging algorithm will recognize line movements, white space issues, and even the syntax of the code.
Whitespace means things like tabs, spaces, and line endings. Windows and Unix systems use two different line ending characters, so if one version of a program was done on Windows, and another on Unix, each line could be different even though not a single line was changed in the code. Another problem is that Python scripts and Makefiles are whitespace specific file formats, but most programming languages like C++, Java, and Perl are not. Most merging algorithms get around this issue by allowing you to decide whether or not to ignore whitespace.
Then, there is the chunk size of the code to consider. What if only a single character on the line changes? Do you consider that as a single change, or do you do the whole line. How about non-line oriented languages like HTML or XML. Should your merging algorithm recognize various language syntaxes? Most merging algorithms simply look for line changes since that will work with almost all files. However, ClearCase does not only do regular text files , but also does XML diffs too.
So, most version control systems let you branch, but avoid merging. Enter ClearCase to change all of this. When Atria created ClearCase, they spent a lot of time and effort making sure merging two versions of a file is a smooth and mostly automatic operation. It is such a rare feat, that it takes developers quite a long time to get use to the idea of allowing a program to automatically handle their merging. ClearCase does such a great job, that branches proliferate in places that use ClearCase. In most places, each developer has their own branch off of the main line of code. They merge the main line of code to their branch, and then when they are ready for the world to see their efforts, merge this code back into the main line. A single development shop could have hundreds of active branches at once, and it isn't too unusually for a single developer to use multiple branches at once.
Perforce also worked hard at merging. You have to if you want your software to be considered a world class version control system. However, Perforce looked at branches in a somewhat different light than ClearCase. In Perforce's view, developers work directly on the main branch, and not on development branches as one wold do in ClearCase. Branching is mainly done to prevent code freezes.
A code freeze happens when you are almost ready to deliver a piece of software. You send the release candidate to your Quality Assurance (QA) organization. QA tests the release and sends you back that list of defects. Your developers fix the defects, then send a new release to QA for testing. Adding new features at this point may add new defects, so all new development is forbidden and only defects are fixed. Meanwhile, you are now paying your developers to twiddle their thumbs while they wait for QA to finish testing.
In order to more efficiently use your development workforce, you create a release branch. All new development continues on the main line while only defects are fixed on the release branch. Now, most of your developers can continue working on without worrying that they are possibly adding new defects to the release candidate. Defects that are found on the release branch can be fixed by only one or two members of your development team while the rest continue their work on the main line.
If a defect is discovered on the release branch, Perforce allows you to fix that defect on the release branch, and merge that change back to the main line development branch.
Although Perforce merging algorithm works great on release branches, it doesn't work very well when developers use branching like they do in ClearCase. Most developers have discovered that Perforce does not handle the back and forth merging required by the rebase/deliver method used in most ClearCase branches, and ask "Why doesn't Perforce just do what ClearCase does?".
At the Perforce European Users Conference, Laura Wingerd gave an excellent presentation about branching strategies and explained about the two different types of branching and why Perforce and ClearCase tend to handle branching differently.
ClearCase handles what she called Convergent Branching. That is, the differences between the two branches tend to converge over time. A developer creates a branch from the main line development, and uses it to do their development. Every once in a while, they rebase by merging any new changes on the main line branch to their development branch. After a while, they do one final rebase, and then deliver their changes by merging their development branch back to the main line branch. ClearCase is perfectly tuned for this rebase/deliver algorithm.
This type of branching is called convergent because, as they merge the code back and forth between the branches, the two branches will remain pretty much in sync. After all, the developer wants to start off with a copy of what the main line development looks like. And, when the developer delivers their code, they pretty much want the main line branch to match what is on their branch. Although the branches can differ, the differences will generally disappear over time.
Perforce, on the other hand, handles what Laura Wingerd calls divergent branching. That is, the differences between the two branches tend to diverge over time. In my above example, the code on the release branch is generally frozen while the main line keeps developing. Within six months, the differences between the main line and the release branch will be fairly large. In two years, they'll be even larger.
However, not all divergent branching involves release branches. For example, you might have a Unix product that you port over to Windows. This involves quite a few changes in the code because of the way the OSs handle internal calls, and probably differences in the API calls. Later on, you want features developed for one OS will be merged onto the other. However, the differences between the two branches will never be 100% reconciled and may grow during time.
Perforce merges were designed to handle divergent branches. ClearCase always picks for the base the version of the file that was the most recent common ancestor of the two branches being merged. Perforce, on the other hand, finds the ancestor with the most rich history in common between the two versions. Perforce also tracks not only that a merge was done, but whether it was a copy or a merge, and whether the merge or copy operation was done without modifications, or if the developer had to edit the results. And, Perforce tracks exactly which versions of a file were considered for a merge operation. ClearCase can only track that a merge took place between two versions of a file.
This results in ClearCase having difficulties with divergent merging and most developers simply perform divergent merges in ClearCase manually. However, for all of its smarts, Perforce has problems when it comes to convergent merges. Most of the time, Perforce simply picks the wrong base to use for convergent merges because it attempts to over analyzes the situation and doesn't realize that in convergent branches, most of the differences between the branches have already been accounted for in the last merge.
To get around its problems with convergent merges, Perforce recommends what they call a Merge Down/Copy Up strategy. That is, you merge changes from the parent branch to the child branch for rebase operations, but you copy the files from the child branch back to your parent branch for delivery. Although the rebase step in Perforce is fairly straight forward (you do the merge the standard way), the delivery operation becomes a complex seven step dance which involves running the merge operation twice.
Perforce recognizes this issue and promises that they will simplify the process of thecopy down operation, and maybe completely automating it in future versions. Meanwhile, Laura Wingerd's presentation has helped clarify the question on when it comes to merging, "why doesn't Perforce just do what ClearCase does?" in merge operations.
Versioning Builds
There are two schools of thought on this subject:
- Why save something when you can just rebuild it? According to this school of thought, you should never version the binary results of a build. It simply waste space.
With a line oriented text file, that is pretty easy to do. Imagine you have a 100 line program, you modified three lines, deleted two lines, and added four new lines. If I had the original 100 line version, all I need to store are the instructions on how to get from the older version to the newer version - just nine changes. If I store two versions of a binary file, I must store a complete copy of each version. If I produce 500 megabytes of binaries with every build, and I am doing five builds per week, I am adding 2 1/2 gigabytes of storage per week or about 125 gigabytes of storage per year. I can quickly overwhelm any version control system if I don't have a policy that helps me reclaim obsolete binaries. This becomes another management headache.
Plus saving build binaries can lead to bad build practices or covers up bad build practices. If I know I have to rebuild my compiled code each and every time, I make sure that my build practices allow me to do just that.
The Subversion development team does not believe in storing products of the build process, and this shows up in the design philosophy behind Subversion: Subversion has no command or easy mechanism for removing versions of files.
In fact, much of the open source community is against storing products of the build process which is why most open source software is distributed strictly as source.
- Why build something when you can save it under version control? I personally lean in this camp, and the reason maybe because I work mainly in non-open source environments with very large teams of developers and multiple products. Our developers may depend upon the pre-built libraries created by their fellow developers. As part of my build process, I compile these libraries and allow other developers to use them. Yes, these developers could build their own libraries, but why should I assume that the developer will select the right code and version of the files to build?
By distributing the binaries of the libraries, I can ensure that each developer is using the same set of files. Another advantage is that storing the output of the build process means everyone knows where the official copy of the release is stored. Plus, I can use the power of my version control system to keep up with the binaries.
This is one of the areas where ClearCase excels. In ClearCase, I can take a built binary, and ClearCase will give me the names of the files used in the build, but also the versions, build scripts, and environmental settings. But. that is only true if the file never leaves ClearCase's storage area.
Another advantage is that I am pretty sure my version control area is being backed up. A storage area for the built files may not be backed up.
The big disadvantage is that you have to keep cleaning out obsolete and unimportant versions of your builds. For example, you might not be interested in built files older than two weeks old as long as those files haven't been sent to clients or to QA. This means determining which versions you want to keep and which to throw out. Again, ClearCase makes this very easy. Under ClearCase, the rmver (Remove version) command won't by default delete a version of a file if it is labeled. If you're using ClearCase, all you have to do is delete the labels that are no longer important to you, and run the rmver command.
Perforce allows you to remove obsolete versions of a file via the obsolete command. The problem is that Perforce will delete interesting and uninteresting versions of files. The best way to handle this is to use branches for environments. Any version that is QA'd should be moved onto the QA branch. Any built version that is ready for customer usage should be placed on the distribution branch. This way, you can remove old builds from the development versions without worrying that you might delete version that is sitting on 353 customer sites.
Another more philosophical question is how can you send something out to production when a different version of the file is in Ohio. If you store your build output, and QA tests what is stored and likes it, you know exactly what was tested. If you have to rebuild your output for production, how can you be sure that is what QA really tested.
In an era of cheap disk space, storing the results of a build is not a terrible waste. Yes, you have to have one more management headache -- finding the obsolete revisions and removing them on a regular basis. But, it isn't that difficult to implement such a task.