Part 4: Scale-up - Analyzing Patterns in the Evolution of Linux
In this article we’ll explore how software evolution helps us make sense of large codebases. We’ll use the Linux Kernel as a practical case study. By analyzing patterns in the evolution of the Linux Kernel we’re able to break down million lines of code, authored by thousands of developers, into a set of specific and focused refactoring tasks that are most likely to give you the most bang for your efforts.
X-Ray reveals that most modifications to existing code are unevenly distributed across the functions in a file.
This is why I recommend that we guide our refactorings by data. Data that is based on our own behavioral patterns in code. You see, refactoring legacy code is both expensive and high risk. With X-Ray as our guide we know that we spend our efforts where they are likely to be needed the most.
Find Implicit Dependencies Between Functions
After this detour into a discussion of change frequency distributions we may still struggle with the refactoring of our main Hotspot. I briefly mentioned that Hotspots often have structural problems. In my previous blog post I showed how X-Ray detects patterns in how a Hotspot grows. More specific, I showed how temporal coupling lets you identify the functions in a Hotspot that tend to be modified together in the same commit. A temporal coupling analysis like that may help you detect some structural problems. Let’s see how it looks on our intel_display.c
.
X-Ray detects how often the functions inside a Hotspot are modified in the same commit.
As you see in the picture above, there are several functions inside intel_display.c
that are modified together all the time. For example, the top row shows that intel_finish_page_flip_cs
and intel_finish_page_flip_mmio
have been modified together in every single commit that touched one of them. That implies that these two functions are intimately related.
The Similarity column in the table above is a clone detection algorithm. You see, a common reason that code changes together is because it contains duplication (either on the code level or in terms of knowledge). In our case, we note a similarity of 99% between the functions intel_finish_page_flip_cs
and
intel_finish_page_flip_mmio
. Let’s click on the Compare button in CodeScene to inspect the code:
X-Ray detects software clones inside a Hotspot.
You need to look carefully at the code above since there’s only a single character that differs between the clones. A single negation(!
) is all the difference there is. That leaves us with a clear case of code duplication. We also know that the duplication matters since the temporal coupling tells us that these two clones evolve together. Software clones like these are good in the sense that it’s a low-hanging fruit; Factor out the commonalities and you get an immediate drop in the amount of code you have in your Hotspot.
Rinse and Repeat for your Main Suspects
Once we’ve inspected our top Hotspot we just rinse and repeat the process with the other main suspects. And here you’ll note another advantage of configuring separate analyses for the different sub-systems. Inspecting a Hotspot is so much easier if you’re familiar with the domain and the code. So if we manage to align the scope of the analysis with the expertise of the team that act upon them we’re in a good place.
This is something I experience each time I present an analysis of a commercial codebase to its developers. As I present the Hotspot analyses of the different parts of the codebase I usually get approving nods from different parts of the audience depending on their area of expertise. Hotspots put numbers on your gut feelings.
Explore the Social Dimension of Software Design
So far we’ve learned the basics of how software evolution helps us uncover potential technical problems in a large codebase. But software evolution also helps you understand the social dimension of code. Since CodeScene uses version-control data for the analyses, CodeScene is able to detect patterns in how people work and collaborate.
Now, there’s a difference between the open source model used in the Linux project as compared to the collaboration mechanisms you usually see in closed source commercial projects. In the latter case you typically have several distinct teams, often co-located on the same site. And improving the coordination and communication between these teams is often even more important than addressing the immediate technical debt in your code.
CodeScene comes with a set of analyses that help you uncover such team-productivity bottlenecks. An example is code that has to be concurrently worked on by members of different teams. Since the Linux project doesn’t have a formal organization we’ll limit our social analyses to individuals.
Detect Excess Parallel Development
The way we chose to organize influences the kind of code we write. There’s a strong difference between code developed by a single individual versus code that’s more of shared effort by multiple programmers. The quality risks are not so much about how many developers that have to work with a particular piece of code; It’s more about how diffused their contributions are.
In Your Code As A Crime Scene I wrote that “[..]the ownership proportion of the main developer is a good predictor of the quality of the code! The higher the ownership proportion of the main developer, the fewer defects in the code”.
Again, open source development may be different and encourage contributions to all parts of the code. However, there’s evidence that suggests that this comes with another cost. One study on Linux itself claims that code written by many developers is more likely to have security flaws (A. Meneely & L. Williams, 2009. Secure open source collaboration: an empirical study of Linus’ law). Wouldn’t it be great if we had an analysis that helps us identify those parts of the code?
Detecting code written by many developers is precisely what CodeScene’s Parallel Development analysis does. In particular, it doesn’t look at the number of authors, but on how diffused their contributions to each file are. Let’s see it in action:
A Parallel Development analysis shows you code that's modified by many authors.
You interpret the visualization above by looking at the color of each file; The more red, the more diffused work on that code. And if you’d like more information, you just click on one of the files to reveal its Fractal Figure;
Fractal Figures shows the diffusion of the contributing authors' work on a single file .
The Fractal Figure above is based on a simple model; Each developer is assigned a unique color and the more that developer has contributed to the code, the larger their area of the fractal.
Embrace the History of your Code
This concludes our exploration of the evolution of the Linux codebase for this time. My goal was to show you how to make sense of a large codebase by utilizing the information of how the system was built. By embracing the history of the code, we were able to identify patterns like Hotspots and implicit dependencies. That is, information that is invisible in the code itself. We also had a brief look at how we can uncover social and organizational information that helps us understand another important dimension of large-scale systems.
Run the Analyses on Your Own Codebase
The best way to learn more is to try CodeScene on your own codebases. CodeScene is available as a service that’s free for open source projects: https://codescene.io/.
Empear also provides CodeScene on-premise. The on-premise version is feature complete with all analyses used in this article. You get an on-premise version here.
Read the Earlier Parts of the Series
Software (r)Evolution is a series of articles that explore novel approaches to understanding and improving large-scale codebases. Along the way we’ll use modern data science to uncover both problematic code as well as the behavioral patterns of the developers that build your software. This combination lets you to identify the parts of your system that benefit the most from improvements, detect organizational issues and ensure that the suggested improvements give you a real return on your investment.