GitNStats: A Git History Analyzer to Help Identify Code Hotspots

GitNStats is a cross-platform git history analyzer. GitNStats is used to identify files within a git repository which are frequently updated. High churn can be used as a proxy for identifying files which may have poor implementation quality, lack tests, or are missing a layer of abstraction.

Below I will provide basic instructions for getting and using GitNStats. We'll also look at two of my projects to review high-churn files and their git history. By reviewing the history of these files, we can identify potential problem areas, refactoring projects, and development process improvements.

Table of Contents:

  1. Getting GitNStats
  2. Usage
  3. Client Project Analysis
  4. Jenkins Pipeline Library Analysis
  5. Further Reading

Getting GitNStats

Best place to download the software is the repository Releases Page. Pre-packaged 64-bit releases are provided for OSX 10.12, Ubuntu 14.04, Ubuntu 16.04, and Windows.

To install GitNStats:

  1. Download one of the pre-packaged releases
  2. Create a home for GitNStats, such as within /usr/local/share or your home directory.
  3. Unzip the release package to the target directory
  4. Link the gitnstats binary to a location in your path, such as /usr/local/bin or /bin.
    1. Alternatively, you can add the target directory to your PATH variable

Example workflow included in the README:

# Download release (replace version and runtime accordingly)
cd ~/Downloads
wget <archive-for-your-platform.zip>

# Create directory to keep package
mkdir -p ~/bin/gitnstats

# unzip
unzip osx.10.12-x64.zip -d ~/bin/gitnstats

# Create symlink
ln -s /Users/rubberduck/bin/gitnstats/gitnstats /usr/local/bin/gitnstats

Usage

The primary method of using gitnstats is simply to run it in a repository without arguments. You will see the repository path, the branch, and a list of file & commit pairs.

$ gitnstats

Repository: /Users/pjohnston/src/ea/templates
Branch: master

Commits    Path
3    oss_docs/CONTRIBUTING.md
3    oss_docs/PULL_REQUEST_TEMPLATE_CCC.md
3    oss_docs/PULL_REQUEST_TEMPLATE.md
3    oss_docs/ISSUE_TEMPLATE.md
2    oss_docs/CODE_OF_CONDUCT.md
1    README_template.md
1    PULL_REQUEST_TEMPLATE_example.md
1    PULL_REQUEST_TEMPLATE_CCC.md
1    Jenkinsfile
1    ISSUE_TEMPLATE_example.md
1    CONTRIBUTING_template.md
1    CODE_OF_CONDUCT_template.md
1    CI.jenkinsfile
1    .github/PULL_REQUEST_TEMPLATE.md
1    .github/ISSUE_TEMPLATE.md
1    oss_docs/README.md
1    jenkins/Jenkinsfile
1    jenkins/CI.jenkinsfile

You can also supply the repository path as a command-line argument, allowing you to invoke gitnstats from outside of a repository:

~$ gitnstats /Users/pjohnston/src/ea/templates
Repository: /Users/pjohnston/src/ea/templates
Branch: master

…

You can specify a branch name to analyze using the -b or --branch arguments:

$ gitnstats -b avoid-failing-when-delete-a-branch
Repository: /Users/pjohnston/src/ea/scm-sync-configuration-plugin
Branch: avoid-failing-when-delete-a-branch

…

You can also limit the search to all commits after a certain date using the -d or --date arguments:

$ gitnstats -d 1/1/18
Repository: /Users/pjohnston/src/ea/embedded-framework
Branch: master

Commits    Path
8    docs/development/libraries.md
5    docs/development/tools.md
4    docs/architecture/architecture.md
3    docs/development/testing.md
2    docs/development/quality.md

Those are the basic operations supported by gitnstats, and they can be combined together:

$ gitnstats ~/src/ea/libc -b pj/stdlib-test -d 10/30/17
Repository: /Users/pjohnston/src/ea/libc
Branch: pj/stdlib-test

Commits    Path
1    src/stdlib/strtof.c
1    src/stdlib/strtod.c
1    src/gdtoa
1    premake5.lua
1    .gitmodules
1    src/stdlib/strtoll.c
1    src/stdlib/strtol.c

For further instruction, refer to gitnstats --help

Client Project Analysis

I recently worked on a short-term project for a client, so let's take a look at that project and see how the file churn maps to problems I encountered along the way.

10:38:13 (master) power-system-fw$ gitnstats
Repository: /Users/pjohnston/src/projects/power-system-fw
Branch: master

Commits    Path
34    src/lib/powerctrl/powerctrl.c
34    src/main.c
33    Makefile
29    README.md
26    src/lib/commctrl/commctrl.c
19    src/_config.h
18    src/drivers/i2c/i2c_slave.c
17    src/drivers/can/can.c
13    src/lib/powerctrl/powerctrl.h
13    src/drivers/bmr456/bmr456.c
11    src/drivers/gpio/gpio_interrupt_handler.c
11    src/lib/commctrl/commctrl.h
10    src/drivers/i2c/i2c.c

There are 8 files that have been changed a significant number of times, and the top 3 files were changed 3 times more than the files below the top 10.

That's a pretty huge gap, so let's look at the history to see what's going on with our top three files:

  • main.c was updated every time a new library or driver was added and required initialization.
    • The abort and error handling functions are included in main.c and received multiple functionality updates (stopping threads, sending a UART message, LED error code)
      • These handlers should be split into a different file
    • Static functions received doxygen updates in separate commits - I can clearly be better about documenting WHILE writing a function
  • powerctrl.c is the library which provides power control abstractions and power-state management
    • Timing parameters have been updated multiple times after validation efforts
      • These values should be configurable and moved into _config.h - churn should happen there
    • Due to timing problems, the library was overhauled to add in a thread which managed power state changes
      • Significantly less churn happens after this change
    • As new parts and drivers were brought up, they were added into the power control library individually
  • Makefile was updated every time a new source file was created.
    • Significant churn happened when bringing up the project on Linux, as differences between gcc versions and case-sensitive file systems identified a series of changes that needed to be made
      • These changes weren't made on a branch, but instead committed and tested with a new build on the build server.
      • This is terrible development practice on my end. I should have been testing locally in a VM or by using a branch.

By looking at the statistics, I can uncover some design work and refactoring efforts that will improve the project. I also see the results of some expedient choices I made, resulting in terrible development practices and unnecessary file churn. Now these facts are logged in git history forever.

What About Recent Changes?

The project was officially delivered on 6/1/18, so let's see what modifications have been made after client feedback:

$ gitnstats -d 6/2/18
Repository: /Users/pjohnston/src/projects/power-system-fw
Branch: master

Commits    Path
1    src/drivers/gpio/gpio_interrupt_handler.c
1    src/lib/powerctrl/powerctrl.c

Not too bad after all, though both gpio_interrupt_handler.c and powerctrl.c are in the high-commit list in the overall history analysis. If these libraries continue to show edits, I know I need to spend more time thinking about the structure and interfaces of these files.

Jenkins Pipeline Library Analysis

The Jenkins Pipeline Library is an open-source library for use by Jenkins multi-branch pipeline projects. I use this library internally to support complex Jenkins behaviors, as well as with some client Jenkins implementations.

Let's see what the highest-churn files for this project are:

10:41:59 (master) jenkins-pipeline-lib$ gitnstats
Repository: /Users/pjohnston/src/ea/jenkins-pipeline-lib
Branch: master

Commits    Path
15    vars/sendNotifications.groovy
11    vars/gitTagPreBuild.groovy
10    vars/slackNotify.groovy
5    vars/gitTagCleanup.groovy
4    vars/gitTagSuccess.groovy
4    vars/setGithubStatus.groovy
4    vars/emailNotify.groovy
4    vars/gitBranchName.groovy

…

Wow, the top three files have been edited more than 10 times.

Clearly there is a problem, which is made even worse by the fact that sendNotifications.groovy was split off into two separate functions: slackNotify.groovy and emailNotify.groovy. The fact that sendNotifications.groovy was managing two separate notification paths was cause for the initial churn on that file, and certainly caused overly complex logic. Splitting the file into two separate functions was A Good Thing.

Diving into the slackNotify.groovy changes, I can see that I was very thoughtless in my initial implementation and committing strategy.

Two commits were actual feature extensions:

  1. Add an option to use blueOcean URLs for slack notifications
  2. Improve output for builds with no changes or first-builds: The commit that was built will be indicated in the message

The rest of the changes were formatting errors, typos, and other fixes for easily-identified errors.

There are some clear lessons here:

  1. I can identify and address problematic files long before 25 total changes (sendNotifications.groovy + slackNotify.groovy)
  2. To avoid high-churn on a file, follow good development processes. Expediency creates terrible history and higher-than-necessary churn. I would be embarrassed to do this on a professional project, so why did I take the expedient route on a personal (and public!) project?

Further Reading