GitNStats is a cross-platform git history analyzer. GitNStats is used to identify files within a git repository which are frequently updated. High churn can be used as a proxy for identifying files which may have poor implementation quality, lack tests, or are missing a layer of abstraction.
Below I will provide basic instructions for getting and using GitNStats. We’ll also look at two of my projects to review high-churn files and their git history. By reviewing the history of these files, we can identify potential problem areas, refactoring projects, and development process improvements.
Table of Contents:
Getting GitNStats
Best place to download the software is the repository Releases Page. Pre-packaged 64-bit releases are provided for OSX 10.12, Ubuntu 14.04, Ubuntu 16.04, and Windows.
To install GitNStats:
- Download one of the pre-packaged releases
- Create a home for GitNStats, such as within
/usr/local/share
or your home directory. - Unzip the release package to the target directory
- Link the
gitnstats
binary to a location in your path, such as/usr/local/bin
or/bin
.- Alternatively, you can add the target directory to your
PATH
variable
- Alternatively, you can add the target directory to your
Example workflow included in the README:
# Download release (replace version and runtime accordingly)
cd ~/Downloads
wget <archive-for-your-platform.zip>
# Create directory to keep package
mkdir -p ~/bin/gitnstats
# unzip
unzip osx.10.12-x64.zip -d ~/bin/gitnstats
# Create symlink
ln -s /Users/rubberduck/bin/gitnstats/gitnstats /usr/local/bin/gitnstats
Usage
The primary method of using gitnstats
is simply to run it in a repository without arguments. You will see the repository path, the branch, and a list of file & commit pairs.
gitnstats
Repository: /Users/pjohnston/src/ea/templates
Branch: master
Commits Path
3 oss_docs/CONTRIBUTING.md
3 oss_docs/PULL_REQUEST_TEMPLATE_CCC.md
3 oss_docs/PULL_REQUEST_TEMPLATE.md
3 oss_docs/ISSUE_TEMPLATE.md
2 oss_docs/CODE_OF_CONDUCT.md
1 README_template.md
1 PULL_REQUEST_TEMPLATE_example.md
1 PULL_REQUEST_TEMPLATE_CCC.md
1 Jenkinsfile
1 ISSUE_TEMPLATE_example.md
1 CONTRIBUTING_template.md
1 CODE_OF_CONDUCT_template.md
1 CI.jenkinsfile
1 .github/PULL_REQUEST_TEMPLATE.md
1 .github/ISSUE_TEMPLATE.md
1 oss_docs/README.md
1 jenkins/Jenkinsfile
1 jenkins/CI.jenkinsfile
You can also supply the repository path as a command-line argument, allowing you to invoke gitnstats
from outside of a repository:
gitnstats /Users/pjohnston/src/ea/templates
Repository: /Users/pjohnston/src/ea/templates
Branch: master
…
You can specify a branch name to analyze using the -b
or --branch
arguments:
gitnstats -b avoid-failing-when-delete-a-branch
Repository: /Users/pjohnston/src/ea/scm-sync-configuration-plugin
Branch: avoid-failing-when-delete-a-branch
…
You can also limit the search to all commits after a certain date using the -d
or --date
arguments:
gitnstats -d 1/1/18
Repository: /Users/pjohnston/src/ea/embedded-framework
Branch: master
Commits Path
8 docs/development/libraries.md
5 docs/development/tools.md
4 docs/architecture/architecture.md
3 docs/development/testing.md
2 docs/development/quality.md
Those are the basic operations supported by gitnstats
, and they can be combined together:
gitnstats ~/src/ea/libc -b pj/stdlib-test -d 10/30/17
Repository: /Users/pjohnston/src/ea/libc
Branch: pj/stdlib-test
Commits Path
1 src/stdlib/strtof.c
1 src/stdlib/strtod.c
1 src/gdtoa
1 premake5.lua
1 .gitmodules
1 src/stdlib/strtoll.c
1 src/stdlib/strtol.c
For further instruction, refer to gitnstats --help
Client Project Analysis
I recently worked on a short-term project for a client, so let’s take a look at that project and see how the file churn maps to problems I encountered along the way.
gitnstats
Repository: /Users/pjohnston/src/projects/power-system-fw
Branch: master
Commits Path
34 src/lib/powerctrl/powerctrl.c
34 src/main.c
33 Makefile
29 README.md
26 src/lib/commctrl/commctrl.c
19 src/_config.h
18 src/drivers/i2c/i2c_slave.c
17 src/drivers/can/can.c
13 src/lib/powerctrl/powerctrl.h
13 src/drivers/bmr456/bmr456.c
11 src/drivers/gpio/gpio_interrupt_handler.c
11 src/lib/commctrl/commctrl.h
10 src/drivers/i2c/i2c.c
There are 8 files that have been changed a significant number of times, and the top 3 files were changed 3 times more than the files below the top 10.
That’s a pretty huge gap, so let’s look at the history to see what’s going on with our top three files:
main.c
was updated every time a new library or driver was added and required initialization.- The abort and error handling functions are included in
main.c
and received multiple functionality updates (stopping threads, sending a UART message, LED error code)- These handlers should be split into a different file
- Static functions received doxygen updates in separate commits – I can clearly be better about documenting WHILE writing a function
- The abort and error handling functions are included in
powerctrl.c
is the library which provides power control abstractions and power-state management- Timing parameters have been updated multiple times after validation efforts
- These values should be configurable and moved into
_config.h
– churn should happen there
- These values should be configurable and moved into
- Due to timing problems, the library was overhauled to add in a thread which managed power state changes
- Significantly less churn happens after this change
- As new parts and drivers were brought up, they were added into the power control library individually
- Timing parameters have been updated multiple times after validation efforts
Makefile
was updated every time a new source file was created.- Significant churn happened when bringing up the project on Linux, as differences between
gcc
versions and case-sensitive file systems identified a series of changes that needed to be made- These changes weren’t made on a branch, but instead committed and tested with a new build on the build server.
- This is terrible development practice on my end. I should have been testing locally in a VM or by using a branch.
- Significant churn happened when bringing up the project on Linux, as differences between
By looking at the statistics, I can uncover some design work and refactoring efforts that will improve the project. I also see the results of some expedient choices I made, resulting in terrible development practices and unnecessary file churn. Now these facts are logged in git history forever.
What About Recent Changes?
The project was officially delivered on 6/1/18, so let’s see what modifications have been made after client feedback:
gitnstats -d 6/2/18
Repository: /Users/pjohnston/src/projects/power-system-fw
Branch: master
Commits Path
1 src/drivers/gpio/gpio_interrupt_handler.c
1 src/lib/powerctrl/powerctrl.c
Not too bad after all, though both gpio_interrupt_handler.c
and powerctrl.c
are in the high-commit list in the overall history analysis. If these libraries continue to show edits, I know I need to spend more time thinking about the structure and interfaces of these files.
Jenkins Pipeline Library Analysis
The Jenkins Pipeline Library is an open-source library for use by Jenkins multi-branch pipeline projects. I use this library internally to support complex Jenkins behaviors, as well as with some client Jenkins implementations.
Let’s see what the highest-churn files for this project are:
gitnstats
Repository: /Users/pjohnston/src/ea/jenkins-pipeline-lib
Branch: master
Commits Path
15 vars/sendNotifications.groovy
11 vars/gitTagPreBuild.groovy
10 vars/slackNotify.groovy
5 vars/gitTagCleanup.groovy
4 vars/gitTagSuccess.groovy
4 vars/setGithubStatus.groovy
4 vars/emailNotify.groovy
4 vars/gitBranchName.groovy
…
Wow, the top three files have been edited more than 10 times.
Clearly there is a problem, which is made even worse by the fact that sendNotifications.groovy
was split off into two separate functions: slackNotify.groovy
and emailNotify.groovy
. The fact that sendNotifications.groovy
was managing two separate notification paths was cause for the initial churn on that file, and certainly caused overly complex logic. Splitting the file into two separate functions was A Good Thing.
Diving into the slackNotify.groovy
changes, I can see that I was very thoughtless in my initial implementation and committing strategy.
Two commits were actual feature extensions:
- Add an option to use blueOcean URLs for slack notifications
- Improve output for builds with no changes or first-builds: The commit that was built will be indicated in the message
The rest of the changes were formatting errors, typos, and other fixes for easily-identified errors.
There are some clear lessons here:
- I can identify and address problematic files long before 25 total changes (
sendNotifications.groovy
+slackNotify.groovy
) - To avoid high-churn on a file, follow good development processes. Expediency creates terrible history and higher-than-necessary churn. I would be embarrassed to do this on a professional project, so why did I take the expedient route on a personal (and public!) project?