Q&A: How do You Gain an Understanding of Source Code for a New Project?

We received a question from one of our community members that we think is worth exploring with the greater community:

What is a good way to go about reading source code to understand it? Should I start at main.c and go line by line until a function is called in a different file, and then jump to that function? Should I start at main.c, read it all, make notes, then proceed to a different source file and header, repeat, and make connections? Is there a better approach?

As with any good problem, our first answer is: the approach to take depends on what your goal is!

You might be trying to gain general understanding of a whole project and develop a mental model of it. You might be trying to find something specific within the code base that you need to understand, which is relevant to a problem you’re working on.

Even for general understanding, you might be interested in creating an operational model (how the program behaves when it’s running) or an organizational model (how the code is organized, both on disk and as modules). Each of these models also requires a different approach.

In this case, the member clarified their interest:

This project was created by someone who is intentionally making it as difficult to work with and "black boxed" as possible. I want to gain an overall understanding of how the project works so I would be able to recreate it (if necessary), expand functionality, and refactor the code to support unit testing.

The steps we outline below are aimed at gaining a general understanding of the project by creating an organizational model and an operational model.

Use Tools to Aid Your Investigation

Our first recommendation is use tools to aid your investigation.

Something like cscope or ctags. These tools generate a database for your project, allowing you to quickly search for functions, variables, macros, etc. You don’t need to hunt around to find where a function is defined: you can go directly to it. You can find every place a function is invoked. For navigating a source tree, especially a difficult one, these tools work wonders.

Tools that generate dependency graphs are also helpful. These dependency graphs provide a visual representation for how different modules are interrelated. Generating these from the code is much faster than trying to do it by yourself. Some options for dependency graph generation are:

Build a Mental Model of the System

Once our tools are set up, we get to the difficult job of developing a workable mental model of the system.

Organizational Model

I typically approach a new system by creating a file-based organization map. I also document the organizational model in the README or other relevant developer documentation so that others can make use of my work.

Questions I seek to answer while building this model are:

  • Is there a hierarchy or a flat structure?
  • What is the folder structure if it’s a hierarchy?
  • Are libraries, drivers, and external code separated in their own locations?
  • What is the underlying organizational logic, if any?

If there are key files, like main.c, I specifically call them out. But I don’t focus on listing every file and its purpose, I simply try to understand how the project is organized at a high level.

Operational Model

Once I know how the project is organized, I start to build an operational model by making notes about how the program behaves while it’s running. I will always do this before looking though the source code, because it provides me with the context I need to understand what the code is doing.

Note: You could create the operational model before the organizational model.

Questions I seek to answer while building up an operational model of the system:

  • What is the boot process? How long does it take to boot the program?
  • What kind of debug messages are printed out?
    • Where are they printed to?
    • Are some of those messages repeated?
  • What does the program do in "steady state", if I take no actions?
    • Are there automated actions that happen? What is the frequency?
  • What are the different operational modes?
    • How do I change modes?
    • When I change modes, what happens?

Sometimes I will draw out rough state charts, sequence diagrams, and communication diagrams while I’m working through these questions. These help me further refine my model of the system’s behavior.

I don’t expect my models to perfectly match the system’s design, but they serve as a helpful starting point as I’m navigating though the source code. I can use debug messages to jump to specific files. I can begin to correlate system states with different modules in the code. I can quickly associate the source code with the behaviors I observed in the system.

Code Investigations

Once I’ve developed models of the project organization and the program’s behavior, I begin to look into the source code. This often involves a two-pronged approach.

One route is to identify which source files go with specific behaviors shown during runtime. I do this by searching for debug strings in the source files. This allows you to focus on specific aspects of the system by finding a “latch” to begin your search, and expand outward from there.

Another route is to start at main() and work through the various functions and modules one-by-one. For my first pass through the code, I try to identify the major functions and modules, ignoring the rest. You can get deep into the weeds if you look too closely initially.

During these explorations, I will also create another visual model of the system. This is not a logical picture of the organization, nor is it a dependency graph. Instead, I simply try to identify the major source code components and document how they interact, usually via communication diagrams and sequence diagrams.

If you’re working through source code that’s unfamiliar, inevitably you’ll find a section of code that you wrestle with. Always add comments to the source once you’ve clarified what’s happening. Future you, and other developers, will be grateful for the documentation improvements and clarifications.

Allow the Process to Take Time

As a matter of practical advice, you should allow (and expect) this process to take time. Spread out the work so you don’t go mad. Learning a new system is a marathon rather than a sprint.

If you do need to sprint for some reason, just focus on understanding a particular subset of the system. That’s usually how I handle the complexity with client projects: I try to ignore as much about the system as I possibly can, focusing only on the area I’m interested in. I then work to understand that subsection of the code deeply, and only expand outward as necessary.

Allowing the process to take time also helps when you are interested in adding unit tests to a code base. Often, the source code will be tightly coupled. This is especially true of obtuse code bases, which likely have many unnecessary dependencies, headers that include other headers that they don’t need, information which isn’t properly encapsulated, etc. That’s where you must begin: breaking these unnecessary dependencies. Recognize that this will be a continual process, and not something that you can complete in one go.

Another good way to begin cleaning up the project is to refactor the organization of the code within the repository. Group together libraries, drivers, and other related modules in separate locations. Split up large files into smaller files with better-defined responsibilities. Again, recognize that this effort will be a continual process and will take time.

Be Aware of the Political Dangers

If you’re working on an intentionally obtuse code base, you must acknowledge that there could be a reason the code is designed that way. A person may have used this approach to gain power: only they can understand the system, and by making it difficult for others to understand the system, they maintain their supremacy.

That power is not easily taken away. Be careful, and ensure that you aren’t setting yourself up for an attack down the road. People don’t tend to play nice when they are exposed and their form of power is being revoked. They may have allies that you do not have.

Look out for signs of resistance to improvements, clarifications, and comments. You may notice direct resistance, passive-aggressive comments, or even that your bosses turn against you. These are signs that you are in a toxic environment. It is better to move on to greener pastures and let the code be.

Recommended Reading

There have been excellent books published on the subject of working with legacy code. Two I can recommend are:

On the subject of improving existing code, I recommend:

Share Your Thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.