CodeMRI® Understanding Cores Training Video

 

Silverthread CodeMRI® Training Video: Understand Cores

 

 

CodeMRI® Understanding Cores

Training Video Transcript

 

Hi, my name is Dan Sturtevant, the CEO of Silverthread today we're gonna be talking about architecture and design quality. We're gonna be talking about cores and important aspect of architecture. We're gonna be exploring how they impact the economics of your software development organization from a productivity perspective and from a defect perspective. And we're gonna look at silver thread tools and how to explore which files are in them and other aspects of cores.

What is a core? You can think of a code base as a large collection of source code files. They might be written in Java, C, Python, or some other programming language. Any software project may be maintained by many developers, hundreds of developers in some of the largest systems that you see out there.

One problem that happens is all of these developers develop these different files containing different functionality and in parallel over time. And as a project evolves, you have a situation where a file A might come to depend on file B. B might depend on C and C might depend on a. Now that's a very small core.

What it means to be a core is that you have a cyclic group, you have a cluster of files that all depend on each other in a circular or cyclic fashion. However, these files, these file cores might span many different parts of the system. You could have a core that encapsulates tens, hundreds, or even thousands of software source code files.

They might be contained within an individual module or managed by an individual team, but sometimes you could have a core that spans files. That are maintained or owned by many, different teams in your organization and maybe even at different sites where it's difficult to communicate about them.

The trouble with this is that cores propagate defects. If you make a change to one of the files in a core, it could have a negative impact on files that that developer doesn't have any idea what they do and has nothing to do with they're causing a bug in a different part of the organization. Likewise, someone in a different part of the organization could be creating a bug for you by modifying a file that they don't understand.

 It's long term or broader impact. Another issue with having cores is that it slows down people's productivity because all of these team members have to be reviewing each other's code in the same meetings, worrying about bugs that they could be passing on to others. They are operating much more slowly than they could if they could confidently make changes to their code while believing that it's not gonna have a broader impact on the organization.

A third impact that you can have because of cores is troubles with refactoring. Refactoring is an attempt to fundamentally restructure the nature of a codebase in ways that don't change its functionality, but make it more maintainable in the future. However, if you're gonna refactor a core, you might find yourself in a situation where you're pulling on one of these lines of thread and that causes you to pull another thread and causes you to pull another thread until you finally give up, throw your hands in the air and realize that you can't change this codebase. Instead, you just have to throw it out and start over.

Cores are kind of like a tangle in a fishing line. I have children, sometimes they pop their spool. I then stand there for 30 minutes trying to untangle their line while they use my pole to fish. Now, this wouldn't be a hard time if I knew what I was doing. If I knew exactly which thread to pull at exactly which time I would be able to untangle this mess in about 30 seconds. But I don't because it's a very complex problem. And as that tangle gets bigger and bigger, I'm more likely to think that I should just cut the line and start over again.

That's similar to what you have to do in software. As cores become bigger and bigger, it becomes harder for your organization to understand how to deal with them. You get more and more bugs, you get more and more lost productivity, and eventually, unfortunately, people decide that they have to throw code base out and start over again.

So how do we think about quality in a codebase? What does code quality and what is design quality? Code quality is a bottom up concept. If you open up each individual source code file and look at its contents, You can see that some functions are great and some are overly complex. Some lines of code might have problems on them. If you look at the contents of individual files, you're thinking about them in isolation.

One way to think about the difference between code quality and design quality is to think about bricks in a house. If you measure the strength of every brick, you're thinking about code quality. If you look at the strength of the mortar or look at whether the bricks come together to form a wall or just a big pile of bricks. You're thinking about architecture or design quality. One is a bottom up concept. The other is a top down concept. You can do the same thing with code. If you look at the contents of each individual source code file and think about whether the functions or methods or classes inside it are good, then you're thinking about code quality.

Code quality measures might be finding bad smells, finding poor comments, finding overly complex functions, finding places where there are security vulnerabilities on individual lines of code. Architecture quality is different. Instead of it being a bottom up concept, it's a top down concept. To think about architecture or design quality we're ignoring all of the information about the contents of those files, and it stood looking at the relationships between them. If file A calls file B because there's a function in A that calls A function in B, then we can think that there's a dependency between A and B. If we look at all of these dependencies between all of the files, then we can use graph theory or network theory to think about architecture.

We can use that graph to reason about whether a code base is modular, whether it has tight APIs, whether it's structured hierarchically, whether there are layers, whether there's common. All of these things are architectural properties because they're about the relationships instead of the contents of those files.

Cores are an important architectural property that you can measure in a codebase, and that's what we're talking about today. Cores represent the breakdown of modularity and hierarchy and a codebase. We at Silverthread have done a lot of work to measure statistically the correlation between code quality and design quality.

There are a variety of code quality measures and design quality measures that you can automatically capture from a code base. Many of these code quality measures can come from a variety of tools that have been around since the 1970s Silverthread via our research at MIT and Harvard has invented multiple ways of measuring architectural health using graph theory. All of these measures of quality can go into statistical tests where we can correlate them with business outcome information that we can capture from an organization.

In these studies, we've measured the impact of code and design quality on capability development, the productivity of new development, the ability to ship features on.

Cost or waste associated with the development process, optionality. What I mean by optionality is that if you have a modular code base, you can change one part of it without having to change a different part of it. This also impacts risks. We've done studies around defects, code and design quality have a big impact on defect density, time to fix bugs, amount of developer time spent fixing bugs. The criticality of those bugs, and also the extent to which those bugs are released and have a customer impact. Finally, code and design quality impacts security.

So what is the impact that cores have? What impact do they have on your software economics, your business performance, productivity, defects, and other things that you might care about?

 Let's look into the details of what it means to be a core. When you think about a codebase, architecturally it can be healthy or unhealthy in a healthy codebase, it's structured as a hierarchy of modules. That means that has a clear, top, middle, and bottom. At the top, you might have a graphical user interface or GUI.

In the middle, you might have middleware or business logic or other code that is the heart of your application. At the bottom, you might have common shared utilities, things like loggers or drivers or databases or operating systems, things that are generic and can be applied on other projects. If your codebase is structured as a hierarchy of modules, it benefits for a variety of reasons.

It can evolve over time as you change modules in and out. It also is easier for humans to understand. A codebase that is modular is easier to understand. It's easier to understand because it's enormously complex in its totality. But if you break it down into four parts as we've done here, then what you've done is you've taken a big system with 400 IQ points of complexity and broke it into four parts each with 100 IQ points of complexity in that.

Teams of real humans can be assigned to different parts of this codebase. Each can understand their part and they can coordinate effectively at boundaries between them. Because a module hides internal complexity and presents simple APIs, you only have to understand 100 IQ points of complexity in your code and have a very simple representation of the code around you.

On the other hand, let's look at our unhealthy architecture in this codebase some modules have become too big. They've become useful, and things have been added to them, but they haven't been split into parts as they grew. Instead of having 100 IQ points worth of complexity, this one now has 200 IQ points. As a result, the developers inside that code, perhaps containing a core inside the module, become less productive and experience more bugs.

You might have the circumvention of APIs instead of going through an API to get to the contents of a module, people might be cheating and going at its internals. A really bad problem that might happen is that someone might tie something from the bottom of that architecture to the top of it. Maybe someone in the gooey layer invented a new algorithm that's generally useful, and someone in the utility team recognized that and called it directly.

We have found that 80% of code bases have significant architecture problems of one kind or another. Now, when we're thinking about the health of an architecture, we're thinking in graph theoretic terms. When we're looking at the graph of a code base, we can use algorithms to find places where it's a platform with plug-ins, where there's modularity, where there are good APIs, where there's a lot of reuse or commonality, where it's hierarchical.

These are all good things. We also find cyclic. Places where file A, calls B, calls C, and then that calls back to a, That's a core. A core represents a breakdown of hierarchy and modularity.

When we're thinking about the health of a codebase from both a code quality perspective and an architecture perspective, we want to look at it in this way. When you think about code quality, you're thinking about the contents of individual files, as we talked about. In this picture, a red dot represents a bad file, and a blue dot represents a good one.

If all you have is code quality information, all you understand is how many dots are red and how many are blue. But let's look at the architecture. In the healthy codebase, it's modular. You have four individual pieces within this code base that each contain a lot of complexity. Let's say each contains 100 IQ points worth of complexity.

If it does, then humans working in teams, working on those individual sections of code can function. Those teams work with each other through APIs that present a simplified representation of other modules in the system. In this way, teams of humans can be assembled into teams and work on different parts of enormously complex systems and still all function while making changes confidently, understanding their code and not breaking each.

Because the system is arranged hierarchically, it means at the top you can have graphical user interfaces. In the middle, you can have middleware or business logic. At the bottom, you might have shared common utilities. These common utilities could be reused in other codebases. Let's look at the unhealthy architecture on the other hand. In an unhealthy architecture, maybe one of these modules grew too big. It was originally useful, and so people kept adding to it without dividing it at some point. A healthy codebase as it grows, will go through cell division. It will start as one module and then become two, and then four, and then eight.

And in this way, you can continue to add complexity and functionality to a codebase as it grows while continuing to maintain it in a state that all of the humans inside it can function. Unfortunately, that didn't happen here in the unhealthy architecture. Now if you are interested in learning more about the technical health of a codebase and architecture quality specifically, you can go to our customer knowledge base.

On these pages, we describe architecture and design, quality architecture, quality principles. Let me click on this link and take you to the page on that topic. What you'll see here is that we've discussed modularity, hierarchy, reuse, layering. The impact of design quality, how to think about a codebase as a network. We have examples of code, dependencies. We have an example of a dependency structure from a Linux kernel map. We have a discussion about the difference between code and design quality and how you think about indirect and direct dependencies. How you think about the hierarchy of a codebase. How you can measure and understand modularity.

How you can think about the hierarchical properties of codebase, reuse and utilities, layers, platforms and plugin architectures, and the breakdown of architectural health represented by cores and cyclic groups.

Now, when we look at a codebase as a hierarchy of modules, we can think about what happens when that four module codebase turns into a nine module codebase as it grows over time. Now you've got nine teams of people working in nine different modules. Some of those teams are happy. They're in the hierarchical modular part of your codebase. They can operate independent of each other. They can make changes confident that those changes won't have negative impacts on other teams. However, some of these teams are now wrapped up in a core. Unfortunately it's worse, it's a core that's not just within a module. It's a core that spans modules and spans teams.

You might have seven people in each team, but collectively they form a team of almost 30 people. 30 people who don't understand that they are on the same team because they're not in the same org chart. However they are, they're one large, unhappy team. They're constantly tripping over each other, creating bugs for each other.

They have to be in each other's design reviews and code reviews. Collectively, they fix a lot of bugs. They're highly unproductive. They miss releases and they're mad at each other all the time. This core, which developers think is four modules, is really one module. One big module that requires developers to have an IQ of 400 to anticipate the side effects of the changes they're gonna make.

It requires an IQ of 400 to successfully refactor this codebase without the assistance of tools to help. We have found that cores exist in 80% of the codebases that we find in the wild. It's not anyone's fault. It's an entropy process. Architecture complexity represented by cores is a natural growth process that happens in any codebase if you don't have processes and tools to check their emergence and growth.

What do we mean when we say that something is a critical core or an emerging core? If you look in our reports and find the cores, you'll see that we might classify them as one of those two things. A critical core is a core with more than 150 files. An emerging core is a core that contains between 30 and 150 files.

So how did we decide on those thresholds? We worked with the Software Engineering Institute on a project looking at hundreds of DOD codebases. In that survey, we benchmarked, profiled, spoke with development teams and decided that would be a good threshold. One of the reasons was that every time we spoke with a team that had a core, at least that size, they were having problems. Developers working in those cores were unproductive, they had a lot of bugs, there were organizational issues, and they had delivery issues. In those codebases we found that cores were directly related to the organizational issues being experienced. Now, to pick the threshold for an emerging core, we looked at the statistics. When we do studies statistically correlating architecture issues with business outcomes, we start to detect a statistically significant increase in defect density or decrease in productivity when cores start to grow beyond 30 files.

So let's explore the cores in your codebase using CodeMRI®. Here we'd see the CodeMRI® Portfolio report. In this report, you can see that there are several codebases. In this case, we have Apache we have access to, and we have the Linux kernel. We can also see that for each codebase, we have several versions each representing a different point in time.

Using those trends, we can see if things are moving in a good direction or maybe not in a good direction. Now, if we look at the software quality columns, we can look at cyclicality and modular. Both are scores related to cores and other architecture problems that you might care about. When we look at the scores, it might be between zero and a hundred.

We're scoring your codebase relative to thousands of others that we have measured in our benchmark data set, and then thinking about it as a percentile in terms of health. You can reuse the CodeMRI® Portfolio product to look at your portfolio, compare it against thousands of other benchmarks, compare it against itself via trends over time, and to see if there are challenges that you will want to explore more deeply using CodeMRI® Diagnostics.

One of the codebases in the portfolio we just saw is Axis 2. In this example, we're gonna explore access to version 1.7.9. Axis 2 is a codebase that contains 3000 java files. It contains over 500,000 lines of code. Now, within that codebase, what we can see is that it has two critical cores and two emerging cores.

The critical cores encapsulate 107,000 lines of code. The emerging cores contain more than 40,000 lines of code. That represents a large portion of the overall size of the system, and in those cores, we predict that this organization might experience more problems than outside of them. Now, let's look into these cores more closely.

If we open up this file list, we can see all the files that they contain. In this view, if you look at the Cyclic Group Size column, you can see that there's a core that contains 177 files. If you look at the Core ID column, it'll tell you an identifier for all the files in that core. Core ID one is assigned to the core that's the largest one in the system. Now, in core one, that's a critical core. We see that it contains MessageContext.java, AxisService.java, and others. As we scroll down in this file, we can look at the other cores as well. For example, we see that there is another critical core in this system.

It contains 161 files. As we scroll down, we find two emerging cores. There's an emerging core with 46 files and another emerging core with 45 files. Below that, we can see that there's a core that doesn't meet the threshold of an emerging core, but nevertheless represents a problem that you might want to address before it becomes too big.

So in order to find the cores in your system. Look at the Cyclic Group Size column, look at the Core ID column and look at the files that they contain. Now to explore even more deeply, we might wanna look at this view. This shows us all of the dependencies between files in your system, including the dependencies that connect one core file to another core file.

In this view, you can see that MessageContext.java depends on another file called AbstractContext.java. We can get more detail than that. We can see that it is a Java call. This might be one Java method calling another Java method. Those two Java methods exist in those two files. We can see the exact line of code that that call happens on so that you can open up the file in your ide e and explore where the call happens.

Now here's another way to get at this information. We can use the CodeMRI® Query Interface. CodeMRI® presents a comprehensive command line interface useful from both Linux and Windows to ask questions. In this command called System Query Files, we can pass the name flag and ask about MessageContext.java. The report that comes back tells us that it is in a core of size 177.

It tells us that it is in Cyclic Group ID one. We can use the command line interface to ask a variety of questions. In this video, we're not gonna go through many of them. However, you can ask questions about all of the files in the system, which files they're connected to, how they're connected to those files, the entities within each, and how they're interconnected.

We can ask question about the cores that they're in. The utility of this command line interface allows us to ask question. And then integrate CodeMRI® into other tooling. You might want to use this to get information that would cause you to throw warnings or errors in a build. You might wanna put this information into reporting in your DevOps pipeline.

You might want to integrate this information into business intelligence tools. Anything you can script, you can use to, inform your organization using whatever reporting tooling that you like. Now, the command line interface is useful for a variety of purpose. You can ask questions in an interactive shell, but you can also do it programmatically.

You can call CodeMRI® from inside your scripts, your Python scripts, your Pearl scripts, Bash scripts on Linux, Power Shell scripts on Windows, any way that you might wanna call our tool and use its output to integrate into other tooling that you might have, such as DevOps systems or business intelligence systems.

This allows you to write any code you want, take our data and format it how you wanna see it. JSON is only one format that you might want to use to integrate into other tooling that you might have. CodeMRI® allows you to get information out in tabular form, in comma separated files, in JSON and in other formats that might be helpful for integration. In another video, we're gonna explore how to use the command like query interface and how to use the command line system administration tools in much more depth.

If you would like to explore this issue more deeply. Please look at the books and papers written by our team. For example, Design Rules The Power of Modularity is written by Carliss Y. Baldwin, one of our founders.

The Hidden Structure Paper published in management science goes through the science of cores and an exploration of the metrics associated with them. Finally, Modular Architectures Make You Agile in the Long Run is a paper that I wrote that explores cores in the wild, in industry and in DoD and the impact that they have had on organizations.

For more information, go to Silverthread's website or our knowledge base to find the information you need. So thank you for watching this video about cores. If you'd like more information, please go to our website or look at our customer knowledge base or watch our other videos. There you can get more information about Technical Health, Software Economics and Application Modernization.

Thank you for watching.