CodeMRI® Code Duplication Training Video

 

Silverthread CodeMRI® Training Video: Code Duplication

 

 

CodeMRI® Code Duplication Training

Training Video Transcript

 

Code Duplication - Concept

Hi, my name is Dan Sturtevant. Today we're going to be talking about code duplication. So, what is code duplication? Developers working at directory structure containing tens, hundreds, or thousands of source code files. Many people may add to or modify that code over decades as they look through the code trying to enhance it, they may come across a code, snoop it, an algorithm or a function that is useful or that might be useful with only minor modifications. They might remember code from a different code base. That would be useful in this one. 

When creating something new, a developer has three choices. Write something completely new, modify the existing code so that it can do what it always did, and do what is needed now, or thirdly, they can duplicate it. They can make a copy of the code, put it where it would be more convenient for their current needs and possibly modify it a little bit. Each of these three options has its place.

Here's an example of duplicated code, and here's another example, and here's another. Sometimes code blocks are copied repeatedly in the same file. Sometimes code blocks are copied between files. Sometimes the same block of code will be copied many times, three times, dozens of times. Or in one recent case I saw, 608 times.

Sometimes entire files are copied. Sometimes entire directories or libraries are duplicated. After code is duplicated, it often drifts apart slowly as different developers change it for different reasons. Eventually, as each evolves, the two copies may become unrecognizable. These three choices also exist at large scale as well.

When two separate organizations or business units both want to enhance the system and they have contrary goals, one could redevelop the system from scratch. Second. They could work together to modify the same system for both of their needs. Finally, they could each take a duplicate copy of the entire code base, known as a fork and proceed independently.

Code duplication has benefits and is tempting sometimes. Often, duplicating code may seem like the safest and fastest way to get the immediate job done. You don't have to reinvent the wheel. You also have complete control of the copied code and can modify it to your liking. You don't have to worry about breaking someone else's stuff that already exists by changing it for your purpose.

Unfortunately, code duplication sometimes presents a trap. There can be a big difference between which choice is better in the short run and which is better in the long run. There's also a difference between the individual pressures each developer or team faces versus what is good for the whole organization.

If code duplication happens, repeatedly, its buildup sometimes creates long-lasting problems that can far outweigh the benefit of doing it in the first place. Here are the problems. 

The first problem is that code is costly to maintain. I've seen situations where a code base has 10 million lines, but if all the duplication was removed, it would only have 2 million. If you can do the same thing in 2 million lines that you can do in ten, you really want to do your best to only have to maintain the 2 million lines of code. If not, you have a lot of unnecessary overhead. 

The second problem is that bugs don't get fixed. If you have 10 copies of the same function and a bug report comes in, developers will only fix the bug in the spot the customer tripped over, and they won't know to look for it in the nine other places where it is. 

A third problem is that you're missing out on the benefit of good forms of reuse. Code that is relied upon, by many other files or that is packaged into a library and then used by many systems becomes a utility. Over time, it is thoughtfully made to be generic so that it can be easily repurposed.

An example of a generic utility is a water faucet or an electrical outlet. You'll always know which you're going to get. You can use it to fill a cup any shape or plug in any kind of appliance. It does one job, and it does it well. The electrical outlet doesn't care that some new kind of device was invented. It just doesn't care. It just has to keep pumping electricity at the proper voltage. A utility is a gift that keeps on giving. 

Similarly, once utility code is made sufficiently generic, it usually doesn't have to change much. General purpose code often found in libraries can be reused in a healthy way and be a development and economic force multiplier. Also, these utilities are often extremely reliable with very few bugs. When Silverthread runs our statistics, we find that utilities have the lowest defect density.

Finally, Linux creator, Linus Torvalds may have said it best. Linus law states that given enough eyeballs, all bugs are shallow. If the same piece of code is used by lots of people for lots of purposes, then the community of people looking at it will have, will be more effective.

We want to encourage the good kind of reuse in the form of utility and library creation and not the bad kind, which is code duplication. So, what should we do about it? CodeMRI® can be used to help you remediate the situation. You can use CodeMRI® to find where all the code duplication is. CodeMRI® reports can show you how many duplicative lines exist in a code base.

 

They can also tell you how small your code base would be if all the code duplication lines were consolidated. CodeMRI® can show you which files are the most affected, some may have only little, while others have a lot. You can look at each instance of duplication to see how big the copied block of code is. You can see how many copies of that code chunk exist, find each of those copies in your code base and inspect each in your editor. You can look at how different each of those copies has become because of drift. So, code duplication is important to pay attention to in your code.

 Don't duplicate code, put it in a library or a utility. 

Thank you.

 

Code Duplication - Application

Hi, I'm James Hamlett. I'm a senior software architect with Silverthread, and in this video we're going to look at using CodeMRI® for identifying code duplication in a code base. , what do we mean by code duplication? Code duplication is a set of multiple instances of the same code reoccurring within a code base or perhaps across code bases.

An example of that might be something as simple as a function that takes two  integers and return and integer. And if this exists in multiple places, then we have code duplication. There are numerous advantages to reducing code duplication. 

First and foremost, it reduces the source lines of code or SLOC footprint. It shrinks the code base. The smaller SLOC footprint means there's less places for errors to occur. Reducing SLOC also likely reduces McCabe complexity as well, and it possibly reduces references to outside sources or depend. Reduce instances of duplication means reduce instances of bugs. For instance, if you have code duplication in multiple places and there's a bug in one of those places, that means there's a bug in all those places.

In previous videos, we've covered how to use CodeMRI® to scan a code base and to generate reports that contain metrics on a code. For this video, I've downloaded an Apache public code base, scanned it and generated reports with those metrics. 

Now, if you open the main diagnostic report, we see here the summary page, there's about a half million lines of codes.

So this is a pretty good example. If we look in the file list, we will see all the data, all the metrics that CodeMRI® presents. How can we find code duplication? CodeMRI® scanning a code base will reveal duplicate code across the code base. I've downloaded an Apache public code base and scanned it and generated reports. Here we have about a half a million lines, and that's a pretty good size for looking for code duplication.

If we click on the file list tab, we will see a list of all the files in the code base that were. Along with all the metrics, the CodeMRI® generated for each file. As we can see, CodeMRI® generates a wealth of information for every file across numerous metrics including file cores, visibility, cyclomatic lines of code, code duplication.

So, let's remove some of this. So, we're just looking at code duplication. Okay, I've hidden most of the columns, except for File Name, Source, Lines of Code, and Percent Duplicate Lines of Code. I've sorted this in descending order by Percent Duplicate Lines of Code. This report gives us an overall view of the amount of duplicate code in each file, and we're going to move on to the next report, which shows us sets of duplicate code and where they're located at exactly.

So, let's open that up. I've opened up the code duplication report that gets generated along with all the other reports I've clicked on the duplication instances tab. This shows us sets of duplicate code in different files, groups them together, shows us what line or how many lines there, what line they start at, and what line they end at.

Now I'm going to sort this by a Set ID so we can see what the sets look. Duplication instance ID is just what it sounds like. It's an ID for duplicate instances of code. So, for every instance of code we have, it's represented here with the same ID, and they get grouped together. I've outlined some of the boxes to make this easier to see, so we can now see various pieces of sets of duplication.

I've highlighted a set of files that share a duplicated instance ID. GMonth.Java. This file lives in two different directories. It has 247 lines of code that are duplicated. They both instances start at line 150, and they both end at line 396. Now let's take a closer look. We'll open up the code side by side and take a look at and see what is duplicated.

This is the GMonth file from two different directories within this code base from Apache. It's an open source, so let's take a look through it.

So far these seem to be just about the same file. As we scroll through, let's see if we notice any differences. Now this is just casual observation. In production, I would use a compare function to verify that everything here is indeed the same. And as we scroll through, we can see that these do look like this is just code that has been, this is a file that has been copy and pasted between a couple of directories. So, we see in this case that we have two files that are the same.

That's duplicate code . One way to refractor this, we have the GMonth file living in types/XD and types/Soapencoding. So, we could remove that file to another directory and then change all the references between XD and Soapencoding to point to the other directory. And that way, we remove one of these files. Therefore, reducing code duplication, reducing where errors can live, reducing our SLOC footprint and we can do that. We have a lot of cases of that. For this particular instance, we have a lot of files that have been copied between directories, so this is very low hanging fruit. 

This is easy to fix. This is easy to refractor. Of course, that's only one way to refractor it. We may want to do it other ways, but the easiest way is to pick one of these files, move it somewhere else, and then have everybody point to that.

Now let's take a look at another example. These are all pretty small instances of duplicate code. If we scroll through here, we see that it grows a bit, and here is one duplication is instance ID 51 starting there and going down to here, that's about, 80 different files that all share duplicate code. They all have 163 lines of code.

They started different lines and ended different lines. So, this likely isn't a file that was copy and paste, but it was functions that were probably cut and paste. How would we refractor this one? Looking at this, if we look a little closer, we can see that about half of these files. Live in one directory and the other half of these files live in a neighboring directory.

One possibility for refractor would be to reduce these 80 plus instances down to one or two instances. Since they're contained within one or two directories, we could make another file that's close to these two move the code over to that into a function, and then everywhere where that code was being called from.

We'll call that new function. Instead. By doing that, we reduce the code base by about 13,000 lines. 163 lines across 80 files comes to 13,500 lines of code. So that's quite a savings with a very easy refractor. So you can see here where code duplication, while it's not a huge deal between two files. It, it can grow and it can get pretty big.

One possibility for refactoring would be to extract one instance of this code and move it to a location that both of these directories or one of these directories can see. Move it to two locations, possibly one or two locations that these two directories can see. And then everywhere where this code exists, remove it and instead call the new location to where it's been moved to by extracting the code into a function, moving it somewhere where these files can see that function. And then calling it, we reduce our code base by about 13,000 lines. Now 163 lines of code duplicator doesn't sound like a big deal until it's duplicated across 80 or so different files.

It really adds up and reducing this is a, a tremendous savings. In terms of refracting this is what I like to call low hanging fruit. There's a very high return on investment. Extracting code into another function is a fairly simple exercise and what you gained back from this is tremendous. It reduces the footprint for where bugs might exist and if a bug, especially in this case, if a bug exists in one of these functions, it exists in 80 places.

So, reducing this down to two sizes also makes software maintenance a lot easier, cuz now the bug just lives in one place. . So that is one duplication incidence id. That's one set of files that's shared duplicate code. And if we scroll through this file a little bit, we'll see there is a lot more, there's a lot of other sets of files that are fairly large.

 

So, if one of these sets will reduce this code base by 13,000 lines, and we're looking through here and I'm counting 5, 6, 7, 8. We could just, just by doing this, we could probably reduce this code base by about a hundred thousand lines. And if I remember correctly when we first looked at it, it's the entire code base is 500,000 lines.

So, we could reduce this by what's that? 20% with a fairly easy exercise. And this makes this code base much more maintain. We've seen we can use Silverthread's CodeMRI® tool to scan a code base to generate reports and to report in detail on code duplication in a code base. We've seen how to examine the detailed information and then to look into the code base to examine it, and we've discussed a few methods for refractor and reduction.

That concludes this tutorial, and we'll see you in the next video.

Thanks for watching!