- Open Access
Understanding automated and human-based technical debt identification approaches-a two-phase study
© The Author(s). 2019
- Received: 24 January 2019
- Accepted: 22 May 2019
- Published: 8 June 2019
The technical debt (TD) concept inspires the development of useful methods and tools that support TD identification and management. However, there is a lack of evidence on how different TD identification tools could be complementary and, also, how human-based identification compares with them.
To understand how to effectively elicit TD from humans, to investigate several types of tools for TD identification, and to understand the developers’ point of view about TD indicators and items reported by tools.
We asked developers to identify TD items from a real software project. We also collected the output of three tools to automatically identify TD and compared the results in terms of their locations in the source code. Then, we collected developers’ opinions on the identification process through a focus group.
Aggregation seems to be an appropriate way to combine TD reported by developers. The tools used cannot help in identifying many important TD types, so involving humans is necessary. Developers reported that the tools would help them to identify TD faster or more accurately and that project priorities and current development activities are important to be considered together, along with the values of principal and interest, when deciding to pay off a debt.
This work contributes to the TD landscape, which depicts an understanding between different TD types and how they are best discovered.
- Technical debt
- Automated technical debt identification
- Human-based technical debt identification
The technical debt (TD) concept brings a new perspective on how software development tasks are discussed and managed. It describes the tradeoff between the short-term payoffs (such as a timely software release) of delaying some technical development activities and the long-term consequences of those delays . According to Avgeriou et al. , TD is a collection of design or implementation constructs that are expedient in the short term, but set up a technical context that can make future changes more costly or impossible. TD presents an actual or contingent liability whose impact is limited to internal system qualities, primarily maintainability and evolvability.
It is common for a software project to incur TD during the development process since small amounts of debt can increase productivity . However, its presence brings risks to the project. Effects of TD can be noticed in different stages of software development due to different types of debt. Low quality, delivery delay, low maintainability, rework, and financial loss are among the top 10 most commonly impactful effects of TD [36, 37].
Automated tools use a metric or analyze an artifact in a way that defines indicators of TD in software projects. Most of these tools have been developed to detect potential debt in the source code;
Manual approaches identify TD items by asking developers or other stakeholders about the presence of debt or by making a manual search for debt in artifacts from the project.
Automated tools are less time consuming and their application is scalable. However, manual approaches are hypothesized to have two advantages over automated approaches. One is that they might be more accurate, i.e., more likely to identify TD that is most significant, while automated analyses may reveal many anomalies that turn out to be unimportant. The other advantage is that human stakeholders might be able to provide additional important contextual information related to each instance of TD (e.g., effort estimates, impact, decision rationale) that is difficult or even impossible to glean from analysis tools.
Both approaches are relevant and have been investigated in different studies ([14, 23, 27, 38, 51, 52]; Morgenthaler [30, 34]). Some of these studies have indicated that it is possible to identify certain classes of potential TD (in particular design debt) with computer-assisted methods (Schumcher et al. [14, 38]). Other studies have shown that code comment analysis can also be used to identify several types of debt such as design, defect, and requirement debt [11, 25]. More specifically, Schumacher et al.  showed that metric-based detection approaches perform well compared to human classification and that their use decreases the effort spent on manual code inspections.
Despite the fact that a number of studies have been conducted regarding TD identification, there is a lack of evidence on how different tools could be complementary and, also, how human-based identification compares with them. The work presented in this paper intends to investigate this area by performing a two-phase study, involving several complementary research methods, considering both automated and human-based TD identification approaches. The goal of the first phase of the study was to compare human elicitation of TD to automated TD identification. We studied three automated approaches (code smells, automated static analysis issues, and collection of code size and complexity metrics), and how their output (in terms of the location in the codebase they pointed to as having potential TD items) compares to TD that is elicited from humans. The goal of the second phase was to understand the differences in the outcomes of the two approaches to TD identification and gain an understanding of when and how they can best be used and/or combined. We also hoped, in Phase II, to glean from developers some understanding on findings that were observed from the first phase but that were hard to understand without contextual information.
Better understanding of how to elicit TD from humans: We assessed a TD template  that can be used to capture, store, and communicate essential information about TD to support decision-making about debt payment. Our results give some insight into the dynamics of eliciting TD from a team of developers, helping us to answer such questions as: How did the developer find the TD items? Were identified TD items related to code ownership?
Comparison among several types of tool support for TD identification: Despite the fact that automated approaches point to system code fragments that need improvement, it is not clear yet if they point to the most important TD from software project stakeholders’ point of view. Thus, we are interested in exploring the extent to which they can support TD identification, how big of a gap they leave if used without human elicitation of TD, and whether new tools might be warranted, possibly derived from knowledge of manual TD inspections. This understanding can help address questions such as how tools can best be used, instead of or in addition to manual approaches, in the identification of TD.
Developers’ point of view about TD indicators and TD items reported by tools: We asked questions regarding the connection between a TD indicator (e.g., as reported by a tool) and a TD item (i.e., something a developer would find worthwhile to pay off), the value of performing preventive maintenance to remove some of the issues found by the tools, and level of surprise about the results reported by the tools.
We identified that the main resources used by the development team members to identify TD items is their own knowledge on the project, augmented by an artifact closely related to their own specific development tasks. Different people, who work with different artifacts and have different roles on the project, appear to report completely different TD items. Thus, we have further shown that different stakeholders know about different debt in their project, indicating that TD identification should include a range of project team members. We also found that automated TD identification tools, to some extent, point to different potential debt than people do. The three different automated approaches did well in pointing to source code files with defect debt and also could point to some instances of design debt. On the other side, many developer-identified TD items could not have been found by the tools or metrics since the artifacts in which they were located are not included in the static code analysis. The type of debt that represented the greatest agreement between the human-elicited TD items and the automatically detected TD was defect debt.
The remainder of the paper is organized into six more sections. Section 2 describes some background concepts related to TD and also related work. Next, section 3 presents the design of the present study. Section 4 reports the results of the quantitative and qualitative analyses of the study data. Next, section 5 discusses the results of the study. After, the threats to validity are presented in section 6. Finally, section 7 presents some final remarks on this work.
In this section, we describe the methods and techniques we used to investigate our research questions. Both phases took place in the same industrial context, which we describe in the next subsection, and used the same goals, described in the “Goal and research” section. Then, we describe the data collection and analysis activities in Phase I and Phase II, respectively.
Figure 1 illustrates our research strategy. The two phases of the study are complementary and contribute in different ways to the study findings. While in the first phase, the goal was to understand the human elicitation of TD and compare it to automated TD identification (see “Phase I” section), in the second phase the goal was to understand those differences, and gain an understanding of when each set of approaches is appropriate and how they can best be combined (see “Phase II” Section). For instance, at the end of the first phase, we had several open questions regarding how developers found TD items, if there is a connection between identified TD items and code ownership, if the results provided by tools were expected or unexpected, and others. Phase II was designed to answer as many of those open questions as possible. At the end, the results from both phases were analyzed together.
Characterization of the participants
Level of experience in years
Goal and research questions
RQ1—Do the TD identification tools find the TD items that were reported by the members of the development team?
RQ2—How much overlap is there between the TD items reported by different members of the development team?
RQ3—Is the TD item template a feasible and effective tool for eliciting TD items from members of development teams?
Data collection and analysis
Below, we outline the data collection and analysis techniques used in Phase I and Phase II. More details about the Phase I research design can be found in .
As can be seen in Fig. 1, one of authors trained the development team in TD basic concepts via Skype. This training included an opportunity for Q&A. As the development team’s native language was Portuguese, all the material was prepared and then translated to Portuguese before presentation. Initially, we explained some basic concepts related to TD. The definition of TD we used was : Technical Debt refers to delayed technical work that is incurred when technical short cuts are taken, but that creates a technical context in which the same work will cost more to do later than it would cost to do now. To explain the attributes of a TD item (principal, interest, interest probability) and the TD template used to report a TD item, we used the concepts and definitions presented in the “Background” section of this work.
Design debt: any kind of anomaly or imperfection that can be identified by examining source code that leads to decreased maintainability if not remedied;
Testing debt: refers to issues found in testing activities that can affect the quality of those activities. Examples of this type of debt are tests that were planned but not executed;
Documentation debt: refers to the problems found in software project documentation and can be identified by looking for missing, inadequate, or incomplete documentation of any type;
Defect debt: known defects that are not yet fixed.
We indicated that those types are just examples of debt that could affect software projects. Thus, the participants were free to consider any other type of debt during the study. After the training, two parallel activities took place: manual and automatic TD identification, i.e., collecting TD items from the development team and collecting the output of tool-based analysis on the source code.
For automatic TD identification, we applied the CodeVizard and FindBugs tools to the latest version of the subject project source code, in order to identify code smells and ASA issues. These tools were chosen primarily because of our previous experience in using them in studies of technical debt [14, 18, 38, 45, 48–50, 52], as well as their ability to find common smells and issues. They were, at the time, the best tools available for our purposes. The resulting data described, for each file (i.e., class) in the code base, how many of each type of code smells were identified and how many of each type of ASA issues were present. Each FindBugs issue has a category (e.g., Performance, Correctness), and a priority from 1 (highest) to 3 (lowest). We also selected and computed for each file the following structural metrics: Lines of Code, McCabe’s Cyclomatic Complexity, Density of Comments, and Sum of Maximum Nesting of all Methods in a Class. Lines of Code and McCabe’s Cyclomatic Complexity are widely used in the literature on defect and maintainability prediction (e.g., [31, 35]). Higher accidental complexity is hypothesized to point to TD since complexity increases maintenance cost (TD interest). Density of comments was selected to study whether highly commented code might have a relationship with TD, while Max Nesting measures complexity in depth similar to McCabe’s complexity measure. The metrics were computed with ad-hoc scripts/tools. Application of all these tools, metrics, and indicators resulted in a very large amount of data and an enormous number of potential TD items. Therefore, in order to present relevant results, we devised and applied some filtering strategies to the tool outputs. Thus, the results presented in the “Results” section do not include all of the metrics and indicators mentioned here. The filtering mechanisms are also explained in the “Results” section.
If you were given a week to work on this application, and were told not to add any new features or fix any bugs, but only to address TD (i.e., make it more maintainable for the future), what would you spend your time on?
How difficult was it to identify TD items?
How difficult was it to report TD items (i.e., fill in the template)?
How much effort did you need to identify and document all the TD items?
Which are the most difficult fields to fill in/which are the least difficult ones?
All answers were given as free text, although the respondents were asked to use the TD template in Table 1 to answer question 1.
The type of debt can be helpful to tailor debt payment to critical quality characteristics of interest. For example, known defect debt may be differently perceived in life critical software applications. Other known TD types are design debt, documentation debt, and testing debt, all of which are defined above as they were defined for the respondents. Other types of debt have been identified and proposed in the literature , but at the time this study was performed, these were the types primarily referenced, and so were the ones used in the survey. However, respondents were also free to invent other types of debt when these four did not cover a particular situation, and some of the respondents did that.
Was the original decision to go into debt made intentionally or unintentionally? This information can help to understand how explicit debt and TD decisions are managed in a project.
Who is responsible for fixing the TD? This information is important for administrative reasons and may help to provide a basis for assessing principal and interest.
Where is the TD located? This field indicates, if the debt is code-related, the file, class, or component that needs to be modified to eliminate the debt. For other types of debt, the relevant artifact or document is specified. This information is important to understand impact on the product, relationships between items, and ripple effects in source code when repaying the debt. For our study, it also allowed us to match up the elicited TD items with the output of the automated tools.
After manual and automatic identification of TD, we performed data analysis with a comparison of the two sets of results. More details and preliminary results from Phase I were presented at the 17th International Conference on Evaluation and Assessment in Software Engineering (EASE 2013) .
CQ1—How did you (developer) find the TD items?
CQ2—Were identified TD items related to code ownership?
CQ3—Do you (developer) think there is a connection between this indicator and the TD item?
CQ4—Does this file have any TD?
CQ5—Would it be worthwhile to do some preventive maintenance to remove some of the issues found by the tools?
CQ6—Are you (developer) surprised that the automated tools did not find any problems with this file?
CQ7—Would you have expected to see more problems found by the tools? Why or why not?
TD Item Description
Question and answer that helps explain a specific TD item.
A question about a specific TD item.
A question that is not specific for any TD item.
A question that was not planned before the meeting.
A question asking for confirmation from all participants.
Answers related to the correlation between TD items and metrics.
Code Smells Correlation
Answers related to the correlation between TD items and code smells.
Find bug Correlation
Answers related to the correlation between TD items and FindBugs issues.
Moderator notes about participant actions.
Answers to new or generic questions.
Technical Debt Overlap
Answers addressing the lack of overlap between the identified TD Items.
Technical Debt Identification
Answers about the manual TD identification process.
Indicators and Technical Debt Identification
Answers about the correlation between technical debt and metrics, code smells, and FindBugs issues.
Benefits of TD Management
Comments about the benefits of TD management.
Benefits of TD Documentation
Comments about the benefits of TD identification.
TD Identification without code reading
Comments about TD identification without reading the source code.
TD Identification with code reading
Comments about TD identification by reading the source code.
Indicators and Problems in the Project
Comments about the correlation between problems in the source code and metrics, code smells, and FindBugs issues.
Code Inspection Feasibility
Comments about the feasibility of inspecting the source code to look for TD items.
We coded the focus group data in vivo, using codes that emerged from the data and thus were, in some sense, related to the formulated questions. For this, we used a combination of pen, paper, and Microsoft Excel. We formulated findings by synthesizing the coded segments pertaining to categories that had substantial data weight. That is, the findings were grounded in the data rather than formulated ahead of time and validated through the data. At the end, the results of both Phase I and Phase II were put together and we drew conclusions based on the entire set of results.
The tools we used generated a large number of issues pointing to potential TD, with indicators occurring in nearly every file in the code base. Therefore, we had to pre-filter the results in some way to allow a meaningful comparison with the manually elicited TD items. For each automatically generated indicator, we sorted the source code files by the number or severity of issues found. For example, we sorted by the number of FindBugs Priority 1 issues, and by the value of each of the source code metrics, and the number of each kind of code smell found. For each indicator, we examine the top 10% of the sorted list and determined how many source files in that 10% corresponded to TD items reported by developers. The indicators having the most developer-reported source files in the top 10% were FindBugs P1 issues, the MAX nesting metric, and Intensive Coupling code smell.
We realize that this filtering approach somewhat biases our results by only showing the automated tool results that performed best in terms of matching up with manually reported TD items. Our motivation was to determine simply if any of the automated approaches were related to the TD elicited from developers and to simplify the presentation of results. Clearly, we cannot claim that these three top performing indicators in this study would also be the best ones in any given case. Thus, we can reject the notion that none of the automated indicators (metrics, ASA issues, and code smells) are good at finding TD, but we are still not at the point where we understand which indicators generally predict TD best, or under which circumstances they predict best, especially given the high number of false positives typically included in the results from such tools.
We synthesize these results below, along with insights gained from the Phase II focus group, to address our research questions.
Do the TD identification tools find the TD items that were reported by the members of the development team? (RQ1)
We observe in Fig. 3 that the three top automated approaches do about equally as well as manual elicitation in identifying defect debt. In fact, these three automated indicators captured all source code components identified by the development team as having defect debt.
For design debt, automated approaches capture about half of the human-reported TD items, although ASA issues (in particular P1 issues) and code smells (i.e., Intensive Coupling) identified more TD items than traditional code metrics (i.e., MAX Nesting). One design debt item was located in a source file identified by all automatic approaches, and another was identified by ASA issues and code smells, but not traditional metrics. Three developer-reported design debt items (two of which were actually the same item) were not identified by any automated approach. The first item (covered by two reports) was described as TD caused by information stored redundantly in two databases. Subjects reporting this TD item said that this kind of debt was distributed over many files. Thus, it is likely that the tools investigating single lines, methods, and classes could not point out this type of TD easily. The second missed item was reported as insufficient use of system resources (i.e., memory usage). This type of TD was missed even though one approach (FindBugs) reports on bug patterns related to performance problems.
Seven human-reported TD items were of type documentation debt, testing debt, or usability debt. Only one of these TD items was related to source code files, all others were related to other development artifacts (e.g., requirements documents and test plans).
In order to have a better understanding of how developers proceeded to find TD and what they think about the indicators considered by the tools, our first focus group question to the development team was How did you (developer) find the TD items?
The main source for identifying TD items was clearly the participants’ own knowledge on the project, without external aids (i.e., written sources). The following excerpts were typical: “Developer 1: I already had some knowledge of these debts in my mind,” “Developer 2: I reported those debts that I remembered,” “Maintainer: I have knowledge of what was done differently from what the customer wanted,” and “Tester: I used my own experience with testing activities.” These types of responses were reported regardless of the role assumed by the participant in the development process, i.e., by developers, maintainers, and testers.
In addition, the data showed that each participant used a “resource” associated with their responsibility in the development process to support the identification of TD items. Both developers reported analyzing or searching the source code for TD items. The maintainer reported analyzing tickets, and the tester relied on information from “testing activities when several problems are identified.”
Thus, we identified that the main resources used by the development team members to identify TD items is their own knowledge on the project, augmented by an artifact closely related to their own specific development tasks. Specifically, developers relied on source code, the maintainer on tickets, and the tester on results of test activities.
This qualitative data from the focus group helps explain in part why there was so little overlap between the developer-elicited TD items and the items revealed by automatic tools and thus contributes to addressing RQ1. Since part of the development team did not work directly with the source code, and since they used resources (knowledge, experience, artifacts) closely related to their job tasks to identify debt in the project, it follows that much of the debt reported by the development team would not be associated with the code and thus could not be identified by the tools used in this study.
To further understand the difference between developer-reported and tool-elicited TD, we asked developers about the value of the automated TD indicators we used (metrics, code smells and FindBug issues). The answer was overall positive. When asked, “Do you think these tools and metrics would help you to identify TD faster or more accurately?,” all participants responded in the affirmative, using phrases such as “useful” and “valid.” However, interestingly, they also used terms that indicated that the use of tools is not a complete solution, e.g., “a starting point to identify TD,” “they can indicate a path to follow,” and “for code, they are good indicators.”
The focus group data also indicates that metrics, code smells, and FindBugs issues, beyond being valid indicators of debt, are also considered good indicators of general quality problems in the source code of the project. For example, when asked “If you didn’t have time to report the TD items, do you think that using these metrics (that point to some larger classes) could help you to identify possible problems in the project?,” the maintainer answered that “Looking at a project as a whole, yes.” Participants also answered positively to a similar question about code smells, indicating that the smells, besides being useful as pointers to debt, are also useful as indicators of quality problems in general and could be used to guide quality improvement efforts.
As a very specific example, there were two distinct classes that we asked the participants to comment on, because the developers had reported TD items in them, but the automated indicators did not reveal any issues. We asked the participants if they were surprised by this result, and the two developers answered no, because the major problems with these two classes had to do with performance, and so would not be expected to be highlighted by the tools.
As another example, when asked about a class in which the automated tools indicated the presence of debt, but for which the participants did not report any TD items (with the prompt “Does this file have any TD?”), the participants answered that they did not know but that the problems reported by the indicators actually exist in the file: “I don’t know. But this class is really quite complex since it involves several use cases and has file handling …. Besides, there is a lot of IF conditions that must be analyzed and, yes, there is a high probability of errors in this class.” Again, participants felt that metrics, code smells and FindBugs issues are good indicators of problems in the system classes, whether or not those problems could be classified as “debt.”
Q: “Would it be worthwhile to do some preventive maintenance to remove some of the issues found by the tools?”
A: “Yes. For example, I would have some trouble understanding this even though I developed it myself. It would be good to refactor this class, for example, to decrease the size of the methods … in order to facilitate future maintenance activities.”
Q: “If you had only these data to indicate potential problems in the project, would it be useful?”
A: “Yes. For example, I had an idea because I developed this class, but the other developers would not have any idea about this class because they didn’t work on it. Thus, they would not have an idea if there is or is not a problem there.”
Q: “In addition, having knowledge about code smells, could it be used to improve the quality of the project?”
A: “Yes. For example, large classes that have been growing and we do not have time to refactor the project although we would like to. More specifically, there is a very large class that we need to adjust but we do not have time right now.”
“Developer 1: but these indicators are more associated with refactoring to facilitate future maintenance of the project”;
“Developer 1: Therefore, the tools are not enough (for example, we reported items that were not identified by indicators)”;
“Developer 2: After that, you need something more to find more specific TD items like the ones related to requirements”;
“Maintainer: However, the indicators are not enough for some technical debt related to functional aspects, standardization or even system requirements.”
“Maintainer: So, the indicators are valid but not enough.”
Code smells, FindBug issues, and code metrics can support the identification of defect and design debt in this project, but not other types of debt that were reported by developers.
These automated indicators are considered good indicators of technical debt in the source code of the project, at least as a starting point.
They are also considered useful as indicators of general quality problems and thus could be used to guide quality improvement efforts during maintenance.
However, the indicators we used are not sufficient because they are not able to identify other types of technical debt items that are not directly associated to the code, for example, problems with requirements, business rules that have not been fully implemented, and usability issues. It should also be noted that the focus group participants were looking at the results from our pre-filtering of the automated indicators, so they were only looking at the top-rated indicator results. The tools returned flagged many more potential TD instances than were examined in the focus group. Thus, we can reject the notion that none of the automated indicators is useful for TD identification, but we cannot make claims about the usefulness of the overall approach of automatic TD identification, nor can we comment on whether the same three indicators we found to be the best performers would also perform well in other contexts.
How much overlap is there between the TD items reported by different members of the development team? (RQ2)
Only one TD item was reported by two different stakeholders (the manager and one developer). None of the remaining 19 items were reported by more than one team member. By analyzing the focus group transcript, it is possible to identify what can lead to non-overlapping TD items when they were being identified.
As mentioned before, when we asked the developer team how they found TD items, they reported that the main resource used was their own knowledge on the project. In addition to their own knowledge, what they used to support the description of the TD items was an artifact related to the specific tasks each team member was engaged in. Because each participant worked on different parts of the project and each TD item was reported based on their memories and experiences on the project, we can assume that the reported TD items would be associated with their daily activities and would be different from each other. In this project, two of the participants worked as developers, one as maintainer and one as tester. Thus, in addition to working on specific parts of the product, each participant had a different perspective when analyzing and reporting TD items. All of this contributed to reduce the overlap when identifying TD items.
In addition, when asked directly about the reasons that led them to report different technical debt items, participants reported that “Everyone was focused on a different part of the project;” thus, “Each developer reported based on what they worked on the project.” Lastly, participants indicated that there was no contact between them while performing the TD identification activity (“There was no contact between us in answer to the questionnaire.”).
If the participants responsible for identifying TD items work in different parts of the project, the items reported by each participant tend to be associated with the parts of the project in which that participant works;
Identified TD items are related to code/artifact ownership.
Thus, to increase coverage when manually identifying TD items, we can infer that it is important that the people identifying the TD are ones who work on different parts of the project and/or have different roles in the development process.
Is the TD item template a feasible and effective tool for eliciting TD items from members of development teams? (RQ3)
As reported in , the five study subjects reported that it took between 50 min and 2 h to identify and document the TD items (average of 19 min per item). Answers about difficulty of the task ranged from “easy” to “difficult/high” (all answers were given as free text). Subjects agreed that the fields principal, interest amount, and interest probability were the most difficult to fill in. Location, type, and responsible were commonly noted as the least difficult fields. These results indicate that, in this project, the initial elicitation of TD items could be done in reasonable time, but that the key financial parameters of TD were difficult to estimate and might require better process or tool support in future. This research question was only addressed in Phase I, i.e., we did not ask questions about the use of the template during the focus group.
The findings presented in the previous section have several implications. We found that automated TD identification tools (to some extent) point to different potential debt than people do. Further, many developer-identified TD items could not have been found by the tools or metrics since the artifacts in which they were located are not included in the static code analysis. We did not limit our participants to reporting only TD items that could be found by the tools we used, which in some sense might have been a “fairer” comparison. But this type of “fairness” was not our goal. We were interested in finding which types of TD developers felt were most important and what role tools could plan in finding those types of TD. The fact that developers reported some code-related debt, as well as TD items that were not code-related, suggests that a focus on source code as the single source of TD is too narrow. Future studies might consider including or proposing tools for other kinds of development artifacts affected by TD. At the time of this study, few such tools were known to us, but several have since been proposed (see, e.g., http://www.hejf.de/).
Our findings also indicate that different people, who work with different artifacts and have different roles on the project, appear to report completely different TD items. This result would have been stronger if we had had multiple participants in the same role, to see if people in the same roles would have reported the same TD items or would have agreed on the importance of different TD items (a limitation of this study that could be addressed in future work). However, this finding is still interesting and implies that, if a project wants a comprehensive view of all the TD present, identification must involve people representing all different development roles, as well as automated tools. This also has implications for prioritization and decision-making about TD. It is likely that different development roles have very different ideas about which TD items are most important to address in the near term, making consensus difficult to reach. This makes it important to have objective criteria to help in prioritizing task and in making decisions about what debt to pay off and when. Estimation of principal and interest could be part of such an approach, but we found that developers found it especially difficult to make these estimations, even at a very coarse-grained (high/medium/low) level. Thus, further work in quantifying TD is crucial if TD decision-making is to be facilitated.
On the other hand, it may be that most projects do not need to consider a comprehensive view of all types of TD. In this study, we asked the development team to consider all types of debt, but in many cases, a project has specific types of debt they want to focus on. Projects often have identifiable “pain points”  that clearly imply specific types of TD that are most relevant to that project. In such cases, our findings imply that involving only specific tools, or specific people, would be sufficient to identify the relevant debt.
These observations about the relationship between code (or artifact) ownership and identifying TD also imply another role for automated detection approaches. In cases where a wide cross-section of developers is not available for TD identification tasks, automated tools might serve as a guide for those developers who are engaged in TD detection, by helping them identify places in the code (with which they might not be familiar) that should be examined for potential TD.
The type of debt that represented the greatest agreement between the human-elicited TD items and the automatically detected TD was defect debt. Remember that defect debt represents known defects that have been allowed to persist beyond when they could have been fixed. In this study, much of the defect debt reported was actually not-fully-implemented requirements, i.e., implementations that do not fully meet the requirements and thus are logged as defects. The tools we used to analyze the source code were not intended to find defects, per se, but are meant, in theory, to find anomalies in the code that could, in the future, affect maintainability and lead to changes that introduce defects. Thus, at some level, it is surprising that automated source code analysis pointed to code that had latent defects and unfinished implementations of requirements. One possible explanation is that such pieces of code represent “quick and dirty” implementations, where time constraints led to (at the same time) code that is poorly structured (thus triggering ASA issues, code smells, and high complexity scores) and incomplete (thus leading developers to tag it as defect debt). In this way, the issues found by the automated tools could be said to co-exist with defects, but not to cause them. Instead, the issues and defects are both caused by a third factor, time constraints. Another possible explanation is that poorly structured code does, in fact, cause defects in a more direct way. Thus, the poor structure (as highlighted by the automated tools) would have preceded the defects, which were introduced during changes that were made more difficult and error-prone by the low quality of the code.
Although not a subject, we explored in the focus group, some interesting observations can be made about the participants’ estimations of principal and interest, represented by the colors of the TD items in Fig. 3. This color coding indicates that principal, interest amount, and interest probability are rather equally distributed among the different types of debt. This suggests that debt characteristics are not tied to the type of debt, i.e., no type of debt has noticeably higher overall interest or principal.
The color coding in Fig. 3 also hints at how this information can be further used to manage debt and make clearer decisions on which debt to pay. For example, items that have generally a low principal (e.g., green front face), but yellow or red interest characteristics are good candidates for paying off first, since their return on investment is more favorable than for other items. This idea of a cost/benefit decision approach has been previously proposed and discussed in .
As with any case study, especially of a small project such as this one, threats to external validity are significant. We accept these threats and attempt to trade off breadth for depth, by doing a thorough analysis of a small case, yielding deeper insights that would not have been possible in a much larger sample. Our goal in this work (as with all case studies) was not generalizable results, but insights derived from a deep dive of a single case on how two classes of identification approaches relate.
An important construct validity threat derives from the fact that we do not know if either the human-elicited TD items or the TD indicated by the tools we applied, constituted “real” technical debt, i.e., that would lead to future maintenance costs. Thus, we had no “ground truth” against which different types of TD identification can usefully be compared. The ultimate and authoritative “ground truth” for studies of TD would be a measure based on future maintenance effort associated with TD items. That is, a “real” TD item is one that leads to higher maintenance effort than would have been incurred if the debt did not exist. However, measuring “real” TD in this way was not possible in this study, nor is it in many studies. For the study in this paper, our results are limited to comparisons between the two modes of TD identification, but not evaluation of either as being more “correct.”
Another assumption that we have made that may also lead to a construct validity threat is that, when an automated approach identifies a problem in a source code module that is also indicated in a TD item reported by a developer, the two indicators are actually pointing to the same TD instance, not to two separate TD instances that happen to reside in the same source module. While this assumption may not be strictly true, from a practical perspective it is reasonable, as fixing one instance of TD in a class (e.g., by refactoring) will very often, as a side effect, fix other instances of TD in the same class.
An internal validity threat (i.e., one that threatens the validity of the relationships we believe we have discovered) arises from the dual role of our company contact (first author) as researcher and member of the organization. Dr. Spínola was a member of the research team, but was also at the time a co-owner of Kali Software, principally engaged in customer service. Although he did not have direct involvement in operations (i.e., in software development), there may have been some bias in the opinions the participants chose to share with him.
In addition, we reported in the “Results” section that we had to pre-filter the issues from automated tools by applying a ranking strategy: for each indicator, we examined the top 10% of the sorted list and determined how many source files in that 10% corresponded to TD items reported by developers, selecting then the top three performing indicators (i.e., FindBugs P1 issues, the MAX nesting metric, and Intensive Coupling code smell). This filtering mechanism gives rise to an external validity threat, since we cannot claim that the top performing indicators used in this study would also be the best ones in any given case. Thus, it made sense to look first at the top performing indicators.
Finally, the focus group was performed in Portuguese, transcribed in Portuguese and, after that, translated to English. So, language constraints and cultural idioms may have had an influence in the understanding of the statements, thus also posing an internal validity threat. To deal with this, the first author is Brazilian and worked carefully to avoid or minimize any bias or misunderstanding in the translation process from Portuguese to English and during the execution of the focus group. A related reliability threat is the fact that coding was performed by only one author, with oversight but not second coding, from another of the authors.
The TD concept has the potential to go beyond a mechanism for communication, to be translated into a whole set of tools and methods for measuring and managing debt. We have presented the results of a two-phase study, in which we compared the results of applying manual and automated techniques for identifying TD and gained further insight into the similarities and differences through a focus group with the development team members. We evaluated how the TD list can be populated by developers through a common TD template and how existing tool approaches can help to identify certain types of debt. We have further shown that different stakeholders know about different debt in their project, indicating that TD elicitation should include a range of project team members. Aggregation, not consensus, would appear to be the most effective approach to combining the input from different team members, if the goal is to gain a comprehensive understanding of all types of debt in a project. In addition, three different automated approaches—code smells, ASA issues, and traditional code metrics—did well in pointing to source code files with defect debt and also could point to some instances of design debt.
A limitation of the performed study is related to the development team involved in the study. Given the fact that multiple developers were not involved in the manual identification of TD on the same part of the project, it is not clear how reliable the reported results of a single developer can be. Aggregation sounds like a reasonable approach but some intersection of results from different developers working on the same part of the project may be necessary to improve the accuracy of the results. Besides, the study pointed out that developers tend to report different results than tools but it was not possible to investigate to which extent this behavior is due to the fact that the study did not use tools that find TD items in non-code artifacts, for example in requirements.
How much of the potential TD reported by tools, but not reported by developers, is “real” TD? That is, is there a value in using tools to identify TD that developers are not aware of?
How can manual TD identification be better integrated into the development process, in order to make it more efficient and feasible?
Is developer-identified or tool-identified TD more likely to lead to future maintenance issues (i.e., which is more likely to be “real” TD)?
How can TD principal and interest be better and more objectively quantified in order to facilitate decisions and prioritization of debt items, in the face of potentially conflicting priorities among stakeholders?
A more comprehensive study (using a larger set of tools, such the ones presented in [9, 19, 28, 46]) would be interesting in order to better grasp the overlap and differences between manual and automatic identification approaches. Thus, we also encourage practitioners to use the proposed template in their projects and to share results and experiences. It will require evidence from a variety of environments to build a full picture of how different TD identification approaches interact, overlap, and are (or are not) synergistic. This evidence is necessary to further refine and to bring into focus the TD landscape.
Dr. Spínola’s contribution to this was supported by CNPq Universal 2014 grant #458261/2014-9 and #201440/2011-3. The participation of Seaman and Zazworka in this work is supported by the US National Science Foundation, award #0916699.
ROS and NZ planned and executed the study, and analyzed and interpreted its results. AV analyzed and interpreted the study results. FS and CS planned the study and validated the obtained results. ROS, NZ, and CS were the major contributors in writing the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Alves NSR, Mendes TS, Mendonça MG, Spínola RO, Shull F, Seaman C (2016) Identification and management of technical debt: a systematic mapping study. Inf Softw Technol 70:100–121. https://doi.org/10.1016/j.infsof.2015.10.008 ISSN 0950-5849View ArticleGoogle Scholar
- Alves NSR, Ribeiro LF, Caires V, Mendes TS, Spínola RO (2014) Towards an ontology of terms on technical debt. In: The Sixth International Workshop on Managing Technical Debt, Victoria, British Columbia. IEEE Computer Society, Washington, pp 1–7. https://doi.org/10.1109/MTD.2014.9 View ArticleGoogle Scholar
- Alzaghoul E, Bahsoon R (2013) “CloudMTD: using real options to manage technical debt in cloud-based service selection”, 4th International Workshop on Managing Technical Debt (MTD ‘13), IEEE Computer Society, pp. In: 55–62, 18–26 May. USA, San FranciscoGoogle Scholar
- Ampatzoglou A, Chatzigeorgiou A, Avgeriou P (2015) The financial aspect of managing technical debt: a systematic literature review, information and software technology, volume 64, April, pages, pp 52–73. https://doi.org/10.1016/j.infsof.2015.04.001 ISSN 0950-5849View ArticleGoogle Scholar
- Avgeriou P, Kruchten P, Ozkaya I, Seaman C (2016) Managing technical debt in software engineering (Dagstuhl Seminar 16162). In: Dagstuhl Reports (Vol. 6, No. 4). Schloss Dagstuhl-Leibniz-Zentrum fuer InformatikGoogle Scholar
- Binkley D (2007) Source code analysis: a road map. In: Future of Software Engineering, pp 104–119Google Scholar
- Brown N, Cai Y, Guo Y, Kazman R, Kim M, Kruchten P, Lim E, MacCormack A, Nord R, Ozkaya I, Sangwan R, Seaman C, Sullivan K, Zazworka N (2010) Managing technical debt in software-reliant systems. In: Proceedings of the FSE/SDP workshop on Future of Software Engineering Research (FoSER ‘10). ACM, New York, pp 47–52View ArticleGoogle Scholar
- CAST (2010) Cast worldwide application software quality study: summary of key findings. In: Tech. Rep 2010Google Scholar
- Curtis B, Sappidi J, and Szynkarski A (2012). Estimating the principal of an application's technical debt in IEEE Software,29(6):34–42.Google Scholar
- D'Ambros M, Bacchelli A, Lanza M (2010) On the impact of design flaws on software defects. In: Proceedings of the 10th international conference on quality software (QSIC ‘10). IEEE Computer Society, Washington, pp 23–31. https://doi.org/10.1109/QSIC.2010.58 View ArticleGoogle Scholar
- Farias MAF, Silva AB, Mendonça MG, Spínola RO (2015) A contextualized vocabulary model for identifying technical debt on code comments. In: 7th international workshop on managing technical debt 2015, pp 25–32Google Scholar
- Fowler M, Beck K (1999) Refactoring: improving the design of existing code. Addison WesleyGoogle Scholar
- Gat I, Heintz JD (2011) From assessment to reduction: how cutter consortium helps rein in millions of dollars in technical debt. In: Proceedings of the 2nd workshop on managing technical debt. ACM, New York, pp 24–26. https://doi.org/10.1145/-1985362.1985368. View ArticleGoogle Scholar
- Guo Y, Seaman C, Zazworka N, Shull F (2011) Domain-specific tailoring of code smells: an empirical study. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering, pp 167–170Google Scholar
- Guo Y, Spínola RO, Seaman C (2014) Exploring the costs of technical debt management - a case study. Empir Softw Eng 1:1–24Google Scholar
- Holvitie J, Leppanen V (2013) DebtFlag: technical debt management with a development environment integrated tool. In: 4th International Workshop on Managing Technical Debt (MTD), pp 20–27Google Scholar
- Hovemeyer D, Pugh W (2004) Finding bugs is easy. SIGPLAN Not 39(12):92–106View ArticleGoogle Scholar
- Izurieta C, Vetro’ A, Zazworka N, Cai Y, Seaman C, Shull F (2012) Organizing the technical debt landscape. IEEE ACM MTD 3rd international workshop on managing technical debt. In: Association with the 34th international conference on software engineering ICSE, Zurich, IEEE Press, Piscataway, pp 2–9Google Scholar
- Kazman R, Cai Y, Mo R, Feng Q, Xiao L, Haziyev S, Fedak V, Shapochka A (2015) A case study in locating the architectural roots of technical debt. In: Proceedings of the 37th international conference on software engineering - volume 2 (ICSE ‘15), Vol. 2. IEEE Press, Piscataway, pp 179–188Google Scholar
- Khomh F, Penta MD, Gueheneuc YG (2009) An exploratory study of the impact of code smells on software change-proneness. In: Proceedings of the 16th working conference on reverse engineering (WCRE ‘09). IEEE Computer Society, Washington, pp 75–84View ArticleGoogle Scholar
- Li W, Shatnawi R (2007) An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution. J Syst Softw 80(7):1120–1128View ArticleGoogle Scholar
- Li Z, Avgeriou P, Liang P (2015) A systematic mapping study on technical debt and its management. J Syst Softw 101:193–220. https://doi.org/10.1016/j.jss.2014.12.027 View ArticleGoogle Scholar
- Li Z, Liang P, Avgeriou P, Guelfi N, Ampatzoglou A (2014) An empirical investigation of modularity metrics for indicating architectural technical debt. In: Proceedings of the 10th international ACM Sigsoft conference on quality of software architectures. ACM, New York, pp 119–128Google Scholar
- Lim E, Taksande N, Seaman C (2012) A balancing act: what software practitioners have to say about technical debt. IEEE Softw 29(6):22–27View ArticleGoogle Scholar
- Maldonado ES, Shihab E (2015) Detecting and quantifying different types of self-admitted technical debt. In: 7th international workshop on managing technical debt, pp 9–15Google Scholar
- Marinescu R (2004) Detection strategies: metrics-based rules for detecting design flaws. In: Proceedings of the 20th IEEE international conference on software maintenance (September 11–14, 2004). ICSM. IEEE Computer Society, Washington, pp 350–359View ArticleGoogle Scholar
- Marinescu R (2012) Assessing technical debt by identifying design flaws in software systems. IBM J Res Dev 56(5):9:1–9:13View ArticleGoogle Scholar
- Martini A, Bosch J (2016) An empirically developed method to aid decisions on architectural technical debt refactoring: AnaConDebt. In: IEEE/ACM 38th international conference on software engineering companion (ICSE-C), Austin, ACM, New York, pp 31–40Google Scholar
- McConnell S (2007). “Technical debt,” 10x Software Development Blog, (Nov 2007). Construx Conversations. URL= https://www.construx.com/resources/managing-technical-debt/
- Morgenthaler J, Gridnev M, Sauciuc R, Bhansali S (2012) Searching for build debt: experiences managing technical debt at Google. In: Third International Workshop on Managing Technical Debt (MTD), pp 1–6Google Scholar
- Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on software engineering. ACM, New York, pp 452–461Google Scholar
- Nugroho A, Visser J, Kuipers T (2011) An empirical model of technical debt and interest. In: Proceedings of the 2nd workshop on managing technical debt (pp. 1–8), ser. MTD’11. ACM, New York. https://doi.org/10.1145/-1985362.1985364. View ArticleGoogle Scholar
- Olbrich SM, Cruzes DS, Sjoberg DIK (2010) Are all code smells harmful? A study of God Classes and Brain Classes in the evolution of three open source systems. In: The Proceedings of the International Conference on Software Maintenance. ICSM, Timisoara, pp 1–10Google Scholar
- Potdar A, Shihab E (2014) An exploratory study on self-admitted technical debt. In: IEEE International Conference on Software Maintenance and Evolution (ICSME) 2014, pp 91–100View ArticleGoogle Scholar
- Riaz M, Mendes E, Tempero E (2009) A systematic review of software maintainability prediction and metrics. In: 3rd international symposium on empirical software engineering and measurement, pp 367–377Google Scholar
- Rios N, Mendonça MG, Spínola RO (2018a) A tertiary study on technical debt: types, management strategies, research trends, and base information for practitioners. Inf Softw Technol 102:117–145View ArticleGoogle Scholar
- Rios N, Spínola RO, Mendonça M, and Seaman C. 2018. The most common causes and effects of technical debt: first results from a global family of industrial surveys. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM '18). ACM, New York, Article 39, 10 p. https://doi.org/10.1145/3239235.3268917
- Schumacher J, Zazworka N, Shull F, Seaman C, Shaw M (2010) Building empirical support for automated code smell detection. In: Proceedings of the ACM-IEEE international symposium on empirical software engineering and measurement (ESEM ‘10). ACM, New York Article 8, 10 pagesGoogle Scholar
- Seaman C (1999) Qualitative methods in empirical studies of software engineering. IEEE Trans Softw Eng 25(4):557–572View ArticleGoogle Scholar
- Seaman C, Guo Y (2011) Measuring and monitoring technical debt. In: Advances in Computers, vol 82, pp 25–46Google Scholar
- Seaman C, Guo Y, Zazworka N, Shull F, Izurieta C, Cai Y, Vetro A (2012) Using technical debt data in decision making: potential decision approaches. In: The Third International Workshop on Managing Technical Debt (MTD), pp 45–48View ArticleGoogle Scholar
- Soltanifar B, Akbarinasaji S, Caglayan B, Bener AB, Filiz A, Kramer BM (2016) Software analytics in practice: a defect prediction model using code smells. In: Desai E (ed) Proceedings of the 20th International Database Engineering & Applications Symposium (IDEAS ‘16). ACM, New York, pp 148–155. https://doi.org/10.1145/2938503.2938553 View ArticleGoogle Scholar
- Spínola RO, Zazworka N, Vetro’ A, Seaman C, Shull F (2013) Investigating technical debt folklore: shedding some light on technical debt opinion. In: Proceedings of the 4th workshop on managing technical debt (MTD ‘13). https://doi.org/10.1109/MTD.2013.6608671 View ArticleGoogle Scholar
- Stochel MG, Wawrowski MR, Rabiej M (2012) Value-based technical debt model and its application. In: 7th International Conference on Software Engineering Advances (ICSEA’ 12). XPert Publishing Service, Lisbon, pp 205–212 18–23 NovemberGoogle Scholar
- Vetro' A, Morisio M, Torchiano M (2011) An empirical validation of FindBugs issues related to defects. In: 15th annual conference on Evaluation & Assessment in software engineering (EASE) 2011, pp 144–153. https://doi.org/10.1049/ic.2011.0018 11-12 AprilView ArticleGoogle Scholar
- Xiao L, Cai Y, Kazman R, Mo R, Feng Q (2016) Identifying and quantifying architectural debt. In: Proceedings of the 38th international conference on software engineering (ICSE ‘16). ACM, New York, pp 488–498Google Scholar
- Yamashita A, Moonen L (2013) Exploring the impact of inter-smell relations on software maintainability: an empirical study. In: Proceedings of the International Conference on Software Engineering (ICSE). IEEE Press, Piscataway, pp 682–691Google Scholar
- Zazworka N, Ackermann C (2010) CodeVizard: a tool to aid the analysis of software evolution. In: Proceedings of the ACM-IEEE international symposium on empirical software engineering and measurement (ESEM). ACM, New York Article 63, 1 pages (PosterGoogle Scholar
- Zazworka N, Seaman C, Shull F (2011b) Prioritizing design debt investment opportunities. In: Proceedings of the 2nd Workshop on Managing Technical Debt (MTD). ACM, New York, pp 39–42Google Scholar
- Zazworka N, Shaw MA, Shull F, Seaman C (2011a) Investigating the impact of design debt on software quality. In: Proceedings of the 2nd Workshop on Managing Technical Debt (MTD). ACM, New York, pp 17–23Google Scholar
- Zazworka N, Spínola RO, Vetro’ A, Shull F, Seaman C (2013) A case study on effectively identifying technical debt. In: Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering (EASE). ACM, New York, pp 42–47. https://doi.org/10.1145/2460999.2461005 View ArticleGoogle Scholar
- Zazworka N, Vetro’ A, Izurieta C, Wong S, Cai Y, Seaman C, Shull F (2014) Comparing four approaches for technical debt identification. Softw Qual J 22(3):403–426 SpringerView ArticleGoogle Scholar