Redocumenting APIs with crowd knowledge: a coverage analysis based on question types

Delfim, Fernanda Madeiral; Paixão, Klérisson V. R.; Cassou, Damien; de Almeida Maia, Marcelo

doi:10.1186/s13173-016-0049-0

Research
Open access
Published: 09 December 2016

Redocumenting APIs with crowd knowledge: a coverage analysis based on question types

Fernanda Madeiral Delfim ORCID: orcid.org/0000-0003-2048-7648¹,
Klérisson V. R. Paixão¹,
Damien Cassou² &
…
Marcelo de Almeida Maia¹

Journal of the Brazilian Computer Society volume 22, Article number: 9 (2016) Cite this article

3744 Accesses
11 Citations
6 Altmetric
Metrics details

Abstract

Background

Software libraries and frameworks play an important role in software system development. The appropriate usage of their functionalities/components through their APIs, however, is a challenge for developers. Usually, API documentation, when it exists, is insufficient to assist them in their programming tasks. There are few API documentation writers for the many potential readers, resulting in the lack of explanations and examples concerning different scenarios and perspectives. The interaction of developers on the Web, on the other hand, generates content concerning APIs from different perspectives, which can be used to document APIs, also known as crowd documentation.

Methods

In this paper, we present a study regarding the knowledge generated by the crowd on the Stack Overflow question-and-answer website. Our main goal is to understand how the crowd can contribute for API documentation on two programming tasks: how to implement a scenario using an API (how-to-do-it), and how to fix domain-independent bugs in an existing code where there was a misunderstanding regarding the usage of an API (debug-corrective). We classified questions available on Stack Overflow by the main concerns of askers, and we used those classified as how-to-do-it and debug-corrective to analyze the coverage of API elements on the discussions related to such questions. Our cases included the well-known and popular Swing and Android APIs.

Results

Our main findings showed that the crowd provides more content for debug-corrective tasks than for how-to-do-it tasks, regardless of the API. Android API elements are more discussed by the crowd compared to Swing. Moreover, we observed that some API elements are frequently mentioned together in discussions, and that there is a strong association between API coverage on Stack Overflow and its usage in real software systems.

Conclusions

Crowd documentation may not be a complete substitute for official documentation because of its partial coverage, especially for how-to-do-it tasks. However, it can still significantly enhance the existent documentation, especially for the most commonly used API elements, providing code samples and explanations on a large variety of usage nuances. Finally, taking advantage of the high coverage for debug-corrective tasks, a new kind of debugging assistant may be conceived.

Introduction

New development platforms are being deployed at an unprecedented pace. Software developers are required to deeply learn their respective APIs (application programming interface) to take maximum advantage of the underlying innovations, while avoiding misuses that could decrease the final product quality. Meanwhile, developers have reported that inadequate or absent resources for learning APIs, e.g., documentation, is a major obstacle for adequate learning [1, 2].

The social interaction of developers in blogs, forums, and question-and-answer (Q&Amp;A) websites generates a partially structured content. This can be considered one of the thriving forms of software documentation available nowadays [3]. Different to the traditional software documentation, which is produced mostly by a central authority, this phenomenon allows anyone to produce and share relevant content for documentation. The content available in such repositories is also known as crowd knowledge [4, 5].

An example of the channels that support developer social interaction is Stack Overflow [6], a question-and-answer website where developers collaborate with each other in order to solve issues related to software development. In the Stack Overflow, developers post questions related to a programming topic, e.g., an API, while other developers can provide answers to help solve their issues [7]. In the context of Stack Overflow, a question is an entry consisting of a title, tags and text body, and an answer is an entry consisting of a text body (which may include code samples) where a solution, a clarification and/or a in-depth discussion are provided concerning the question.

Stack Overflow has been studied to help researchers to understand the knowledge/mechanisms available on it and how that can be used to assist software development. To investigate the feasibility of using crowd knowledge on Stack Overflow for API documentation, Parnin et al. [8] carried out a study on how content related to a set of APIs is being produced on Stack Overflow. They analyzed the coverage of API elements over threads, which is the composition of a question with no or with a collection of answers. However, they did not analyze the nature of threads, which can impact significantly on how suitable threads are for documentation purposes.

The nature of threads can be distinguished by the main concerns of the askers. These concerns are being used in the literature for the definition of question types [7, 9, 10]. Examples of question types are how-to-do-it – providing a scenario and asking how to implement it—debug-corrective—dealing with problems in code already written —seeking-something—looking for something objective (e.g., tutorial, tool, library) or subjective (e.g., an opinion, a suggestion, a recommendation)—and conceptual—regarding conceptual questions on a particular topic (e.g., definition of concepts, best practices for a given technology). The definition itself of the how-to-do-it question type reveals that this type is more adherent to the purpose of documenting how to use API elements. Nonetheless, questions of type debug-corrective are still useful as complementary documentation on how to fix frequent problems related to the usage of API elements, while the other types seems to be marginally useful.

In our previous work [11], we conducted a study on the coverage of API elements on Stack Overflow for API documentation, introducing the idea that the coverage analysis should take into account the API documentation purpose. We reported the coverage of Swing API elements by threads with how-to-do-it question type in order to measure how much elements are covered in discussions that provide code samples on how to implement a specific task by using the API elements.

In this paper, we extend the previous study, providing coverage analysis of the Swing and Android APIs on threads containing how-to-do-it and debug-corrective question types. Our main goal is to measure the coverage of API elements to understand how the crowd can contribute for (1) API documentation on how to use API elements through code samples to accomplish a specific task given a scenario (coverage on threads containing how-to-do-it questions) and (2) API documentation on how to fix domain-independent bugs in an existing code where there was a misunderstanding regarding the usage of an API (coverage on threads containing debug-corrective questions).

Our overall contribution consists of the intersection of the Parnin et al.’s coverage analysis [8] with two question types defined by Nasehi et al. [9]. Our specific contributions are threefold. First, we developed a classifier by using supervised machine learning algorithms to automatically classify Stack Overflow questions, in order to select threads containing how-to-do-it and debug-corrective question types. Second, we proposed a methodology for analyzing API coverage on Stack Overflow based on question types, and consequently improving the knowledge for using Stack Overflow content for API documentation. Third, we analyzed the co-occurrence of API elements on threads, i.e., how threads discuss multiple API elements. As complementary analyzes, we investigated (i) the growth of the coverage of API elements as compared to the growth in the number of threads on Stack Overflow related to the same API and (ii) the association between API coverage and its respective usage in a large code base repository.

The remainder of this paper is organized as follows. In the “API coverage by Stack Overflow” section, we positioned our work regarding the work of Parnin et al.. In the “Automatic classification of questions” section, we explain how we classified Stack Overflow questions. And in the “Linking Stack Overflow threads with API elements” section, we explain how we identified API elements on threads. Our methods and experimental setup regarding coverage analysis are presented in the “Methods” section. We present and discuss the obtained results in the “Results and discussion” section, as well as the threats to validity, limitations, and practical implications of the work. In the “Related work” section, we present the related work on (re)documenting APIs, Stack Overflow, and linking documents with code elements. Finally, our conclusions and future work are presented in the “Conclusions” section.

API coverage by Stack Overflow

The analysis of API-related crowd knowledge available on Stack Overflow is essential for providing indicators on how reliable the crowd is at generating as much content as possible for a complete API documentation. The completeness of the crowd knowledge for an API can be measured by analyzing how many elements of the API are discussed by the crowd, i.e., by coverage analysis.

A study on API coverage analysis by the crowd on Stack Overflow was conducted by Parnin et al. [8], which is the work most related to ours. They claimed that a high coverage would suggest that it is feasible to use the crowd knowledge for API documentation as a comprehensive source of knowledge concerning an API.

To analyze API coverage, they built a traceability model between API classes and Stack Overflow threads where these classes are mentioned. Then, they calculated the percentage of API classes linked with at least one thread, resulting in the coverage of API classes. They found that 87.2, 77.3, and 54.3% of the Android, Java, and GWT classes, respectively, are covered by the crowd. They also concluded that despite the potential of a documentation by the crowd to provide many examples and explanations on API elements, the crowd is not reliable for providing content over an entire API.

Their analysis, however, is still too general. They analyzed coverage of API classes with no criterion or filter on Stack Overflow threads regarding the type of API documentation that they target. The type of API documentation can be characterized by the type of content that the documentation should include, which is defined by the intentions from the point of view of the API users, i.e., what they wish to accomplish through its use or knowledge concerning a given API. For example, an intention type can be how to implement specific tasks using an API [5]. Hence, without considering types of API documentation, an understanding of how the API elements are covered by the crowd and how the crowd knowledge can be used for generating API documentation was not possible.

In our work, on the other hand, we have introduced the notion that coverage analysis must be conducted according to the intended type of API documentation. With that in mind, we performed a coverage analysis considering types of Stack Overflow questions related to the main concerns of askers, and thus, subsidizing the choice of threads for different types of API documentation. We took into account the how-to-do-it and debug-corrective question types:

How-to-do-it: the asker describes a scenario and asks how to implement it (sometimes with a given technology or API) [7, 9]. Figure 1 shows an example of a how-to-do-it question.
Debug-corrective: the asker describes or presents problems in the code under development, such as run-time errors, notifications, and unexpected behavior [7, 9]. Figure 2 shows an example of a debug-corrective question.
Fig. 1
Example of a how-to-do-it question
Full size image

Fig. 2
Example of a debug-corrective question
Full size image

Therefore, the main difference between our work and that of Parnin et al.’s work is that we analyzed the coverage of API elements by the crowd on Stack Overflow considering types of questions for different types of API documentation. We can also point out the following differences:

We presented an analysis of co-occurrence of API elements on threads, i.e., how threads discuss multiple API elements;
Instead of analyzing the speed of the crowd at covering API elements over time, as Parnin et al. did, we analyzed the growth of coverage of API elements comparing to the growth in the number of threads on Stack Overflow related to the same API;
We also analyzed the coverage of API elements with their actual usage, as Parnin et al. did; however, we collected API usage from a large code base repository instead of Google Code Search API (which is no longer available), and we searched for usage in any type of statement instead of searching only in import statements as they did.
We evaluated our implementation of the linking approach (the identification of API elements in the content of threads), which is based on Parnin et al.’s linking approach and other works [12, 13]. However, as they did not carry out the same evaluation, we cannot judge the reliability of their linking implementation or perform comparisons.
The Parnin et al.’s coverage analysis relies on API classes, while our coverage analysis relies on three types of API intermediate elements: classes, interfaces, and enumerations.

Additionally, we replicated Parnin et al.’s coverage analysis to quantify the difference of coverage considering threads with specific question types (how-to-do-it and debug-corrective), i.e., our coverage, regarding coverage considering all threads, i.e., their coverage.

Automatic classification of questions

The selection of Stack Overflow threads by specific question types can be characterized as a text classification problem. A classification problem consists of mapping a data sample (in our case, text) for an appropriate class (or label) that is previously known.

Therefore, we rely on supervised machine learning algorithms to classify questions for selecting threads with how-to-do-it and debug-corrective question types. We could not reuse the classifier built by Souza et al. [7] and Campos et al. [14], as this classifier does not address debug-corrective questions. Also the classifier built by Campos and Maia [15] does not focus only on how-to-do-it and debug-corrective.

The fact that how-to-do-it and debug-corrective questions are more directly related to API documentation than seeking-something and conceptual, the latter are considered as belonging to the others class, and so we have a ternary classification problem.

In the remainder of this section, we present the used classification algorithms and tools, the training set construction, the generation of input data for classification, and finally, the evaluation and selection of classification algorithms.

Classification algorithms and tools

There does not exist any classification algorithm that performs better than others for all application domains. For this reason, to select an appropriate algorithm for the classification of Stack Overflow questions, we evaluated and compared the performance of different algorithms, which are listed as follows:

IBk° (nearest neighbor method) [16]
J48° (decision tree) [17]
C45 ^∙ (decision tree) [18]
NaiveBayes° ^∙ (Bayesian approach) [19]
BayesNet° (Bayesian approach) [20]
DecisionTable° (rule-based method) [21]
MultilayerPerceptron° (neural network) [22]
SMO° (support vector machine) [23]
RandomForest° (random forest) [24]
SimpleLogistic° (linear logistic regression) [25]
Logistic° (multinomial logistic regression) [26]
MaxEnt ^∙ (maximum entropy) [27]
MaxEntL1 ^∙ (multinomial logistic regression with L1 regularization) [27]

These algorithms belong to different types of classifiers. For instance, J48 and C45 are decision tree-based algorithms, while NaiveBayes and BayesNet are Bayesian approaches and SimpleLogistic, Logistic, MaxEnt and MaxEntL1 are logistic regression models. We chose these algorithms because (1) we want to evaluate algorithms from different types of classifiers, (2) they are well-known algorithms, and (3) their implementations are available.

Our evaluation was conducted using Weka [28] and Mallet [29], two open source tools containing a collection of machine learning algorithms, including classification algorithms and support for their evaluation. The difference between these tools is that Mallet is specialized in machine learning applications to text, such as information extraction and document classification, while Weka is for general machine learning and data mining tasks. In the above list of the algorithms selected for evaluation, ° and ^∙ flag algorithms from Weka and Mallet, respectively.

Training set construction

Classification algorithms fall within the category of supervised machine learning algorithms. It means that they must be trained with a set of labeled samples (training data) to be able to distinguish unlabeled samples (test data) between a set of classes.

Since we are interested in the selection of Android and Swing threads, we randomly selected 400 questions related to each API from the Stack Overflow database to construct the training sets. We consider that a question is related to an API if some of its tags contain the name of the API (“android” or “swing”). We decided to use non-exact word matching on tags, i.e., to not accept only those which match the exact “android” and “swing” keywords, as we observed that there are Android- and Swing-related questions tagged with packages of the API. For example, the question 951121 has been tagged with “android-widget”.

The 800 questions were manually and independently classified by two of the four authors to obtain reliable training sets. At the end of the classification process, we calculate the Kappa statistic [30] for assessing the agreement between the two manual classifications. The observed agreement and the Kappa value for each training set and for both together are presented in Table 1. The observed agreement for Swing (84.75%) was higher than that for Android (78.75%), as the Kappa value as well. For both Android and Swing, individually or together, the strength of the agreements based on Kappa value is considered substantial [30].

Table 1 Kappa statistic on the training sets built by manual classification

Redocumenting APIs with crowd knowledge: a coverage analysis based on question types

Abstract

Background

Methods

Results

Conclusions

Introduction

API coverage by Stack Overflow

Automatic classification of questions

Classification algorithms and tools

Training set construction

Input data for the tools

Algorithms evaluation and selection

Linking Stack Overflow threads with API elements

Link types

Threads preprocessing

Identification of API elements in Stack Overflow posts

Linking approach evaluation

Analysis of actual semantic links

Methods

Data collection

Thread selection based on question types

API coverage analysis

Results and discussion

How-to-do-it coverage of API elements

RQ #1. To what extent does Stack Overflow cover API elements of Swing and Android APIs with threads containing how-to-do-it question?

Debug-corrective coverage of API elements

RQ #2. To what extent does Stack Overflow cover API elements of Swing and Android APIs with threads containing debug-corrective question?

Threads covering multiple API elements and their co-occurrence

RQ #3. How often do threads cover multiple API elements?

API coverage growth and API-related thread growth

RQ #4. How does Stack Overflow cover API elements over time?

Coverage and actual use of APIs

RQ #5. Is there an association between coverage of API elements on how-to-do-it and debug-corrective discussions by the crowd and actual usage of these elements in software systems?

Comparing how-to-do-it and debug-corrective coverages with Parnin et al.’s coverage (all-inclusive)

Threats to validity

Limitations

Practical implications

Related work

(Re)documenting APIs

Stack Overflow

Linking documents with code elements

Conclusions

Appendix

Attribute selection based on the information gain method

References

Acknowledgements

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords