Towards a Holistic Model for Software Traceability
This web page is a companion to the FSE 2019 paper submission entitled "Towards a Holistic Model for Software Traceability".
The COMET Model
To help aid in the comprehension of COMET’S underlying model, we provide a graphical representation using plate notation in Fig. 1, which we use to guide our introduction and discussion. The model in Fig. 1 is computed on a per link basis, that is between all potential links between a set of source (S) and target artifacts (T).
We use S and T to refer to a single source and target artifact of interest respectively. COMET’S probabilistic model is formally structured as an HBN, centered upon a trace link prior θ which represents the model’s prior belief about the probability that S and T are linked. Our model is hierarchical, as the trace link prior is influenced by a number of hyperpriors, which are in constructed in accordance with a set of hyper-parametersthat either derived empirically, or fixed. In Fig. 1, hyperpriors are represented as empty nodes, and hyper-parameters are represented as shaded blue nodes. In general, empty nodes represent latent, or hidden, variables whereas shaded nodes represent variables that are known or empirically observed quantities. The rectangles, or “plates” in the diagram are used to group together variables that repeat.
To make our model easier to comprehend, we have broken it down into four major configurations, which we call stages, indicated by different colors in Fig. 1. The first stage of our model (shown in blue at top) unifies the collective knowledge of textual similarity metrics computed by IR techniques. The second stage (shown in orange in middle) reconciles expert feedback to improve the accuracy of predicted trace links. The third stage (shown in green at bottom) accounts for transitive relationships among development artifacts, and the fourth stage combines each of the underlying complexities. It should be noted that the first stage of our model can be taken as the “base case” upon which the other complexities build and is always required to predict the existence of a trace link.
Trace Link Reasoning
The posterior distributions illustrated above show the causal-effect relationships among hidden variables within Comet’s Hierarchical Bayesian Network and are defined by the Bayes’ rule P (cause | effect).
In our case, the causes are represented by different sources of information. Such sources are the w variable (hyperparameter for the mixture model), Bmix (random variable for the transitive link probability), μn (random variable for information retrieval), and linkage θ (hidden variable for traceability). This model is clearly a causal-diagnostic rule model described by Russell & Norvig where learning the DAG is not an important task. However, Russell & Norvig pose a causal structure based on prior probabilities and plausible behaviour. We use the causal-diagnostic theory to adapt uncover artifacts’ relationships using our intuition of how the information naturally flows.
More specifically, the graph above shows the hyper-posterior distributions when accounting for execution information in the 3rd stage of Comet’s HBN for the Industry-Net project. In this case, incorporating the execution information caused a shift in the increase in probability that this particular link exists. Thus, in this manner, information can be given to developers about the contribution of execution information to the overall probability of traceability.
The Comet Jenkins Plugin
The screenshots below illustrate the User Interface of the Jenkins plugin that was used to evaluate Comet during the user study.
The above screenshots shows the general front-end user interface for the automated traceability plugin we developed in collaboration with our industrial partner. The Comet HBN serves as the backend and provides information regarding the probability that a trace link exists between two given pairs of software artifacts. For the project indexed in the user interface above, Comet has analyzed source code test cases and requirements. The first dropdown menu provides the user with options to view potential links between differing artifact pairs.
The list of candidate trace links are shown just below the configuration drop down menus, and display a “base artifact”, the linked artifacts — as determined by Comet’s HBN — are shown in the middle of the list with the textual similarity from Stage 1 of Comet’s HBN shown directly next to the artifact. The rightmost column shows the artifact similarity of the final stage of Comet’s HBN, after taking into account developer feedback and transitive links.
This list can be generated and updated upon each trigger of Jenkins pipeline, which in turn can be triggered upon each commit to a project software repository. Comet maintains an cache of its HBN for each pair of artifacts and only has to recalculate changed artifact pairs as the project evolves, lending the tool well to agile projects that utilize CI/CD pipelines.
The above screenshot illustrates the traceability interface for examining potential links between requirements and test cases.
The above screenshot illustrates the sorting functionality of our traceability plugin. Developers or analysts using our tool are able to select from one of three categories: “Probably Linked” indicating a high artifact similarity score based on Comet’s HBN; “Unsure” or cases where the artifact similarity score fell within a standard deviation of the median similarity values for given set of artifact pairs; and “Probably Not Linked” where the the artifact similarity score was in the bottom quartile of similarity values for a given set of artifact pairs. This allows the developer to quickly shift between views and traceability tasks.
This screen shows the interface that allows developers or analysts to provide feedback for a given pair of artifacts. If no feedback has been previously been provided for an artifact pair, then the developer or analyst simply has to click on the “None” link to bring up the feedback modal dialog. This dialog presents the developer with five options for feedback. Each of these options maps to value to between [0,1] to indicate the level of confidence that a developer has in the link existing. These values are as follows: “Strongly Agree”=0.95, “Agree”=0.75, “Undecided”=0.5, “Disagree”=0.25, and “Strongly Disagree”=0.05.
This interface was selected to simplify the feedback process and make it quick and easy for developers to provide expert feedback to Comet’s HBN. In some instances, a developer or analyst may want to view the artifacts in question to make further determination regarding their similarity; thus the Comet plugin allows for the opening of artifacts in the pair by clicking on the hyperlinked artifact names at the top of the model dialog window.
The above screenshot illustrates the updated traceability after a developer or analyst has provided feedback via the modal dialog. We can observe that the “Feedback” entry for the top pair of artifacts now reads “Strongly agree” and the Artifact similarity score has been updated to account for this feedback, increasing the score by the amount stipulated by Comet’s HBN.
The above screen illustrates the links between source code files and test cases as determined by collecting runtime information via lightweight instrumentation during the automated testing process via the Jenkins pipeline.
During our limited deployment of the traceability tool with our industrial partner, we observed one highly desired use case from security professionals. This use case involved having the plugin display artifacts that were not linked to any other artifact according to Comet’s HBN. The intuition behind this feature is that such artifacts represents “suspicious” requirements or source code snippets that must be further inspected. The above screen shows the support that our Comet plugin provides for this use case, and it simply lists artifacts that are unlikely to have any link to any other artifact according to a predefined threshold (0.15 in our plugin implementation).
RQ1: How effective is Comet in terms of predicting candidate trace links using combined information from IR techniques?
RQ2: To what extent does expert feedback impact the accuracy of the candidate links of Comet?
RQ3: To what extent does information from transitive links improve Comet’s trace link prediction accuracy?
RQ4: How effective is the Holistic Comet’s model in terms of predicting candidate trace links?
RQ5: Do professional developers and security analysts find our implementation of Comet useful?
The context of this empirical study includes the eight datasets shown in the table below. Six of these datasets are taken from the open source CoEST community datasets. Note that we do not use all available subjects in the CoEST repository, as we limited our studied systems to those that: (i) included trace links from requirements or use cases written in natural language to some form of code artifact, (ii) were written in English and/or included English translations, and (iii) had at least 1k LoC. We utilize two datasets to investigate and tune the hyper-parameters of Comet’s HBN, Albergate, and the Rq→Tests dataset of the EBT project. We utilize the other six datasets for our empirical evaluation. The subject system called “Industry-Net” is an anonymization of an open source networking related software project which was created and is actively maintained by our industrial partner. The ground truth set of trace links between Rq→Src and Rq→Tests was created by a group of authors with feedback from engineers working on the project.
|EBT ||Traceability Benchmark||Java||1,747||Rq→Tests|
|eTour ||Tour guide management||Java||23,065||UC→Src|
|EBT ||Traceability Benchmark||Java||1,747||Rq→Src|
|SMOS ||School Management||Java||9,019||UC→Src|
|iTrust||Medical System||Java, JSP, JS||38,087||Rq→Src|
The “base” first stage of Comet’s HBN is able to utilize and unify information regarding the textual similarity of development artifacts as computed by a set of IR techniques. While there is techni- cally no limit to the number of IR techniques that can be utilized, we parameterized our experiments using the 10 IR- techniques enumerated in Table II. The first five techniques are standalone IR techniques, whereas the second five are combined techniques utilizing the methodology introduced by Gethers et al. This combined approach normalizes the similarity measures of two IR techniques and combines the similarity measures using a weighted sum. We set the weighting factor λ for each technique equal to 0.5, as this was the best performing configuration reported in the prior work. The other parameters for each of the techniques were derived by performing a series of experiments on the two tuning datasets, and using the optimal values from these experiments. For all IR techniques, we preprocessed the text by removing non-alphabetic characters and stop words, stemming, and splitting camelCase. Note that non-deterministic techniques such as LSI, LDA, and NMF were run over multiple trials.
|IR Technique||Tag||Model Parameters||Treshold Technique|
|Vector Space Model||VSM||N/A||Link-Est|
|Latent Semantic Indexing||LSI||k=30||Link-Est|
|Latent Dirichlet Allocation||LDA||# Topics=40
|NonNegative Matrix Factorization||NMF||# Topics = 30||Median|
|Combined VSM + LDA||VSM+LDA||k=5
IR Threshold Determination
In order to accurately estimate the likelihood function Y for Comet’s HBN we need to choose a threshold ki for each IR technique that maximizes the precision and recall of the trace links according to the computed textual similarity values. To derive the best method for determining the threshold for each IR technique, we performed a meta evaluation on our two tuning datasets. We examined six different threshold estimation techniques: (i) using 1% of the ground truth, (ii) using the mean of all similarity measures for a given dataset, (iii) using the median of all similarity measures across a given dataset, (iv) using a Min-Max estimation, (v) a sigmoid estimation, and (vi) Link- Est, where an estimation of the number of confirmed links for a dataset is made based on the number of artifacts, and a threshold derived to ensure that the estimated number of links is above that threshold. We performed each of these threshold estimation techniques for all studied IR techniques across our two tuning datasets, and compared each estimation to the known optimal threshold. We used the optimal technique across our two tuning datasets, as reported in Table II.
1% of the Ground Truth: This technique simply utilizes 1% of the ground truth to find the optimal threshold for the sampled one percent. While this can be accurate, it requires some known existing links.
Mean of Similarity Measures: This technique simply takes the mean of all the similarity measures as the threshold.
Median of Similarity Measures: This technique simply takes the median of all the similarity measures as the threshold.
Min-Max Estimation: This technique simply takes the Max value, subtracts the min value, and divides by two to determine a threshold.
Sigmoid Estimation: This technique fits a sigmoid curve to the generated IR similarity values to determine a threshold.
Link-Est (Simple Inference): For this technique, we assume a fixed number of artifacts are linked to each source artifact (e.g., requirements), then we multiply this number by the total number of source artifacts, and then divide by the number of target artifacts. This gives us a number N, and an optimal threshold is derived by ordering all similarity values in ascending order and taking the Nth similarity value.
The Precision/Recall and ROC curves for the first stage of Comet's HBN are illustrated below (click on a figure for more detail). The blue, red, and green curves represent the P/R of link probabilities inferred according to VI, NUTS, and MAP respectively. The solid grey line represents the Median of all IR techniques and the dotted grey lines indicate the best and worst IR techniques respectively. As these results illustrate, Comet’s Stage 1 HBN outperforms the median of the IR techniques on average, and in certain cases, such as for the iTrust project, is able to perform nearly as well as the best IR technique. In fact, across all studied subjects, the most effective inference technique for Comet's Stage 1 model matches or outperforms the median of the IR techniques in terms of AP, with Comet achieving an overall AP=0.33, and the Median IR techniques achieving an overall AP= 0.29. According to the Wilcoxon signed rank test, Comet outperformed the Median to a statistically significant degree for each subject (where p < 0.001 in each case).
These results signal strong performance for Comet's Stage 1 model. Recall that, the Stage 1 model only utilizes observations taken from the set of ten IR techniques introduced above, thus we do not expect this stage of the model to outperform the best performing IR technique. However, what the first stage of our model does provide is consistentperformance across datasets. IR techniques are notoriously difficult to configure for peak performance . However, our model is shown to provide consistent performance, outperforming the median IR techniques, using IR parameters tuned on completely separate datasets. Thus, Stage 1 of Comet's model is attractive due to its consistent performance that we demonstrate transfers across datasets.
The P/R curves for all subject systems are shown below for each error rate (0%, 25%, 50%). In thses figures, the blue curve represents Stage 2 of Comet’s HBN and the red curve represents the results for Stage 1 for the randomly sampled set of 10% of the subject programs’ potential links. This figure illustrates that Stage 2 of Comet’s HBN is able to effectively incorporate expert feedback into its predicted trace link probabilities, as the Stage 2 model dramatically outperforms both the median and best IR techniques as well as the first stage of the model, even with an error rate of 25%. Furthermore, we see this trend continue across subjects for the 25% error rate, where the AP across all subjects for Stage 2 sampled links was 0.57 compared to 0.35 for Stage 1 and the results for all subjects are statistically significant (p < 0.01). However for larger error rates such as 50%, we see Stage 2 start to underperform Stage 1 where the AP across all subjects for Stage 2 sampled links was 0.33 compared to 0.35 for Stage 1. Finally, when considering all subjects with 0% error, Comet’s Stage 2 model unsurprisingly achieves perfect precision and recall for the sampled links. These results illustrate that Stage 2 of Comet’s HBN is able to effectively utilize expert feedback to improve its predictions.
Results for 0% Expert Error
Results for 25% Expert Error
Results for 50% Expert Error
The Figures below present the P/R curve for Stage 3 of Comet’s HBN for our six subject systems’ links that were found to have transitive requirement relationships. This figure illustrates that incorporating the transitive links into Comet’s HBN leads to a small increase in AP. However, the results for Stage 3 are quite subtle across systems. Across all systems, taking into account the best performing value of τ, the Stage 3 HBN narrowly outperforms the Stage 1 model in terms of overall AP, where the Stage 3 AP= 0.342, and the Stage 1 AP= 0.336. Stage 3 outperformed Stage 1 in terms of precision to a statistically significant degree on three subjects. However, it should be noted that the the incorporation of transitive links did generally lead to higher precisions at higher recall values, which can be important when higher recall is required. In summary we find that incorporating transitive requirement links in Stage 3 of the HBN leads to a slight improvement in predictions compared to Stage 1.
Results for Req - Req Transitive Links T = 0.50
Results for Req - Req Transitive Links T = 0.65
The figures below illustrate the the P/R curve of Comet’s Stage 4 “holistic” model for three subjects where the blue curve represents Stage 4 and the red curve represents Stage 3. This figure illustrates that Comet’s holistic HBN is able outperform both the Stage 1 model and the best IR technique by incorporating information from both expert feedback with a 25% error rate, and transitive requirement links. We found that Comet’s holistic model generally outperformed Stage 1, but to varying magnitudes. For instance, the performance in terms of AP for Industry-Net Req→Tests increase by 0.04 over Stage 1, however for Industry-Net Req→Src AP only increase by 0.01 over Stage 1. However, our results clearly indicate that the holistic HBN performs best, outperforming the median and, in certain cases, best IR technique.
Results for Holisitc Model
The figure below provides the responses to the likert-based UX questions from the six developers who work on the Industry-Net project after interacting with the Comet plugin. Overall, the responses from these developers were quite positive. They generally agreed the Comet plugin easy to use and understand, but more importantly, generally found the accuracy of the predicted links and non-links to be accurate. These results exhibit the potential applicability of the Comet plugin for developers when it is applied to a mature industrial software project. More encouraging still were the responses recorded during the semi- structured interviews with the software auditing groups. In these interviews, the teams believed that Comet would be very useful for code auditing, as it would “allow compliance analysts to [inspect] links, look at the code and validate [the links]”. Furthermore, a team responsible for security audits of systems found an interesting use case for Comet that is often overlooked in traceability analysis. That is, they were interested in code and requirements that are not linked to any other artifact, as such artifacts are likely to be suspicious and should be inspected further. In this case, Comet’s predictions of non-links would be just as important as the predication of links. Overall, teams working with our industrial partner saw great promise in Comet.
Results for Industry Net Case Study UX Questions
Code and Data
Below we provide links to the Comet data set and replication package, as well as the code repository for our implementation of the Comet both as an extensible Python library and a Jenkins plugin. (These will be made available upon paper acceptance)