ConCodeSe utilises state of the art data extraction, persistence and search APIs. The Java code is parsed using a source code mining tool. We also developed a Java module using the Lucene’s Standard-Analyzer to tokenise the text in the bug reports into terms, which also includes stop-word removal.
Given a bug report (BR) and a source file, our approach computes two kinds of scores for the file: a probabilistic score, given by VSM as implemented by Lucene, and a lexical similarity score. Each kind of scoring is obtained with four search types using a different set of terms indexed from the BR and the file.
For each of the 8 combinations of scoring, all files are ranked in descending order. Then, for each file we take the best of its 8 ranks.
Whilst other localisation algorithms take a “one size fits all” approach, we treat each BR and file individually, using the summary, stack trace, stemming, comments and file names only when available and relevant, i.e. when they improve the ranking.