On the Validity of Benchmarking for Evaluating Code Quality

H. Gruber, R. Plösch, M. Saft: On the Validity of Benchmarking for Evaluating Code Quality, Proceedings of the joined International Conferences on Software Measurement IWSM/MetriKon/Mensura 2010, Stuttgart, Germany, November 10-12, Shaker Verlag, Aachen, 2010, doi:10.13140/RG.2.1.1893.0961

Evaluating source code quality is a laborious task when performed by experts. There exist a number of approaches that try to provide an automatic assessment. Absolute quality assessment methods (e.g. based on thresholds) did not yet proof the significance of the results. Software benchmarking is a relative assessment approach based on the general idea of benchmarking from other industries. We developed a benchmarking-oriented code quality assessment method that at least overcomes known technical problems of other benchmark based methods. Nevertheless, the major concern is to validate the significance of benchmarking based results by comparing these results with the quality assessments of experts. For this purpose we conducted two studies (one for Java projects, the other for C# projects) involving a total of 10 open source projects and 22 experts. While the first experiment with Java projects has provided a result that has motivated us to use the benchmarking-oriented assessment more intensively, the experiment with C# showed us that we cannot trust the results of the automatic benchmark assessment blindly.