Escort

Automatically deriving software clustering constraints

Follow me on GitHub

Escort Introduction

The effectiveness of software maintenance tasks are heavily dependent on the accuracy and reliability of software documentation, especially if the tasks are out-sourced to third party vendors. If the documentations are out-of-date, considerable amount of time need to be spent on software comprehension activities. Software clustering is often used as a remodularization and architecture recovery technique to help developers simplify software maintenance tasks and ease the burden of software comprehension. Despite this, unsupervised clustering techniques tend to ignore prior knowledge from domain experts, leading to results that can be nonsensical to developers. Semi-supervised clustering (constrained clustering) can incorporate supervision of domain experts or side information to help improve clustering results of classic unsupervised clustering techniques. However, these techniques rely heavily on manual analysis for identifying clustering constraints and hence, cannot scale very well.

We propose an evolution-aware software clustering constraint derivation approach, Escort, which automatically derives clustering constraints based on evolutionary data of the analyzed software. Specifically, Escort can serve as an alternative approach to derive implicit and explicit constraints in situations where domain experts are absent. In the subsequent constrained clustering process, Escort can be considered as a framework to help supplement and enhance various unsupervised clustering techniques to improve their accuracy and reliability. We evaluate Escort based on both quantitative and qualitative analysis. For the quantitative validation, the experiment results showed that our approach outperformed five other unsupervised clustering techniques. For the qualitative validation, we invited experienced developers working in five IT companies and students majoring in software engineering to participate in our survey to evaluate the rationality of the generated clustering constraints. The survey shows that the participants agreed with the clustering constraints generated by Escort. Moreover, we evaluate the usefulness of refactoring suggestions based on the generated constraints. The validation indicates that Escort is capable of providing meaningful refactoring suggestions that are consistent with the real refactoring operations (obtained by Refactoring Miner from commit massages) performed by developers. In particular, for the 15 refactoring suggestions generated by Escort that have not yet been carried out by developers, we also reported them to the respective developers on GitHub for further validation. Encouragingly, 60% of our reported refactoring suggestions have been acknowledged by the developers where they have either incorporated them directly, or in future releases.

Studied Subject

ID Project # Versions # Major Versions # Stars KLOC (Avg) # Classes (Avg) Commits
1 Activemq 64 2 1,764 324.9 3,057 10,601
2 Activemq-artemis 32 2 602 518.3 3,324 7,502
3 Aeron 86 2 5,065 51.1 330 12,654
4 Alluxio 62 3 4,613 248.0 916 0,937
5 Apktool 34 2 10,220 16.6 179 1,648
6 Assertj-core 50 3 1,756 109.9 2,600 2,870
7 Atmosphere 204 3 3,430 40.6 259 5,931
8 Atomix 95 3 1,901 55.6 619 4,265
9 AxonFramework 99 4 2,020 93.0 724 5,951
10 Beam 83 2 3,998 389.6 1,063 27,132
11 Bisq 86 2 3,102 111.1 892 11,168
12 Byte-buddy 202 2 3,485 117.0 581 5,200
13 Calcite 52 2 1,894 211.5 869 4,175
14 Camel 154 3 3,242 680.0 7,981 45,096
15 Cas 218 4 7,620 91.1 1,219 16,869
16 Cassandra 241 4 5,950 189.2 775 25,297
17 Conversations 215 3 3,541 54.6 150 6,274
18 Cxf 153 2 642 527.7 4,618 15,722
19 Dbeaver 108 4 13,652 286.0 2,233 16,052
20 Debezium 73 2 3,265 75.5 363 3,125
21 Discovery 76 3 2,954 17.4 289 2,403
22 Dropwizard 147 3 7,657 44.0 509 5,430
23 Eclim 76 2 1,026 33.2 326 4,849
24 Flink 101 2 13,149 698.3 4,037 22,170
25 Fresco 40 2 16,207 89.2 547 2,531
26 Grakn 45 2 2,107 76.6 570 4,291
27 Guacamole-client 33 2 1,004 19.5 281 5,378
28 Hadoop 293 4 10,489 972.6 1,784 23,874
29 Hawtio 137 2 1,138 63.3 199 8,803
30 Hive 40 2 3,174 850.3 2,345 14,501
31 Java-tron 51 3 2,380 80.2 849 14,129
32 karaf 82 3 480 80.0 655 8,197
33 Maxwell 170 2 2,141 68.8 123 3,110
34 Nifi 88 2 2,066 60.1 693 5,286
35 Okhttp 95 4 37252 50.3 167 4645
36 Openapi-generator 53 3 5,446 374.2 542 14,218
37 Orientdb 157 3 4,154 368.1 2,329 19,352
38 Pdfbox 52 2 1,162 134.7 939 8,962
39 Pmd 70 2 2,887 184.3 1,415 16,532
40 Powermock 42 2 3,121 36.8 590 1,607
41 Redisson 163 3 13,242 74.7 486 5,675
42 Rest-assured 56 3 4,748 20.0 180 1,959
43 Speedment 67 2 1,832 95.3 1,537 4,674
44 Spotbugs 41 2 1,894 227.6 1,891 16,206
45 Spring-framework 175 3 37,411 502.5 3,773 20,896
46 Spring-security 143 4 4,843 145.0 1,231 8,732
47 Storm 33 2 6,078 160.0 920 10,316
48 Testcontainers-java 73 2 3,805 8.3 175 2,008
49 Tika 56 2 1,002 82.0 526 4,747
50 Traccar 31 2 2,392 25.9 415 6,227

Quantitative evaluation (RQ1)

dl Number of clustering constraints derived from subjects.xlsx

dl The results of the application of ESCORT in different algorithms.xlsx

Qualitative evaluation (RQ2)

Questionnaire

dl Questionnaire.zip

The issues reported by Escort

ID Project Filed Issue ID # Suggested refactorings Status
1 Activemq #8583 1 Fixed
2 Alluxio #16439 2 Pending
3 Atmosphere #2475 1 Pending
4 Beam #23896 1 Confirmed
5 Bisq #6395 3 Confirmed
6 Cxf #8690 2 Confirmed
7 Redisson #4642 1 Pending
8 Openapi-generator #12200 1 Pending
9 Orientdb #9787 1 Pending