Classifying generated white-box tests: an exploratory study
White-box test generation analyzes the code of the system under test, selects relevant test inputs, and captures the observed behavior of the system as expected values in the tests. However, if there is a fault in the implementation, this fault could get encoded in the assertions (expectations) of the tests. The fault is only recognized if the developer, who is using test generation, is also aware of the real expected behavior. Otherwise, the fault remains silent both in the test and in the implementation. A common assumption is that developers using white-box test generation techniques need to inspect the generated tests and their assertions, and to validate whether the tests encode any fault or represent the real expected behavior. Our goal is to provide insights about how well developers perform in this classification task.
We designed an exploratory study to investigate the performance of developers. We also conducted an internal replication to increase the validity of the results. The two studies were carried out in a laboratory setting with 106 graduate students altogether. The tests were generated in four open-source projects. The results were analyzed quantitatively (binary classification metrics and timing measurements) and qualitatively (by observing and coding the activities of participants from screen captures and detailed logs). The results showed that participants tend to incorrectly classify tests encoding both expected and faulty behavior (with median misclassification rate 20%). The time required to classify one test varied broadly with an average of 2 min. This classification task is an essential step in white-box test generation that notably affects the real fault detection capability of such tools.
We recommended a conceptual framework to describe the classification task and suggested taking this problem into account when using or evaluating white-box test generators.
|Example of participants' classification performance. Rows represent the tests to classify, and columns denote each participants’ results. (results.png)||50KiB|
Tue 13 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
15:00 - 16:30
|Adaptive metamorphic testing with contextual bandits|
Journal-First PapersLink to publication DOI Pre-print
|Classifying generated white-box tests: an exploratory study|
Journal-First PapersLink to publication DOI File Attached
|Hansie: Hybrid and Consensus Regression Test Prioritization|
Journal-First PapersLink to publication DOI Media Attached File Attached