Classifying generated white-box tests: an exploratory study (ICST 2021 - Journal-First Papers)

Mon 12 - Fri 16 April 2021

Who

Dávid Honfi, Zoltán Micskei

Track

ICST 2021 Journal-First Papers

Time Zone

The program is currently displayed in (GMT-03:00) Brasilia, Distrito Federal, Brazil.

Use conference time zone: (GMT-03:00) Brasilia, Distrito Federal, BrazilSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 13 Apr 2021 15:30 - 16:00 at Boa Viagem - Journal First I Chair(s): Heike Wehrheim

Abstract

White-box test generation analyzes the code of the system under test, selects relevant test inputs, and captures the observed behavior of the system as expected values in the tests. However, if there is a fault in the implementation, this fault could get encoded in the assertions (expectations) of the tests. The fault is only recognized if the developer, who is using test generation, is also aware of the real expected behavior. Otherwise, the fault remains silent both in the test and in the implementation. A common assumption is that developers using white-box test generation techniques need to inspect the generated tests and their assertions, and to validate whether the tests encode any fault or represent the real expected behavior. Our goal is to provide insights about how well developers perform in this classification task.

We designed an exploratory study to investigate the performance of developers. We also conducted an internal replication to increase the validity of the results. The two studies were carried out in a laboratory setting with 106 graduate students altogether. The tests were generated in four open-source projects. The results were analyzed quantitatively (binary classification metrics and timing measurements) and qualitatively (by observing and coding the activities of participants from screen captures and detailed logs). The results showed that participants tend to incorrectly classify tests encoding both expected and faulty behavior (with median misclassification rate 20%). The time required to classify one test varied broadly with an average of 2 min. This classification task is an essential step in white-box test generation that notably affects the real fault detection capability of such tools.

We recommended a conceptual framework to describe the classification task and suggested taking this problem into account when using or evaluating white-box test generators.

Link to Publication

https://link.springer.com/article/10.1007%2Fs11219-019-09446-5

DOI

https://doi.org/10.1007/s11219-019-09446-5

File attachments

Example of participants' classification performance. Rows represent the tests to classify, and columns denote each participants’ results. (results.png)	50KiB

Dávid Honfi

Zoltán Micskei

Budapest University of Technology and Economics

Hungary