Software testing is a critical aspect of ensuring the quality of software. Ideally, tests should
produce consistent results when being executed repeatedly on the same version of the software.
However, certain tests may exhibit non-deterministic behavior, commonly known as ªflaky testsº.
These tests can provide ambiguous signals to developers and make the test results unreliable.
Despite being a recognized phenomenon for decades, academic attention towards test flakiness
has only recently increased. The current dissertation aims to contribute to the advancement of
research in two directions. First, we focus on predicting the lifetime of a flaky test, an issue that
has been left unaddressed in the flaky tests research area. Secondly, we question the efficiency of
previous studies in discerning flaky failures from legitimate failures, focusing on the Chromium
build result as our dataset.
In our investigation of the historical patterns of flaky tests in Chrome, we identified that 40% of
flaky tests remain unresolved, while 38% are typically addressed within the initial 15 days of introduction.
Subsequently, we developed a predictive model focused on identifying tests with quicker
resolutions. Our model demonstrated a precision of 73% and a Matthews Correlation Coefficient
(MCC) approaching 0.39 in forecasting the lifespan class of flaky tests.
Furthermore, we discovered that current vocabulary-based flaky test detection approaches misclassify
78% of legitimate failures as flaky failures when applied to the Chromium dataset. The
results also revealed that the source code of tests is not enough indicator for predicting flaky failures,
and other execution-related features must be contributed for better performance.