Empirical results on the study of software vulnerabilities




















Siavvas, E. Gelenbe, D. Kehagias, and D. Computer and Information Sciences. Cham: Springer International For the empirical study, a relatively large benchmark reposi- Publishing, , pp. Siavvas and E. Shin and L.

Two individual rankings were produced, one [6] M. Siavvas, D. Hypothesis [7] S. Neuhaus, T. Zimmermann, C. Holler, and A. A strong positive correlation was observed between the TD [8] M. Gegick, L. Williams, J. Osborne, and M. Siavvas, M. Jankovic, D. Furthermore, the [11] R. Scandariato, J. Walden, A. Hovsepyan, and W.

Chowdhury and M. Shin, A. Meneely, L. Williams, and J. For Systems Architecture, vol. Moshtari and A. Walden, J. Stuckman, and R. Roumani, J. Nwankpa, and Y. Walden, M. Compared with benchmarks consisting of real-world programs, artificial benchmarks have two advantages. On the one hand, artificial benchmarks can provide ground truth of vulnerabilities, which is essential for the evaluation of vulnerability coverage.

On the other hand, artificial benchmarks can synthesize diversities around the vulnerabilities. Such diversities can evaluate the generality of vulnerability detection techniques.

Despite the above advantages, while some bug finding tools perform well on artificial benchmarks, they may not perform well in real-world programs. For example, Wang et al. So researchers are hesitating with adopting artificial vulnerability benchmarks, concerning that these benchmarks insufficiently represent reality and thus, evaluation using such benchmarks is unreliable or even misleading. Unfortunately, there lacks research on handling such concerns in the literature.

In this work, we aim to understand the validity of artificial benchmarks. Our methodology is to compare the mainstream benchmarks of artificial vulnerabilities with real-world vulnerabilities and focus on the following questions and seek answers through our study: Q1. How do artificial vulnerabilities differ from real-world vulnerabilities?

Can current artificial vulnerabilities sufficiently mirror the reality? What can we find by modifying the artificial benchmarks according to what we have observed from Q1? What improvements can we make towards more realistic artificial vulnerabilities benchmarks? To answer the above questions, we develop a general model that captures the essential properties of vulnerabilities. This model depicts the properties of memory corruption vulnerabilities, together with how the properties influence the evaluation on the vulnerability detection tools, as we will detail in Section III.

Following this model, we carry out our experiments and summarize answers to the above three questions. In short, our study reveals that while artificial benchmarks attempt to approach the real world, they still differ from the reality. For Q1 , we found that the artificial benchmarks created by injecting bugs do not capture the key properties of real-world vulnerabilities in terms of bug requirements and vulnerability types.

For Q2 , we found that the artificial bugs cannot mirror the reality well because they may not fairly reflect the program state coverage of the vulnerability detection tool. For Q3 , we propose several strategies to make artificial benchmarks more realistic based on our analyses and experiments. First, to the best of our knowledge, we perform the first in-depth empirical study on the analysis of benchmarks of artificial vulnerabilities with manual verifications and confirmations.

Our large-scale analysis covers artificial bugs and 80 real-world vulnerabilities, which is the most comprehensive comparison between artificial vulnerabilities and real-world ones.

Second, we develop a general model to describe essential software vulnerabilities, together with how the bug requirements influence the evaluation on the vulnerability detection tools.

Third, our results provide quantitative evidence on the differences between artificial and real-world vulnerabilities from some properties e. Also, we identify how the properties influence the evaluation of vulnerability detection techniques.

Fourth, we modify the benchmarks to make it more realistic according to what we have observed on control flow and data flow. We also propose new improvements toward making artificial benchmarks more realistic. In this section, we start by introducing the background of security vulnerabilities and benchmarks we study. We then proceed to describe our research goals. A memory corruption vulnerability is a security vulnerability that allows attackers to manipulate in-memory content to crash a program or obtain unauthorized access to a system.

In spite of decades of research in bug detection tools, there is a surprising dearth of ground-truth corpora that can be used to evaluate the efficacy of such tools. The lack of ground-truth datasets means that it is difficult to perform large-scale studies of bug discovery.

To comprehensively measure the utility and understand the limitations of vulnerability detection techniques, the community has been developing artificial benchmarks of vulnerabilities. In recent years, three widely-used artificial vulnerability databases have sought to address the need for ground-truth benchmark and there are two categories of them. LAVA-M and Rode0day inject large numbers of bugs into program source code automatically and include similar bugs structurally.

In this part, we will briefly provide background information on the artificial vulnerabilities before introducing our study. LAVA currently concentrates on injecting Buffer-Overflow bugs and produces corresponding proof-of-concept inputs as their triggers [ 2 ]. Figure 1 presents an overview of LAVA. When LAVA obtains the source code and series of input files of the program, it finds unused portions of the input by running the program with dynamic taint analysis [ 7 ] on each specific input.

This data parts of the inputs bytes are dead i. Since DUAs are often a direct copy of input bytes and can be set to any chosen value without sending the program along a different path, LAVA intends to regard them as candidate triggers for memory corruption. Then LAVA finds potential attack points ATPs , which is a program instruction involving a memory read or write whose pointer can be modified and must occur temporally after a DUA along the program execution.

If the condition is met, then a buffer-overflow will occur. It is a benchmark of artificial software vulnerabilities that injected more than one bug at a time into the source code and used widely to evaluate vulnerability detection tools [ 8 , 9 , 10 , 11 ]. Rode0day [ 3 ] is a recurring bug finding contest, which is also a benchmark to evaluate vulnerability discovery techniques.

In the Rode0day contest, the successful detection of memory error vulnerabilities is demonstrated in the form of a simple crash. Evolved from LAVA, Rode0day uses automated bug insertion to generate new error assemblies in the form of standard bit Linux ELF files, to help evaluate the performance of vulnerability detection tools.

CGC [ 4 ] , a competition among autonomous vulnerability analysis systems, is a widely adopted benchmark in recent studies [ 11 , 9 , 10 , 12 ]. This competition resulted in a collection of small programs with known vulnerabilities and triggering inputs. Each challenge of CGC contains one or more bugs, which are deliberately devised by security experts to evaluate vulnerability detection tools.

While existing works focus on proposing new benchmarks that can be used to evaluate the efficacy of vulnerability detection tools [ 13 , 2 , 3 , 4 , 14 ] , there is a lack of systematic understanding of whether the artificial vulnerabilities can represent the real-world vulnerabilities. Although other works suggest that some widely-used benchmarks, like LAVA-M, Rode0day and CGC are different from real-world bugs in several ways [ 13 , 15 , 14 ] , they did not provide in-depth analysis of the benchmarks.

Moreover, there is also an absence of analysis on how the different properties of the benchmark vulnerabilities can reflect the different features of the vulnerability detection tools. Also, analyzing the differences between artificial and real-world vulnerabilities can help to create more realistic benchmarks for tool evaluation. Thus, we propose a study to provide a first in-depth understanding of similarities and differences between artificial and real-world vulnerabilities while exploring solutions to make artificial bugs more realistic.

In this paper, we focus on the memory corruption vulnerability, which is the majority of software security defects [ 16 ]. Instead of discussing the vulnerabilities case by case, we propose a general model to describe the memory corruption vulnerabilities by summarizing the requirements for triggering the vulnerability.

Besides the model for vulnerabilities, we also propose a model to represent the different features of the vulnerability detection tool and how they are affected by the components of the vulnerability model. This is because the goal of the benchmarks is to evaluate the vulnerability detection tools. So we further extend the vulnerability model by adding the relations with the properties of the vulnerability detection tool.

Figure 2 depicts our model, in which the cyan boxes represent the main components of a bug and the gray boxes are the properties of the vulnerability detection tools. This model provides a systematic basis for our experiments and analyses. Here we discuss the model in detail. Based on the results of the main Cox regression model, this study suggests that vulnerability-related attributes can be utilized to estimate the likelihood of timely patch releases for zero-day vulnerabilities.

More specifically, the proposed framework of survival analysis can measure the likelihood of timely patch release through the following factors: attack complexity, privileges required, scope, confidentiality impact, and the number of affected vendors, products, and versions. Results also report how each covariate positively and negatively impacts patch release time. Surprisingly, vulnerabilities with high attack complexity seem to increase the likelihood of timely patch releases.

This implies that zero-day vulnerabilities that require more effort from the attackers are being patched sooner than less complex ones.

Thus, this explains why survival analysis yielded this counterintuitive result. As more data becomes available, we believe this result will become less biased. Zero-day vulnerabilities that result in scope change and that affect a larger number of vendors, products, and versions expectedly pose a greater risk, and therefore vendors will prioritize these patch releases.

More specifically, results imply that vendors are more alert to vulnerabilities that may impact resources beyond the scope of the initial target and do take into consideration the count of entities affected by the zero-day vulnerabilities. On the contrary, zero-day vulnerabilities that require privileges decrease the likelihood of timely patch releases.

A possible explanation is that zero-day vulnerabilities that require higher efforts from attackers are considered less risky by vendors; therefore, such vulnerabilities are not assigned high priority. Similarly, results reveal that zero-day vulnerabilities that impact confidentiality are not patched on time.

While our model does not examine the reasons behind the delays, future studies may benefit from reaching out to specific vendors and examining the relationship between patch development and patch release time. Overall, these results inform IT vendors to the importance of prioritizing fixes for zero-day vulnerabilities with respect to these attributes. Furthermore, it enables vendors to foresee the delays in patch release times when dealing with certain zero-day vulnerabilities.

Statistical tests show that patch release times vary across products and vulnerability types. For instance, findings showed that enterprise software has the highest median patch release time.

This may be attributed to the growing complexity [ 49 ] and challenges of fixing enterprise software. It may be also attributed to the amount of effort and time needed to test enterprise systems. In any case, this is a finding that deserves more in-depth examination.

Personal software, on the other hand, had the lowest median patch release time. Future research can perform detailed analysis to learn about the patching practices of IT vendors. Another interesting result is the outcome of the sub-analysis of vulnerability types. For example, while CWE type had the lowest median patch release time 71 days , CWE, on the other hand, reported the highest median time days.

To gain additional insights, further research about each vulnerability type is needed. Also, this type of vulnerability affects the integrity, availability, and access control of the vulnerable system.

This information might be useful to IT vendors who may want to revisit their patch development and SDLC processes to try to mitigate future vulnerabilities or speed up patch development time.

This study advances our knowledge and understanding of zero-day vulnerabilities and timeline of patch releases. It explores new vulnerability-related attributes and measures their impact on patch release timing. In recent years, researchers have called for more empirical studies in the information security field [ 50 ] that can further our understanding of security patch release [ 2 ].

While previous studies have added to our knowledge of vulnerability patching, further investigation was needed to overcome the limitations and analyzing patching timelines of zero-day vulnerabilities. This study attempted to bridge this gap by examining the impact of other as-yet unexplored factors on the patch release time of zero-day vulnerabilities using survival analysis.

This is the main contribution of this study as it captures relevant information about vulnerability-related factors and the applicability of survival analysis method. In addition, unlike prior research, to our knowledge, this is the first study to use zero-day vulnerabilities in assessing patch release timing.

As zero-day attacks continue to affect firms, our results suggest several important messages. First, we encourage organizations to strengthen their prioritization when dealing with patch development. For instance, vendors of enterprise software are urged to be more vigilant to delays in patch release when compared to other types of software. In addition, as firms prioritize which zero-vulnerability needs to be fixed first, our results can aid in the decision process using vulnerability type.

For example, given that CWE vulnerability type takes the longest time to patch, firms can take this into consideration in the prioritization process. Second, our results can assist decision-makers in allocating organizational resources when patching specific types of zero-day vulnerabilities.

More precisely, since patching vulnerabilities that require privileges and impact confidentiality take longer time to fix, security managers may alter their patching practices accordingly and assign their resources to focus on the most impactful issues.

Third, for firms that rely on existing heuristics for prioritizing the severity of zero-day vulnerabilities, our model can be used as an adjunct to such methods. Moreover, if technology vendors are not using existing vulnerability risk triage methods, our model can help vendors make informed decisions about the timing needed to fix zero-day vulnerabilities. Forth, since the proposed model and the analysis relied on publicly available data, it can be implemented by any firm.

This should be a welcoming news to security managers that can replicate the results and build a vendor-specific survival model through open access to zero-day vulnerabilities along with the vulnerability attributes through NVD. Such findings can benefit managers who want to play an integral role in information security within organizations [ 51 ].

In this section, we examine possible threats to validity of our results and categorize our concerns into three sections, namely internal, external, and conclusion validity. In this study, the proposed model considered the effects of vulnerability-related attributes and control variables on patch release time.

However, this does not imply cause—effect relationship. As mentioned in the introduction, this is an exploratory study that examines the effects of some covariates. Nevertheless, we provide explanation supporting the fact that the proposed model may play a role to affect patch release time.

Due to the limitation of the dataset, it was not feasible to capture this information and include it into the model. Future research may investigate the effects on the aforementioned variables. Another threat is that the reported zero-day vulnerabilities by ZDI may already be known by the affected IT vendor.

This may affect the accuracy of measuring the outcome variable patch release time. However, given the monetary value awarded to security researchers, we assume that ZDI researches and vets each submission to ensure originality of zero-day vulnerabilities. Finally, a threat to internal validity is that this study did not consider the effect of prior zero-day vulnerabilities on patch release time. In other words, repeated disclosure of zero-day vulnerabilities may impact the speed of patch release of the affected firms.

Future studies may benefit from running Cox regression models with a repeated events model to check whether results are sensitive to the exclusion of the subsequent zero-day vulnerability disclosures.

One threat is the selection of the dataset. In this study, the zero-day vulnerability dataset was captured from a single source namely, ZDI. This limits the sample size and poses a threat to external validity; thus, we cannot generalize the findings of this study to all zero-day vulnerabilities.

To improve the generalizability and significance of the observations, it is ideal to capture data from different sources such as Google Project Zero, iDefense, Vulnerability Contributor Program VCP and the dark web. However, we argue that ZDI was the only source that had a high level of transparency and provided timeline information needed for this study; thus, it was deemed appropriate for the project. Another threat to external validity is the classifications of the independent and control variables.

Specifically, in the sub-analyses, we grouped the categorical variables into multiple levels to mitigate issues related to low sample size per level. Limiting the constructs to specific levels implies that not all levels were represented in the analysis and results are limited to the classifications. However, as the ZDI database is constantly being updated with new zero-day vulnerabilities, this threat can be mitigated by capturing more data.

One threat to conclusion validity is related to misuse of statistical assumptions or the lack of statistical calculations that may result in incorrect conclusions. To avoid such threats, first, we chose Cox regression that does not require specification of the probability distribution of the survival times.

Moreover, the Cox regression models were fitted with stringent limits on significance levels. In this work, a survival model was put forward to measure the patch release time of zero-day vulnerabilities. Using vulnerability dataset captured from ZDI, we employed Cox regression method, conducted sub-analyses using K—M technique, and assessed the robustness of the results. Based on the fit statistics, results showed that survival analysis is significant and useful for examining patch release timing.

We found that IT vendors are fast in releasing timely patches for zero-day vulnerabilities that result in scope change and affect more vendors, products, and versions, while vulnerabilities that require privileges and impact confidentiality are less likely to be patched on time. Results also revealed that personal software has the lowest median patch release time, whereas enterprise software has the highest median.

Finally, findings showed that CWE vulnerability type has the lowest median patch release time, while CWE has the highest median. Although this study makes a number of contributions to vulnerability patching field, it has some limitations. The first limitation is the sample data.

Our study analyzed reported zero-day vulnerabilities from a single source, namely ZDI. While there are other reputable bug bounty programs, we limited our sample to ZDI given the transparency level, open access, and the quality of their data.

Future research may benefit from exploring data from other bug bounty programs. A second limitation is that our analysis relied on data that belonged to the "white" market of bug bounty programs.

Therefore, our findings did not consider other disclosure sources that belong to "gray" and "black" markets. Given the difficulties in gaining access to such markets and the sensitivity of the subject matter, we believe that capturing this type of data may introduce issues related to integrity and quality of the data.

Future research directions include improving the accuracy of the proposed model. This is possible by considering additional factors such as the estimated development cost of security patches, the original source of disclosure, and firm-level factors e.

A third limitation is that most product types in our sample are close-source software. Thus, our results may not be applicable to open-source projects. As more data become available, we recommend conducing survival analysis to compare whether patch release time differs among open- and close-source projects. Fourth, this study only considered eight vulnerability characteristics attack vector, attack complexity, privileges required, user interaction, scope, and CIA impacts.

The choice of these characteristics was limited by the data available in the NVD. To our knowledge, NVD is the only source that provides vulnerability-related data for a large sample of vulnerabilities; therefore, it was the appropriate choice for this study.

Future studies may benefit from extracting other vulnerability-related characteristics directly from vendors [ 53 , 54 ]. Finally, future research can also benefit from fitting other decision-based models, including decision trees that were not considered in this study.

National Vulnerability Database. An empirical analysis of software vendors' patch release behavior: impact of vulnerability disclosure. Inf Syst Res. Google Scholar.

Does information security attack frequency increase with vulnerability disclosure? An empirical analysis. Inf Syst Front. Information security trade-offs and optimal patching policies. Eur J Oper Res. Constantin L. Oracle knew about currently exploited Java vulnerabilities for months, researcher says.

Google Preview. Menn J. Microsoft knew about a Word bug that put millions of computers at risk but waited 6 months to fix it. Security patch management: share the burden or share the damage?

Manag Sci. A study on web security incidents in China by analyzing vulnerability disclosure platforms. Comput Secur.

Bi-criterion problem to determine optimal vulnerability discovery and patching time. Discrete-time framework for determining optimal software release and patching time.

Singapore : Springer , , — Software release and patching time with warranty using change point. Effort-based release and patching time of software with warranty using change point.

Int J Perform Eng. We experimented with various features constructed using the information available in NVD, and applied various machine learning algorithms to examine the predictive power of the data. Our results show that the data in NVD generally have poor prediction capability, with the exception of a few vendors and software applications. By doing a large number of experiments and observing the data, we suggest several reasons for why the NVD data have not produced a reasonable prediction model for time to next vulnerability with our current approach.

Unable to display preview. Download preview PDF. Skip to main content. This service is more advanced with JavaScript available. Advertisement Hide. Conference paper. Keywords data mining cyber-security vulnerability prediction. This is a preview of subscription content, log in to check access.



0コメント

  • 1000 / 1000