Functional test data generation for Simulink-like models

Araujo, Rodrigo Fraxino; Delamaro, Marcio Eduardo; Maldonado, Jose Carlos

doi:10.1007/s13173-013-0104-z

Original Paper
Open access
Published: 20 March 2013

Functional test data generation for Simulink-like models

Rodrigo Fraxino Araujo^1,2,
Marcio Eduardo Delamaro¹ &
Jose Carlos Maldonado¹

Journal of the Brazilian Computer Society volume 19, pages 325–339 (2013)Cite this article

2475 Accesses
Metrics details

Abstract

Embedded systems are increasingly present in many electronic devices and is often related to critical applications. Therefore, the need for a well planned and executed testing procedure is even higher. We intend to contribute in this area by presenting an experimental evaluation of the pairwise combinatorial approach as a technique for test data generation applied specifically to Simulink-like models. In particular, we have applied our strategy to the generated source code of several models. Furthermore, a testing tool was developed to assist in the test data generation process. We show that there is no statistical significant advantages of the proposed approach over random generation of test data, but when used together they yield better results. The feasibility of the experimental results indicate that efforts can be employed in order to obtain a testing strategy integrated within a testing environment.

1 Introduction

As a result of technological advances, more and more mechanical systems are being replaced by electromechanical ones. This can be noted in the growing number of embedded systems present in cars, aircrafts, trains and electronic devices. Many of these systems are critical and cannot tolerate failures. Therefore, the testing of embedded systems is a very important task [23].

The National Institute of Standards and Technology estimates that, in the United States, the cost for insufficiency in the software testing process in 2000 was around 59 billion dollars. An example is what frequently happens with automotive models. Recall is a common practice to correct faults that have been introduced during the manufacturing of embedded systems [6].

Due to the complexity of systems and the ever-increasing needs for shortening time-to-market pressures, the testing task has become even more challenging. A common problem is the testing stage being performed at the end of a project development life cycle. Thus, when faults are found, the cost to fix them is much higher [34].

Furthermore, although the implementation of automated testing activity is a common practice, the creation of test sets is still performed manually in many cases. Embedded systems, with increasingly sophisticated software, may have a high number of combinations of inputs and events, which can result in many different outputs, leading to a possible failure to cover all outcomes in a manual testing activity.

A possibility to lessen the aforementioned problem is by using precise models that support a system development life cycle. Models are concise and understandable abstractions that capture the decisions of the functions of a system whose semantics are derived from the concepts and theories of a specific domain [32].

In this scenario, platforms such as ScicosLab/Scicos [24] and Matlab/Simulink [44] are widely used to design and simulate embedded system models. One of their advantages is the application analysis at different levels of abstraction. Another benefit is the automatic code generation, which reduces development costs and programming errors.

To ensure the reliability of this kind of system, the industry has been investing in an approach known as model based testing [8]. In this approach, it is easier to automate the testing activity, which includes an automatic generation of test sets. The testing activity takes place at a more abstract level, even before the software is coded. This leads to a more efficient process with significant cost reduction and a final product with higher quality.

Seeking to address such issue, our main results come from the experimental studies that were conducted to assess the adequacy of test data generation using two different approaches, a pairwise generation [19] and a random one. The adequacy assessment of the test data was performed by applying the mutation testing in a small set of programs and comparing the mutation score achieved by each test data generation approach.

For the conduct of such experiments, we introduce a tool called TeTooDS (Testing Tool for Embedded Systems) [3], which assists in the test data generation by applying the pairwise approach in embedded systems models. This approach ensures that any two possible values, belonging to two different parameters, will be present in at least one test data [19].

Kuhn et al. [26] show that the combinatorial design approach for automatic test data generation is quite effective in many situations and that the pairwise testing is fit for most general applications. Thereby, we considered that efforts could be employed by applying this approach in the context of embedded systems models. We are also not aware of other studies that focus on a well planned and documented experimental evaluation of test data generation methods for Simulink-like models.

This paper is structured as follows. Section 2 presents some characteristics of the environments for embedded systems development and simulation. In Sect. 3, testing techniques for Simulink-like models are described. The pairwise approach and the testing tool we developed to assist in the testing of Simulink-like models are detailed in Sect. 4. Section 5 presents the experimental results of the test data generation methods applied to a small set of models. The concluding remarks, possible extensions for the testing tool and suggestions for further work are presented in Sect. 6.

2 Development and simulation environments

Embedded systems devices are an increasingly higher portion of several technological areas. In the embedded systems development, it is common to use simulation prior to hardware and software integration. This occurs mainly because the hardware may not be available for testing, or may even not exist. The simulation can also avoid dangerous situations, such as equipment damage or human lack of safety [21, 23].

In this context [2], emphasize that a dynamic system consists of a set of possible states, together with a rule that determines the present state from a past state. According to [25], dynamic systems relate model-system states to earlier states. Classical physics, for example, predicts continuous changes of quantities such as position, velocity, or voltage with continuous time.

Simulink [44], a software that works with Matlab, and Scicos [24], a software that works with ScicosLab, are alternatives for the development and simulation of dynamic systems. One of their advantages is the application analysis possibility in several levels of abstraction. Another benefit is the automatic generation of code, thereby reducing development costs and programming errors.

Such systems are mathematically represented by systems of equations, that are differential equations in the case of continuous time systems, difference equations in the case of discrete time systems, and a mix of both in the case of hybrid systems. The simulation of these type of systems is based on numerical algorithms, whereby the solution of a system of equations, i.e., the semantics of a Simulink model, is given by the sequence of values that represents the temporal functions [11].

2.1 Simulink

Simulink is software for modeling and simulating embedded systems or, more precisely, dynamic systems. It provides a common environment for sharing data, designs and specifications, making it possible to develop more reliable critical systems and generating code with security [43].

Large worldwide organizations make use of Simulink. One of the main areas is the aerospace industry. The Airbus A380, the Mars Exploration Rover and the F-35 joint strike fighter were modeled using Simulink [44]. Embraer, a Brazilian exponent company in aviation has extensively used Simulink as shown, for instance, in Cavalcanti and Papini [10].

These models are based on blocks diagrams. These blocks include a library of sinks, sources, connectors and linear and non-linear components. Moreover, it is possible to create blocks for specific purposes. Models can be hierarchical and it is possible to analyze the whole system on a high level or detail each one of the blocks, increasing the model abstraction level. It facilitates the understanding of the model organization and on how the components interact with each other [36, 43].

2.2 Scicos

Scicos is also software for modeling and simulating embedded systems [24]. Unlike Simulink, Scicos is free software. For modeling, it offers a modular way to build complete embedded systems by an editor of block diagrams. It is possible to build a library of reusable modules (blocks) that can be used in different systems and projects [9]. A large number of blocks, that can be used to perform basic operations, are already present on the platform. It is almost never necessary to build a block from scratch.

Moreover, as with Simulink, Scicos provides features that help in the optimization, validation and generation of C code for a particular model. For example, an application can have its development cost decreased by traditional optimization techniques, validated by simulation and its C code generated for specific hardware.

2.3 Example

Simulink-like models are composed by blocks connected by lines (signals). These blocks can be elementary, containing simple operations (as arithmetics, for instance), or subsystems, that contains a composition of elementary blocks. It is worth emphasizing the Integrator and the UnitDelay blocks, which introduce the notion of time. When an Integrator is used, the model is said to be of continuous time, and the operation associated with the block is a mathematical integration over time. A model that uses a UnitDelay is said to be of discrete time. A mix of both produces a hybrid model, defined as a data flow where the signals are continuous or discrete time functions.

Figure 1 contains an example of a Simulink-like model that is divided into three subsystems [11]. A continuous time subsystem is present in Fig. 1a and represents a braking pedal as a mass-spring-damper mechanical system. A discrete time subsystem is present in Fig. 1b and is responsible for detecting when the pressing force is greater than a given threshold to activate the brake. Fig. 1c presents the main system, a composition of both subsystems, containing an input, the force, and an output, the detection result.

3 Embedded system testing

Embedded system testing is considered a vital task, especially because many systems are critical. However, some factors may hamper the testing of such systems. If a specific hardware is required, for instance, this test can be prevented by the equipment cost. Another concern is related to the possibility of the hardware being damaged, since the system may not present the expected behavior.

Developers typically use simulation prior to the integration of hardware and software. However, the simulation may also present some problems, e.g., a generation of potentially very large data sets, a high consumption of time, error prone outputs and temporal restrictions [22].

We conducted a systematic review that aimed to identify methods for the automatic generation of test data from embedded system models. Among the 29 selected papers, 15 handle specifically Simulink-like models [4]. We could notice that most of the papers present or make use of commercial testing tools and that there are no studies regarding the test of Scicos models, probably because Simulink is more consolidated within industry.

Concerning specifically the testing of Simulink-like models [22], focus on the use of functional characteristics to test this type of model. A great deal of the studies [1, 5, 12, 17, 18, 37, 39, 41, 44, 46, 48] involve the application of structural techniques, which can be motivated by guidelines such as DO-178B [38], that requires minimum structural coverage assurances. The mutation testing can also be used, as described by [7, 47, 48]. A more detailed description of each study is presented as follows.

The functional testing is a technique used to design test cases in which a program or system is considered as a black box, and to test it, inputs are provided and the outputs generated are evaluated to verify their compliance with the specified goals. In principle, the functional testing can detect all faults, submitting the program or system to all possible inputs. However, the input domain may be infinite or very large, making the test time of an activity unfeasible, and thus rendering this alternative not practical [33].

Due to the possible large number of inputs and outputs, a tool that supports the automatic generation of these values can help the tester’s task. The test values must be created for each input variable on each time step throughout the simulation. In addition, to help with the test data generation, a testing tool can capture, organize and display the generated output for the tester analysis [22].

Henry et al. [21, 22] propose the use of MATT (Matlab Automated Testing Tool), a tool that focuses on real time systems testing. It helps with the generation of test data that exercise a specified system in different conditions. According to the authors, it is necessary to use critical values, system dependent, values between the minimum and maximum bounds, and input values that are outside of the specified domain. The approach they present helps to generate test data which take the system to such conditions.

Blackburn and Busser [5] developed T-VEC [42], a commercial tool that integrates the development and testing based on functional requirements. It uses a requirements definition model, and tracks all of the requirements to a embedded system model or code, ensuring that each specified requirement is tested. It is used by large organizations such as Lockheed Martin [42]. One of the important features of the tool is the automatic generation of test vectors. It has the ability to determine the inputs, the expected output, and a mapping of each test case to their functional requirement.

Another possibility is to use the structural technique for testing a given system, which requires the implementation of pieces or components of a program. The logical paths are tested by the selection of test data that exercise a specific condition set and pairs of definition and use of the variables [33].

In general, most of the structural criteria make use of a representation known as Control Flow Graph (CFG). A program can be decomposed into a set of disjointed blocks of commands. The execution of the first command of a block results in the execution of all other block commands. All commands in a block, except possibly the first, have one predecessor and exactly one successor, except possibly the last command [31, 35].

Zhan and Clark [46, 48] describe an alternative for the test data generation using the structural technique. The solution consists of first performing a random test data generation, aiming to improve the operation cost. Subsequently, an analysis of the paths which have already been traversed is used for the generation of the remaining test data. This proposal is based on the costs to cover each possible path. Probes are inserted in the second input of each switch block, a block that can be compared to an if-then-else sentence. Thus, it is possible to analyze which values are being supplied and for each condition identified, it is possible to associate a cost indicating how far the data is to go to satisfy the switch block condition.

Gadkari et al. [17] present the AutoMOTGen, a tool for the automatic generation of test data from Simulink models in order to test automotive controllers. The goal is to use the Simulink model, high-level requirements and test specifications to apply model checking techniques for the generation of test data. Model checking techniques aim to automatically test whether a model meets a given specification. The tool uses the SAL language to create an intermediate model, by converting the Simulink model into a formal model of finite states, translating the specifications and the high level testing requirements into formal properties by the use of linear temporal logic (LTL). This model is structured to contain hierarchical information and a mapping of the structural coverage of the Simulink model in relation to the testing requirements.

Satpathy et al. [39] propose a technique and testing tool for Simulink models called Randomized DIRECTED Testing (REDIRECT). It is based on four principles: test by using random inputs, direct traversion of a block previously reached, backtracking and random test based on feedback. One of the highlights of the tool is that REDIRECT uses a heuristic guided by defined patterns and can reach non-linear blocks, whose output is not proportional to their input.

Mathworks released Simulink Design Verifier [44], a tool that generates test values by the use of formal analysis techniques to achieve exhaustive evaluation of a Simulink model. The technique is based on mathematically rigorous procedures to simplify and search through the possible execution paths of a model. Its advantages include the detection of incomplete requirements and the exploration of design faults. Nevertheless, not all Simulink features are supported by the tool.

Reactis [37, 41] is a commercial testing tool that can generate test sets to exercise a Simulink model. For the generation of test data, a random generation is performed and the inputs are selected by Monte Carlo methods. The guided simulation technique is also used, in which values are analyzed and chosen in order to cover the remaining test requirements of the model. This test set covers the MC/DC (Modified Conditions/Coverage Decisions) criterion. The tool contains three main components: Tester, which automatically generates test sets; Simulator, which allows the visualization of the execution; and validator, that performs a search for violations in the requirements specified by the user [12].

Beacon Tester [1] is also a commercial tool that automatically creates test vectors in order to ensure the quality and reusability of a Simulink model. The coverage obtained with the test vectors include functional criteria, such as boundary value analysis, and structural criteria, such as all-nodes, all-edges and MC/DC.

One more commercial tool dedicated to the generation of test data for Simulink models is Safety Test Builder [18]. At first, the generation of test data is random, and then a heuristic algorithm is used. The test set generated aims to cover structural criteria such as MC/DC. Very little documentation can be found regarding Beacon Tester and Safety Test Builder.

The mutation testing can be an alternative for the testing of Simulink-like models. In the mutation criterion typical implementation faults are used to generate testing requirements. The program being tested is altered several times, creating a set of alternative programs (mutants). The tester is responsible for choosing test data that show difference in the behavior among the original program and the mutant programs [31]. The test set quality is measured according to its ability to detect faults in the mutants [15].

In Zhan and Clark [47, 48], a solution for the generation of test data using the mutation testing is described. It is necessary to perform a random generation of test data for a model. For the mutants that have not been differentiated from the original program, the proposal is based on faults propagation costs inserted by mutation operators to the model outputs. If a test data is too weak to propagate a fault, that is, if the measurement that is done shows that this test data is far from acting on the mutant part of the code and influence the output, it receives a high cost. If a test data is good for spreading a particular fault to the output, it receives a low cost. One of the possible weaknesses of this proposal is the low numbers of mutation operators defined, i.e., add, multiply and assign.

Brillout et al. [7] developed a methodology to assess the correctness of Simulink models by automating the test data generation activity. Their objective is to cover the requirements imposed by the mutation testing. In order to generate and optimize the test data, the approach focus is on model checking techniques. However, the authors do not present a solution for how to apply the mutation testing, i.e., which mutant operators should be used to generate the testing requirements.

Generally, different techniques are complementary and should be combined in practice. In the next section, our approach for applying the pairwise testing in embedded system models is described.

4 Combinatorial testing for Simulink-like models

In this section, TeTooDS, a tool that supports the testing of Simulink-like models, is presented together with the pairwise testing. The testing tool assists in the extraction of relevant information for the appropriate generation of test data using the pairwise approach. Thereby, we can achieve a reduction in the number of test generated data, improving the computational cost and the effort required to analyze the output data.

The pairwise technique is detailed in Sect. 4.1 and TeTooDS is described in Sect. 4.2.

4.1 Pairwise testing

Functional testing considers the system as a black box from which only inputs and outputs are known. Testing criteria assist in determining inputs for testing the underlying system as well as combining them in such a way that they effectively exercise it.

Critical systems must have a very small probability of failure. Nevertheless, it is difficult to reduce the risk of unpredictable behavior of an embedded system to zero [30]. Therefore, the test data generation activity is of great importance, since the effectiveness of a test criterion depends on the selection of significant values that satisfy existing test requirements. However, the cost for the test data generation can be high and, thus, guidelines must be followed to reduce the required effort.

One possible approach is to apply the pairwise testing, that can generate efficient test sets that contribute to a reduction in the cost of finding adequate test data. An input set is introduced and then an evaluation is performed to check whether the result conforms with the requirements [19].

The combinatorial testing was originally proposed in order to reduce the number of test data required to verify the interoperability among the functions of a system, based on a combinatorial method used in mathematical constructions for statistical experiments [13, 20]. Moreover, it presents good code coverage and ability to detect failures [26]. The number of tests needed to cover n combinations of input parameters grows logarithmically according to the number of parameters. It is important to note that the key to minimize the number of test data is the fact that each one covers different combinations (pairs, triples or \(n\)-tuples).

The parameters of an example system extracted from Lott et al. [29] are presented in Table 1a. By them it is possible to create 24 values combinations. If the pairwise combinatorial approach is used, only 12 cases would be needed to cover all parameters pairs, as shown in Table 1b. The test set created has the characteristic that given any pair of fields (columns), all possible combinations of values for them are present.

Table 1 Pairwise test

Full size table

According to Schroeder et al. [40], several studies have been conducted on coverage achievement by pairwise testing. Coverage in the majority of the studies refers to code coverage, or the measurement of the number of lines, branches, decisions or paths of code executed by a particular test suite. High code coverage has been correlated with high fault detection. In general, the conclusion of these studies is that testing efficiency, i.e., the amount of time and resources required to conduct testing, are optimized because the pairwise test set achieves the same level of coverage as larger combinatorial testing sets.

However, the results of these case studies are difficult to generalize. These practices are not described in enough detail to understand the full significance of these results. Additionally, little is known about the characteristics of software systems used in the studies [40].

Taking into account that the pairwise strategy mainly targets functional specifications, we have considered that an investigation of its suitability to Simulink-like models would be appropriate. These kind of models are a design implementation of a functional specification, that is going to be ultimately converted to low level code.

4.2 Tool for pairwise testing

TeTooDS is a testing tool specifically designed to interpret Simulink-like models, interact with simulation environments, and is used to assist in the test data generation task. It has been developed to provide support for the application of functional criteria, specifically the pairwise approach, in Simulink-like models. For that, it is possible to:

parse the models for relevant features to the application of the functional technique, like the input variables and their types. Based on these features, the tool provides to the tester a set of standard values that can be used as input values for the test, as the inferior and superior limits;
the tester, aware of the system specification, can change these values by setting, for instance, typical ranges for certain variables or values. It is also possible to use the proposal from Henry et al. [22], allowing the tester to select values that lead to transitions that supposedly exercise the model, or at least all input variables, completely;
generate the test data using the pairwise approach, producing input test data that can be used in the model simulation in a computer environment or in the C generated code; and
exhibit the outputs acquired from the external simulation environment.

TeTooDS works by creating testing projects. When one is created, the tool parses relevant information from a Simulink-like model. Such information must be stored in an XML file, that can be generated from a Simulink or Scicos parser, or even manually. The information retrieved include input ports, input datatypes, blocks, connections and output ports. Then, the tester is responsible for selecting how the test data are going to be generated.

To help on the definition of appropriate values, the testing tool provides a data type editor. It is possible to create new data types, based on the limits of already existing ones. It can be useful to select more than one range of values for a data type or to select different ranges of values for inputs that originally have the same data type. It is also possible to select how many values will be selected in an interval and how this generation will be performed (ascending order, descending order or randomly).

For instance, an input X is defined in the model as a double data type and represents a temperature in the range of \(-\)300 to 300. An input Y is also defined as a double data type, but represents a velocity in the range of 0–120. The testing tool permits the creation of a temperatureDouble data type with a range of \(-\)300 to 300 and a velocityDouble data type in the range of 0–120. It is possible to assign these new data types to X and Y. When the test data are generated, these new ranges are used.

The tester also needs to specify how many values are going to be selected for each range and how this selection is going to be performed. For example, the tester can specify that 6 values are going to be selected for the temperatureDouble range, in an ascending way. That means that the values \(-\)300, \(-\)180, \(-\)60, 60, 180 and 300 are going to be used in the test data generation. If the random option is chosen, then it is not possible to predict which values of that range are going to be selected.

The test data can be generated from the pairwise approach or randomly. If the first option is chosen, it means that any two different input parameters values are present in at least one test data. Otherwise, there are no guarantees of what combinations exist and it is necessary to specify how many test data should be generated. Additional test data can be added manually, as well as existing test data can be removed.

Several strategies can be used to implement the pairwise generation technique [19]. In particular, the IPO strategy (In Parameter Order) is used by TeTooDS for the test data generation [28]. This strategy is used because its implementation is simple, the algorithm can be extended to support more parameter combinations and its number of generated combinations is similar to other strategies [19, 27].

The IPO algorithm first tries to cover the first two model parameters pairs. Then it extends the test data in order to cover the remaining parameters, one at a time. Two basic algorithms are used, one for the horizontal growth, which extends the existing test data to cover a new parameter, and other for the vertical growth, which creates new instances of test data to ensure that the coverage of this new parameter pairs is complete.

After the generation of test data, they can be applied in a simulation or in an executable application via the testing tool. Corresponding relationships between the inputs (test data) and generated outputs will be presented to the tester for further analysis. It is worth mentioning that the testing tool merely interacts with the execution environment to gather the output data.

5 Experimental studies

There are two paradigms that are usually used as the basis for the development of an empirical study. The qualitative paradigm aims at the study of objects in their natural environment. The researcher seeks phenomena interpretation based on explanations. The quantitative one focuses on quantifying a relation or the comparison of two or more groups. The objective is to identify cause and effect relations and is normally related to data collection by case studies [45]. The two paradigms should be considered complementary and not competitive.

For the scope of this paper, the qualitative assessment was aimed through the analysis of the test data generated by TeTooDS in relation to mutation criterion. The quantitative evaluation was also an object of study as we used several models in our experiment.

In this experiment we aimed to measure the test data generated using the pairwise approach against a random generation set of test data. This comparison could be achieved by applying the mutation testing as a mean to obtain the respective scores when applying each test data set to a model.

To help with the task of measuring the coverages obtained, testing tools are of major importance. However, as there are no tools available to work directly on embedded system models, our alternative was to use the C code generated from the models.

Thereby, the test sets generated by TeTooDS were evaluated for each one of the programs by using Proteum [14] to support the mutation testing. In Proteum, all of its 73 mutation operators were used. However, the code generated by Scicos has too many unused variables and other pieces of unexecuted code, resulting in a large number of equivalent mutants to be analyzed (up to 500,000 mutants). It was necessary to use a reduction strategy to the generation of mutants for these models. In that case, 10 % of the mutants were selected per mutation operator.

5.1 Planning

Five Simulink-like models were gathered for this experiment. The Tiny, Quadratic and Duplex models were extracted from Zhan and Clark [48]. The Flow Control model was selected from Blackburn and Busser [5] and the Generic Voter model was provided by Embraer [16].

Although some of the models may seem to perform a small number of operations, we argue that the main nature of Simulink-like models is the implementation of data flow systems in different levels of abstraction.

For each of the models we generated and compared, random test sets combined randomly against random test sets that had their values combined by the pairwise strategy. The purpose of our comparison is to evaluate the effectiveness of each generation strategy measured by the capacity of distinguishing mutants of the programs generated from the models. Figure 2 describes this process.

Ranges of values within and out of the input domain of each model were used, and different quantities of values for each interval were specified. Our approach is dependent on the tester for the choice of appropriate intervals of values.

We have generated, according to the design of Fig. 2, 30 test sets for each generation strategy. This is considered the minimum number of repetitions required for the achievement of statistical significance. Each test set is composed of several groups of test data, where a group comprehends a test data for each input of a model. When a test set is applied to a model, its groups of test data are provided consecutively in continuous time steps to the execution environment. Afterwards, the means of the mutation scores are obtained and compared. We have used the Student t test to evaluate the significance of the results, that are described for each model in the following sections.

Furthermore, we have generated test sets of two different sizes, which are characterized as the following:

\(RP_1\): Random generation with pairwise combination. The number of generated elements depends on the model inputs;
\(RR_1\): Purely random generation. The number of generated elements is the same as \(RP_1\) set;
\(RP_2\): Random generation with pairwise combination. The number of generated elements depends on the model inputs, but more elements were selected in each interval when compared to \(RP_1\);
\(RR_2\): Purely random generation. The number of generated elements is the same as \(RP_2\) set.

For each input we have experimented selecting numbers in an incremental way, starting with the lower and upper bound of the valid domain and with two values outside of it. The number of elements selected in \(RP_2\) was the one that meant that no further improvement could be achieved in the mutation score just by selecting more elements from the input datatype interval.

To avoid possible interference of the code generation process with the results, we used three different programs versions generated from the models, on which the mutation testing was applied to assess the generated test sets. This code was generated in three different ways: (1) manually; (2) from a Simulink model, using Mathworks Real Time Workshop (RTW) and; (3) from a Scicos model, using its own code generator. Section 5.3 details the generation procedure for each model.

By conducting this experiment, we plan to compare the means of the mutations scores obtained by the \(RR_1\) and \(RP_1\) sets and by the \(RR_2\) and \(RP_2\) sets. Considering \(\mu _r\) as the mean obtained by the random sets and \(\mu _p\) the mean obtained by the random sets with pairwise combination, we have the following hypothesis to verify:

Null hypothesis \(H_0\): \(\mu _r = \mu _p\);
Alternate hypothesis \(H_1\): \(\mu _r \ne \mu _p\);

After taking into consideration the obtained results, which were not statistically significant, we have decided that we should further investigate the comparison of adequacy among the means of the pairwise and the random combination individually against a union between the pairwise and the random combination. For that matter two additional test sets are used in the experiments:

\(RP_1\) + \(RR_1\): Union between \(RP_1\) and \(RR_1\) test sets.
\(RP_2\) + \(RR_2\): Union between \(RP_2\) and \(RR_2\) test sets.

Despite that the pairwise approach had presented better results in most of our experiments, we could not achieve an statistical significance to support this sentence. Therefore, we have investigated the union of both approaches, that have yielded better results. Details of the path we have followed are discussed in Sect. 5.4.6.

We stress that our experiment is actually based on Simulink models because the technique being evaluated is applied on such models. On the other hand, our evaluation of the quality of the test sets is performed at code level, since (i) there is no good set of mutant operators for Simulink-like models; (ii) no tool to seed faults or perform mutation testing at model level is available; and (iii) it is a reasonable assumption to consider the mutants at code level as representative of possible faults at model level, since the models are ultimately converted into code.

It is important to mention that possible time constraints are not relevant to our experiment. Since we are only using the input information from the models, we do not address the features of a Simulink-like model when generating test data using our approach.

5.2 Threats to validity

An important concern regarding the obtained results is its validity level. It is possible to classify the validity threats of an experiment in basically four types: conclusion, internal, construct or external [45].

The internal validity is of high priority, mainly because an important object of study is the relationship between the causes and the outcomes. A possible threat may be the generation procedure of the test data: a single tester (one of the authors) aware of the models behavior defined the parameters for the generation of the test set. If it can influence on the results, on the other hand, it represents the procedure of the functional testing.

We tried to eliminate the threat to the construct validity by including a comparison of the pairwise generation with randomly generated test data, in addition to the mutation scores obtained.

In our experiment, we make use of a set of mutant operators for low level code instead of an specific set for a Simulink-like model, as explained in Sect. 5.1. Despite the reasonable assumption that we could follow this approach since this type of system is converted into low level code, we considered it appropriate to mention it as a threat to validity.

A statistical significance was achieved by generating each test set 30 times, including: (i) a pairwise generation with random values, (ii) a random generation with random values and (iii) both generations together (values are selected by the pairwise approach and by the random approach). Furthermore, different number of values were selected in each of the specified intervals randomly and with a constant variation among them.

Regarding the external validity, the results are difficult to generalize, as many depend on the tester and the used models. A selection of inappropriate values or models with specific features (such as the need for a specific value as one of the inputs) may provide less satisfactory results.

5.3 Experimental setup

This subsection presents the general descriptions of each model that have been used during our experimental evaluation and their domain inputs test generation parameters.

5.3.1 Tiny

This system was introduced by Zhan and Clark [46] and models a problem constructed by the authors. It has three inputs (X, Y and Z) of double datatype and each input has a domain from \(-\)100 to 100. When the expression \( ((Y-Z)*(Z-X)>=1{,}000) \& \& (Z*Z>=8{,}950) \) is true, the output is \( (X + Y) \). Otherwise, the output is \( (Z * Z) \). This model is presented in Fig. 3.