A Numerical Experiment on Mathematical Model of Forecasting the Results of Knowledge Testing

Abstract In this paper the new approach to the forecasting the results of knowledge testing, proposed earlier by authors, is extended with four classes of parametric functions, the best fitting one from which is selected to approximate item characteristic function. Mathematical model is visualized by two numerical experiments. The first experiment was performed with the purpose to show the procedure of selecting the most appropriate item characteristic function and adjusting the parameters of the model. Goodness-of-fit statistic for detecting misfit of the selected model is calculated. In the second experiment a test of 10 items is constructed for the population with latent ability having normal distribution. Probability distribution of total test result and test information function are calculated when item characteristic functions are selected from four classes of parametric functions. In the next step it is shown how test information function value could be increased by adjusting parameters of item cha...


Introduction
Measuring the knowledge and other mental features is the problem which has it's particularity because of difficulty to determine the object of investigation and deficiency of measuring 2011 volume 17(1): 42-61 doi: 10.3846/13928619.2011.553994 instruments. One will agree that it is much more difficult to measure person's attainment in some knowledge field than his physical properties. A very important stage is to create the appropriate instruments -questionnaires that allow getting maximum information about a measured feature. For the construction of a "good" questionnaire we must be able to choose the most informative subset of items from the whole item bank. This subset of items must be suited to the population under investigation so that the information supplied by test reaches it's maximum value. In this article we will deal with dichotomous test items, when there are only two possibilities to answertest item may be responded correctly or incorrectly. In practice there can be more than two answer categories in questionnaires. For such cases polytomous latent variable models are developed. Another approach is when all incorrect answer categories are joined to one category and we derive dichotomous model as well. Knowledge testing problem is the object of investigation of Item Response Theory (Rasch 1960).
The last articles on IRT are concerned with computerized adaptive tests, i.e. individualized tests that are optimal for each individual (Eggen, Verschoor 2006); latent class analysis (LCA) -a statistical method used to identify a set of discrete, mutually exclusive latent classes of individuals based on their responses to a set of observed categorical variables (Lanza et al. 2007); new technologies such as heuristic search and machine learning approaches, including neural networks to automatically identify the most informative subset of test items when the item bank is very large (El-Alfy, Abdel-Aal 2008); tests of model misfit to validate the use of a particular model in IRT (Wells, Bolt 2008); evaluation of the standard error of the estimated latent variable score (Hoshino, Shigemasu 2008); new IRT software development (Rizopoulos 2006).
El-Alfy, Abdel-Aal (2008) proposed a new approach of abductive network modelling to automatically select most informative subset of test items without serious loss of accuracy. This method was compared to three parameter logistic IRT model (3PL). The accuracy of IRT-based model was slightly better, nevertheless the new abductive network approach enable to reduce number of test items from 45 to 12 which classified an evaluation population with 91% accuracy.
Van Barneveld (2007) analyzed the effect of aberrant response patterns on test construction. Data was generated using two item response models -the three parameter logistic IRT model (3PL) alone and combined with Wise's examinee persistence model. Item parameters were estimated using the maximum marginal likelihood estimate approach with Bayesian priors on the item parameters using the program BILOG-MG (Zimowski, Muraki, Mislevy & Bock 1996). Tests were constructed using an optimal item selection method. Items with the largest item information estimates at each of the targeted cut-off ability points were selected for the optimal test. Biased item parameter estimates, item and test information estimates were obtained from responses from poorly motivated examinees. Wells, Bolt (2008) investigated a nonparametric method for detecting misfit when using the two-parameter logistic model (2PL). Two nonparametric statistics for detecting misfit based on the (Douglas, Cohen 2001) approach were examined. The results were compared to other well known goodness-of-fit statistics 2 S X − (Orlando, Thissen 2000) and BILOG's 2 G (Mislevy, Bock 1982). For all studied conditions the methods based on the nonparametric approach exhibited more power to detect the misfit while also controlling Type I error rate. It is hypothesized that nonparametric statistics provide a more informative description of the nature of misfit, which can help in diagnosing the cause of misfit (e.g., guessing). Savalei (2006) proposed the approximation of standard normal distribution with a logistic distribution with scaling constant 1.749 based on minimizing the Kullback-Leibler (KL) information function. This approximation is compared with Item Response Theory logistic function, in which another constant 1.702 is used. The new approximation gives better fit on the tails of the distribution. Hoshino, Shigemasu (2008) proposed a formula to evaluate the variance of the estimated latent variable score when the true values of the structural parameters are not known and must be estimated. It is shown that the appropriate accuracy is reached when the number of subjects and items are both large. For all conditions considered the standard errors of ability parameters using the proposed method were less than those using familiar standard errors as the inverse of the test information. Eggen, Verschoor (2006) investigated computerized adaptive tests (CAT), which select an optimal test for each individual. Such test is realized by selecting, on the basis of the results of previously selected items, the most informative item from the item bank. The optimal selection of item often means that item will be chosen for the individual student, which has a 50% of probability of answering correctly. But such tests are often too difficult for students and this fact has its negative side effects. To eliminate these effects two item selection procedures giving easier or more difficult tests were analyzed for both one (1PL) and two (2PL) parameter logistic models. The first procedure based on the success probability points of selected items shows good results in ability estimates measurement precision only for 1PL model. Another item selection procedure based on maximum information at shifted ability level gives good results for both 1PL and 2PL models. Rizopoulos (2006) developed the package ltm for the well known open source statistical software R for the analysis of multivariate dichotomous and polytomous data using Item Response Theory logistic models. Parameter estimates are obtained under marginal maximum likelihood using the Gauss-Hermite quadrature rule. This package is suitable not only for unidimentional latent variable models but also when there is a small set of latent variables which explain the observed data. St-Onge et al. (2009) compared parametric and nonparametric Item Characteristic Curve estimation methods on the effectiveness of Person-Fit Statistics (PFS). For both large and small sample sizes, the accuracy of the PFS was greater when used with the parametric models.
The aim of this paper is to propose the model for forecasting the results of knowledge testing when the best fitting item characteristic function is selected from 4 classes of parametric functions. Prior distribution of knowledge level of the population could be chosen from 4 classes of probability density functions. However, this model allows using parametric functions of another form, the main restriction of the model is that item characteristic function has to be nondecreasing and items have to be mutually independent.
Earlier it was shown (Krylovas, Kosareva 2008a, b) how segments of linear functions could be used as an item characteristic function and also as a probability density function of the population knowledge level. It was shown how this approach could be used to construct norms-referenced latent trait estimations to select test items which are optimally fitted to the examined population. In (Krylovas, Kosareva 2009a) the generalization of this model with wider set of item characteristic functions and probability density functions was presented. This model could be used not only for knowledge testing, but also for solving diagnostic tasks in various fields of human activities. The problems of decision making in the information deficiency conditions were analyzed in (Zavadskas et al. 2009.

Rasch and normal ogive models
The well known models in latent trait testing theory are the One Parameter Logistic model (1PL) and normal ogive model. These models describe the conditional probability of correct response to the item i given ability level p : ( ) . This functional relation is denoted and its graph is called item characteristic curve (ICC). In 1PL the logistic function is applied to describe this relation (Rasch 1960): here i b is the difficulty parameter of item i . The function above, which is called item characteristic function, belongs to the class of logistic functions. Rasch model is taking place if some constraints on the model are satisfied (Molenaar 2007): i. Unidimensionality of latent trait p . This means that p is one-dimensional quantity which reflects person's ability to answer test items correctly. At a time one mental property is measured, the influence of other latent traits is treated as negligible. ii. Conditional independence of items given person and conditional independence of persons given item. Given the person's ability p the elements of the response vector are independent. On the other hand, person's response to the item is independent of other respondent's responses to this item. Respondents do not influence the responses of each other.
iii. Monotonicity of item response functions. The item response function ( ) is nondecreasing function of p. iv. Sufficiency of total test score. The total score i i x ∑ is a sufficient statistic for p.
Formula (1) represents generalization of Rasch model (Birnbaum 1968) were supplementary parameters are discrimination parameter i a (2PL model) and additionally the probability of random guessing i c of the item i (3PL model): In 1PL, 2PL and 3PL models probabilities of correct response to the items are "S" -shaped functions. 2PL and 3PL logistic models satisfy only i. -iii. conditions, while condition iv. is generally not required.
In the normal ogive model the link function between given ability level p and the probability is standard normal probability distribution function (Uebersax 1999): with the same interpretation of parameters i b and i a . We suggest using wider class of link functions that enables more precise approximation of ICC and as a consequence more efficient tests.

Problems and proposals for their solution
Before the examinee testing process begins each item must be calibrated according to the selected model. Due to the restrictions on one class of parametric functions (either on the logistic functions or on the standard normal probability distribution function) the calibration process results in biases of the item parameter estimates. These biases cause the biases in the test information function value's estimates ( ) I p and consequently in the precision of examinee ability estimatesp . The other sequel of biases in item parameter estimates is that the standard error of measurement of the estimated examinee ability, when overestimated at a given ability level, results in excess number of items proposed to examinees with intention to reach the nominal precision (Van Barneveld 2007). This enlarges the resources of testing process.
According to the formula (2) (Lord 1980) the standard error of measurement is inversely proportional to the square root of test information function( ) I p : (2) Our proposal is that the reduction of the bias in item parameter estimates is possible not only by increasing the number of examinees in calibration group or/and number of test items proposed to the examinees but also by expanding the set of item characteristic functions ( ) k p which we select the best fitting one from. In the proposed model the best fitting ICC is selected from one of the 4 classes of parametric functions depending on one or two parameters. These functions are -2 parameter logistic function restricted in the interval [0; 1] described by (3); arccotangent function (4); segments of linear functions (5); segments of 2 parabolas (6).
It is notable, that all these functions have their definition range in the interval [0; 1], so functions 1 ( ) k p -4 ( ) k p are defined for ability levels from this interval. The interpretation of examinee's ability level is the proportion of maximum value of ability score ( ) max 1 p = , which the examinee possesses. In IRT the ability value (usually noted by θ ) in theory belongs to the interval ( ; ) −∞ +∞ and in practice it is in the interval ( 3; 3) − + . The ability level of the examinee in IRT can not be estimated when number of correct responses to the N items test is equal to its minimum or maximum value ( 0 or N respectively). Our model enables estimation of the ability level in such marginal cases with the number in the interval [0; 1]. The particular estimated ability value depends on the selected ICC of the model. Function (3) is the two parameters logistic function (2PL). When restricted in the interval [0; 1] it has the attractive property to be similar with the three parameter logistic function (1) for low item discrimination parameter a and difficulty parameter b values. This function obtains values greater than zero for low p values. So we get the effect of guessing without guessing parameter of 3PL. Likewise for low parameter a and high parameter b values function's (3) value is less than 1 for high ability levels p. This can also improve the estimate of ICC in some situations.
The selection of the most appropriate model from these function classes and estimation of item parameters that best fit the observed proportions of correct responses could be done as described in (Baker 2001). Examinees are grouped into ability intervals based on their ability scores. The interval [0; 1] is divided to J intervals of equal length with j m examinees in the j -th interval. The total number of examinees is The examinees within the same interval have the same ability score j p (we have taken this point at the middle of the j -th interval). Let j r examinees of ability score j p answered the item correctly. Then the observed proportion of correct responses to the item at ability score j p is ( ) Our purpose is to select the function from 4 function classes, which will provide the best accuracy of the approximation to the observed proportions of correct responses to the item. At the first step the best approximation of ICC from the 2 parameters logistic functions class 1 ( ; ; ) k p a b described by (3) is calculated. In the next iteration adjustments to the estimated parameters 2 a and 2 b , which improve the agreement between k 1 (p; a; b) and observed proportions of correct responses are found. So, 2 1 and this process is continued until the improvement of the agreement becomes very small. Then current values of parameters 1 n a and 1 n b are fixed and they are considered item parameter estimates for 1 ( ; ; ) k p a b . This procedure is repeated for functions 2 ( ) k p -4 ( ) k p determined by (4)-(6) at the next steps, and the minimum value is chosen from the four distances . The corresponding ICC is the best fitting model to the observed data which is chosen from 4 classes of functions (3)-(6).
Goodness-of-fit 2 Χ statistic's value for detecting misfit of the selected model is defined as follows: here ( ) k p is the best fitting model ICC, found in previous step.
2 Χ statistic has 2 χ distribution with 1 J s − − degrees of freedom when ( ) k p is suitable for the observed data. Here J is the number of grouping intervals, s is the number of parameters of the model under investigation. For example, 1 s = for the arccotangent function (4) and 2 s = for the functions (3), (5), (6). The observed value of statistic (8) is compared with criterion value which is equal to 2 χ distribution with 1 J s − − degrees of freedom critical value 2 0.05 ( 1) J s χ − − . If calculated X 2 statistic's value is greater than criterion value 2 0.05 ( 1) J s χ − − then corresponding ICC does not fit the data and vice versa. There is the requirement to have more than 5 observations in each interval for X 2 statistic (Bagdonavicius, Kruopis 2007). When number of the observed data in some interval is less than 5, the adjacent intervals may be joined together and J equals to the number of intervals after concatenation.

Experiment 1
The primary aim of this experiment is to demonstrate how the best fitting ICC could be selected from 4 classes of parametric functions (4)-(6). Suppose that the ability level p has Beta distribution (3;3) B . Beta distribution is convenient to use for approximation of the ability level distribution because it's definition range is the interval [0; 1] and it could be either symmetric or not -depending on the parameter values. 3000 observations were randomly generated from the Beta distribution (3;3) B . Item responses drawn from Bernoulli distribution with probabilities 2 ( ;1) k p (arccotangent function (4) with parameter 1 a = ) were generated for 1,2,...,3000 j = . The data were grouped into 31 equal length intervals according to the ability scores. The observed proportions of correct responses were calculated for each group. Then the iterative process of unknown parameter a value adjustment was made by minimizing distances 2 ( ) d k .  Table 1.
The best adjustment to the observed proportions of correct responses was achieved with the function 2 ( ; ) k p a and parameter value estimateˆ0.96 a = . Very similar result was obtained with the modified logistic function 1 ( ; ; ) k p a b and parameter estimates and arccotangent function 4 ( ; ; ) k p a b with . The adjustment of the best fitting function 3 ( ; ; ) k p a b to the observed data was worse, values of X 2 statistic exceeded criterion value, so the conclusion about the misfit of the model 3 ( ; ; ) k p a b is done. It is notable that due to the randomness of the experiment the best accuracy of the distance with the function from the class 2 ( ; ) k p a wasn't reached with the true parameter value a = 1 though the number of observations is large (3000).

Mathematical model of experiment 2
Let us suppose that the probability distribution of knowledge level p is known: x P p x f p dp The model was applied to 4 classes of probability density functions ( ) f p : segments of linear functions 1 ( ) f p , Beta distribution 2 ( ) f p , Normal distribution when normalized in the interval [0; 1] 3 ( ) f p and histogram function 4 ( ) f p . These functions are defined for [0;1] p ∈ and their parameters are chosen in such way that functions satisfy two features of probability density functions: Segments of linear functions 1 ( ) f p are represented by trapezium or triangle depending on the parameter values (Krylovas, Kosareva 2008a). Beta distribution probability density function initially satisfies features (9) and (10). The probability density function of normal distribution is restricted with [0;1] p ∈ and multiplied by suitable normalize constant so that the feature (10) holds. Histogram is also defined for [0;1] p ∈ . responses to the test items for the fixed latent ability p due to the condition ii) stated above are independent random variables i X having Bernoulli distributions with probabilities . So S has generalized Binomial distribution (Bagdonavicius, Kruopis 2007) with the probabilities , which are equal to the coefficients near the corresponding x degrees in the generating function polynomial of the random variable S: .
The test information function I is described as follows: 1 2 0 ( , ,..., ; ) ln The normalized value of the function I is the percentage of the test information function (11) from the maximum value, which is reached when all probabilities are equal to 1 ( ) , 0,1,2,..., 1 P S i i N N = = = + . Our purpose is to choose test items that maximize the value of the test information function.  (3)-(6), so the conclusion about the stability of these results could be drawn.
Maximum values, averages and standard deviations of ( ) are decreasing as the sample size increases. It was shown in this data simulation example that the mathematical model describes real processes correctly. This model when applied for ICC functions chosen from 4 parametric classes of functions guarantees better approximation precision of the observed ICC function and as a consequence better accuracy of the probability distribution function of total test result S.
Probability distribution of total test result S and the value of the normalized test information function I (11) for the described data calculated by the model are presented in Table 3. The histogram of probabilities is shown in Fig. 4. The test is sufficiently good for this population, the value of the test information function reaches 92% of its maximum value. Nevertheless, we can see from the histogram that the test is too difficult for this group of examinees as the probabilities of lower grades exceed the probabilities of higher grades. We can increase the value of the test information function by substituting difficult items with easier ones (for example, by reducing parameter's a value in 2 ( ; ) k p a ) or substituting items with low discrimination parameter values with items that have higher values of this parameter. In this experiment parameters of 4 items were changed: The results of the new test are presented in Table 4 and Fig. 5. The value of the improved normalized test information function is 0.96.

Case study: some examples of diagnostic operators for evaluation of microclimate in office rooms
We will now show how the proposed model could be applied for solving a practical decision-making problem. It is necessary to emphasize that the example is fitted only for demonstrative purposes and we do not try to reach very precise results. This is because of insufficient number of observations (14) and shortage of test length (only 3 items). In practice if one wants to obtain good precision in parameter calibration procedure the recommended number of observations is 500. Nevertheless our method gives suitably accurate results in this example.
In ) the problem of evaluation of microclimate in office rooms was solved by applying ARAS method for multicriteria decision-making. 6 microclimate evaluation parameters were analyzed in the paper: 1) air turnover inside the premises; 2) air humidity; 3) air temperature; 4) illumination intensity during work hours; 5) air flow rate; 6) dew point. According to these parameters and estimates of 38 experts comparison criterion of 14 office rooms (denoted by p) was calculated. In the example below 2 problems will be solved -office rooms will be grouped into clusters according to the test results and on the next step the comparison criterion p for 14 office rooms will be evaluated.
Let us denote NR -number of room, RH -relative air humidity, T -temperature, I -illumination intensity during work hours (parameters RH, T and I will be used to construct diagnostic operators), R -rank of the office room. All parameter values are presented in . Notice that only three of six parameters which are given in Table 5 will be used. Comparison criterion p was calculated using 6 parameters. As we see from Table 5, according to the 2 parameters RH and T, estimations of rooms 2 and 3, 5 and 6, 8 and 9, 11 and 12 will coincide. Let us construct the test from 2 dichotomous items K RH (RH > = 44), K T (T > = 20). In Table 6 the results of two items test are presented.
Therefore 2 items test lets us group office rooms into three clusters correctly enough. According to the criterion p, NR4 in the group TS2 = 0 must be substituted with NR10. NR12 in the group TS2 = 2 must be substituted with NR1. However there are not enough data in the 2 item test to distinguish NR11 from NR12.
With the intention to improve the test a third item K I (I > = 320) will be added to it. The new 3 item test denoted by TS3 will give 4 clusters, represented in Table 7.  Group TS3 = 0 coincides with TS2 = 0. TS3 = 3 correctly includes 3 rooms NR8, NR9, NR11 possessing highest ranks. TS2 = 1 is spited into 2 groups TS3 = 1 and TS3 = 2. The next improvement is that NR12 goes to TS3 = 2. The only improvement required is to move NR10 from TS3 = 2 to TS3 = 0.
So, third item lets us distribute office rooms into 4 groups more precisely. However there is no reason to expect better results of ranking comparing with ARAS method, because only 3 of 6 parameters were used. Better results could be achieved by including additional items to the test.
The comparison of classification results obtained by 3 items test (4 clusters) and by ARAS method is represented in Fig. 6.
It is notable, that theoretical characteristics of diagnostic operators depend on the chosen mathematical model (Krylovas, Kosareva 2009b). It is important that this methodology has only natural restrictions on the shape of diagnostic operator that assures the principle of diagnostic operator's validity -subject with higher comparison criterion value has bigger probability to respond to the test item positively. Thresholds of diagnostic operators (44 for RH, 20 for T and 320 for I) are selected so that approximately one half of testees would positively respond to the item. In this case test information function (11) achieves it's maximum value.
The best approximation to the probability of positive response to the first item K RH (RH> = 44) was selected from the class of two parabolas segments functions (6), where a = 0.53, b = 0.75. In Fig. 7 best approximations to the first item empirical distribution function P(RH> = 44|p) selected from 4 function classes (3)-(6) are represented.
were calculated for the data grouped in 5 intervals by the formula (7) Criterion value estimation p for each office room is the value that maximizes loglikelihood function (12), i.e. natural logarithm of likelihood function: .
Estimated criteria values p and rank values R are presented in Table 8. Since responses to all 3 items are identical for rooms NR4, NR5, NR6, NR7 and for NR8, NR9, NR11, also for NR13, NR14, etc., estimated criteria values p coincide for rooms in these groups. Value of the Spearman's rank correlation coefficient between p and p equals 0.841. We can see that even 3 item test gives estimated criteria values accurate enough. Better results could be expected for tests with more items.

Conclusions
In this paper the investigation of mathematical model of forecasting the results of knowledge testing proposed by the authors in (Krylovas, Kosareva 2008a, b;2009a) is continued. This model does not require apriori information about probability distribution of ability level in the population of examinees and allows selecting item characteristic functions from a variety of forms. This enables to apply the model for the different probability distribution functions which occur in practice.
In the paper the Monte Carlo experiments were performed to show the technique of evaluating the best fitting parameters of the model. The unknown parameters of the item characteristic function were selected by the method proposed by (Baker 2001). Then the best fitting function was chosen from four function classes. Values of X 2 goodness-of-fit statistic for detecting misfit of each model were calculated. The experiment demonstrated that parameters of the model could be steadily reconstructed using standardized statistical procedures when the number of numerical experiments is rather big. These results show that in cases when the number of real experiments was not very big, the proposed model would still enable one to construct efficient tests for attainment measuring. This is the object of the authors' further investigations.
In the second numerical experiment the responses to 10 items test with ICC from four function classes and increasing number of examinees were generated for normal-ability population. The probability distribution of total test result and test information function value were calculated. It was shown that the results of the test could be efficiently improved by selecting relevant parameters of ICC. Obviously, the set of ICC functions, from which we choose the best fitting one, could be expanded with other classes of parametric functions. It must be mentioned that the precision of the results depends not only on how good item characteristic curves are approximated but also on the precision of probability density function of p value approximation.
The proposed mathematical model could be used not only for knowledge testing but also for solving diagnostic tasks in various fields of human activitiesmedicine, sports, geology, technical diagnostics and others. As an example, evaluation of microclimate in office rooms was performed by applying this methodology.