The precision of the estimations of some methods of the CTT and IRT as a base to display the differential item functions in the different item ordered test formats
Abdelnaser Sanad Abdulmutallib Alakayleh
Specialization: Measurement and Evaluation
Institution: Ministry of Education
Formerly: Faculty of Education / Al-Jouf University /
Saudi Arabia, currently: MoE at Jordan
مقال نشر في مجلة جيل العلوم الإنسانية والاجتماعية العدد 38 الصفحة 139.
Abstract:
The present study aimed mainly at revealing the precision of some estimates of CTT and IRT methods as a basis for demonstrating differential item functions when using different ordered test forms. Data were collected from a total of 901 randomly selected students from 16 out of 25 primary schools using Thus, it was found that by comparing the methods of the CTT theory and the methods of IRT theory, there is a marked increase in the number of items that have the differential item function advantage and that the methods of IRT theory were more accurate, precision, and consistent With the differential item function advantage of items.
The results of the analysis of the current study, data show that EH is the more advantage method followed by the random method R, while no positive or advantage of HE method showed, and the test and reordering of its items and confirm its basic psychometric characteristics such as (validity, reliability and unbias) and giving to the examinees of a high degree of importance and necessity that it will be used in the making of decisions of those examined.
Key Words: (DIF, Ordered Test Form, EH- HE-R, IRT, and CTT)
Introduction:
It is certain that tests of all kinds in the field of practice, educational, educational or psychological research is important because they are the most widely used and widely used methods for investigating and detecting the performance of the academic, vocational or professional examinations. It is also used in the diagnosis and evaluation processes and the personal and psychological needs of the individual and the community. As they both relate to crucial and decisive decisions in self-determination and the future of students such as academic admission tests, classification or employment.
Perhaps the most important of those who use these tests is that they are characterized by criteria and indicators to ensure the validity of their results and come to him from the disclosure and provide data to examinees, the characteristic of validity, reliability and consistency of the elements that correspond to the requirements of any research uses the tests in addition to other psychometric characteristics, which became one of the basis of the criteria on which the test is based Such as indicators of discrimination, guessing, etc., as these indicators support the decision taken strongly towards the examinees and their future academic or practical. There are many types and forms of tests provided to the examiners and differed according to their objective, object and purpose, such as equal models and open and non-open questions and their use has become very easy because of the presence of modern tools in preparing, extracting, archiving, analyzing and analyzing them accurately and highly value which makes it provider reassure its results and decisions accordingly. Another reason for such diversity was to control and prevent fraud and breach of test rules by examinees.
It has become familiar to the examiners and tests in ways to get the desired results as an arrangement of items EH, which removes the examined sources of anxiety and anxiety exam, which increases the lack of confidence of the student or student himself and lack of focus and lack of motivation to continue to perform the test which affects clearly and clearly on the final performance in the final test. Increases the lack of confidence of the student or student himself and lack of focus and lack of motivation to continue to perform the test which affects clearly and clearly on the final performance in the final test.
But is it really the case that changing the form of the test or arranging its items or rearranging them is easy to hard or difficult to easily affect the performance of the examinee or not? The results of the studies that examined this subject have been divided between supporters and opponents. The study of) Çokluk, Ömay, Gül, Emrah, 2015; Magis, David; De Boeck, Paul, 2014; Dogan-Gül, Çilem; Impara & Foster, 2006; Alakayleh, 2017) suggests that the difference in the order of the test items or their rearrangement in the test depending on the variation of their forms affects the performance of the examinees and makes the students in a definite regard for their arrangement easy to the difficult or difficult to facilitate.
In contrast, the study of (Brackikorski & Olsen, 1975; Carlson & Ostrosky, 1998; Gero, 1980; Klonser & Gellman, 1973; Tippest & Benson, 1989) suggests that the arrangement of test items does not affect the performance of students or examinees, It is clear that the studies have adopted CCT& IRT, but most of the studies have been based on CCT. Some concluded that the order of the test items affects its final grade and its various parameters (Carnegie, Jacqueline A., 2017).
Perhaps the most important thing in the process of testing in particular, measurement and evaluation in general is the research and make comparisons between the results and grades of forms and different forms of testing used by institutions and bodies and examiners in order to detect the work and tasks of test items in different test groups, and this leads us to the need and importance of research on the differential functions of items A statistical feature of the item that shows the extent to which this item may be measuring different capabilities of members of separate subgroups. The average scores for subgroups that have the same overall score in the test are compared to determine whether the item is measured in the same way for all subgroups. DIF existence requires revision and judgment, and it does not necessarily indicate a bias while Dave provides an indication of the unexpected behavior of the items on the test. The item does not display Dave if people from different groups have different probability to give a specific response. DIF displays if and only if people of different groups with the same real latent ability have a different probability of giving a specific response and this means equal to the probability of the correct answer to the item without DIF for those with a single level of ability, and that the respondents respond to the item in the test given to them regardless of their characteristics Or their presence in specific groups (Millsap & Everson, 1993).
The probability of differentiation is called by some as a result of the item when it is produced because of the levels of individual abilities and it is perhaps important to express differences between the different levels of ability in individuals. In addition to this, the probability of a correct response differential is observed as a result of any factor not related to the construction of the measurement such as race, color, or other More than abilities, and these results are positive for certain groups and are not positive for others for a given item, because they mean that other characteristics are included in the measurement processes.
The different methods of examining and presenting DIF were based on the CTT and IRT. In the CTT, the item under DIF means that the difficulty indicator varies among the subjects in groups. First, we calculate the P-value of the main groups, Focal and Z- The value of P-1 (1-P) and then we convert the output to a delta scale with an average of 13 and a standard deviation 4 as it is more reliable than the second conversion process with an average of 0 and a standard deviation 1, Because many researchers and those interested in this subject tend to use it more than the z scale of the delta scale The normal z-scale does not have to have an average of 0 and a standard deviation of 1 (Hambletton, RK, & Swaminathan, H. ,1989; Camilli, G., & Shepard, L. A. 1994,; Osterlind, S. 1983).
CTT-Based Methods:
Approaches that are based on CTT focus on item difficulty as a fundamental indicator of item performance. The subpopulations are matched on overall test score, or in test score ranges. Then the number of examinees in the identifiable subgroups correctly answering each item is compared. Three different variations of this approach are (Scheuneman’s Chi-Square, Log-linear analysis, and Mantel-Haenszel method).
Mantel-Haenszel Method:
The Mantel-Haenszel (MH) shows similarities to both the chi-square approaches and the log-linear methods presented above. Originally developed for use in medical applications, this method was introduced by (Holland & Thayer, 1986) as a technique for investigating differential item functioning. The MH method is based on the odds ratio in each of the score points for the test. Two-by-two contingency tables are formed for each of the possible score values. Chi-square statistics are calculated at each of these score points, converted to odds ratios (similar to a proportion) in order to be on the same scale, and weighed by the product of the frequency of right and wrong responses divided by the frequency of responses. A significance test reveals those items for which it is more likely for a member of one group to get the item right than for a member of the other group.
Scheuneman’s Chi-Square:
This method, suggested by (Scheuneman, 1975) begins with dividing the examinees into categories based on the total test score (usually three to five categories are formed). For each item, Scheuneman’s Index, C2, is computed as a function of the number of correct answers for members of each group, summed across the test score categories. As a test statistic, C2 asymptotically follows a chi-square distribution with degrees of freedom equal to the number of test score categories. Several variations of this method have been proposed, including those by (Camilli, 1979; Marascuilo & Slaughter, 1981). The “full chi-square” method (Camilli, 1979) includes the number of incorrect as well as correct answers in the computation. These methods tend to produce very similar results; however, the sample size requirements for the full chi-square method are somewhat higher than those for Scheuneman’s Chi-Square method.
Logistic Regression (LR):
A logistic regression model, detecting DIF items between the focal and the reference groups was introduced by (Swaminathan & Rogers, 1991). Although the logistic regression model is sensitive to both uniform DIF and non-uniform DIF. Generally, the statistical significance of a coefficient is determined by using either likelihood ratio test or Wald statistic (Swaminathan & Roger, 1991), The logistic regression procedure can be used with multiple examinee groups and with polytomous item scores (Agresti, 1990). Another advantage of using logistic regression is that estimates of the regression coefficients can be plotted. This plot can then be used to detect where along the scale the DIF is becoming problematic (Miller et al., 1992). The LR procedure might give a clear perspective on the possible causes of DIF by the inclusion of a curvilinear term and other relevant examinee characteristics such as text anxiety. LR procedures use total score as a proxy for latent trait and this feature might cause some problems when items have a Multiparameter IRT model. The MH and the SIBTEST also share the same problem.
In the Lord’s Chi-square method, where the elements of a group are estimated for a subset of the first common discrepancies, the Chi-square statistics are then calculated using the parameters of the IRT theory, Lord’s Chi-square and “Raju’s Area.” Which have been standardized and the values of common discrepancies. (Camilli & Shepard, 1994). DIF values are determined by comparing observed values with expected values (Osterlind, 1983).
In the Raju’s Area method, the process of detecting and checking the curve of the properties of the item first, the items with the values of equal parameters need curves on the properties of equal items with the calculation of the differences between the curves of the characteristics of the item withdrawn for both partial groups using a square measure.
It has been shown that the tests that rely on the empirical basis for the critical and important decisions are tests that are far from bias without any preference or not for any partial group with different tests in the order of items. The test bias means that there is a regular error, that the validity of the test is affected in a noticeable manner. This means that the number of studies examining the effect of the order of the test items should be increased in order to reach a trial proof of DIF
The recent trend of most studies has begun to adopt methods of comparing the use of different methods to measure the effect of the order of the test items, based on the two measurement theories CTT; IRT and different characteristics of the items using different tests in the form, subject and objective.
Purpose of study:
The current study was for the purpose of Precision of the estimates for some methods of the CCT and IRT as a base to display the differential item functions in the different item ordered test formats.
Study Society and Research Sample:
The current study population consisted of all the tenth grade students in the Directorate of Education in Aljamea, a District of the academic year 2016/2017 and the number of 32462 male and female students, while the sample of the study consisted of 901 students were randomly selected from 12 of the 25 primary schools, Three familiar forms have been applied to examinees for the national test to control the quality of education for English Course NTQCE (EH, HE and R). Table 1 shows the distribution of students in each group relative to the school and the type of test format.
Table 1: Distribution of Students in Study Group According to Schools and Different Test Forms
Study Tool:
The current study used (NTQCE) for the English language course for the academic year 2016/2017, which is prepared by the Administration of Examinations and Tests at the Ministry of Education in Jordan. The purpose of this study is to collect data and reveal academic achievement of students in the English language curriculum. The national level, where the test consists of 40 items and four options for each item, and the test was re-tested on the selected sample after more than six weeks after the rearrangement of items.
The values of the reliability indicators of the test scores reached 0.86 using KR-20 while the value decreased to 0.84 using the Spearman–Brown method at the time of test partition. These values in the reliability index of the test scores may be acceptable for current research purposes. General with specific and fixed standards for all students and takes into account the levels of students. It contains three levels of difficulty of the item: easy, medium and difficult, whose values are shown in Table 2.
Table 2: Item Difficulty Indices in NTQCE Test
Item Difficulty Level p Item Difficulty Level p | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 2 shows that the Items’ Difficulty Indicators (IDI) ranged from 0.11 to 0.85, where IDI indicators show the correct response rate for the given groups and range from 0 to 1. The closer IDI is from zero, the harder it is for the subjects, The value of the IDI of 1 means that the item is easily accessible to the subjects. This classification is defined in the distribution (Easy, Moderate & Hard) based on the difficulty values of items based on( less than 0.40 = difficult, 0.40 – 0.70 = average and more than 0.70 =Easy) (Mitra NK, Nagaraja HS, Ponndurai G et al, 2009).
The current study adopted three test models to be applied to the subjects based on the difficulty indicators of the calculated items shown in Table 2. The test model took the first order of items easy for the Examined, while the second test model took the order of the items difficult to easy, The random sample was randomly assigned to the 901 examinees in the study group, as follows (EH: 301, HE: 305 and R: 395).
Procedures:
The current study procedures and before presenting any data and information resulting from the analysis to discuss and examine the hypotheses related to the theory of IRT for the methods adopted by this theory and used in the detection of DIF,
The data used for the analysis of the data based on this theory were examined and revealed. Therefore, the type of model used in estimating the parameters of the 40 items used in the test was adopted. From the analysis results of the model data, The – 2logglikelihood value was 7243.2514,When using the 3-PLM model, the value of -2Loglikelihood increased to 7311.6231, When using the 2-PLMmodel, whereas in the 1- PLM the -2 Loglikelihood increased the value of to 7399.1172 and the typical (1 & 2- PLM) The calculation is that the values of -2Loglikelihood were the most common in the 1- PLM model and the lowest in the The 3- PLM model where the decline in values is significant and taking these values into consideration. When we move from the use of the 1- PLM to the 3 – PLM, the decline in the values of significance and significance of -2 Loglikelihood means the application of the 2-PLM When estimating the parameters of the items.
In addition, the current study aims at examining one of the hypotheses of the IRT theory when applying the national test to control the quality of education in the English language course, in order to determine whether the test using the preliminary or exploratory analysis achieves a unidimensionality to construct the test or not.
The test of Scree Plot shows the dominance of one of the 40 factors that represent the test items. This case shows a one-dimensional achievement and the assumptions of the IRT theory. It is worth mentioning that most studies indicate acceptance of the assumption of the two hypotheses because achieving one-dimensional refers to the achievement of local independence also in the same Context and this is referred to (Zenisky A. L & Hambleton, RK; 2006).
Figure 1: scree plot of component number of the NTQCE Test
Therefore, the following IRT and CTT methods were used in the study to investigate and detect DIF: the difficulty of the traditional item, Mantel-Haenszel (MH), the logistic regression( LR) based on CTT, the Lord Chi- Square methods and the Raju region based on IRT, IRT, as it is just checking and detecting the DIF is not enough, but must find levels, where clearly indicated (Zwick, 2012) to the classification approved and agreed by most researchers, as follows.
A: Acceptable DIF
B: Moderate DIF
C: High DIF
Findings:
The current study examined the precision of some estimates of IRT and CTT methods in detecting and determining the of the differential item functions in the different order of the items on the different forms of the test, such as the difficulty of the traditional item, Mantel-Haenszel (MH) and the logistic regression(LR) based on the CTT and Lord Chi- Square and Raju On the IRT.
Table 3: DIF Results of Items in the NTQCE Test Forms Easy-to-Hard and Hard-to-Easy Versions, based on CTT and IRT Methods
Table 2 shows that DIF has a significant effect on all of the methods used in the study. The results showed that (items 17, 18, 25, 29 and 33) achieved significant DIF depending on CTT and level B. In the IRT based methods, that number increases to nine (Items 1, 2, 3, 4, 5, 8, 10, 11, and 14) DIF. (Items 1,2, 3, 4, 5, 6, 7, 8, 9, 15, 17, 18, 19, 25, 29, 33, 36, and 40) DIF, It can also be seen that the group with EH forms is advantageous and more visible in items 1-9 than in the form of HE which means the uselessness and utility of HE forms.
Table 4: DIF Results of Items in the NTQCE Test Forms of the Easy-to-Hard and Random Versions, based on CTT and IRT Methods
In Table 4, we find that (items 5, 9, 19, 25, and 39) have achieved a DIF effect on the B & C levels in at least two CTT-based methods. The results show that (items 5, 9, 11, 12, 16, 19, 25, 31, and 39) have achieved DIF significance based on IRT, where the number of items that achieve DIF significance has increased using IRT methods, DIF, and (items 19, 25, and 39) have shown DIF in the detection based on both theories CTT & IRT, and it can be observed that the group with EH forms is more advantageous and the appearance of random order R where it is more evident in items 1-9.
Table 5: DIF Results of Items in the NTQCE Test Forms of the Easy-to-Hard and Random Versions, based on CTT and IRT Methods
Table 5 shows that there are two items (7 &39 ) with high DIF (Level B and C) in at least two methods according to the CTT-based methods and the number of items increases to eight (Items 7, 9, 18, 28, 31, 33, 36, and 40) according to the IRT-based methods. We notice that Item 7 and 39 is found to display DIF in the analyses based on the two theories. Also, the findings show that those with the test forms of the random test form( R), are more advantaged than those who are given to the HE version.
Discussion of findings and recommendations:
The current study examined the precision of some estimates based on the IRT and CTT methods of detecting and determining the functions of the differential item in the different configurations of the items on the different forms of the test, such as the traditional item difficulty, Mantel-Haenszel (MH) and the logistic regression (LR) based on the CTT, and Lord Kay Square and Raju The results showed the consistency and precision of the DIF test estimates using the CTT & IRT study. The results of the analysis showed that the methods adopted by the IRT mainly referred to a number of the more common methods of CTT.
The results of the analysis of the current study showed that the different arrangement of the test items and the application of the test showed different performances at the same levels of power of the subjects on the same test items. The order of levels EH, HE, and R have differentiated in their final indicators in the test results and analysis of results, Analysis of the results of the current study, data shows that the EH & R models are the most abundant and the most likely in the test used in the current study, especially in the first items, which showed the significance of DIF, whereas the HE model is the least fortunate, since it is the least advantage. We would like to see the results of the other two models in the study. While the use of the random model (R) also had the advantage of comparing it with the HE model, the HE model is not odds in any case when compared to test subjects under different forms.
The rearrangement of the NTQCE items based on the three difficulty levels has led to a change in the understanding and grasp of certain items specifically, and therefore differentiation and discrimination in the correct response, especially as that we have previously seen that the subjects on the same line with the level of ability vary performance When they take different forms of testing. (Neely, D. L, 1994; Perlini, A. H., 1998; Laffittee, R. G, 1984).
Based on the above, the current study finds that the different arrangements of the items in the test affect the possibility of correct responses to a specific item for those of the same level of ability, just like the parameters of the item, and that the more advantage for the subjects is taking them to form the test EH and then R to make The correct decision on the individuals and people in the society concerned with the case has shown the importance and necessity that the test given to the examinees has been arranged based on its psychometric characteristics and known rules. It is essential and necessary that psychological and educational tests should not be affected, by any characteristics or parameters other than the abilities of individuals and that they remain far from being biased with or without the advantages of any group. This case reveals the need to carefully consider the basic principles of measurement in any type of testing practice.
The current study recommends that researchers re-study small samples to determine the extent and magnitude of the difference in the different methods and methods of detecting DIF across different sample sizes. It also recommends that the study be used by comparing different age levels and both male and female sexes to increase knowledge stocks and the size and impact of this diversity in the final results of the study affecting DIF.
—————————————-
Reference:
- H. Perlini et al., “Context Effects On Examinations – The Effects Of Time, Item Order and Item Difficulty”, Canadian psychology, 39 (4), 1998, pp. 299-307
- Agresti, A., C. R. Mehta, and N. R. Patel. 1990. Exact inference for contingency tables with ordered categories. J. Amer. Assoc. 85: 453458.
- Barcikowski R. S., Olsen H. Test item arrangement and adaptation level. Journal of Psychology, 1975, 90, 87–93.
- Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. London, UK: Sage.
- Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks [Calif.: Sage Publications.
- Neely, Darlene L., Frederick J. Springston, and Stewart, J.H. McCann. “Does Item Order Affect Performance on Multiple-Choice Exams?” Teaching of Psychology 21: (1994) 44–45.
- De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559.
- Doğan-Gül, Ç. (2014). Madde güçlüklerine göre farklı sıralanan testlerde düşük ve yüksek kaygılı öğrencilerin akademik başarıları puanlarının karşılaştırılması[A comparison of academic achievement scores of students with high and low anxiety levels in different sequence tests according to item difficulty] (Master’s thesis, Ankara University, Ankara, Turkey). Retrieved from http://tez2.yok.gov.tr/
- Gerow Joshua R. (1980). “Performance on Achievement Test as a Function of The Order of Item Difficulty”. Teaching of Psychology, 7 (2), 93-94.
- Hambleton, RK, “Good practices for identifying differential item functioning – Commentary” (2006). MEDICAL CARE. 117
- Hambletton, R. K., & Swaminathan, H. (1989). Item response theory: Principles and applications. Boston, MA: Kluwer-Nijhoff Publishing.
- Impara, J.; Foster, D. (2006) Strategies to minimize test fraud. In S. M. Downing and T. M., Haladyna (Eds) Handbook of Test Development (pp. 91-114). Mahwah, NJ: Lawrence Erlbaum Associates.
- Item Sequence and Student Performance on Multiple-Choice Exams: Further Evidence
- Lon Carlson, Anthony L. Ostrosky. in The Journal of Economic Education23 (3) · June 1992.
- Jacqueline Carnegie. “Does Correct Answer Distribution Influence Student Choices When Writing Multiple Choice Examinations?”. Canadian Journal for the Scholarship of Teaching and Learning, v8 Article 11 2017.
- KLOSNER, Naomi C.ve GELLMAN, Etselle K. (1973). “The Effect of Item Arrangement on Classroom Test Performance: Implications for Content Validity”. Educational and Psychological Measurement, 33, 413-418.
- Laffittee, R. G. (1984). Effects of item order on achievement test scores and students’ perceptions of test hardy. Teaching of Psychology, 77 (4), 212–214.
- Magis, D., Beland, S., & Raiche, G. (2015). A collection of methods to detect dichotomous Differential Item Functioning (DIF). Package ‘difR’. Retrieved from https://cran.r-project.org/ web/packages/difR/difR.pdf
- Marascuilo, L. A., & Slaughter, R. E. (1981). Statistical procedures for analyzing item bias based on chi square statistics. Journal of Educational Measurement, 18, 105-118.
- Miller, T., Spray, J., & Wilson, A. (1992, July). A comparison of three methods for identifying nonuniform DIF in polytomously scored test items. Paper presented at the annual meeting of the Psychometric Society, Ohio.
- Millsap, R. E., & Everson, H. T. (1993). Methodology review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297–334.
- Mitra NK, Nagaraja HS, Ponndurai G et al. The levels of difficulty and discrimination indices in type A multiple choice question of preclinical semester 1 multidisciplinary summative tests. IeJSME 2009; 3 (1): 2-7.
- Mitra NK, Nagaraja HS, Ponndurai G et al. The levels of difficulty and discrimination indices in type A multiple choice question of preclinical semester 1 multidisciplinary summative tests. IeJSME 2009; 3 (1): 2-7.
- Neely, D. L., Springston, F. J., & McCann, S. J. H. (1994). Does item order affect performance on multiple-choice exams? Teaching of Psychology, 21 (1), 44-45.
- Ömay Çokluk1 Ankara University Emrah Gül2 Hakkari University Çilem Doğan-Gül3 Ankara University. Examining Differential Item Functions of Different Item Ordered Test Forms According to Item Difficulty Levels. (2016). DOI 10.12738/estp. 2016.1.0329 (February 2016 (16((1) (319-330.(
- Osterlind, Steven J. Test Item Bias (Quantitative Applications in the Social Sciences), Beverly Hills. Sage publications, (c) 1983.
- Perlini, A. H., Lind, D. L., & Mumbo, B. D. (1998). Context effects on examinations: The effects of time, item order and item difficulty. Canadian Psychology, 39 (4), 299-307.
- Pettijohn, T. F., & Sacco, M. F. (2007). Multiple-choice exam, question order influences on student performance, completion time, and perceptions. Journal of Instructional Psychology, 34(3), 142–149.
- Robert S. Barcikowski& Henry Olsen, The Journal of Psychology Interdisciplinary and Applied, Volume 90, 1975 – Issue 1.
- Scheuneman, J. S. (1975, April). A new method of assessing bias in test items. Paper presented at the meeting of the American Educational Research Association, Washington, DC.
- Tibbets, E., and Benson, J. (1989). “The effect of item arrangement on test anxiety.” Applied Measurement in Education, 2 (4), pages 289–296.
- Volume 21, 1994 – Issue 1.
- Zenisky, A. L., & Hambleton, R. K. (2007). Differential item functioning analyses with STDIF: User’s guide (Unpublished report). Amherst, MA: University of Massachusetts, Center for Educational Assessmen.
- Zwick, R. J. (2012). A review of ETS differential item functioning assessment procedures: Flagging principles, minimum sample size requirements, and criterion refinement (Research Report). Educational Testing Service.