Learning Objectives
By the end of this workshop, the learner should be able to:
1. Characterize a sample set by its level of measurement.
2. Identify the dependent and independent variable in a research question.
3. Define the term reliability.
4. Provide two examples of how you could check reliability in a research project.
4. Compare and contrast internal and external validity.
5. Explain how sample size can effect both reliability and validity.
1. Characterize a sample set by its level of measurement.
2. Identify the dependent and independent variable in a research question.
3. Define the term reliability.
4. Provide two examples of how you could check reliability in a research project.
4. Compare and contrast internal and external validity.
5. Explain how sample size can effect both reliability and validity.
Introduction
|
The purpose of this workshop is to provide an overview of the research process, particularly in the scholarship of teaching and learning. As a background, levels of measurement and information on reliability and validity will be discussed. An example educational technology will be presented and the importance of this technology in medical admissions will be explained. To demonstrate the process two studies, one qualitative and one quantitative, will be presented. These studies have not actually been conducted and no IRB approval has been sought.
There are several different categories of research that can be conducted. Historical research examines events which have occurred in the past while descriptive research examines events that are presently occurring. Correlational research provides the opportunity to examine the relationship between, but not the effect of one variable on another. If you have two groups and they are preassigned it is called quasi- experimental research. In true experimental research the experimental and control groups have not been preassigned (Salkind, 2017). |
Description of Educational Technology
|
The educational technology that will be discussed throughout this workshop is the WebAdmit software developed and utilized by the Association of American Medical Colleges to facilitate the review and processing of medical school applications. This software will roll out in the 2019 admissions seasons. Questions on both its usability and the effect of its inherent software features on the admissions process are both pertinent and relevant. WebAdmit provides a modular based approach to the application layout as well as the ability to develop scoring models which may allow a more holistic view of the medical school applicant's application. A pertinent question would be: to what extent could a scoring algorithm that is developed and weighted in the same way as an application is manually reviewed predict the pre-admissions score given to the applicant?
|
Why technology is important in Medical Admissions
|
Medical School admissions staff need to process and evaluate approximately 5000 applicants to medical school in an equitable and holistic manner. Software that is capable of handling this volume of applications multiplied by the 141 medical schools across the nation is vital. This software has been beta tested, but it has been well established that as volume of users increases the number of software glitches tend to increase and be identified. Research on this particular software is timely because of its recent roll out and pertinent because of the features it offers.
|
Two particular features of interest are the ability to blind reviewers to particular parts of the medical school application and use of an algorithm to replace the content of this part of the application with a score. This may allow a reviewer to focus more on the holistic parts of the application and less on the academic metrics which tends to be the focus. No research thus far has shown a correlation between academic metrics and success as a medical student. Therefore the Association of American Medical Colleges encourages all admissions offices to move toward a more holistic review focusing on characteristics of the applicant that meet the stated mission of the medical school. A second area of interest would be whether the ability to provide competency tagging to activities and experiences in the medical school application provides a more holistic review of the applicant. The Association of American Medical Colleges has identified key competencies for future physicians. WebAdmit provides the opportunity for applicants to tag their activities and experiences to demonstrate how they best meet these competencies. Finally, all medical schools admissions offices are flooded with applications. They desire an ability to holistically and fairly review each of these applications. Many reviewers participate in the review process. An ability to use a technology to monitor inter-rater reliability would be valued as well as looking at whether there is variability on scoring of candidates based on admissions committee member's experience level.
Measurement in Research
Levels of Measurement
|
Nominal- a nominal level of measurement is appropriate when you are looking at variables that are categorical in nature and you can label them. The differences between the categories are qualitative instead of quantitative and they must be mutually exclusive.
Ordinal- this level of measurement allows you to order variables. You can clearly tell which variable is above or below, greater than or lesser than another. There is no "true zero" however and there is no standard difference between one level and the next. Interval- this level of measurement takes the characteristics of both nominal and ordinal and adds to it regular intervals. There is still no "true zero". |
Ratio- this level of measurement has all the characteristics of the other three, but also has a true or absolute zero.
Types of Variables
|
Dependent- the variable that is actually measured to see if changing or manipulating the independent variable has any effect. It is the outcome of your research study.
Independent Variable- the variable that is actually changed or manipulated by the researcher. It is the treatment or conditions that the researcher changes or effects to see what the outcome or result may be. Control variable- is a variable that potentially could affect the dependent variable and the researcher must take care to minimize this effect. |
Confounding variable- factors or variables that make the results unclear. They are not the variables you are interested in testing (Salkind, 2017).
Reliability
|
Reliability is most concisely said to be the consistency of the measurement. Each time you measure the same variable in the same way you will get the same results. Salkind (2017) describes measurement of a variable as an observed score made up of a true score and an error score. He goes on to say that the error score has two components: method error and trait error. Method error results from the testing situation or environment whereas trait error results from the person or subject taking the test. There are ways that the investigator can increase reliability. For example, they can increase the sample size, make sure the directions are succinct, standardized, and clear and give the test under standardized conditions (Salkind, 2017). Reliability between tests can be measured with a correlation coefficient and can range from +1 or perfectly reliable to 0 which indicates no reliability to -1 which is perfectly, inversely reliable.
|
|
Test-Retest Reliability- the same test is given to the same group of people at two different times to see if there is a consistency in scoring between the two test administrations. An important consideration is how long you should wait between tests and it would depend on what you want to achieve from the test and retest (Salkind, 2017). If a rubric is used to assess a candidate in admissions committee, it may be worthwhile to retest the scoring of the same candidate later in the cycle to make sure that scores are not getting inflated or decreased as we move through the admissions season.
|
Parallel-Forms Reliability- in this situation different forms of the same tests are given to the same group of participants to measure consistency between the two tests. If the same scores are obtained on both tests then they are equivalent which is a statistically significant finding, If you are trying to measure a variable repeatedly in a short period of time, different forms of the test will be needed to prevent memorization (Salkind, 2017). When developing interview questions for applicants, different forms of the questions must be developed. They have to measure characteristics equivalently, but cannot be reused too often or candidates find them on the internet and have an unfair advantage.
|
Inter-Rater Reliability- this reliability refers to consistency in measurement between one rater and the next. If you develop a rating scale or a rubric, you want to make sure that everyone who uses that rubric uses it in the same way and that if multiple people were evaluating a subject in the same circumstances they would give the same score (Salkind, 2017). In admissions, inter-rater reliability is very important. It is key to develop a way to rate or screen applicants where all raters would give approximately the same score. There is difficulty with this however, in that each of the raters values different aspects of the candidate's application and diversity is valued more than consistency.
|
Internal Consistency- refers to how unified in focus all of the questions are in a particular subsection of an assessment. Internal consistency is measured by correlating each subset score with the total score using either Cronbach's alpha or Kuder-Richardson correlation coefficient (Salkind, 2017). This measurement is very important in the screening of candidates for admissions because the approach to the application is supposed to be holistic and scoring an applicant based strictly on academics is not consistent with our mission statement.
Validity
Validity is the measure of whether you are truly measuring what you think you are measuring.
Content Validity- measures how well the test sample represents the universe or population of all items from which it is drawn (Salkind, 2017). When developing a rubric to screen applicants for admission to medical school it would be important to choose questions that would fairly assess all aspects of the application and minimize any bias that could be perceived by the question. If such a rubric were developed it could be sent to admissions officers in other medical schools for feedback and evaluation.
Criterion Validity- measures how well the test measures either concurrent or present performance or how well it predicts future performance (Salkind, 2017). The test questions are designed to measure a certain criterion which is defined by the researcher. The purpose of a criterion is to measure a level of achievement for that criterion not to compare one test taker to another. If a medical school rubric is developed it could be used to screen applicants who are felt to be outstanding candidates and then to screen applicants who are felt to be weak. This would allow the researcher to make sure it measures the criterion accurately before general use. Of course, confounding variables would be what the definition of excellent and weak are and if the test accurately measures a consensus definition of these terms.
|
Construct Validity- according to Salkind (2017), measures whether a test results are truly related to a set of underlying related variables or the test results can be tied to a theory or a model of behavior. If the results can be attributed to other factors then the experiment lacks internal validity. External validity refers to whether the results can be generalized from the sample that was tested to the general population. Threats to external validity include the change in behavior by the subjects when they know they are being observed, the Hawthorne effect, or a change due to the researchers themselves, or a change due to external people "helping" the experimenters or a sensitivity as a result of being given a pretest (Salkind, 2017). In the measurement of medical school candidates using a rubric, does the rubric actually measure whether a medical student will be successful in medical school and beyond. This research has been done and so far no variables have been statistically significant in predicting medical student success.
|
However, there is a newer version of the MCAT with subsets which has not been fully explored. Many argue that there are too many data points in too longitudinal of a time frame to see any correlation. Another important factor to consider would be: Do the reviewers for admission change their scoring techniques when they are being observed to try to decrease bias and/or be more socially acceptable?
|
Internal Validity - Internal validity refers to whether the results obtained in the experiment are a direct result of the independent variable. According to Salkind (2017), there are a number of threats to internal validity. Experiments take place over a span of time and outside factors can effect the results, such as the loss of subjects due to attrition. The subjects of the experiment may change or mature over time independently of whatever is being tested in the experiment. The selection of the participants for an experiment may not be as random as the researcher thought or intended. The act of giving a pretest may threaten internal validity and over time the rubric or method of grading may change.
|
Sample Qualitative Study
Webadmit software:
The purpose of this study is to compare the time spent on individual modules in the two admissions software packages: WebAdmit and ARM. WebAdmit provides the opportunity for applicants to list their prerequisite courses and to tag activities with the 15 competencies endorsed by the Association of American Medical Colleges. These two new features in WebAdmit were designed to encourage a more holistic review of an applicant's application. ARM is the current software used by the admissions committee and does not include these two features.
Research question
To what extent does the ability of WebAdmit software to provide prerequisite course and competency tagging in an admissions application affect the order of modules visited and time spent on each module by a reviewer?
Participants
Past and present members of the admissions committee were selected by convenience and stratified by experience and medical education. Ten participants will be randomly selected from each of three groups: less than 3 years experience, 3-5 year experience and 5 or greater experience.
Measure/Method
The order the modules are traveled to, the amount of time on each screen, PAC score, and any comments made will all be recorded. The first step in determining value of listing prerequisites and competencies and whether they contribute to holistic review is to see if reviewers actually look at them, how much time they spend on them and whether they make any comments about the features. This data will be collected by observing a screener reviewing and scoring an applicant.
Reliability/Validity
1. A randomized. stratified sample of convenience will be selected from the admissions committee members.
2. Two candidate applications will be used matched for academics and experiences and both prescored as a level 5.
3. Ability to comfortably navigate both software will be verified before the test.
4. One half of the screeners will start with applicant A in WebAdmit and the other half will start with applicant A in ARM, applicant B will be used in the opposite software.
5. Screeners will be told time taken in screening is of interest.
6. All screeners evaluated by the same researcher to negate inter-rater reliability.
7. Mean times for each module will be correlated between the two softwares as well as mean times between modules in the same software.
8. Triangulation of results will be provided by looking at means times spent on each software as well as time spent on the same content in the opposite software.
Data Analysis
1. Numerical order of modules
2. Mean time spent on each module in both ARM and WebAdmit, regression analysis to look for correlation
3. Time spent on prerequisite screen and competency screen will be compared to the academic metrics screen in ARM and regression analysis will be performed looking for correlation.
4. The final recorded PAC scores will be evaluated as well as verbal comments.
The purpose of this study is to compare the time spent on individual modules in the two admissions software packages: WebAdmit and ARM. WebAdmit provides the opportunity for applicants to list their prerequisite courses and to tag activities with the 15 competencies endorsed by the Association of American Medical Colleges. These two new features in WebAdmit were designed to encourage a more holistic review of an applicant's application. ARM is the current software used by the admissions committee and does not include these two features.
Research question
To what extent does the ability of WebAdmit software to provide prerequisite course and competency tagging in an admissions application affect the order of modules visited and time spent on each module by a reviewer?
Participants
Past and present members of the admissions committee were selected by convenience and stratified by experience and medical education. Ten participants will be randomly selected from each of three groups: less than 3 years experience, 3-5 year experience and 5 or greater experience.
Measure/Method
The order the modules are traveled to, the amount of time on each screen, PAC score, and any comments made will all be recorded. The first step in determining value of listing prerequisites and competencies and whether they contribute to holistic review is to see if reviewers actually look at them, how much time they spend on them and whether they make any comments about the features. This data will be collected by observing a screener reviewing and scoring an applicant.
Reliability/Validity
1. A randomized. stratified sample of convenience will be selected from the admissions committee members.
2. Two candidate applications will be used matched for academics and experiences and both prescored as a level 5.
3. Ability to comfortably navigate both software will be verified before the test.
4. One half of the screeners will start with applicant A in WebAdmit and the other half will start with applicant A in ARM, applicant B will be used in the opposite software.
5. Screeners will be told time taken in screening is of interest.
6. All screeners evaluated by the same researcher to negate inter-rater reliability.
7. Mean times for each module will be correlated between the two softwares as well as mean times between modules in the same software.
8. Triangulation of results will be provided by looking at means times spent on each software as well as time spent on the same content in the opposite software.
Data Analysis
1. Numerical order of modules
2. Mean time spent on each module in both ARM and WebAdmit, regression analysis to look for correlation
3. Time spent on prerequisite screen and competency screen will be compared to the academic metrics screen in ARM and regression analysis will be performed looking for correlation.
4. The final recorded PAC scores will be evaluated as well as verbal comments.
Sample Quantitative Study
Webadmit software:
The purpose of this study is to evaluate whether the experience level of admissions committee members affects the scores given to the applicants they screen using a rubric on the WebAdmit software. WebAdmit software has been utilized by other professions for admissions screening, but has recently been adopted by the AAMC for screening of medical school applicants, It has a modular design and is completely online. Two features, the ability to tag activities with competencies and the ability to match coursework with prerequisites courses required by the medical school are expected to allow more holistic review of applicants.
Research question
To what extent would level of experience affect the scoring of prescreened applicants for admissions to the medical school using a standardized rubric?
Participants
Past and present members of the admissions committee were selected by convenience and preassigned by experience and medical education level. Thirty participants will be randomly selected from each of three groups: less than 3 years experience, 3-5 year experience and 5 or greater experience.
Measure/Method
1. Members of the admissions committee have been routinely using the WebAdmit software for a full cycle and should be comfortable with its use.
2. Twenty applicants for medical school were independently screened by three senior members of the admissions committee. Six applicants were chosen: 2 applicants who were scored as a "1" the lowest score possible, 2 applicants who were scores as a "3" and 2 applicants who were scored as a "5" or excellent candidates.
3. Each member of the admissions committee participating in this study will screen these six applicants in random order using the rubric and submit their score in the traditional way.
4. After the screenings are complete, a five question Likert survey will be given to assess confounding variables such as comfort with WebAdmit, feelings about the rubric, feelings on value of diversity verses equality.
Reliability/Validity
1. A preselected stratified sample of convenience will be selected from the admissions committee members.
2. Six candidates will be used who have all been independently scored by three senior members to help verify inter-rater reliability.
3. Ability to comfortably navigate both software will be assumed because only members who have used it for a year will participate.
4. The order of the six candidates will be in random order for all screeners
5. Screeners will take a survey after the experiment to decrease testing effect on validity
6. 30 members will be in each of the three groups
7. WebAdmit software will be used by all to standardize conditions
8. The content of the rubric was developed in a workshop for all committee members, voted on for usage and evaluated by the faculty curriculum committee for utility
9. The use of the 15 competencies to develop the rubric and the use of WebAdmit software should make the results of this experiment generalize to other schools.
Data Analysis
1. An ANOVA test will be conducted to look at correlation between the level of experience of the admission committee screeners and the scores that they give six applicants.
2. A survey will be conducted to look at the confounding variables: comfort with WebAdmit, feelings about use of a rubric, feeling about the value of diversity verses equality when screening applicants.
The purpose of this study is to evaluate whether the experience level of admissions committee members affects the scores given to the applicants they screen using a rubric on the WebAdmit software. WebAdmit software has been utilized by other professions for admissions screening, but has recently been adopted by the AAMC for screening of medical school applicants, It has a modular design and is completely online. Two features, the ability to tag activities with competencies and the ability to match coursework with prerequisites courses required by the medical school are expected to allow more holistic review of applicants.
Research question
To what extent would level of experience affect the scoring of prescreened applicants for admissions to the medical school using a standardized rubric?
Participants
Past and present members of the admissions committee were selected by convenience and preassigned by experience and medical education level. Thirty participants will be randomly selected from each of three groups: less than 3 years experience, 3-5 year experience and 5 or greater experience.
Measure/Method
1. Members of the admissions committee have been routinely using the WebAdmit software for a full cycle and should be comfortable with its use.
2. Twenty applicants for medical school were independently screened by three senior members of the admissions committee. Six applicants were chosen: 2 applicants who were scored as a "1" the lowest score possible, 2 applicants who were scores as a "3" and 2 applicants who were scored as a "5" or excellent candidates.
3. Each member of the admissions committee participating in this study will screen these six applicants in random order using the rubric and submit their score in the traditional way.
4. After the screenings are complete, a five question Likert survey will be given to assess confounding variables such as comfort with WebAdmit, feelings about the rubric, feelings on value of diversity verses equality.
Reliability/Validity
1. A preselected stratified sample of convenience will be selected from the admissions committee members.
2. Six candidates will be used who have all been independently scored by three senior members to help verify inter-rater reliability.
3. Ability to comfortably navigate both software will be assumed because only members who have used it for a year will participate.
4. The order of the six candidates will be in random order for all screeners
5. Screeners will take a survey after the experiment to decrease testing effect on validity
6. 30 members will be in each of the three groups
7. WebAdmit software will be used by all to standardize conditions
8. The content of the rubric was developed in a workshop for all committee members, voted on for usage and evaluated by the faculty curriculum committee for utility
9. The use of the 15 competencies to develop the rubric and the use of WebAdmit software should make the results of this experiment generalize to other schools.
Data Analysis
1. An ANOVA test will be conducted to look at correlation between the level of experience of the admission committee screeners and the scores that they give six applicants.
2. A survey will be conducted to look at the confounding variables: comfort with WebAdmit, feelings about use of a rubric, feeling about the value of diversity verses equality when screening applicants.
Assessment
References
Salkind, N. J. (2017). Exploring research (9th edition). Upper Saddle River, NJ: Pearson Education Inc.