Matthew Fieldman, Psychological Testing Project, Due: December 10, 1997
To start, let me say that I have a passion for football. Having played since third grade, I’ve come to appreciate its intricacies and complexity. That’s why I have always hated the way people associate this amazing sport with violence and aggression; they call football "barbaric" and "boorish." This test originated out of my pet peeve—I decided to test whether the critics of football were right. Thus, this test was constructed to measure the correlation between knowledge of football ("fanaticism") and a person’s level of aggression. My operational definition of aggression is: the frequency and intensity of feeling angry, combined with the violence one exhibits when angry. My aim was to scientifically disprove people’s misconceptions, and in the end, I succeeded.
In the preliminary version, the test consisted of twenty-five multiple choice items, designed to comprehensively measure football knowledge in college students. The questions dealt with five important areas: rules, history, current events, terminology, and miscellaneous areas. Further, it contained two validity questions on aggression: how aggressive a person’s friends would describe them, and how often they found themselves angry. These two scores were added into an aggression index (AI). The preliminary test took about 10 minutes to administer, and the subjects of the first trial were fourteen Psychological Testing students.
With these fourteen scores, my first task was to break them up into a high- and low-scoring group for analysis. My first finding was that the distribution was anything but normal: six people missed two or less, and six people missed twelve or more; only two scores fell in between these scores. Thus, I ignored those middle two scores and dealt with the other twelve. The item analysis went very well—seventeen items showed good or excellent discrimination. To evaluate validity, I used Microsoft Excel to correlate the aggression index with the subject’s score; there was a .6004 correlation between the two. The first test did extremely well statistically, but it needed a few minor changes. At the suggestion of Professor Cunningham, a question asking for the subject’s gender was added. On the seventeen items that survived the first cut, distractors were altered to increase their effectiveness, and the question concerning safeties was completely re-worded. Interspersed within these seventeen items were three new questions, for a total of twenty. The instructions were shortened so as not to prompt the subjects for any of the following questions. The main problem with the first test was that the validity questions were downright horrible. They were poorly designed to measure my construct, aggression. First, subjects had difficulty interpreting the meanings of "often," "sometimes," and "never," thus lowering the validity of the test substantially. Secondly, the AI distribution was very disappointing, as it only ranged from two to seven. Finally, only having two questions of four choices each, I allowed for an AI of only eight different degrees—obviously, this is too few to have real construct validity. So, in the final version, a third question was added, and all three items were changed to five-point scales instead of multiple choice questions. These alterations almost doubled the AI’s distribution from eight to fifteen points. The intention was to allow subjects to more comprehensively express their aggressive attitudes, and thus raise validity significantly.
The second test was administered to a class of 44 Introduction to Psychology students. The scoring distribution was much better—there were twelve high scorers, twelve low scorers, and the other twenty were spread all over the spectrum. An item analysis found very poor results: 12 items showed good discrimination, five were horrible, and three were questionable. This is odd, because seventeen items showed excellent discrimination on the preliminary testing. The psychometric properties were equally fascinating. To start, I evaluated the reliability by using the Kuder-Richardson 20 formula: rtt = (n ) SD-2 — xpq (n-1) SD2
Where n = 20, the Standard Deviation (SD) = 5.574, and xpq = 4.559; the reliability coefficient of the test was .898. This is an amazing coefficient, and I didn’t accept it at first; however, checking it over and over again proved that it is, in fact, correct. Out of curiosity, I decided to see what would happen if the test were expanded to 100 items. The Spearman-Brown formula states:
rnn = nrtt___ 1 + (n-1) rtt
Where n = 5 and rtt = .898. In this case, the reliability would be .978, another incredible answer. After repeating the whole process, I must conclude that my calculations are correct, and that this test is astonishingly reliable. Unfortunately, the validity of the test was not nearly as good. A simple correlation of aggression and test scores led to an unimpressive answer of .1988. Of course, making a conclusion based on one test is foolishness; however, based on this correlation, apparently aggression and "football fanaticism" are not related. However, the test did find a striking correlation between gender and knowledge of football. The mean score of a male was 15 correct, the mean for a female was 7.26. Thus, males had over twice as many right, on average. Further, of the top 13 scorers (subjects who had 17 or more correct), only one was female. The twelve worst scores (with 6 or fewer correct) were all female. Considering the fact that there were 23 females in the sample group, over half scored in the bottom group; no male had less than 7 correct. Although these are just correlations and statistical measures, they imply a strong underlying factor that causes the high discrimination.
Unfortunately, we will never know exactly what caused the sad decline of validity from the initial to the final testing. One possible cause is due to the restriction in range; this test was administered solely to college students, and many aggressive people never make it to college. Also, due to the face validity of the AI questions, the test is susceptible to many possible subject variables: "faking good," "faking bad," and social desirability. Further development of the test would have to address these problems. A more comprehensive and systematic aggression index would probably increase the overall validity. A longer test, with a broader coverage of football knowledge, would help discriminate better between subjects; with only twenty items, it was hard to have good content validity. Finally, there is obviously an underlying mechanism causing the low female, and high male, scores. I believe it lies in the socialization of children—after they learn what sports are socially acceptable for each gender, they stick to these stereotypes. An analysis of our society, in conjunction with the administration of this test, might yield some insight into these intriguing results. Hopefully, this test might one day solve the intense debate over whether a violent sport, like football, is correlated to aggression in the people who enjoy watching and playing it.