Nov 19, 2014

Here We Go Again: Tests for the Common Core May Be Unfair to Some and Boring To All

by Jim Loewen

Clip art licensed from the Clip Art Gallery on DiscoverySchool.com

On October 16, 2014, the Center for American Progress (CAP), a "liberal" think-tank in Washington, D.C., presented a panel, "The Need for Better, Fairer, Fewer Tests." It starred Jeffrey Neilhaus, Chief of Assessment at PARCC, Partnership for Assessment of Readiness for Colleges and Careers, Inc.

In case you're wondering what PARCC does, these folks seem to have an inside track toward developing the testing that will undergird America's new Common Core Standards for K-12 education.

Unfortunately, multiple-choice tests are often biased against African Americans, Native Americans, and Mexican Americans, and sometimes against females. My (extensive) experience studying the SAT and other "standardized" tests[1] has shown that items that favor African Americans systematically get dropped. Increasingly, so do items that favor Mexican Americans and also items that favor girls on math tests. This is not because racists or sexists at ETS, Pearson, etc., want to deny competence or college to black folks or women or whomever. Rather, it results from the standard statistical tests psychometricians apply to items, notably the point-biserial correlation coefficient (known in sociology as the item-to-scale correlation coefficient).

Let me explain "point-biserial correlation coefficient." Testmakers seek to measure or predict student performance. For example, ETS wants the SAT to correlate positively and strongly with first-semester college GPA. Researchers at ETS are always developing new items. What they should do is see if getting those items right correlates with higher first-semester college grades. ETS has relationships with over a hundred institutions of higher learning and could obtain from them students' GPAs. Then they could match those GPAs with performance on each new item. But doing so would be expensive and would take most of a year. Instead, ETS researchers argue that an item "behaves" statistically if it correlates with students' overall ability. That claim would make sense, except for their unfortunate choice of measure: "overall ability" is students' scores on the test as a whole. (This is why sociologists call this the "item-to-scale correlation coefficient.")

Why is that measure of overall ability problematic? Well, consider an item that uses the word "environment." ETS staffers showed decades ago that when white students hear the "environment," most think first of the natural environment -- ecology, pollution, the balance of nature, and the like. Nothing wrong with that -- everyone knows that meaning of "environment." But when African American students hear the word, most think first of the social environment -- "what kind of environment does that child come from?" Again, nothing wrong with that -- everyone knows that meaning of "environment." In the pressure cooker conditions under which students take "standardized" tests, however, the first meaning that flashes into their minds when they encounter a word often influences whether they pick a "distracter" (wrong answer) or get the item right.

It follows like night after day that a potential item based upon the second meaning of "environment" could never make it to the SAT. Perhaps 75% of African Americans would get it right, but only 65% of European Americans. In addition, working-class and lower-class whites and Latinos living in inner-city neighborhoods would also be more likely to get it right, since they live in a majority-black subculture. Unfortunately, all of these students typically score well below most suburban whites. Therefore the item would fail the point-biserial correlation coefficient test. The people getting it right scored lower on the overall test than the people getting it wrong.[2] The "wrong people" got it right. In statistical terms, the item had a negative point-biserial correlation coefficient. No one at ETS wanted nonwhites to score lower. No intentionality was involved. The process is purely statistical. Nevertheless, the result is: even though, within each group, the item may separate those with more ability from those with less, across all groups, the item "misbehaved" statistically. No such item — on which people who score badly overall do better than rich white males — can ever appear on the final exam.

Like most other "standardized" tests given widely in the U.S., researchers originally validated the SAT on affluent white students. Affluent white students have always done better on it than have African Americans, Hispanics, Native Americans, Filipino Americans, or working-class whites. It follows that using point-biserial correlations increases test bias. Because new test items must correlate with old ones, items that favor blacks, Latinos, or poor people cannot pass this hurdle. Neither can items that favor girls on the math test.

Knowing this, I asked the presenters at CAP how they were dealing with the possibility of biased items. In response, Neilhaus said that PARCC was doing two things. First, they subjected each item to scrutiny, to avoid language that might upset any race, ethnic group, gender, etc. Second, they subjected each item to DIF, differential item functioning, a statistical test developed by ETS, the company that puts out the SAT.

Unfortunately, neither of these techniques has much to do with reducing bias.

To be sure, we don't want to use tests with language or content that would upset any group. But most biased items have no such language or content. That's not a major problem.

DIF is even less relevant to the issue of bias. DIF is a statistical technique that flags an equal number of outliers in each direction. Twenty years ago, I got staffers at ETS to admit that DIF is not a bias-reduction technique. It cannot be. Typically it brings to testmakers' attention as many items that favor blacks as those favoring whites. Indeed, since the number of items that favor African Americans is likely to be fewer than the number that favor whites, the test may flag more pro-black items than pro-white items, because the pro-black items will stand out as more distant from the mean.

Moreover, whites may outscore blacks on items, but the items will still get dropped because whites don't outscore blacks by a great enough margin! Thus if 7% fewer African Americans typically get an item correct, compared to whites, and if 6% is the cutoff point that triggers DIF, then items on which blacks do "only" 1% worse than whites will get flagged for scrutiny. To be sure, so will items on which whites do 13% better than blacks. If both outliers get dropped, the mean difference, here 7%, remains unchanged. If bias explains part of that difference, the testers will neither notice nor remove it. DIF drops the mean difference before it looks at items. Therefore DIF is irrelevant to the mean difference.

PARCC does not propose to look at items on which blacks do much worse than whites (or Hispanics much worse than Anglos, or women much worse than men on math). PARCC only proposes to use DIF. Therefore, PARCC has no bias reduction procedure in place.

Even worse than that criticism, however, is the simple fact that their tests apparently will be mostly multiple-choice. There is no excuse for this. Rhode Island and Vermont showed years ago that states can test and qualify students for promotion or graduation using a portfolio approach that demands various competencies of them. Students might be asked to give a persuasive talk, for example. They might have to write a ten-page paper, making a coherent argument and backing it up. They might have to do a chemistry lab project or use math to interpret a table.

I asked Neilhaus if they had considered using portfolios. He replied that portfolios are not feasible, partly because of problems comparing across different states. However, portfolios have crucial advantages over "standardized" tests. First, they are fun for students to assemble. No one has ever accused "standardized" tests of being fun. Second, portfolios give a real picture of students' strengths and weaknesses. Third, they give students something useful and meaningful to do, to improve. When a student gives a disorganized talk, it's apparent to all and is a skill worth acquiring. When a student percentages a table wrong, it's obvious upon reflection; again, it's a skill worth acquiring.

When a student scores low on a "standardized" test, on the other hand, usually no meaningful diagnostic comes back. Does the student read slowly? By itself, that can decrease test scores dramatically on such tests, but the test does not even tell the student that s/he reads slowly, let alone offer meaningful remediation. Moreover, the ability to choose "B" from among five alternatives quickly is not a skill of any use in later life. Perhaps the most effective way to improve one's score on an SAT-type exam is by taking a coaching class, particularly the Princeton Review, but the "skills" it teaches are not only not useful after the test, they are anti-intellectual. And of course they are also expensive.

The key drawback to portfolios and the other alternative forms of examining student performance? It's harder for companies to monetize them.

Comparing portfolios across different states isn't crucial, but if it became an issue, it's solvable. After all, ETS hires capable people — including very good HS history teachers — to read and grade the essays that students write responding to the DBQ (Documents-Based Question) on the APUSH (Advanced Placement U.S. History) exam. Yes, it's labor intensive, but it can be done, and on a large scale. After all, ETS is a massive test-producer.

If the Common Core is really going to have better, fairer, fewer tests, it needs to move away from overreliance on multiple-choice items, a.k.a. "standardized." Portfolios would be one solution. No Child Left Behind did not mandate multiple-choice tests, as Vermont and Rhode Island showed. Even less does the Common Core mandate such tests.

For the record, students whose parents did not graduate from college, immigrant children, young people of color, etc., will find it harder to assemble a powerful portfolio, compared to affluent white students. The upper class will always find ways to advantage its children. They're even supposed to: it's a rule that parents should do what they can for their kids, and rich folks can do more. But at least portfolios and some other forms of assessment are less biased than multiple-choice tests. Moreover, since they test abilities that are useful in the workplace and in college, they possess an intrinsic fairness that multiple-choice tests lack.

On March 9, 2014 the conservative columnist Kathleen Parker wrote "Simplifying the SAT." She bemoaned the loss of the analogy items back in 2005. Analogy items were particularly subject to bias. Hence I was glad to see them go, even though, like Parker, I personally enjoy them. Parker wrote, in passing, "Critics of the SAT maintain that the test is biased in favor of students from wealthier families. We all want a level playing field and equal opportunity for children. This is fundamental to who we are."

This is an example of wishful thinking. "We all" do not want this, or we would have it! The affluent and influential want tests that advantage their children, and they have them.

Here is just one of the many ways that affluent and influential families gain unfair advantage on "standardized" tests. They "game" the testing rules. Although rich and powerful families do not want their children in special education (except "gifted and talented" programs), they do want their children to get dispensations for special testing regimens from ETS. Across, say, the Los Angeles metropolitan area, these requests — each bolstered by a statement from a doctor or psychologist paid by the family — map inversely with the prevalence of students assigned to special education. These requests also correlate strongly, when mapped with median family income. And they correlate with the availability of coaching classes for the SAT.

Based on the presentation at the Center for American Progress, the Common Core as tested by the Partnership for Assessment of Readiness for Colleges and Careers may be no fairer to "have-not" students than were the old multiple-choice tests that "No Child Left Behind" spiraled down into using in most states. No fairer than the SAT. Not fair at all. Not fun, either. No fun at all.

[1]Since "standardized" implies much more to the lay reader than legitimately intended by the psychometrician, this word needs to be encased in quotation marks. Otherwise, many readers will infer that tests are somehow made "standard" or fair across population groups; just the opposite is usually the case.

[2]See Loewen, "Statement," in Eileen Rudert, ed., The Validity of Testing in Education and Employment (DC: U.S. Commission on Civil Rights, 1993), 42-43; cf. Loewen, "A Sociological View of Aptitude Tests," ibid., 73-89.