Inside the multimillion-dollar essay-scoring business

Behind the scenes of standardized testing

By Jessica Lussenhop

published: February 23, 2011

Dan DiMaggio was blown away the first time he heard his boss say it.

The pensive, bespectacled 25-year-old had been coming to his new job in the Comcast building in downtown St. Paul for only about a week. Naturally, he had lots of questions.

At one point, DiMaggio approached his increasingly red-faced supervisor at his desk with another question. Instead of answering, the man just hissed at him.

"You know this stuff better than I do!" he said. "Stop asking me questions!"

DiMaggio was struck dumb.

"I definitely didn't feel like I knew what was going on at all," he remembers. "Your supervisor has to at least pretend to know what's going on or everything falls apart."

DiMaggio's question concerned an essay titled, "What's your goal in life?" The answer for a surprising number of seventh-graders was to lift 200 pounds.

Although DiMaggio had been through a training process, he found himself tripped up as he began scoring the essays. What made the organization "good" as opposed to "excellent"? What happens when the kid doesn't answer the question at all, but writes with excellent organization about whatever the hell he wants? Did it matter that it was insane for seventh-graders to think they'd be benching 200 pounds?

DiMaggio had good reason to worry. His score could determine whether the school was deemed adequate or failing—whether it received government funding or got shut down.

DiMaggio soon learned that his boss was a temp like him. In fact, the boss was only the team leader because he'd once managed a Target store.

DiMaggio found out that the human resources woman who'd hired them both was a temp. He realized that their office space—filled with long tables lined with several hundred computer monitors and generic office chairs—was rented.

Eventually, DiMaggio got used to not asking questions. He got used to skimming the essays as fast as possible, glancing over the responses for about two minutes apiece before clicking a score.

Every so often, though, his thoughts would drift to the school in Arkansas or Ohio or Pennsylvania. If they only knew what was going on behind the scenes.

"The legitimacy of testing is being taken for granted," he says. "It's a farce."

  

THOUGH THE EFFICACY of standardized testing has been hotly debated for decades, one thing has become crystal clear: It's big business.

In 2002, President George Bush signed the infamous No Child Left Behind Act. While testing around the country had been on the rise for decades, NCLB tripled it.

"The amount of testing that was being done mushroomed," says Kathy Mickey, a senior education analyst at Simba Information. "Every state had new contracts. There was a lot of spending."

The companies that create and score tests saw profits skyrocket. In 2009, K-12 testing was estimated to be a $2.7 billion industry.

The Twin Cities were early beneficiaries of the gold rush. Minnesota's history as an early computer hardware hotbed led to the creation of some of the earliest data-scanning and numbers-crunching businesses here, including Scantron and National Computer Systems. By the '90s, NCS was grading 85 percent of the standardized tests for the nation's largest school districts.

In 2000, NCS was bought by Pearson, a multinational corporation based in London, making it a part of the largest education company in the world. In 2009, it posted $652 million in profits.

Today, tens of thousands of temporary scorers are employed to correct essay questions. This year, Maple Grove-based Data Recognition Corporation will take on 4,000 temporary scorers, Questar Assessment will hire 1,000, and Pearson will take on thousands more. From March through May, hundreds of thousands of standardized test essays will pour into the Twin Cities to be scored by summer.

The boom in testing has come with several notable catastrophes. The most famous happened in 2000, when NCS Pearson incorrectly failed 8,000 Minnesota students on a math test. Pearson shelled out a $7 million settlement to the students, and Gov. Jesse Ventura participated in a makeup graduation for students who were wrongly denied their diplomas. In 2010, Pearson again miss-scored two questions on Minnesota's fifth- and eighth-grade tests. Delays in its Florida scoring resulted in a $3 million fine and glitches in Wyoming led the company to offer a $5.8 million settlement.

But while a mistake on a bubble form is a black-and-white problem, few scandals have broken on the essay side of the test-scoring business.

"It requires human judgment," says Michael Rodriguez, of the University of Minnesota's educational psychology department. "There is no way to standardize that."

Now scorers from local companies are drawing back the curtain on the clandestine business of grading student essays, a process they say goes too fast; relies on cheap, inexperienced labor; and does not accurately assess student learning.

"The entire testing system in the U.S. needs to be restructured," says Robert Schaeffer, public education director for FairTest. "That would likely result in the disappearance of these essay-scoring sweatshops."

  

DANI INDOVINO DIDN'T want to score tests. She wanted to work in nonprofit administration.

But she was fresh out of school in September 2008, just as the economy was entering its freefall. Desperate to get out of her parents' house, she perked up when some friends told her about becoming a "reader" for one of the local test companies. It was easy work to get and there was lots of it. All you needed was a college diploma.

"I was like, 'Yeah, I have a degree, I can do that,'" she recalls.

On Indovino's first day, she drove out to Questar Assessment in Apple Valley, a beige warehouse, and followed the signs that said "Scoring Center" in bright red letters. During her brief interview, she'd been asked repeatedly if she was able to follow a "rubric"—a set of guidelines to assess the essays in as uniform a way as possible.

"I guess they've had bad experiences with English teachers," she says.

Inside Questar, Indovino took a seat in a room that looked like a classroom, crammed with as many computers and desks as could fit. It was here that the team leaders unveiled the scoring rubric, which was like a secret decoder ring for the job.

The rubrics are most often developed in conjunction with the state's department of education and its testing contractor. Currently, Minnesota contracts both its test writing and scoring to Pearson. Local teachers are included in the rubric-writing process, as well as test-writing academics called "psychometricians."

At first blush, the rubric seemed simple enough to Indovino. It was a chart with one- or two-sentence explanations of each number grade. Scorers are forbidden from taking the rubrics out of the Questar building or talking about them, but they generally look something like this:

6. An excellent response, the essay includes

• excellent focus and development

• excellent organization

• excellent language skills and word choice

• excellent grammar, usage, and mechanics

5. A good response, the essay includes

• good focus and development

• good organization

• good language skills and word choice

• good grammar, usage, and mechanics

4. An adequate response ...

On down to 1s, which were reserved for barely decipherable language.

As part of their training, Indovino and her co-workers read through pre-graded examples out loud, then discussed why each had been scored the way it was. The process quickly divided the room into two camps—the young, unemployed kids who were just there for a paycheck, and the retired teachers.

"The retired teachers would argue everything," says Indovino.

After two days of going through example papers, each scorer had to pass a qualifying exam. Indovino scored three sets of ten pre-scored papers. In order to be approved to work on the project, she had to pass two of the sets with at least an 80 percent "agreement rate" with the rubric. She did so with relative ease; most of the rest of the room passed on their second try.

Her first project was from Arkansas, an essay written by eighth-graders on the topic, "A fun thing to do in my town."

And that's where the troubles began.

Suddenly, she was being asked to crank through 200 real essays in a day. The scanned papers popped up on the screen and her eyes flitted as fast as they could down the lines. The difference between "excellent" and "good" and "adequate" was decided in a matter of seconds, to say nothing of the responses that were simply off the reservation. How do you score a kid who rails that his town sucks? What about an exceptionally well-written essay on why the student was refusing to answer the question?

All over the room, the teachers were raising their hands and disputing the rubric. Indovino preferred to keep her head down and just score the way she was told to.

"I was good at the bad system," she says.

Over the next several months, Indovino got to know her co-workers better. The young people were mostly laid off or in foreclosure. They came straight from paper routes and went off to waitressing jobs afterward.

They also made for a very dedicated workforce. Indovino says she saw her co-workers hung-over, extremely ill, and even fresh from surgery.

"I scored a full day without glasses on," Indovino says with a shrug. "I sat with my nose up to the glass all day. I couldn't read it."

When she eventually got a full-time job, Indovino quit scoring. Although she'd done well by the company's standards, following the rubric provided little sense of accomplishment.

"Nobody is saying, 'I'm doing good work, I'm helping society,'" she says. "Everyone is saying, 'This isn't right.'"

  

DAVID PUTHOFF WAS an experienced reader with Questar when he started getting the warnings that his job performance wasn't up to snuff.

"Your numbers are down a little bit," his supervisor said at the end of one day. "Make sure you bring those back up."

Most essays, depending on the criteria established in the state, are scored by two readers. As Puthoff and his fellow scorers whipped through their essays, their supervisor had their own eyes glued to a screen, keeping them apprised of whether Reader #1 agreed with Reader #2. If so, both got a 100 percent agreement score for that essay. If one differed by a point or so, the essay would be counted as "adjacent" agreement.

Puthoff had thus far been an agreement-rate superstar. He was consistently in the high 80s.

Then came the question from hell out of Louisiana: "What are the qualities of a good leader?"

One student wrote, "Martin Luther King Jr. was a good leader." With artfulness far beyond the student's age, the essay delved into King's history with the civil rights movement, pointing out the key moments that had shown his leadership.

There was just one problem: It didn't fit the rubric. The rubric liked a longer essay, with multiple sentences lauding key qualities of leadership such as "honesty" and "inspires people." This essay was incredibly concise, but got its point across. Nevertheless, the rubric said it was a 2. Puthoff knew it was a 2.

He hesitated the way he had been specifically trained not to. Then he hit, "3."

It didn't take long before a supervisor was in his face. He leaned down with a printout of the King essay.

"This really isn't a 3-style paper," the supervisor said.

Puthoff pointed out the smart use of examples and the exceptional prose. The supervisor just shook his head and pointed out how short the paragraphs were.

"You know, it's more of a 2," the supervisor repeated. "Not enough elaboration."

Puthoff quickly learned these were not arguments he could win. But as time went on, he found himself having more and more of them.

There were the students who wrote extremely well but whose responses were too short—in his mind he saw them, bored with the essay topic, hurrying to finish. Or the essays where the handwriting got rushed and jumbled at the end, then cut off abruptly—he imagined the proctor telling the frantic student to lay down his pencil on a well-written but incomplete response.

And there were the kids who just did what they wanted. Like the boy from Arkansas who, instead of writing about the most fun thing to do in his town, instead wrote a hilarious essay on why his town is terrible and how he wanted to burn it down and pee on the ashes.

"I wanted the kid to get the score they deserved," Puthoff says of his time in the business. "But they want to put them in boxes."

In defiance, Puthoff started giving creatively written essays an illicit score bump. His agreement numbers noticeably suffered.

The industry calls this "scorer drift," a well-documented tendency to begin deviating from the rubric over time. One case of scorer drift actually resulted in some 4,100 teachers failing the essay portion of their certification exams. The teachers successfully sued for $11.1 million.

What was different about Puthoff's scorer drift was that he was doing it on purpose.

"I'll bring them up, don't worry," he'd say of his agreement rate, then go back the next day and do the exact same thing.

"I know this kid is good," he'd tell himself. "I know this kid's a good writer."

   

TODD FARLEY TREATED his supervising position at a scoring company like a joke.

"At the time, testing wasn't that big," he says. "I never had to feel like I'm actually deciding someone's future. It was just silly."

Farley had started at the bottom rung of the testing industry in Iowa City. A part-time graduate student with bills to pay, he was more interested in partying and trying to become a writer than he was in getting a real job. So he took one scoring job after another for NCS.

"It was always a temporary gig," he remembers. "It was a lovely, slacker-y life."

Farley had no official training in teaching, education, or test writing, but the longer he remained at NCS, the more responsibilities he was handed. He took the offer to become a team leader because it paid a little extra money and got him out of scoring.

Teaching his first group of scorers, Farley walked them through the rubric the same way he'd been shown. He fielded the inevitable bombardment of confused questions as best he could, in particular from one man: Harry the laid-off refrigerator plant worker.

Even though Harry eventually passed his qualifying exams, he was a disaster. As Farley monitored Harry's scoring, he found himself walking back over to Harry repeatedly.

"Look," Farley would say. "You're giving this essay a 2 even though it's perfectly formatted."

Harry would nod. But a short time later, another ridiculous low ball from Harry would land on Farley's desk. Before long, Harry began to drag down the all-important agreement level.

Farley now understood the reasons why, when he'd been a scorer, his team leaders would tell the room he wanted to start seeing more 3s or 4s or whatever. Supervisors were expected to turn the test scores into a nice bell curve. If his room did not agree at least 80 percent of the time, the tests would be taken back and re-graded, wasting time and money. The supervisor would be put on probation or demoted.

When Farley complained to a fellow supervisor about his problem, she smiled wryly and held up a pencil.

"I've got this eraser, see," she told him. "I help them out."

So Farley simply began changing Harry's scores to agree with his peers'. The practice soon spread well beyond Harry.

"I'd just change a bunch of answers to make it look like my group was doing a great job," Farley says. "I wanted the stupid item to be done, and so did my bosses."

There were a few other tricks to keep the numbers up. One was to send a wayward scorer off into a corner to study example papers long enough for the group's numbers to rebound. Another was to pair up a couple of bad scorers and make them decide together what to give a paper.

Or he could make the same announcement he'd heard from his supervisor back when he was a scorer.

"It's time we see more sixes," Farley would tell the group, which was code that his bell curve was off. "We're in trouble here, we need higher scores, give higher scores."

Though Farley and his fellow team leaders were fudging the numbers, even he was shocked when a representative from a southeastern state's Department of Education visited to check on how her state's essays were doing. As it turned out, the answer was: not well. About 67 percent of the students were getting 2s.

That's when the representative informed Farley that the rubric for her state's scoring had suddenly changed.

"We can't give this many 1s and 2s," she told him firmly.

The scorers would not be going back to re-grade the hundreds of tests they'd already finished—there just wasn't time. Instead, they were just going to give out more 3s.

No one objected—the customer was always right.

Eventually, Farley was hired away by a rival testing company and moved to the East Coast. As he saw standardized tests becoming more and more important to the fate of schools and kids, he got fed up, quit the industry, and decided to write a whistle-blowing book.

Making the Grades: My Misadventures in the Standardized Testing Industry, came out in 2009. Though the tell-all chronicles Farley's many misdeeds while scoring tests and supervising, he has nonetheless been invited back to work for the testing companies several times. The boom has just made his experience too valuable.

"They get paid money to put scores on paper, not to put the right scores on papers," he says. "They have a bottom line. Why anyone would expect anything else is beyond me."

  

PEARSON SPOKESMAN ADAM Gaber warns against taking the opinions of former scorers too seriously.

In an email, he characterized their concerns as "one-sided stories based upon people who have a very limited exposure and narrow point of view on what is truly a science."

Questar declined a request to visit their facilities, but reached by phone, Susan Trent, vice president of assessment services, said that the essays are scored as objectively as is possible.

"We're really insistent that readers understand they're dealing with kids," she says. "Decisions are being made about these kids based on these scores, and we're absolutely committed to getting them right."

She denies that graders are pressured to work too quickly and says that any evidence of scorer drift results in test re-scoring. She is also adamant that well-trained temps are the best way to score essays objectively.

"You do not have to be a teacher in order to score student response," adds Terry Appleman, vice president of performance assessment. "You have to have a good rubric and good training."

Asked what to make of the former Questar employees who felt they couldn't do a good job given their training and time constraints, Appleman quickly answers: "If they don't think they're qualified, it's not the job for them."

Most of the scorers interviewed for this story agree, but nearly all plan to return to the scoring center. They say they need the money.