PRE-EMPLOYMENT TESTING...When is a test "valid" enough?

You are the HR manager of a medium-sized manufacturing company. You need to hire ten machine operators, and you are working with a consultant who has a test covering mechanical aptitude, basic math, and reading comprehension. He assures you that the test has been properly "validated", but you are wondering....will it really help you to hire the best workers?

Before we deal directly with this question, let’s review some facts about pre-employment testing and validation. Since 1964, testing has come under the jurisdiction of Title VII of the Civil Rights Act, which outlaws tests that discriminate against minorities unless such tests can be shown to be based on "business necessity".* *Since then, the courts have identified two types of discriminatory hiring practises...* *"disparate treatment", covering blatant acts of discrimination, and "adverse impact", which is the governing rule in most cases of pre-employement qualifications. For example, a factory which required unskilled workers to have a high school diploma would tend to exclude African-Americans ... perhaps not so much today, but certainly in 1964 when the laws were written. In fact, such cases were among the earliest challenges to be launched against companies under the new legislation. The defendants attempted to justify such standards by applying the "business necessity" argument...that they were entitled to prefer more intelligent workers over less intelligent. The courts did not accept this argument. They insisted that any employment practise having adverse impact on protected groups must be "validated" according to a more rigorous standard.

It’s a common misconception that a test must be validated to prove that it doesn’t discriminate against women or minorities. This is wrong. If a company is challenged under Title VII, it has only one way of proving that its hiring practises are non-discriminatory ... by applying the 80% rule. If the success rate among protected-class job applicants is at least 80% of the hiring rate among white male applicants, then the complainant has failed to make a "prima facie" case of adverse impact, and the claim will normally be thrown out. A test that doesn’t discriminate cannot be challenged, and therefore does not need to be validated.

The tests that need to be validated are precisely the ones that do, in fact, discriminate. Why would an employer want to use such a test? Because no matter how hard a company tries to be fair, women on average will still do worse than men on most tests of physical ability....and for whatever reasons, whites continue to out-score African-Americans on most types of IQ tests. "Validation" allows the employer to legally use the results of such tests.

It’s a curious fact that there is no precise legal definition of how to validate a test, and no government agency empowered to certify a test as being officialy "validated". The EEOC’s "Uniform Guidelines for Employee Selection Procedures" are just what their name implies ... "guidelines"; however, in the words of an early Supreme Court ruling, they are to be accorded "great deference"; and since that time, they have acquired virtually the force of law.

The Uniform Guidelines recognize three kinds of validation: criterion-, content-, and construct-based. All three methods start with a job analysis, where the job is broken down to its component tasks. After that, the choice of validation method tends to follow the type of test.

A content validated test is one that measures proficiency in actual job tasks...a typing test for a secretary, or questions about tax rules for an accountant. It sounds simple, but there are all kinds of ways that lawyers can challenge content-validation. For example, a municipality asked would-be firefighters to drag a water-laden firehose a certain distance across the ground within a given time limit. This sounds, to a layman, like a reasonable simulation of an actual job task. But the test was challenged by female applicants, and thrown out by the courts on the reasoning that there was no proof that firefighters were routinely expected to perform this task alone. And when the basic validity of such a test is conceded, the results can still be thrown out because the cut-off score (in this case, the maximum time) was not validated to the court’s satisfaction ... or because the weight accorded to a particular task was, in the opinion of the court, out of line with that task’s actual job importance.

A construct validated test measures attributes or characteristics that are judged to be necessary for the job. An example might be a test of mechanical reasoning and reading comprehension for an equipment operator. In the case of fire-fighters, the municipality might choose to do medical tests for cardio-vascular capacity rather than actual job simulations. It’s something of a strategic choice ... the employer might feel that judges are more likely to be more impressed by expert medical testimony and clean, clinical procedures than by "crude" tests of brute strength.

All these examples rely on validation by expert analysis. The third type of validation is quite different, and can theoretically be applied to any kind of test. A criterion-based validation starts, like the other two, with a job analysis. But in this case, the results of the analysis are used not to design the test questions, but to design the evaluation criteria. These are performance measures such as productivity, attendance, quality of work, etc. A sample group of workers is chosen to write the test. Each worker who writes the test is also evaluated by his supervisors according to the specified criteria. The test results are statistically compared to the actual job performance (as measured by the criterion evaluations), to generate a measure of validity called the Pearson r-value, or the validity co-efficient. Theoretically, a company would need to test a sample groupof job applicants, then hire all of them and track their performance...but in practise, a "concurrent" validity study (done with the existing workforce) is generally accepted.

The interesting thing to note about a criterion-validated test, is that the questions could theoretically be about almost anything: football, movie trivia, you name it ... as long as the results were shown to correlate with actual job performance. In practise, there is little interest in experimenting with such methods. Employers generally feel more comfortable about a test with good "face validity", even if that’s not a concept recognized by the courts.

All this brings us back full circle, to the question we started out with: How do you know if a test will really help you to hire the best workers? It comes down to the all-important validity co-efficient, a diminesionless number between negative one and positive one, with zero indicating a purely random relationship. What is the minimum value of this parameter which signifies that a test is "valid"?

And the answer is .... there is no answer! A fair guess would be that most of the criterion-validated tests in use today have r-values in the range of .20 to .40 or thereabouts. But the Uniform Guidelines say only that the correlation must be "statistically significant." In theory, this means that any non-zero value of r would be acceptable, as long as the sample group was large enought to establish a reasonable confidence interval that the result was not purely due to chance. The courts have not, so far, taken it upon themselves to impose any lower bounds, concerning themselves only with making sure that the procedure has been carried through correctly.

So how good are these tests? Industrial psychologists, trained in psychometric analysis, are able to answer this question in two different ways. How they choose to answer depends, among other things, on whether they are testifying as expert witnesses for the plaintiff or for the repondent. If this last remark sounds like a cynical cheap shot at the profession, it really isn’t ... (well, maybe it is, just the tiniest little bit). Actually, that snide-sounding remark is really nothing more than a reflection on just how confusing those statistics can be, even to an expert. The simple fact is that there are exactly two commonly-reconginzed "verbal interpretations" of the Pearson r-value, and while they are mathematically equivalent, they are so different in the impression they give, that they seem to lead to diametrically opposite conclusions.

Let’s start by observing that it’s probably hard to write a multiple-choice test with a validity co-efficient of much less than .20, no matter what job. This is because performance on most tests is at least partially correlated with overall intelligence, and the same goes for performance in almost any job. Now consider a test with an r-value of .30, which almost any industrial psychologist would consider to be perfectly acceptable.Take the *square *of the r-value: .09, or 9%. Actual job performance will account for 9% of the variance in scores on such a test. Yes, that’s nine percent. This is how one interprets the r-value of a test in terms of "percent variance explained"... ninety-one percent of the variance in scores is due to factors other than those being measured by the test.

This thought is so chilling, that one hesitates to point out that the term "variance" in the previous passage is actually a technical, mathematical term that refers to the square of the standard deviation. Taking this into account makes a bad situation worse...it means, in this example, that actual job performance accounts for less than 5% of the spread in test scores. Let’s consider what this means ... suppose you administer this test to a specially-selected sample of "average" performers...no good ones, no bad ones, just workers chosen from right top-dead-center of the bell curve. Suppose further that the results are spread out in a bell-curve with a range (plus-or-minus one S.D.) of 50 to 70. (Remember, just because these are "identical" performers, it doesn’t mean they’ll get the same test scores: 95% of the spread is due to unrelated factors.)

Now administer the same test again, but this time to a homogenous sample of workers, representing all skill levels. Instead of a standard deviation of ten points in the test scores, you will now see a standard deviation of .... ten and one-half points! This is exactly what you get with a test whose validity co-efficient is 30%. It’s not very encouraging.

But there’s another way of looking at the same test, that casts quite a different light on things. It’s called utility analysis, and it’s not a "trick", it’s just a different way of looking at things. Here is how it goes: you start by assuming that you have a pool of, say, fifty job applicants, with ten to be hired. First, assume that you are going to randomly hire ten workers. They will have an average productivity of, let’s suppose, $20,000 per year. Now think back to our bell curve, and ask the question: what is the difference in productivity between an "average" worker, and a worker somewhere else on the curve, say one standard deviation above or below the average? Suppose the difference is $2000, for this unskilled position. (A common rule of thumb for skilled or professional workers puts the actual deviation at closer to 40% of the workers salary, or four times the amount shown in this example.)

Starting from these parameters, we can apply the Brogden formula to calculate the cost benefit per employee hired. The actual formula is given in the footnote below...as a simplified approximation, we can observe that the effect of the test is to shift the average productivy of each worker upwards by roughly 30% of one standard deviation. There is actually a correction term in this formula, l/p, which makes the payoff even better the higher you set your cut-off...in this example, it comes to something like $1200 per year, per worker. Not too shabby.

Utility = r * SDy * l/p - c/p where Utility = cost savings per employee selected r = validity coefficient SDy = standard deviation of performance, in dollars l = y-ordinate on a standardized (mean=0, sd=1) normal curve p = selection ratio c = cost to administer the selection system |

**So what gives here? How can we have it both ways...is the test a shocking waste of time, or is it a cost-effective means of improving productivity? Are back at square one again, for the third time?**

To cast more light on this question, I have invented a new way to look at it. Here’s how it works (and remember, I’m not promising to *answer *the question, only to "cast more light on it"!):

You are going to El Salvador to recruit baseball players. You want sluggers that can hit over 300, but they don't keep statistics down there. So you'll have to watch them bat. You will only have a chance to see each hitter go to the plate ten times, and your plan is to offer a try-out to anyone who gets three or more hits.

Using the concept of "percent variance explained", as discussed above, we can actually calculate the "validity co-efficient" of this test! Let's assume that the typical batting average in El Salvador is .250, with a standard deviation of .050 (so about one in six players is hitting over 300). On ten trips to the plate, the random variance is given by 10*.25*.75 = 1.875 Furthermore, the performance-related variance will be 0.5*0.5 = 0.25. The total variance is the sum of these, and the percentage of this total which is performance-related (not simply random) is approximately 12%. To get the validity coefficient, we just take the square root of this, and get 0.34. This is quite a respectable r-value as these things go.

So using a test with a validity co-efficient of .34 is equivalent to hiring a batter based on ten "at-bats". The beauty of this method is that you can make the test more "valid" by simply counting more "at-bats"...for example, if you observe 25 trips to the plate, the validity of this "test" rises to .54. Going to the other extreme, you might choose to simply to sign a contract with the very first player who gets a base hit....the validity of this "test" is .11, as can be easily verified.

The benefit of this example is that we now have a common-sense method for understanding how good a test really is. It is easy for the non-mathematician to appreciate the idea of hiring a baseball player based on a limited number of at-bats. To be sure, in ten trips to the plate, a star player is more likely to shine than a journeyman ... and yet, on any given day, a mediocre player can have a bit of a lucky streak. The year when Ted Williams hit 400, he often went 2 for 10 in a given string of appearances. And in a nutshell, that is really the strength and the weakness of any test with a validity in the range of 30 to 40 percent ... on average, you will certainly get a better outcome than random chance. But there is little assurance that you will hire Ted Williams, even when he shows up at the plant gate looking for a job.

Click here to see the results of some Criterion-Based Validation Studies carried out with manufacturing sector clients in Canada and the U.S.