Introduction to Stats for Assessment

Psychological Assessment is a complex process. It is different from what a Psychometrician does, in that they simply administer the tests and interpret the scores in a standardized manner. A score of 85 on a test means the same for everyone, and the interpretation for this score has nothing to do with the reason for referral. Example: An MSW gives a BDI…

Psychological Assessment can be conceptualized as a process of reflection, decision-making, evaluation, and integration to answer a referral question, followed by reporting the results of this process. Reflection involves considering the reason for referral, or what the referral source and client want to know. Decision-making entails determining which assessment methods will allow the psychologist to answer the question. Evaluation is composed of four basic elements: standardized testing, interviews, observations, and informal assessment. Reporting the results entails numerous concerns too, but we'll discuss them later.

Standardized testing, according to Cronbach, is when the testee's words and acts, the apparatus, and the scoring have been fixed so the scores collected at different times and places are fully comparable. Standardized tests generally have a set of norms, or a collection of scores describing average performance, as well as performance above and below the mean.

Interviewing allows the psychologist to gather extensive information beyond that gathered by a standardized test. Current level of functioning, highest past level of functioning, and future expectations can be assessed using an interview.

Observation refers to observing a client in a natural, or somewhat natural, environment. How they react to others, to their environment, and to their own abilities can be assessed using observation.

Informal assessment can at times supplement standardized testing, interviewing, and observation. However, since such assessment measures are of questionable reliability and validity they should be used with caution.

All four are interrelated, however.

You should keep in mind that:

tests are samples of behavior, and do not directly reveal traits or capacities. Rather, they allow inferences to be made about the testee; test scores and levels of performance are affected by fatigue, anxiety, distress, motivation, and cooperation, all of which may vary from one testing session to another. Thus, alternate interpretations or inferences have to be considered

Suppose you were asked to measure your height. You stand against a wall, place a pencil on your head, and make a mark on the wall level with the top of your head. Then, you measure from the floor to the mark. This isn't measuring your height; it's measuring something that represents your height. Likewise, standing on a bathroom scale doesn't measure your weight. It measures the effect your weight has on something. The difference might seem small, but for psychological testing, especially IQ testing, it is an important difference, as we are not always sure how close to perfect (or how far from it) our measurement is.

test results must be interpreted in light of an individual's cultural background, primary language, and any handicapping conditions.

Statistics
There are four kinds of scales on which you can "measure" things.

Nominal scales consist of non-ordered categories, like apples, rocks, and furniture.
Ordinal scales allow for categories they can be ordered or ranked, such as first, second, and third.
Interval scales include an arbitrary 0 point, and used equal units. Intelligence tests fall into this category, since an increase from 110 to 120 is the same as an increase from 70 to 80. However, someone with an IQ of 150 is not twice as smart as someone with an IQ of 75.
Ratio scales have a true 0 point and allow for ratios to exist. Someone who weighs 150 lbs. is twice as heavy as someone who weighs 75 lbs., and someone who weighs 0 lbs. has no weight. Very few psychological scales are ratio scales.

The mean, median, and mode refer to the arithmetic average, the middle point, and the most frequent score.

The range represents the distance between the highest and lowest score.

Variance refers to a statistical measure of the amount of variation in a set of scores. Its formula is:

The standard deviation is the square root of the variance. Its formula is the square root of S².

The normal curve or Bell-shaped curve refers to a distribution of scores where the mean, median, and mode are all equal. That is to say, the arithmetic average, the middle of the distribution, and the most frequently earned scores are all the same. Further, the curve is shaped such that 68% of the scores fall within one standard deviation of the mean, another 27% fall between 1 and 2 standard deviations, and the remaining 5% falls between 2 and 3 standard deviations from the mean.

A correlation tells us the strength and direction of association between two variables. The Pearson r ranges from -1 to +1, and a correlation of 0 tells us there is no relationship between the two variables. Correlations must be squared to tell you the percentage of the variance controlled. Thus, a correlation between A and B of .8 means that 64% of the variation associated between A appearing and then B appearing is controlled by this variable. Note, you will never be able to control all the variation.

The raw score is the actual number of correct answers given by the client. It is generally changed into a derived score based upon a norm group. There are three variables to consider when evaluating a norm group;

1) representativeness
2) size
3) relevance

Age-equivalent and grade-equivalent scores are obtained by discovering the average score obtained by someone of the same age or same grade. When your score is compared to these scores, an estimate of your mental age can be obtained. Age and grade equivalent scores should be used with caution, since small differences in raw scores can make large differences and age-equivalent or grade-equivalent scores. Further, many age and grade equivalent scores are based upon extrapolation of data between age and grade groups.

Originally, intelligence quotients were actual quotients, obtained by dividing the mental age by the chronological age and multiplying by 100. However, as noted earlier, intelligence scores are not ratio data. A 20 point difference at age 20 is not the same as a 20 point difference at age 10. As a result, this method of determining and IQ was abandoned.

Percentile ranks allow us to determine where and individual's score will fall in a sample of scores using percentages.

Standard scores are scores that have been converted into a new score distribution where the mean and the standard deviation are predetermined. Typically, when we convert to standard scores the mean is 100 and the standard deviation is 15.

A z-score is a type of standard score in which the score is expressed in terms of portions of the standard deviation from the mean. A z-score of +1.2 means that the score falls 1.2 standard deviations from the mean, and falls above the mean.

A t-score is another standard score where the mean is 50 and the standard deviation is 10.

These transformations of the raw score can be compared. An IQ of 130 is 2 standard deviations above the mean, since the mean is 100 and the standard deviation is 15. Likewise, it equals a TScore of 70 (50, the mean, plus two standard deviations of 10). It is also equal to a z-score of +2. This score is equal to the 98th percentile, since 68% of the scores fall within one standard deviation of the mean, another 27% fall between 1 and 2 standard deviations, and of the remaining 5%, 2.5% fall 3 standard deviations above the mean. Thus, 68% plus 27% plus 2.5% is 98%.

Reliability and Validity
Reliability refers to the ability of a test score or finding to be found again under similar conditions. A reliability coefficient of .80 is strong. Because your score on a test, if taken 10 times, will never be exactly the same, there is some variation within a subject's score. If a large portion of the variation is due to fluctuations in the construct itself, then reliability for a test to measure the construct will never be good. On the other hand, some variation due to error will occur. This variation results from stress, fatigue, variations in functioning even when the trait is stable…. The rest of the score is thought to be true variation associated with the construct. So, an obtained score is equal to some true score plus some error score, since the variation is ideally true variance and error variance.

There are several types of reliability:

test-retest - test now and again later and compare
alternate forms - test with two equal versions and compare
internal consistency - how well do the items "stick together"

There are several factors that affect reliability:

test length - longer is better
test-retest interval - shorter time is better
variability of scores in the sample - smaller range is better
effect of guessing - lower guessingbonus is better
variation of the test setting - greater standardization is better

The standard error of measurement is the general error we make in determining a person's "true" score. It is defined as

That's the standard deviation times the square root of 1 minus the reliability coefficient. The poorer the reliability then, the greater the SEM. The greater the reliability, than the lower the SEM.

Confidence intervals tells us the probability that a subject's true score lies between a range of numbers. It is defined as:

CI=obtained score + z(SEM)

When the level of confidence is	z is
68% (or 1 standard deviation)	1.00
85%	1.44
90%	1.65
95%	1.96
99%	2.58

So, you take a test and get a score of 92. We don't expect you to get a 92 the next time, since the test is not perfect. Say the SEM is 7 points. We are 68% sure that, if tested again, your score would fall between 85 and 99 (92 minus 7 and 92 plus 7), and 95% sure it would fall between 78 and 106 (92 minus 1.96x7 and 92 plus 1.96x7). See what we mean?

Validity is the "truth" of the score. High validity means that we are truly assessing what we think we are. There are several types:

content or do the questions really measure the construct well enough
criterion-related or do the test scores predict performance on some other criterion--there are two kinds
- concurrent or a criterion we collect at the same time or just before the test
- predictive or a criterion we collect at a future time
construct or does it measure a real trait
incremental or does this test add to our understanding sufficiently to warrant using it
conceptual or does it allow us to make accurate statements about specific clients (as opposed to the "average" person)

Several factors effect validity:

test taking skills, anxiety, motivation, speed, etc…
reliability of the criterion used
intervening events like history, maturation, environmental changes…
reliability of the test itself

Factor Analysis is a method of analyzing variance or correlations between tests and measures. You try to account for it in as few pieces as possible. A cake could be separated into outside icing layer, inside cake layer 1 and 2, and inside icing layer. That's four separate factors that make up a cake. Alternately, air pockets, cake crumbs, moisture, and icing could be used. That's four different and separate factors that make up a cake. Alternately, you might say eggs, flour, baking soda, sugar, butter, and cream.

Why this is important to intelligence testing will become more clear after we discuss the factors of the IQ test.