Inter-Rater or Inter-Observer Reliability
Whenever you use humans as a part of your measurement procedure, you
have to worry about whether the results you get are reliable or
consistent. People are notorious for their inconsistency. We are easily
distractible. We get tired of doing repetitive tasks. We daydream. We
misinterpret.
So how do we determine whether two observers are being consistent in
their observations? You probably should establish inter-rater
reliability outside of the context of the measurement in your study.
After all, if you use data from your study to establish reliability,
and you find that reliability is low, you're kind of stuck. Probably
it's best to do this as a side study or pilot study. And, if your study
goes on for a long time, you may want to reestablish inter-rater
reliability from time to time to assure that your raters aren't
changing.

There are two major ways to actually estimate inter-rater reliability.
If you measurement consists of categories -- the raters are checking
off whichcategory each observation falls in -- you can calculate the
percent of agreement between the raters. For instance, let's say you
had 100 observations that were being rated by two raters. For each
observation, the rater could check one of three categories. Imagine
that on 86 of the 100 observations the raters checked the same
category. In this case, the percent of agreement would be 86%. OK, it's
a crude measure, but it does give an idea of how much agreement exists,
and it works no matter how many categories are used for each
observation.
The other major way to estimate inter-rater reliability is appropriate
when the measure is a continuous one. There, all you need to do is
calculate the correlation between the ratings of the two observers. For
instance, they might be rating the overall level of activity in a
classroom on a 1-to-7 scale. You could have them give their rating at
regular time intervals (e.g., every 30 seconds). The correlation
between these ratings would give you an estimate of the reliability or
consistency between the raters.
You might think of this type of reliability as "calibrating" the
observers. There are other things you could do to encourage reliability
between observers, even if you don't estimate it. For instance, I used
to work in a psychiatric unit where every morning a nurse had to do a
ten-item rating of each patient on the unit. Of course, we couldn't
count on the same nurse being present every day, so we had to find a
way to assure that any of the nurses would give comparable ratings. The
way we did it was to hold weekly "calibration" meetings where we would
have all of the nurses ratings for several patients and discuss why
they chose the specific values they did. If there were disagreements,
the nurses would discuss them and attempt to come up with rules for
deciding when they would give a "3" or a "4" for a rating on a specific
item. Although this was not an estimate of reliability, it probably
went a long way toward improving the reliability between raters.
|