Out of the Box Coaching and
Breakthroughs with the Enneagram, Mary R. Bast, Ph.D. 
Copyright 1999. All rights reserved. Revised: September 16, 2014
  

 

 

 

: Guides to a Decision-Making Process
     by Susan Gustavus Philliber, Mary (Schwab) Bast, and G. Sam Sloss, 1980,
     Peacock Publishers, Inc.

Chapter 3: Measurement

As the process of research shifts from theoretical consideration to actually employing empirical methods to "know" about the social world, measurement of variables becomes crucial. Most basically, measurement means the assignment of symbols to observable phenomena. Of course, before any variable can be measured, it must be defined.

Suppose, for example, we're studying "prejudice" and define this variable as "tenacious feelings of hostility toward members of another group based on little or incorrect information about that group." This theoretical definition of the concept "prejudice" conveys the meaning of the word. Specifying such a theoretical definition is important because it allows us to relate our empirical findings about prejudice to a body of theory. But measurement considerations arise when we ask how we know when feelings of hostility are present. How can these feelings (or X) be differentiated and categorized separately from feelings that are not prejudice (not-X)?

The concept of "prejudice" must also have an operational definition, or a definition that explains how it's going to be measured. There are obviously many options to convert any theoretically defined concept into precise, unambiguous observable phenomena. In the study of prejudice, for example, you might directly observe people interacting with members of other groups and differentiate those who are prejudiced from those who aren't on the basis of specific behavior. On the other hand, you could ask people a series of questions about their feelings toward other groups and combine their answers in some way to create a score. Which of these or many other options are chosen (or how theoretical variables are operationally defined), is perhaps the most crucial step in the research process. In the remainder of this chapter we'll deal with guidelines for, and problems with, social research measurement.

LEVELS OF MEASUREMENT

To measure the variables that are part of any research problem, we must ask what kinds of values or states each variable assumes. We've already learned that variables may be qualitative or quantitative. A qualitative variable is also called a nominal variable. A nominal variable comes in categories or states, but those categories or states are not ordered in any way. For example, the variable "religion" may be Christian, Jewish, Muslim, Hindu, Buddhist or several other states, but Hindu is not more religion than Buddhist or vice versa. A nominal variable comes only in kinds, not in quantities.

There are three kinds of qualitative variables: ordinal, interval, and ratio. Ordinal variables come in categories that may be ranked relative to one another. For example, social class status categories can be ordered as high, middle, or low because they indicate more or less social class status.

Interval variables come in categories that can be ranked relative to one another, and the categories are also all the same size and the same distance apart. An example of an interval variable is degrees of temperature on a Fahrenheit scale.

Finally, ratio variables have all the properties of interval variables and have an absolute zero point as well. An absolute zero point indicates that the zero value has meaning. It represents the absence of the property or variable altogether, such as zero dollars of income. In social science research a true interval scale that is not also a ratio scale is rare. Therefore, some think of interval and ratio variables together as one kind of scale simply referred to as interval-ratio.

Decisions about the appropriate level of measurement for each variable are crucial, particularly at the data analysis stage. If you can measure a variable in an interval fashion, for example, you can add and subtract values of the variable. If it can be measured with a ratio scale, you can multiply and divide values of it. But if the variable is nominal or ordinal, such procedures would make no sense. To give a ludicrous example, how much are one Catholic, two Baptists, and three Methodists?

To be able to identify the appropriate level of measurement for a variable, you should first develop the habit of asking yourself whether or not the variable comes in amounts of any kind. Is it reasonable to think of more and more, higher and higher, longer and longer, and so on? If not, the variable is nominal. The symbols assigned to nominal variables may be words, labels, or other devices for categorizing differentiated states of the variable.

On the other hand, if it's possible to think of the variable values as being more and more, higher and higher, or the like, the next question is whether it would be reasonable to think of the variables in equal-sized units, and whether a zero value for this variable would mean anything. Returning to our prejudice example, it does seem likely that some people are more prejudiced than others. It's also at least theoretically possible some individuals have no (or zero) prejudice. In this case, then, we know we should try to construct an interval-ratio measure of prejudice if at all possible.

In assigning symbols to values of variables at nominal, ordinal, interval, or ratio levels, two other general considerations need to be remembered. The scale devised for categorizing values of variables must allow for all values to be assigned a mutually exclusive and exhaustive set of symbols. Measures are mutually exclusive if it's not possible to assign any given value to more than one category or number. For example, a measure designed to differentiate and categorize values of "religious affiliation" must not contain both "Protestant" and "Baptist." Because Baptists are a subset of Protestants, the categories or symbols are not mutually exclusive.

Scale values are exhaustive if they permit assignment of all values of the variable. If a measure only contained the symbols "Protestant," "Catholic," and "Jewish," then those who are "Buddhists" could not be assigned a scale value. This problem can be remedied by the addition of an "Other" label or by listing more religions in the scale.

We've been using nominal variables to illustrate the importance of mutually exclusive and exhaustive measures, but these considerations apply to other levels of measurement as well. A measure that classifies income into categories beginning with $30,000 doesn't provide for the exhaustive classification of individuals who have incomes of less than $30,000. The questions to ask yourself are: "Can I classify all my observations with this measure?" and "Can I unequivocally put observations into groups that don't overlap?"

SCALING TECHNIQUES

Many variables can be easily measured by asking one question or making one observation. Such single-item measures are most useful when a single opportunity to assess a variable's value is likely to be sufficient, including age, gender, marital status, number of children, or income. Single-item measures can be used for factual data or even for preferences such as "Which candidate do you favor for election?"

Some variables are more complex, requiring a variety of opportunities to measure them adequately. Examples of variables that require multiple-item measures are authoritarianism, political liberalism, and anxiety. If individuals are to be placed on quantitative scales and compared to one another, it often becomes necessary to ask a variety of question that allow such ranking. For example, examinations in college courses attempt to measure "learning" in a course of study, and teachers rarely assume such measurement is possible with a single question.

There are many multiple-item scales already available. They're so plentiful, in fact, that several may already have been developed for commonly measured variables. A good literature review should unearth scales already in use for your area of interest. Many summary volumes list and describe available social science measuring devices.

On occasion, however, you may have to invent a multiple-item scale. Constructing measures of quantitative variables may proceed in a variety of ways. The purpose of all the scales discussed here is the relative ranking of cases on values of variables by using multiple items. Each of these techniques has limitations as a measurement strategy, and they vary from rather complex to exceedingly simple. It isn't possible to determine in an absolute sense whether these scales provide for interval or ratio rather than ordinal measurement, but in practice they're most often dealt with as yielding interval-ratio measures.

Forced Ranking Scales

One of the simplest measures is the forced ranking scale, where someone is given a list of items and asked to rank them on a dimension. For example, Rokeach developed a list of seventeen values including equality, freedom, salvation, happiness, and security. Respondents are asked to rank these values in their order of importance. The advantage of this format is that each of the items can be ranked against the others. You can determine if a respondent thinks freedom is more important than salvation or vice versa.

Forced ranking scales have been used to rank a variety of dimensions such as identifying social problems considered critical by the population, providing rankings of the prestige of occupations, and ranking individuals in terms of power. In such instances, the researcher's primary goal may be simply to obtain rankings on a particular dimension. In other instances this ranking procedure may be only a first step in the development of other scales.

The disadvantage of forced ranking is that it does not measure how important each of the items is to the respondent. Someone who ranks salvation as more important than freedom may think neither is very important. Or maybe the respondent considers both very important and the chosen ranking resulted from only minor differences.

Summated Ratings

One of the most popular techniques for constructing a scale is the summated rating. Individual scores are obtained by summing responses to a number of items intended to measure the same concept. For example, suppose you want to measure attitudes toward people who've been in prison. You might give respondents a list of statements about ex-convicts and ask them to choose "yes" or "no" or "agree" or "disagree" for each item. For example:

Ex-convicts generally try hard to be contributing members of society.
Ex-convicts are good employees.

Favorable responses could then be assigned a score of 1, and unfavorable responses a score of 0. If there were ten items, the highest score of 10 would represent the most favorable attitude toward ex-convicts. Individuals could then be ranked and compared according to their scores. While both items in the example are worded so a "yes" response indicates a favorable attitude, it's actually not necessary or desirable to present items this way. Rather, the items should be worded so a "no" response would sometimes indicate favorability. This discourages respondents from filling out the scale by simply checking off all "yes" responses after having read a few items.

Items for summated rating scales should be administered to a pretest sample similar to those who will eventually respond to the scale. To evaluate the utility of each item for final inclusion, the response to each item must be compared to overall scale responses. As a general guideline, each item should discriminate between those who score high and those who score low on the scale. In the scale assessing attitudes toward ex-convicts, for example, we might find that respondents with both high total scores (say 9 or 10) and low total scores (say 1 or 2) agree with the item, "Ex-convicts generally try hard to be contributing members of society." In such a case we'd say the item doesn't discriminate between individuals with favorable attitudes and individuals with unfavorable attitudes, and we'd eliminate the item from the scale.

Comparing individual scale items with total scores thus provides a check on those items and helps repair wording. In the example above, nearly universal agreement with the statement could have been induced by the nonspecific phrase "generally try hard," which may allow more agreement than would the statement "Ex-convicts are likely to be contributing members of society."

Although compelling in its simplicity, the summated ratings technique carries some problems. First, even after checking response distributions, there's no guarantee that items included in the scale actually measure the property in question. In fact, it's quite easy to slip items into scales that measure causes or consequences of variables. To belabor the ex-convict example a bit more, we might have included an items that reads: "The prison experience is detrimental to individuals." While people who are favorable toward ex-convicts may respond to this item differently from those who aren't favorable, the item may measure attitudes toward imprisonment and not attitudes toward ex-convicts. Even though beliefs about the prison experience may influence attitudes toward ex-convicts, these beliefs and attitudes are not the same.

It's also problematic that identical scores may be obtained with different response patterns in a summated ratings scale. For example, a score of 5 could be produced by favorable responses to statements 1, 3, 7, 9, and 10 or to statements 2, 3, 5, 7, and 8. If we've eliminated non-discriminating items, this won't happen in very many instances. But it can happen.

The summated ratings technique permits each scale item to make an identical contribution to the total scale score. Theoretically, however, our attitudes toward any object are probably not equally influenced by all dimensions included in a scale. In our convict example, the assessment of ex-convicts as good employees gets the same weighting as assessment of their efforts to be contributing members of society. If items were included that asked about their potential to repeat their crimes or about their honesty, these would also get equal weight in the scale.

Likert Scales

A slightly different version of the summated rating technique is the Likert scale, where included items are subjected, as in summated ratings, to a check for their ability to discriminate. However, a wider range of response choices is permitted for each item in a Likert scale. Figure 3-1 shows examples of Likert items with several potential response patterns:

_____________________________________________________________________

Overall, I rate the quality of instruction I've received in this course as:

____ Excellent
____ Good
____ Fair
____ Poor
____ Excellent
____ Good
____ Average
____ Fair
____ Poor
 
____ Above average
____ Average
____ Below average

I recommend that others take this course:

____ Strongly agree
____ Agree somewhat
____ Disagree somewhat
____ Strongly disagree
____ Strongly agree
____ Agree
____ Neither agree nor disagree
____ Disagree
____ Strongly disagree
 
____ Agree
____ No opinion or don't know
____ Disagree

Figure 3-1. Examples of Likert Items
_____________________________________________________________________

Likert scale items may have any number of response choices, but the practical range is from three to seven. Fewer than three responses limits the researcher to a dichotomy, and the scale becomes a simple summated ratings scale. The upper limit of responses is dictated by how many responses can be differentiated by people filling out the scale. With more than seven choices, and perhaps with fewer, it becomes difficult to make adequate distinctions along the response continuum.

Some of the response sets for the two items in Figure 3-1 contain a central response that's a neutral or "don't know" choice. For measurement tasks where all respondents should know the answer to the question, including such a category makes little sense. The first item in Figure 3-1 is more of that kind. At others times the decision to include a neutral category or not is more complicated. Some researchers argue that people do sometimes have "neutral" responses, and prefer to provide an uneven number of responses, with the middle category representing a response that's somewhere between agreement and disagreement. Others argue it's important to force a choice among the positive and negative ends of a dimension by providing an even number of responses with no neutral category. In this way we see the direction of respondents' values, even if the difference between "agree somewhat" or "disagree somewhat" is not very great.

The sets in Figure 3-1 that have a neutral point demonstrate a switch in direction of the verb on either side of the neutral point. Thus "neutral or don't know" is flanked by "agree" on the one side and "disagree" on the other. The response choices lacking a neutral point, such as those beginning with "excellent," present degrees of response along a single dimension. Whether the scale contains a neutral point or not, switching from modified agreement to modified disagreement may create something of a logical problem as illustrated in the first set of choices to the second question. Logically speaking, "agree somewhat" and "disagree somewhat" are equivalent. While this logical problem seems somewhat sticky, respondents are able to distinguish between such responses because the context shows the items are part of a continuum and occupy different positions on that continuum.

So you have great latitude in constructing response choices to Likert items. The only clear requirement is that the response choices reflect a single dimension and thus allow mutually exclusive and exhaustive responses. The following Likert item and responses provide an example of violating these criteria:

Overall, how do you rate this teacher's ideas?

very creative
pretty good
average
fairly disorganized
very poor

The response choices mix a general assessment, represented by such responses as "pretty good" and "very poor," with creativity and disorganization. The problem can be avoided by stating in the item what dimension is intended and providing response choices consistent with that dimension. For example, the item might have read:

Overall, how do you rate the organization of this teacher's ideas?

very organized
fairly organized
fairly disorganized
very disorganized

Other questions might follow that allow assessment of creativity of ideas or other dimension of interest.

Once having constructed items with logical response choices that discriminate between respondents, Likert scale items are scored by assigning numbers to response choices and summing these across items, as in the summated ratings scale. For example, in the question above, "very organized" might receive a score of "4," "fairly organized" a score of "3," and so on. If there are twenty items in the scale, each with a potential score ranging from 1 to 4, then total scores can range between 20 and 80. As in summated ratings scales, items or responses should be alternately worded in positive and negative directions so the first response choice is not always the most favorable (or unfavorable).

While respondents seem to prefer Likert scales to summated ratings scales because of the range of responses permitted, neither procedure allows absolute confidence that all items included measure the variable intended.

Guttman Scales

In the preceding examples it's been necessary to assume that so-called "ordinal" scales reflect an accurate ranking of individuals according to their numerical score on a measure. In other words, if a Likert-type scale measures "prejudice," we assume an individual with a score of 40 is more prejudiced than one with a score of 36. The Guttman scalogram actually tests the accuracy of ranking scores using the following criterion: if individual A is ranked higher than individual B, then A should possess all the properties B possesses, plus at least one more. For example, if you're testing problem-solving ability in mathematics, you might have ten problems that increase in difficulty. You assume an individual who can solve the most difficult problem will also be able to solve the nine easier ones. On the other hand, you wouldn't expect someone who's unable to answer an easier problem to be able to answer any of the more difficult problems.

This type of ranking lends itself to social research as well. Suppose you were measuring prejudice and presented respondents with an array of statements from "I'd be willing to walk on the same street as a Muslim" to "I'd be willing to marry a Muslim." Obviously someone who agreed to the latter statement would also agree to the former. If the intervening statements also have order, this same individual would also agree to all the statements preceding the most unprejudiced one. Figure 3-2 shows the expected response patterns of five individuals (designated by letters) to four Guttman scale items:

_____________________________________________________________________ 

 

Individuals
________________________________________________________
 

 

A

 B

C D E
I would be willing to marry a Muslim 1        
I would be willing to have a Muslim to dinner   1   1  
I would be willing to work with a Muslim 1 1 1    
I would be willing to sit on a bus next to a Muslim 1 1 1 1 1

Figure 3-3. Guttman Scalogram Responses of Five individuals to Four Attitude Items with Error Responses in Red  _____________________________________________________________________

A particular respondent may, for some reason, be willing to have a Muslim to dinner but not be willing to work with a Muslim. Individual D illustrates this "error." A set of items is generally accepted as having met the criterion for a Guttman scale or is said to be "unidimensional" if the responses include less than 10 percent error (or variation from the expected pattern) for the whole sample.

A statistic called the "coefficient of reproducibility" is calculated after the administration of any Guttman scale. This number represents the proportion of "correct" or non-error responses to the items and should be as high as possible to assure the items are ordered in the way intended. There's a general consensus among social scientists that this statistic should reach at least .90.

Respondents are individually scored on Guttman scales by simply adding up the number of items they agree with. The response choices given are simple dichotomies, and items are not presented to respondents in their presumed order. Unlike the summated ratings and Likert scales, however, it's possible to predict an individual's response pattern from a total Guttman scale score. Barring errors, there's only one way to obtain each Guttman scale score. The problem remains, however, of including items that do not actually measure the variable in question. Even though the criterion of item order seems to lessen this problem, it's still possible for items to demonstrate unidimensional order in the Guttman sense and not really measure the same variable with all items.

Equal-Appearing Intervals

The summated ratings, Likert, and Guttman techniques produce ordinal-level rankings of individuals for the variables being measured. But there's no guarantee with any of these scales that people with scores of, say, 20 have twice as much of the variable as those with scores of 10. Still, scores from these scales are usually analyzed as interval-ratio data.

L.L. Thurstone attempted to create a true interval scale through the method of equal-appearing intervals. Constructing a scale using this technique begins by assembling a number of items intended to measure attitudes toward some object. These items are then given to a set of judges selected to resemble the target population. The judges are asked to sort these items into boxes or piles according to the degree to which each item is favorable or unfavorable toward the attitude object in question. The judges ignore their own agreement or disagreement with each statement, concentrating instead of how generally favorable or unfavorable the item is to the attitude object. The number of piles or boxes used may vary, but Thurstone suggests eleven. The technique derives its name from the fact that the boxes or piles are equal in size and are equidistant from one another.

Items that receive scattered placement by the judges are then eliminated, because this indicates the item's ambiguity or lack of clarity. The remaining statements are then assigned a scale value computed as the median value of the boxes or piles into which they've been placed. The final items are selected from these so as to include a wide range of values on the scale.

These items are given agree-disagree response choices and administered to those whose attitudes are to be measured. Individual scores are computed as either the mean or the median of the item values that are agreed with. Items are presented to respondents randomly so statements with high scale values are interspersed among those with low scale values. Thurstone argues that the resultant values provide interval level measurement.

Obviously the initial group of judges plays a critical role in constructing Thurstone scales, and should be as similar as possible to the final population to be measured. This helps avoid selective sorting and resulting scale values that don't apply. Most discussion of the Thurstone technique's utility has centered on whether it really creates an interval measure. Some argue that assurance on this point is not worth the extra effort of assembling judges to sort items. Others maintain that the Thurstone procedure makes sense as an interval technique. However, the problems we've described with other scales remain. It's possible to include items that don't really measure the intended variable, and individuals with quite different response patterns may arrive at similar scores.

SPECIAL APPLICATIONS OF SCALING

We've discussed the most commonly used scaling techniques. There are, in addition, special measurement strategies that adapt basic scaling methodology for unique purposes. Two such special applications of measurement are sociometry and the semantic differential.

Sociometry

If the concern of social science and sociology in particular is social interaction, then sociometry seems to be the supreme method of the discipline. Sociometry is a "who-to-whom" method of measuring reported or actual interaction preferences.

Sociometric data may be gathered in many ways. When gathered by questionnaire or interview, a question like this is usually asked: "Which three people in the classroom would you most like to sit next to?" On another dimension a question might read: "Name the three people you like best in your work group." Such questions can inquire about who has power, who gives advice, who's beautiful, who's most intelligent, whom the respondent thinks others evaluate favorably, or any number of other qualities. Thus, in addition to demonstrating who is chosen, sociometry can also elicit characteristics about people in the interaction, including characteristics of those doing the choosing.

When gathered by observation, sociometric data may consist of noting choices for lunch in a work setting or noting to whom people talk in the break room. Children may be observed and their choice of playmates recorded in a nursery school setting. Or the flow of e-mails in an organization may be plotted, noting both sender and receiver over a period of time. Available data on cell phone calls, past choices for trips or visits, or similar information can also be used. In each case the emphasis is upon interaction choices or preferences, either reported or actual.

The development of sociometry is most often credited to J.I. Moreno, who hypothesized the existence of "tele" -- an interpersonal attraction force that people supposedly emit. Sociometric methods were designed to measure this "tele," because it had obvious implications for predicting social interaction patterns. From that rather mystical but intuitively appealing beginning, sociometry users suggested that perhaps such data could measure underlying social structure in groups or communities. By identifying opinion leaders, communication patterns, friendships, and the like, these researchers claimed that mapping or documentation of these important social structures was possible. In addition, sociometric techniques have been used to study temporary group processes. For example, sociometric measures of friendship choices has illustrated that physical location in a group may influence popularity. In quite a different setting, the problem-solving group, sociometric choices have illustrated that most groups contain two types of leaders -- one who solves group problems related to the task at hand, and another who fulfills emotional needs of group members.

Two basic techniques can be used to organize sociometric data. The first is a sociogram, or diagram, of the interaction choices obtained by the data-collection procedure. For example, in a restaurant we might be interested in the personnel's interaction choices as an indication of their efficiency or cooperation. Suppose there are six waitresses, and we ask each to report "Who do you help while working?" We might then summarize their choices as in Figure 3-4:

_____________________________________________________________________

The sociogram shows there's not a full team effort here. Judy often receives help from others, while Mary and Susan weren't named by anyone. Jane and Kathy are isolated from the rest of the women. On the other hand, the sociogram in Figure 3-5 is quite different:

_____________________________________________________________________

Figure 3-5. A Sociogram

_____________________________________________________________________

In this case, all the women reported helping Mary. Perhaps Mary isn't doing her job or simply has too may duties to perform without help from others. In either case, the sociogram is a diagrammatic way to display interaction choices. Had we asked about an additional dimension of interaction, these results could be plotted using a different colored arrow or by using overlays. Comparison of diagrams may then enable a more in-depth look at interaction.

Another way to present sociometric data is to use a sociometric matrix, which is simply a table indicating choices and receipt of choices for each person in an interaction. Returning to the restaurant example, we would ask each woman to choose the two others they'd most like to work with and summarize the data in a matrix as in Table 3-1:

Table 3-1
A Sociometric Matrix

_______________________________________________________________________

Objects

  Judy Mary Susan Doris Jane Kathy Total
Subjects
Judy
Mary
Susan
Doris
Jane
Kathy

   Total


--
1
1
1
1
0
___
4

1
--
1
1
0
1
___
4

1
0
--
0
0
0
___
1

0
0
0
--
0
0
___
0

0
1
0
0
--
1
___
2

0
0
0
0
1
--
___
1

2
2
2
2
2
2
___
12

The total column on the right side of the table indicates that each woman did indeed make two choices as instructed. The object totals along the bottom of the table indicate how many times each woman was chosen. Data totaled on a sociometric matrix in this way can also be thought of as scale scores relative to subject and object interaction choices. It becomes apparent that Judy and Mary are the two most popular co-workers, while Doris wasn't chosen by anyone. This method of presentation allows for multiple choices without becoming too cluttered as a sociogram might. In addition, it allows the computation of totals in an interval-ratio fashion, rather than as a graphic image.

Finally, sociometric data for individuals may be summarized as total measures derived from either of these procedures. Totals such as those described in a matrix may be produced for each individual. In addition, summary scores across several choice dimensions could be computed, as could average scores that take into account the number of people choosing each time.

Two major limitations of sociometry should be mentioned. First, the size of the group may limit the use of sociometric techniques. It's plain from these sociograms and matrices that if the study were to include a hundred subjects instead of five, presentation of the data in these formats would become extremely cumbersome. Even if there are ten subjects in the study and they're each asked to make three choices on each of five interaction preference dimensions, a sociogram may become too complex. Still, summary measures for individuals could be computed without being displayed in either sociogram or matrix form. The only limitation on such computation would be the size of a group the respondent's capable of perceiving and evaluating along the dimensions being investigated. At some point, individuals no longer know everyone you'd like them to consider as potential interaction choices. Whether or not this matters differs from study to study. For any given study, who is chosen may be what's important, not who's known and rejected. If people aren't known, perhaps that's informative, too. The group-size limitation on sociometry is not a rigid one, then, but group size must be considered in order to present the data in a coherent fashion.

The second limitation on sociometry is more theoretical. Some have charged that while yielding interesting data on interaction, it's too much to expect simple who-to-whom responses to reflect the complexity of social structure. Part of this issue is how one defines social structure or how one maps it. There's agreement that sociometry can chart part of that complex pattern. That such procedures can fully describe even small group social structure is in doubt.

Finally, a word needs to be said about the sensitive nature of sociometric data. Particularly when used in small group settings, the potential of such data to create negative consequences must be recognized. Surely the restaurant researcher wouldn't want to indicate that poor Doris, identified by name, was chosen by no one. Thus, when gathering these data you should assure respondents their choices are confidential.

Semantic Differential

Another special application of scaling methodology is the semantic differential. The purpose of this techniques is to discover underlying meanings of concepts or objects for individuals and groups. This is accomplished by presenting the concept or object to be rated, followed by sets of polar adjectives. For example, if you were investigating "self-concept," you might instruct respondents "On each scale below, check the space that best describes you. If these words have nothing to do with you, mark space number '4' in the middle. Work quickly. Your first impression is what we're after."

You

    1 2 3 4 5 6 7  
1. Good

____

____

____

____

____

____

____

Bad
2. Strong

____

____

____

____

____

____

____

Weak
3. Fast

____

____

____

____

____

____

____

Slow

Other common adjective pairs are fair-unfair, clean-dirty, heavy-light, hot-cold, valuable-worthless, and so on.

After respondents fill out such scales, factor analysis techniques are used to determine which of the adjectives seem to belong together. Without presenting a discussion of that statistical technique here, suffice it to say that factor analysis is an attempt to find out which adjectives show similar response patterns and thus belong together to describe some dimension.

The develops of the semantic differential argue that the three dimensions most likely to appear when this analysis is performed include (1) an evaluative dimension, defined by such adjectives as good-bad; (2) a potency dimension, defined by such adjectives as strong-weak; and (3) an activity dimension, defined by such adjectives as fast-slow. Pursuing our self-concept example, then, individuals might end up with rankings for themselves on these three dimensions. Once the factor analysis determines which adjectives belong together, the ratings can be summed for respondents. In this way, individuals may be compared with each other. The technique is said to deal with meanings of concepts, since the factor analysis procedure defines which adjectives go together instead of a priori assumptions being made by the researcher. Thought of in this way, scale items are defined by respondents.

There are several problems with the use of the semantic differential. The first is its lack of intuitive appeal. At first glance, it doesn't seem likely that respondents would rate objects such as themselves, doctors, colleges, or anything else in this blanket fashion. Continued use of the semantic differential testifies differently, however. While we each make judgments about one doctor or one college that may be different from our judgments about others in these categories, we have general impressions of doctors and colleges we can describe using such adjectives.

The second difficulty with the technique is its reliance on statistical correlation to provide definitions. In other words, when producing dimensions by combinations of adjectives, the semantic differential user relies on the patterns in the sample rather than any definition deduced from theory. This is not necessarily a limitation and could indeed be interpreted as an advantage. Still, the user should be aware that empirical definition rather than theoretical definition of concepts is paramount here.

RELIABILITY AND VALIDITY

Regardless of whether measures are single- or multiple-item or whether we use any of the scaling techniques described, we must be accountable for the accuracy of our measures. Throughout the measurement process, two overriding considerations are present. The first is whether or not our measures are valid. Validity means we're measuring what we intend to measure. This consideration was repeatedly cited in our earlier discussion of scaling. The second concern is whether or not our measurement is reliable. Reliability of measurement means stability or consistency. It means if we measure the same thing twice with our measurement device, we get the same results.

The concepts of validity and reliability are closely but not symmetrically related. If a measure is valid, it should also be reliable. The reverse is not true, because a measure that's reliable is not necessarily valid. In repeating any study, you can get the same results and still not be measuring what you intend. While this summarizes the theoretical relationship of validity and reliability, you'll see that the actual demonstration of each is somewhat difficult.

It's the responsibility of any scientist to assure others of their measuring devices' reliability and validity. Sometimes this is taken care of by merely describing how measurements were taken or constructed. If the procedures used are common, such description may suffice to convince the scientific community the results are valid and reliable and thus worthy of inclusion into a body of theory. In constructing Likert scales, for example, the researcher should report having eliminated non-discriminating items, because this procedure is designed to increase validity. If measures are less conventional, however, it may be necessary to report the results of separate tests of validity and reliability. It is therefore important to know how these attributes of measurement are demonstrated.

The Measurement of Validity

The simplest validation procedure, called face validity, is agreement among observers that the procedure "appears" to measure the concept. Perhaps a number of people are asked to check a list of items that could represent a concept such as dogmatism. The most agreed-upon items could then be used in a questionnaire asking respondents to indicate whether they have such traits. This procedure is, of course, quite subjective. There's no guarantee that agreement, or collective subjectivity, in fact represents reality. Some researchers argue that demonstration of face validity is not a validity check at all.

Second, construct validity indicates whether a measured variable shows the same relationship to other variables as might be predicted by theory. For example, suppose the theory used to generate the research problem indicates that social class should be inversely related to authoritarianism. Having used a measure of authoritarianism in a study, credibility will be added to the validity of the procedures if it can be demonstrated that social class is inversely related to this variable in the data as well. This process derives its name from the fact that empirical measures are constructed to represent an abstract or theoretical concept. The difficulty with this procedure is its reliance on a proposition derived from theory, which itself may be invalid. Even if the empirical results match the theoretical proposition, it's not entirely clear that measurement procedures have been validated. Both theory and measurement could contain error.

A third procedure to demonstrate measurement validity is criterion validity, similar to construct validity in that an outside criterion is compared with the new measure to demonstrate validity. The other measure of the same concept might enjoy more widespread acceptance than the measure being validated. Using this procedure, however, it's only possible to show the new measure achieves the same result as the old one. They might both be invalid.

Another criterion for validation is what some call "known groups." Suppose you're trying to validate a scale you believe measures "political conservatism." You might administer the scale to local meetings of the John Birch Society and the American Civil Liberties Union. If these groups don't show differences on your scale of "political conservatism," you have reason to doubt you're measuring what you think you are. The use of known groups as a criterion depends on the accuracy of identifying such groups relative to the variable you're trying to measure.

The Measurement of Reliability

Illustrating the stability of a measurement procedure is not a simple matter because it's difficult to measure the same thing twice without change occurring that has nothing to do with the measurement. Even if an observation is made at a different time of day, for example, there could be a change in results. Imagine the difference in motivation between someone answering a questionnaire right before lunch while fighting hunger pangs and someone else answering the same questionnaire at mid-morning, feeling fresh and alert. Further, measuring a phenomenon at one time alerts human subjects to your purposes and devices. They may try to over-cooperate the second time and make their answers match the earlier measurement, or they may take a non-cooperative attitude and change answers randomly.

Because of these problems, test-retest reliability is based on re-measuring a phenomenon after having allowed sufficient time for recall to be poor. Or test-retest procedures may compare results from different observers. Another method, alternate or equivalent forms of measurement, is often used when multiple-item scales are used to measure a variable. The researcher constructs two forms of the scale intended to be equivalent and uses one form in the first measure and the second in the retest.

A third way to demonstrate reliability with multiple-item scales is split-half reliability, where you simply compare the results obtained using half the items with the results obtained using the other half. Do people appear to be just as authoritarian if they're scored on items 1, 3, 5, 7, and 9 as when items 2, 4, 6, 8, and 10 are used? This techniques checks the internal reliability of the scale.

The difficulty with each of these procedures is the lack of protection from intervention of real change. The more time that elapses in a test-retest reliability procedure, the more likely real changes can affect respondents. If the concepts being measured are relatively stable characteristics such as social class status, the problem is less serious. Even in this case, however, drastic changes in the common social class measures of occupation or income are possible in short periods of time. In equivalent forms or split-half reliability procedures, we depend on the correctness of assuming the forms really are equivalent or that test items in one half are the same as those in the other half. In short, demonstrating reliability gives credibility to the measurement process, but failure to demonstrate reliability could indicate either real instability of your measuring procedures or only failure to measure reliability well.

Tests for both reliability and validity are important to the advancement of social science. Yet many studies provide neither, relying instead on assuming most research efforts reflect an honest attempt to eliminate major sources of bias. Indeed, the cost of repeating each measurement procedure may preclude demonstrating reliability.

ISSUES IN MEASUREMENT

Measurement requires choices that are crucial to the successful outcome of research. Still, each choice allows considerable latitude, which permits and encourages debate. Indeed, in earlier times there was much debate over whether social variables could be measured at all. Not only did it seem difficult to measure and validate such concepts as "alienation," some resented attempts to reduce important individual and group characteristics to a set of numbers. Today this debate more often appears as an inquiry into whether it's reasonable to quantitatively measure social variables. No one argues that "prejudice," for example, comes in amounts, but it may be less acceptable to assume prejudice can be measured in intervals that are equal in size and equidistant from one another. The problem comes down to whether it can be meaningful to say one individual is twice as prejudiced as another.

Because the whole science of mathematics can be brought to bear on social science data by measuring variables at interval or ratio levels, there are compelling reasons to make such assumptions when feasible. Clearly social scientists don't often deal with length, width, volume, pressure, temperature, or other variables that lend themselves more easily to this kind of measurement. A point that's often overlooked, however, is that all measurement procedures are arbitrary -- even for the sciences where there's much agreement. The concept "length" came to be measured in the U.S. in inches, feet, and yards. We all agree these units are real, even though they're entirely arbitrary, as illustrated by the definition of "length" in other countries based on centimeters, meters, and kilometers.

Social science is too young a discipline to have had all its concepts clearly defined and to have established consensus on the way to measure each of them. Still, to avoid the dilemma that may arise from speaking of social science concepts in terms of an interval or ratio, most researchers talk about "prejudice scores," rather than "prejudice," being twice as great for one individual or group as another. This distinction is important, not just semantic. It shows the researcher is aware of the limitations of measuring devices and procedures.

Another social science debate has revolved around the asymmetrical relationship of reliability and validity discussed above. Hopefully the discussion of how to measure these important characteristics has alerted you to the unfortunate news that we really cannot ever be absolutely sure our measures are valid or reliable. Each procedure in current use requires so many assumptions, they themselves can never be validated. It's much easier to get at least some agreement on a test-retest procedure than to demonstrate validity with any available method. But because finding reliability provides only a hopeful sign and not a guarantee that they're measuring what they think they're measuring, social scientists could be overemphasizing reliability.

It should be clear by now that many factors must be weighed when making decisions about operational definitions or measurement. There are no easy rules to decide whether a variable requires multiple-item measurement or single-item measurement or whether one scaling technique would be better than another. Some of these measures are complex and difficult to use. They may carry different dollar costs as well. It's always necessary to weigh these practical considerations against concerns for reliability and validity.

Some have argued that measurement decisions become easier when theoretical development is further advanced. Indeed, an unambiguous and tightly constructed theoretical definition of a concept allows many fewer operational choices than a nonspecific theoretical definition. If you find that many measurement strategies are possible and procedures or questions might be focused in many directions, it's time to question your theoretical and definitional work that preceded measurement decisions.

EXERCISES

  1. Choose a concept and write its theoretical definition. Now write two operational definitions for the concept. In other words, write two different ways it might measured. Could you have written still more operational definitions? What are the implications of choosing one over the other?

  2. What are the levels of measurement (nominal, ordinal, interval, or ratio) for each of the following variables:

    (a) numbers of football players;

    (b) income in dollars;

    (c) leadership effectiveness;

    (d) dedication to church?

  3. How is class standing or grade average computed at your or your children's schools? What's the level of measurement employed? What measurement assumptions about student performance are made in computing these averages? Is one's standing or grade average treated as an ordinal, interval, or ratio variable? Is there a theoretical concept being measured with this variable? What is it?

  4. Make up a question that allows open-ended response. For example, "What is your reason for choosing your career?" or "Why do you drink alcohol, if you do?" Ask this question of ten people. Now classify these answers into mutually exclusive and exhaustive categories.

  5. Practice creating Likert responses by writing a range of response choices for each of the following questions:

    (a) How satisfied are you with your career?

    (b) How often to you fail to complete projects?

    (c) How important is career success to you?

  6. Think of a concept that might be measured with a Guttman scale. For example, willingness to participate in politics might range from "I'm willing to seal envelopes" to "I'm willing to make speeches for candidates." Using a concept you choose, create five Guttman-type items, and administer your scale to five other people. Tabulate your results as in Figure 3-2. Did your items scale as you expected them to?

  7. At your workplace or some other setting where a group of people is available, administer a questionnaire designed to produce sociometric choices. Use any dimension that interests you for these questions. For example, ask a few people to name three individuals they would most like to work with or they perceive to be leaders. Disguise the identities of your respondents and plot your results as a sociogram or sociometric matrix. What did you learn?

  8. Choose an article from a leadership journal that reports data. What assurances does the author offer that measurement was valid and reliable? Are you convinced? Why or why not?

Back to Social Research Preface and Index