Running Head: MULTIVARIATE STATISTICS

Mulivariate 1

Running head: MULTIVARIATE STATISTICS

Multivariate Statistics Homework 2: Reliability and Validity

Michael J. Walk

University of Baltimore

Multivariate Statistics Homework 2: Reliability and Validity

Part 2-A

Self-Esteem and Relationship Intention

The test-retest reliabilities for the self-esteem and intention to remain in relationship scales are .83 and .65, respectively.

However, assuming that the internal consistency of the self-esteem scale at Time 1 and Time 2 is .90 and .85, respectively, it is possible to correct for attenuation (i.e., to calculate the test-retest reliability while controlling for item content error). The corrected reliability is .95. By eliminating item content errors in our calculations, we have a better picture of the “true” test-retest reliability. Whether the corrected reliability or the simple reliability is the more appropriate measure depends on what the researcher decides to treat as error. If the researcher wants to know how much of the error between first and second tests is only due to time and not due to content, then the disattenuated measure, .95, is the most appropriate. However, if a researcher is interested in the reliability of a test over time considering all possible sources of error (time and content included), then the simple reliability, .83, is more appropriate.

Agreeableness

The agreeableness scale (12 items) has an alpha coefficient of .75. Possible sources of error are related to content (i.e., the actual items of the scale introduce error variance into the construct measurement by their wording, how that wording is interpreted by participants, or even the item’s placement in the whole of the test).

It is well-known that increasing the number of questions in a scale increases the reliability of that scale. However, with more questions, it is more likely that the test will be too demanding or too long to be feasible. Therefore, it is worthwhile to see if internal consistency can be adequately maintained while decreasing the number of questions. To do this, Cronbach’s alpha was calculated for the first four items as well as the first eight items. The alpha coefficient for the first four items is .55; the alpha coefficient for the first eight items is .70. Decreasing the scale from the full 12 items to four items decreases the consistency to below acceptable levels. Using the first eight items increases the consistency to .70; however, this value is only marginally acceptable. Therefore, keeping all 12 items is the best scenario—producing the strongest reliability, .75.

The agreeableness scale has a test-retest reliability of .63. However, this value was corrected using the Spearman-Brown formula; the resulting value is .77.

Results and Discussion

Means, standard deviations, inter-variable correlations, and internal consistency coefficients are presented in Table 1. The test-retest reliabilities for self-esteem and relationship intentions were .83 and .65, respectively. The self-esteem measure was adequately consistent over time while the measure for relationship intentions was not. This is understandable given the constructs upon which the two scales were built. Self-esteem, a personality trait relatively consistent across various situations, is more equable than relationship intentions, which may be directly impacted by various incidents such as arguments or changes in future plans. In other words, self-esteem is more trait-based while relationship intentions is more state-based. Sources of error for the self-esteem measure and the intentions measure include time-sampling errors as well as content errors. Examples of possible time-sampling errors are: state changes, practice, fatigue, or any event that may have occurred in the testing interval that would alter a person’s score (i.e., having a fight with a significant other would probably change a person’s relationship intent score and self-esteem score). A participant may have let their “true” score on the second test be affected by remembering their previous responses and answering similarly. Examples of possible content errors are: any inconsistency in the scale items themselves, ambiguous items, misinterpretation of item wording, and under- or over-representing the construct. That is, certain items may possess characteristics that introduce systematic variance into the true variance. While the internal consistency reliabilities for both measures are adequate (i.e., greater than .70), intentions appears to be a less reliable measure than self-esteem (both over time and internally). The self-esteem measure is adequately reliable to be used in research. However, the intentions measure should only be applied to cross-sectional research.

The Cronbach’s alpha for the agreeableness scale was .75. A possible source of error for this scale was content error. Since the 12 items are supposed to measure the construct of agreeableness, a subject’s answers on all questions should be commensurate with his or her actual agreeableness. However, a participant could misinterpret a question or not understand a question, or the question itself could be an inaccurate measure of the agreeableness construct. We have no data on the reliability of agreeableness over time; this limits what can be confidently be said about its use in psychological research. However, the measure is sufficiently internally reliable.

Table 1

Means,Standard Deviations, and Correlations among Variables

Variablesa / Mean / SD / 1 / 2 / 3 / 4 / 5

Time 1 Measures

1. Self-esteem 1 / 4.28 / .49 / (.90)
2. Intentions 1 / 3.24 / .83 / .03 / (.75)
3. Agreeableness / 36.09 / 5.73 / .24** / .102 / (.75)

Time 2 Measures

4. Self-esteem 2 / 4.31 / .54 / .83** / .05 / .19** / (.85)
5. Intentions 2 / 3.09 / .93 / .05 / .65** / .14* / .16** / (.83)

Note. Values enclosed in parentheses represent the internal consistency reliability (α). For correlations with agreeableness, n = 332; for all other correlations, n = 334.

*p < .05. **p < .01.

Part 2-B

Self-Esteem and Relationship Self-Efficacy

For construct definitions, test items, item critique, and validation plan, see Appendix A.

Multitrait-Multimethod Matrix

Correlations (corrected for attenuation) and reliabilities are presented as a Multitrait-Multimethod Matrix in Table 2. By examining the reliabilities (in parentheses) of the various methods, we can see that, generally, methods are sufficiently reliable; however, the diary content coding method is only marginally reliable for the relationship satisfaction measure (α = .70). Evidence for convergent validity is mixed. All values are greater than zero; however, five of the nine values are less than .40. The strongest pattern of convergent validity is found between Method 1 and Method 3—values for self-esteem (SE), relationship self-efficacy (RSe), and relationship satisfaction (RSat) are .47, .69, and .83, respectively. Method 1 – Method 2 validity coefficients are moderate (.38, .38, and .48, respectively), and Method 2 – Method 3 coefficients are low (.27, .17, and .36, respectively). Evidence of discriminant validity is mixed as well. Because of the low values of monotrait-heteromethod correlations, many of the heterotrait-heteromethod correlations are comparatively larger. Only the Method 1 – Method 3 values show substantial evidence of discriminant validity. Even though the Method 2 – Method 3 validity diagonal values are all larger than the heterotrait-heteromethod values, the differences are small—the discriminant validity is low.

The degree of validity found by this research is low. Convergent validity is weakened by the validity diagonals being generally lower than or equivalent to the heterotrait-monomethod values and the heterotrait-heteromethod values. Even though all measures obtained sufficient internal reliability, different measuring methods produced different results and the constructs themselves seem to be significantly overlapping. Method 2 (partners’ ratings) exhibited evidenece of method bias. All heterotrait correlations within that method were .76 or greater. In general, there appears to be too much variance shared among same-method constructs and not enough variance shared across similar constructs assessed by different methods. Although the constructs are theoretically related, seven of the eight heterotrait-monomethod correlations are greater than .50—bringing to question whether the current measures of these constructs or the constructs themselves can be adequately differentiated.

The measure for SE is internally reliable, but when assessed by different methods, it lacks convergent validity (all heteromethod correlations for SE are less than .50), and it repeatedly fails to discriminate itself from the other measured constructs.

The measure for RSe is also internally reliable across different measures. However, it fails to demonstrate convergent or discriminant validity due to low covergent correlations and high discriminant correlations.

Method 2 (i.e., partners’ ratings) seems especially problematic—its heterotrait correlations are the highest and monotrait-heteromethod correlations are the lowest. This makes sense, theoretically, because it can be expected that partners’ ratings would be the most divergent from the subjects’ self-reports. Methods 1 and 3 (i.e., the pencil-and-paper instrument and the diary coding, respectively) provide some evidence of validity; however, across both of these methods, there remains high correlations between constructs, making it difficult to know what constructs are truly being measured.

The correlations between the criterion—relationship intent (RI)—and the predictors—SE, RSe, and RSat—differ in strength depending on the method of measurement. The diary method produced the largest correlations (ranging from .35 to .67). The self-report measure and partners’ ratings both resulted in relatively similar criterion correlations (ranging from .06 to .29 for self-report and from .19 to .26 for partners’ ratings). It is theoretically expected that there should be a moderate to strong positive correlation between RSat and RI. Those participants most satisficd with their current relationship would also intend to stay in that relationship. The diary coding method provided the strongest positive correlation (r = .67); while the self-report method and the partners’ ratings method provided correlations of .29 and .22, respectively. Although these correlations are in the predicted direction, only the diary coding method provided a strong value. Since the same constructs were measured by different methods and the magnitude of the resultant correlations with RI are dependent upon the method of measurement, some form of method bias seems to be present in the research.

Further research needs to be conducted in order to validate the newly developed construct measures for self-esteem and relationship self-efficacy. The present research found that self-esteem, relationship self-efficacy, and relationship satisfaciton are related to one another and are difficult to differentiate. It may be necessary to combine constructs or to further distinguish the constructs from one another in order to conduct meaninful research. It also may be beneficial for future research to replace or refine the partners’ ratings method since it correlated weakly with the other two methods and exhibited the highest heterotrait-monomethod correlations.

Table 2

Multitrait-Multimethod (MTMM) Matrix

Method 1

/ Method 2 / Method 3 / r
Traits / SE / RSe / RSat / SE / RSe / RSat / SE / RSe / RSat / Intent
SE / (.90) / .06
M1 / RSe / .37 / (.87) / .24
RSat / .40 / .79 / (.75) / .29
SE / .38 / .44 / .51 / (.90) / .19
M2 / RSe / .32 / .38 / .43 / .76 / (.95) / .26
RSat / .34 / .31 / .48 / .79 / .86 / (.92) / .22
SE / .47 / .16 / .31 / .27 / .13 / .06 / (.80) / .42
M3 / RSe / .27 / .69 / .23 / .06 / .17 / .13 / .51 / (.81) / .35
RSat / .19 / .13 / .83 / .11 / .15 / .36 / .61 / .58 / (.70) / .67

Note. SE = Self-Esteem; RSe = Relationship Self-Efficacy; RSat = Relationship Satisfaction; Intent = Intent to Remain in a Relationship; Method 1 (M1) = multiple item self-report measure; Method 2 (M2) = partners’ ratings; Method 3 (M3) = content coding of diary. All correlations are corrected. Values in parentheses are reliabilities (α). For all correlations, n = 334.

Appendix A

Self-Esteem and Relationship Self-Efficacy:

Construct Definitions, Test Items, Critique, and Validation Plan

1. Relationship Self Efficacy: one’s perception of his/her competence in building and maintaining a healthy relationship specifically in terms of one’s perseverance and proactiveness.

12345

Strongly Disagree Neutral Agree Strongly

Disagree Agree

____ /

When I feel upset about something my partner says or does, I will approach him/her.

____ /

I feel like giving up when my partner and I do not see eye to eye. (Reversed)

____ /

If a disagreement arises, I find ways to compromise with my partner.

____ /

When problems in my relationship arise I actively seek solutions to solve it.

Self Esteem: one’s perception of self value, usually defined in terms of positive and negative attitudes toward oneself, specifically in terms of General Self worth, Interaction, and Self Determination

12345

Strongly Disagree Neutral Agree Strongly

Disagree Agree

____ /

Overall, I feel good about myself.

____ /

I feel that I am of equal value with people I interact.

____ /

I feel that my opinions are a valuable asset to conversations with others.

____ /

I feel that no other person controls my confidence level.

2. A. Threats to Validity:

Relationship Self Efficacy:

Our definition is too specific; therefore, in terms of construct validity, we may be limiting the types of questions that can be asked.
Are these questions situation/relationship specific or can they be generalized to every relationship? (External Validity)
History Effect: If the couple just had a fight before taking this survey this may affect the results.
All the items are addressing proactivity as manifested in a negative or aversive relationship situation. None of them measure proactivity and perseverence during the positive or enjoyable times in a relationship, thus underrepesenting the construct. For example, one or two of the current items could be replaced with, “I remember to celebrate special occasions with my partner,” or “I am able to understand and meet my partner’s needs.” Adding these positive-situation items may increase the validity of our measure.

Self Esteem:

Our definition is too specific; therefore, in terms of construct validity, we may be limiting the types of questions that can be asked.
In our definition we identified three specific areas of self esteem. An individual may score high in one of these areas and low on the others, giving a false overall measure of ones self esteem.
Question 4 was supposed to measure self determination, but we are questioning whether it actually measures locus of control, which is not in our definition of self esteem.

B. Threats to Reliability

Relationship self efficacy could be state based and thus future measures using this scale could be completely different from the original score (time influences).
Inter-rater reliability: The girlfriend’s perception of the questions could be different from those of the boyfriend, thus the scores would be different even though the individuals are members of the same relationship.
An individual’s current mood state may influence their answers to the questions on the self esteem measure (i.e., if a person gets a rejection letter from a college and then has to take our survey, they may be feeling depressed).

3. Even though the scenario indicated that as part of the literature review, there was indication of an existing scale for self-esteem, we would go back through the literature to be sure. In reality there does exist a reliable measure of self-esteem. The first step of our validation plan includes testing our scale for convergent validity against the Rosenberg Self Esteem Scale. Participants’ scores on both scales should be moderately to highly positively correlated to each other. If the scores are not positively correlated to one another, then it is likely that our scale does not actually measure self-esteem.

Step 2 of the validation plan involves finding a measure that is reasonably considered the opposite construct of self-esteem. We decided that depression may be such a construct. This would allow us to test for divergent validity. High scores on our scale should not be positively related to scores on the Beck Depression Inventory-II. If such a correlation did exist, that would again indicate that our scale is not measuring self-esteem, but rather depression in individuals.

The final step of our validation plan is to examine ecological validity, which focuses on the structural component of external validity and generalizability of our findings. One such influence may be the setting that participants were measured in. If participants are allowed to take the scale in a therapist’s office vs. a laboratory setting will that affect the score? Second, are the participant samples that we use to validate our measure representative of one specific group or do they possess characteristics that can be generalized to other populations? For example, if we only test on Caucasian males in graduate school, will this be a valid measure for use with African Americans, females, or individuals with less education? Most likely no. Finally we would examine the research procedures for administration of the self-esteem measure. Does this measure have standardized procedures or can it be given casually by untrained individuals such as an undergraduate conducting a survey? This could affect both the reliability and validity of the measure.