SOCY7709: Quantitative Data Management

Instructor: Natasha Sarkisian

Processing Observations across Subgroups

1. Subgroup operations across multiple variables

Oftentimes, datasets include nested subcomponents – e.g., data on each child of each respondent, or on each person who helped with activities of daily living, or each recipient of financial assistance, or on organizations for which respondent volunteered, etc. Even though the entire dataset might not have a nested structure, these subcomponents are essentially nested – children within the respondent, etc. Such data are usually provided in separate variables for each lower level unit, e.g.:

R’s ID / Child 1 age / Child 1 gender / Child 2 age / Child 2 gender / Child 3 age / Child 3 gender
1
2
3

Different individuals might have different number of children, so, for example, if someone only has one child, then only child 1 columns will contain data; the following ones will be missing. When dealing with such data, we could either (1) conduct analyses on the level of individual à need to aggregate across variables or (2) conduct analyses on the level of lower level unit (here, child) à need to reshape the data into:

R’s ID / Child number / Child gender / Child age
1 / 1
1 / 2
1 / 3
2 / 1
2 / 2
2 / 3
3 / 1
3 / 2
3 / 3

In this long format, there will be rows with all child data missing since some people have less than 3 children; such blank rows need to be identified and deleted (we learned how to do that when discussing reshaping).

We will focus, however, on aggregating across variables when our goal is to have individuals as units of analysis. As we discussed earlier, when we are interested in creating new variables based on information that is stored in multiple columns in the dataset, we can use the many egen options that allow us to combine such information across columns in various ways.

This is a recap:

·  anycount(varlist), values(integer numlist) -- looks for a match across multiple variables and generates a count of variables among those in the varlist for which values are equal to any integer value in a supplied numlist. Values for any observations excluded by either if or in are set to 0 (not missing).

·  anymatch(varlist), values(integer numlist) – same but generates a yes/no indicator; it is 1 if any variable in varlist is equal to any integer value in a supplied numlist and 0 otherwise. Values for any observations excluded by either if or in are set to 0 (not missing).

·  rowfirst(varlist) -- gives the first nonmissing value in varlist for each observation (row).

·  rowlast(varlist) -- gives the last nonmissing value in varlist for each observation (row).

·  rowmin(varlist) -- gives the minimum value in varlist for each observation (row).

·  rowmax(varlist) -- gives the maximum value (ignoring missing values) in varlist for each observation (row).

·  rowmean(varlist) – gives the mean for each observation across variables; great for creating individual scores for multi-item scales. It ignores missing values: for example, if three variables are specified and, in some observations, one of the variables is missing, in those observations the new variable will contain the mean of the two variables that do exist. Other observations will contain the mean of all three variables.

·  rowsd(varlist) – gives the (row) standard deviations of the variables in varlist, ignoring missing values.

·  rowmedian(varlist) – similar to mean, but gives a median.

·  rowpctile(varlist) [, p(#)] – similar to median but gives the #th percentile of values across the variables specified. If p() is not specified, p(50) is assumed, meaning medians, which is the same as rowmedian option.

·  rowtotal(varlist) [, missing] -- creates the (row) sum of the variables in varlist, treating missing as 0. If missing is specified and all values in varlist are missing for an observation, new variable is set to missing.

·  rowmiss(varlist) -- gives the number of missing values in varlist for each observation (row).

·  rownonmiss(varlist) [, strok] -- gives the number of nonmissing values in varlist for each observation (row). String variables may not be specified unless the strok option is also specified. If strok is specified, string variables will be counted as containing missing values when they contain "".

For all of these, if all variables in the varlist are missing, the result will be missing as well.

Example:

. use "C:\Users\sarkisin\Documents\Data Management\gss2002.dta", clear

. sum relate*

Variable | Obs Mean Std. Dev. Min Max

------+------

relate1 | 2765 1 0 1 1

relate2 | 1876 3.168977 2.126585 2 8

relate3 | 944 3.643008 1.496795 3 8

relate4 | 494 3.504049 1.306476 3 8

relate5 | 207 3.541063 1.313487 3 8

------+------

relate6 | 64 3.890625 1.533925 3 8

relate7 | 24 4.166667 1.632993 3 8

relate8 | 8 4.375 2.065879 3 8

relate9 | 4 5.25 2.629956 3 8

relate10 | 2 3 0 3 3

------+------

relate11 | 6 5.5 1.643168 3 7

relate12 | 1 7 . 7 7

. codebook relate*

------

relate1 relationship of 1st person to household head

------

type: numeric (byte)

label: relate1

range: [1,1] units: 1

unique values: 1 missing .: 0/2765

tabulation: Freq. Numeric Label

2765 1 head of household

------

relate2 relationship of 2nd person to household head

------

type: numeric (byte)

label: relate2

range: [2,8] units: 1

unique values: 7 missing .: 889/2765

tabulation: Freq. Numeric Label

1214 2 spouse

328 3 child

13 4 son or daughter-in-law

7 5 grand or great-grandchild

11 6 parent or parent-in-law

44 7 other relative

259 8 non-relative

889 .

------

relate3 relationship of 3rd person to household head

------

type: numeric (byte)

label: relate3

range: [3,8] units: 1

unique values: 6 missing .: 1821/2765

tabulation: Freq. Numeric Label

774 3 child

15 4 son or daughter-in-law

39 5 grand or great-grandchild

13 6 parent or parent-in-law

40 7 other relative

63 8 non-relative

1821 .

------

relate4 relationship of 4th person to household head

------

type: numeric (byte)

label: relate4

range: [3,8] units: 1

unique values: 6 missing .: 2271/2765

tabulation: Freq. Numeric Label

417 3 child

7 4 son or daughter-in-law

30 5 grand or great-grandchild

1 6 parent or parent-in-law

16 7 other relative

23 8 non-relative

2271 .

------

relate5 relationship of 5th person to household head

------

type: numeric (byte)

label: relate5

range: [3,8] units: 1

unique values: 4 missing .: 2558/2765

tabulation: Freq. Numeric Label

172 3 child

18 5 grand or great-grandchild

9 7 other relative

8 8 non-relative

2558 .

------

relate6 relationship of 6th person to household head

------

type: numeric (byte)

label: relate6

range: [3,8] units: 1

unique values: 4 missing .: 2701/2765

tabulation: Freq. Numeric Label

45 3 child

11 5 grand or great-grandchild

5 7 other relative

3 8 non-relative

2701 .

------

relate7 relationship of 7th person to household head

------

type: numeric (byte)

label: relate7

range: [3,8] units: 1

unique values: 4 missing .: 2741/2765

tabulation: Freq. Numeric Label

14 3 child

7 5 grand or great-grandchild

1 7 other relative

2 8 non-relative

2741 .

------

relate8 relationship of 8th person to household head

------

type: numeric (byte)

label: relate8

range: [3,8] units: 1

unique values: 4 missing .: 2757/2765

tabulation: Freq. Numeric Label

5 3 child

1 5 grand or great-grandchild

1 7 other relative

1 8 non-relative

2757 .

------

relate9 relationship of 9th person to household head

------

type: numeric (byte)

label: relate9

range: [3,8] units: 1

unique values: 3 missing .: 2761/2765

tabulation: Freq. Numeric Label

2 3 child

1 7 other relative

1 8 non-relative

2761 .

------

relate10 relationship of 10th person to household head

------

type: numeric (byte)

label: relate10

range: [3,3] units: 1

unique values: 1 missing .: 2763/2765

tabulation: Freq. Numeric Label

2 3 child

2763 .

------

relate11 relation of 11th person (visitor) to head

------

type: numeric (byte)

label: relate11

range: [3,7] units: 1

unique values: 4 missing .: 2759/2765

tabulation: Freq. Numeric Label

1 3 child

1 4 son or daughter-in-law

2 6 parent or parent-in-law

2 7 other relative

2759 .

------

relate12 relation of 12th person (visitor) to head

------

type: numeric (byte)

label: relate12

range: [7,7] units: 1

unique values: 1 missing .: 2764/2765

tabulation: Freq. Numeric Label

1 7 other relative

2764 .

. for num 1/8: egen relshipX=anycount(relate*), values(X)

-> egen relship1=anycount(relate*), values(1)

-> egen relship2=anycount(relate*), values(2)

-> egen relship3=anycount(relate*), values(3)

-> egen relship4=anycount(relate*), values(4)

-> egen relship5=anycount(relate*), values(5)

-> egen relship6=anycount(relate*), values(6)

-> egen relship7=anycount(relate*), values(7)

-> egen relship8=anycount(relate*), values(8)

. sum relship*

Variable | Obs Mean Std. Dev. Min Max

------+------

relship1 | 2765 1 0 1 1

relship2 | 2765 .4390597 .4963621 0 1

relship3 | 2765 .636528 1.025315 0 8

relship4 | 2765 .0130199 .11338 0 1

relship5 | 2765 .040868 .2810861 0 5

------+------

relship6 | 2765 .0097649 .1088294 0 2

relship7 | 2765 .0433996 .2839101 0 5

relship8 | 2765 .1301989 .4350459 0 5

2. Subgroup operations across cases

When talking about subgroups in previous examples, we were working across variables, focusing on aggregating based on groups of variables. But oftentimes, subgroups of interest are represented within a given variable – that is the case when the dataset is nested/multilevel and structured as such datasets are typically structured – e.g., students nested within schools means each student is one observation, and therefore each school corresponds to multiple observations. In such situations, we could conduct operations by group using by and bysort prefixes – for example, to examine means by groups:

. bysort degree: sum income

------

-> degree = lt high

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 367 9.637602 3.140082 1 13

------

-> degree = high sch

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 1427 10.77645 2.510244 1 13

------

-> degree = junior c

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 193 11.65285 1.509934 1 13

------

-> degree = bachelor

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 430 11.68372 1.55324 1 13

------

-> degree = graduate

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 227 11.8326 1.071841 1 13

------

-> degree = .

Variable | Obs Mean Std. Dev. Min Max

------+------

income | 1 9 . 9 9

We could, however, also use mean command with “over” option:

. mean income, over(degree)

Mean estimation Number of obs = 2644

_subpop_1: degree = lt high school

_subpop_2: degree = high school

_subpop_3: degree = junior college

bachelor: degree = bachelor

graduate: degree = graduate

------

Over | Mean Std. Err. [95% Conf. Interval]

------+------

income |

_subpop_1 | 9.637602 .1639109 9.316195 9.959009

_subpop_2 | 10.77645 .0664514 10.64615 10.90676

_subpop_3 | 11.65285 .1086874 11.43973 11.86597

bachelor | 11.68372 .0749039 11.53684 11.8306

graduate | 11.8326 .0711406 11.6931 11.9721

------

If we want to examine values by group, we can use list with sepby:

. sort indus80

. list age sex hrs1 indus80, sepby(indus80)

+------+

| age sex hrs1 indus80 |

|------|

1. | 72 female . 10 |

2. | . female . 10 |

3. | 62 male 70 10 |

4. | 83 male . 10 |

5. | 74 male 20 10 |

6. | 69 female . 10 |

7. | 31 male . 10 |

8. | 47 male . 10 |

9. | 68 male . 10 |

10. | 75 female . 10 |

11. | 76 female . 10 |

|------|

12. | 39 female . 11 |

13. | 36 male 6 11 |

14. | 45 male 60 11 |

15. | 19 male . 11 |

16. | 61 male . 11 |

17. | 37 male 30 11 |

18. | 50 female 6 11 |

19. | 63 male 50 11 |

20. | 46 female 36 11 |

21. | 43 male 40 11 |

22. | 39 female 40 11 |

23. | 23 male 45 11 |

24. | 36 male 50 11 |

25. | 23 male 70 11 |

26. | 78 male . 11 |

27. | 65 male . 11 |

28. | 20 male . 11 |

29. | 76 male . 11 |

|------|

30. | 51 female 35 20 |

31. | 50 male 20 20 |

|------|

32. | 59 male 45 21 |

33. | 63 female 40 21 |

34. | 20 male 15 21 |

35. | 53 male . 21 |

36. | 24 male 48 21 |

37. | 30 male 16 21 |

38. | 28 male . 21 |

39. | 25 male 65 21 |

|------|

40. | 82 female . 30 |

|------|

41. | 58 female 36 31 |

42. | 31 female 45 31 |

43. | 33 male . 31 |

|------|

44. | 75 male . 40 |

|------|

45. | 33 male 40 41 |

|------|

46. | 56 male 50 42 |

47. | 41 male . 42 |

48. | 39 male . 42 |

49. | 45 male 84 42 |

50. | 52 male 50 42 |

|------|

51. | 48 male 52 50 |

--Break--

r(1);

If we want to actually aggregate from the lower level to the higher level (that is, create a variable that will have the same value for all members of the same subgroup – for example, mean of that subgroup, or its standard deviation, etc.), we would now do that across observations (for a given variable or a combination of variables), also using bysort prefix. There are some useful egen options for that as well – we didn’t discuss these yet:

·  mean(exp) -- the mean of exp

·  sd(exp) -- the standard deviation of exp

·  total(exp) [, missing] -- creates a constant (within varlist) containing the sum of exp treating missing as 0. If missing is specified and all values in exp are missing, new variable is set to missing.

·  median(exp) -- the median of exp

·  pctile(exp) [, p(#)] -- the #th percentile of exp. If p(#) is not specified, 50 is assumed, meaning medians.

·  mode(varname) [, minmode maxmode nummode(integer) missing] -- the mode for varname, which may be numeric or string. The mode is the value occurring most frequently. If two or more modes exist or if varname contains all missing values, the mode produced will be a missing value. To avoid this, the minmode, maxmode, or nummode() option may be used to specify choices for selecting among the multiple modes, and the missing option will treat missing values as categories. minmode returns the lowest value, and maxmode returns the highest value. nummode(#) will return the #th mode, counting from the lowest up. Missing values are excluded from determination of the mode unless missing is specified.