Theme: Design of sampling methods
0 General information
[The general information items are included in the header of the published module. They are used internally to link between different modules, and for version management purposes.]
0.1 Module code
Theme: Design of sampling methods
0.2 Version history
Version / Date / Description of changes / Author / Institute1.0p3 / 21-6-2012 / Second version / Ioannis Nikolaidis / ELSTAT
0.3 Template version and print date
Template version used / 1.0 p 3 d.d. 28-6-2011Print date / 21-6-2012 14:25
Contents
General section – Theme: Design of the sampling methods 3
1. Summary 3
2. General description 3
2.1 Sampling 3
2.2 Sample design 3
2.3 Application of sample design methods 5
2.4 Examples on sample design methods 13
3. Design issues 16
4. Available software tools 17
5. Decision tree of methods 18
6. Glossary 19
7. Literature 23
Specific section – Theme: Design of the sampling methods 24
A.1 Interconnections with other modules 24
2. Determining the boundaries of size classes used as strata 24
General section – Theme: Design of the sampling methods
1. Summary
Sampling provides the mean of gathering information about a population without examining it entirely. Sample design comprises of the selection method, the sample structure and plans for drawing inferences about the entire population. Sample designs can vary from simple to complex and depend on the type of information required and the selection method applied. The design affects the sample size and the way of conducting the analysis of sample results. The greater the required precision of estimates and the greater the complexity of design applied, the larger the required sample size.
Many sample designs are conducted by applying random selection, because this achieves inferences from the sample to the population, at quantified levels of precision. Random selection prevents from arising bias as opposed to non-random selections (purposive or quota sampling). However, a random selection may not always be required, e.g. in such cases where the samples are small and the sample data will not be extrapolated to draw inferences about the entire population.
2. General description
2.1 Sampling
Sampling is a tool for selecting a subset of units from a target population in order to draw inferences about the entire population on the basis of data collected for these units. The subset of the selected units is called a sample. The units that make up the target population must be described in terms of characteristics that clearly identify them (Cochran, 1977). However, some units of the target population may be excluded due to operational constraints, such as the high cost of data collection in some remote areas or the difficulty of identifying and contacting certain units. The included population is called survey population. The list of units from which the sample will be drawn is the sampling frame. The frame population can be obtained and identified by existing information (Rensen, 1998).
2.2 Sample design
A sample design is a set of rules that specifies how a sample of a given size is to be selected. It provides information on the target and final sample sizes, strata definitions and the sample selection methodology. In the theory of finite population sampling, a sampling design specifies for every possible sample its probability of being drawn. Mathematically, a sampling design is denoted by the function, which gives the probability of drawing a sample (Cochran, 1977).
Sample design methods generally refer to the technique used to select the sample units for measurement (Rensen, 1998). There are two types of sampling:
· Probability or random sampling: Known positive selection probabilities for all population units. It provides reliable inferences about the entire population.
· Non-probability sampling: A subjective method is applied to select the sample and some units have zero selection probabilities.
There are different types of probability sample designs. The most basic one is simple random sampling. The designs increase in complexity to encompass systematic sampling, probability-proportional-to-size sampling, cluster sampling, stratified sampling, multistage sampling and multi-phase sampling. Each of these sampling techniques is useful in different situations. If the objective of the survey is simply to provide overall population estimates and stratification would be inappropriate or impossible, simple random sampling may be the best. If the survey is performed by interviewers (though it is not typical in business statistics) making the cost of survey collection high and the resources are bounded, cluster sampling is used. If an effective stratification is possible and/or subpopulation estimates are also desired (such as estimates by region or size of business), stratified sampling is usually performed. The main advantage of probability sampling is that since each unit is randomly selected and the inclusion probabilities of the sampling units can be determined, reliable estimates of the survey parameters and estimates of their sampling errors can be calculated.
Non-probability sampling is often used as an inexpensive, quick and convenient alternative to probability sampling. However, but it is not a valid substitute for probability sampling, since the inclusion probabilities of elements cannot be calculated due to selection bias and sometimes the absence of a frame. As a result, there is no way of producing reliable estimates and those of their sampling error. In order to make inferences about the population, the characteristics of the population must follow an adequate model or be uniformly or randomly distributed over the population. The commonly used non-probability sampling techniques are purposive sampling, based on the deliberate choice of a sample, quota sampling, based on population proportions and balance sampling, which resembles with the quota sampling in the sense that control variables are used to guide the sample selection. The cut-off sampling applies probabilistic selection for a part of the population, whereas for the remainder population the selection probability is equal to zero (‘take-no’ part). This type of sampling is an intermediate between probability and non-probability sampling, because there is actually a deliberate exclusion of part of the target population from the sample selection.
The different ways of probability and non-probability sample selection as well as the different estimation methods are described among others in the Statistics Canada publication “Survey Methods and Practices” (2003), the Eurostat publication “Survey sampling reference guidelines” (2008), Rensen (1998), Dalen (2005) and Montaquila and Kalton (2011).
Which sample design fits best for a particular survey, depends on the availability of suitable sampling frames, the auxiliary information included in the frame, the needed level of precision, the detailed information to be obtained for subpopulations, called domains of analysis, the estimators to be applied for the parameter estimates and other operational constraints such as budget and time. The choice of a suitable probability sample ensures that the sample can support the required inferences without considerable biases. If an inappropriate probability sample design is used, the survey estimates will have larger variances than in a more suitable design scheme. Biases may appear in the estimates due to non-response and non-coverage errors. Suitable weighting of sampling units reduces the bias due to non-response, but it may increase the variance of the estimates due to extra random weighting. Corrections and weightings for non-coverage are more difficult than those for non-response, because coverage rates cannot be obtained from the sample, but only from outside sources (Kish, 1992).
2.3 Application of sample design methods
2.3.1 Simple Random Sampling
According to simple random sampling (SRS) each population unit has an equal selection probability and each combination of n units has an equal selection probability. The formulas for estimating population parameters and conducting hypothesis tests are not complicated. However, it needs a complete population frame and by chance, some domains may be over represented in the sample while other may not be sampled at all.
Simple random sampling facilitates the researchers to use simple weights and formulas for estimating the population parameters, conducting hypothesis testing and applying theories of distribution for complex statistics. However, this type of sampling is not applied widely on its own, because it produces estimates less efficient than in case of a more complex designs. Additionally, with SRS there is a probability that some domains may be over-represented or not be sampled at all. Thus, in most cases complex sample designs are applied, having as a basis simple random sampling (eg. stratified element sampling, cluster and multistage sampling, usually also stratified).
2.3.2 Stratification
Stratified sampling is used both to decrease the variances of the estimates and to gain efficient domain estimators. Commonly, the major domains (e.g. economic activity branches, regions) for which separate and accurate estimates are sought, can be defined as strata. For these domains (called design domains) the sampling methods and the sampling fractions can deliberately vary. In order to improve the efficiency of a sampling strategy compared with SRS, additional strata can be defined in the domains taking into account the distribution of the main population variables. In this course, the aim is to assure strong homogeneity within the various strata (the units in strata should be similar with respect to the variables of interest) and a possibly great difference between them. This is achieved if the stratification variables are strongly correlated with the survey variables of interest. The benefits of stratification diminish if the variables of interest are not highly correlated with the stratification variables. Stratification variables unrelated to each other, but related to survey variables should be preferred (Kish, 1965).
When the variables of interest have skewed distributions, as it commonly happens in business surveys, then a small number of large units contribute to the parameters in a large share. In this case, within the design domains, a separate stratum is created including all large units (take-all stratum), which are surveyed exhaustively (Glasser, 1962; Hidiroglou, 1986; Lavallee and Hidiroglou, 1988; Hidiroglou, Choudhry and Lavallee, 1991). Generally, the take-all stratum is indispensable if the sampling rates do not afford to select enough large elements and although the ‘upper tail’ of the distribution of the variable is small, it accounts for a large portion of the aggregate and influences sufficiently the estimates (Kish, 1965).
Commonly, in element stratified sampling within the really sampled (take-some) strata the sample is selected by applying systematic sampling. Before drawing the systematic sample, the population is often ordered by variables considered strongly related to the main survey variables, in order to gain in precision of the estimates to be obtained from a population with a monotonic trend (Kalton, 1983).
Two-phase sampling
When the sampling frame lacks of auxiliary information that could be used to stratify the population, a two-phase sampling scheme may be applied (Kish, 1965). In the first phase a large sample is selected to obtain the required stratification information. This first sample is then stratified to provide, in the second phase, a subsample from each stratum, in order to collect more detailed information.
For example, we suppose that detailed information is needed about the local units of enterprises. The sampling frame only lists the local units, with no auxiliary information about their sizes (number of employees), which is needed for stratification. In the first phase, a large sample of local units is selected so as to gather the required information for the stratification variable (e.g. number of employees). Then this first sample is stratified and in the second phase, a smaller sample of these units is drawn in each stratum, in order to collect more detailed information
Subclasses
A subclass is the sample of a domain. If a domain was not considered as a design one, then selecting subclass members from the sample has the effect that zero values are assigned to the variables of non-members. The proportion of zero values (blanks) increases as the subclass proportion decreases. When crossclasses (subclasses that cut across the strata) become smaller, the variability of survey variables increases greatly, and as a result, the stratification gains in accuracy tend to become lost, especially for small crossclasses (Kish and Frankel, 1974). This occurs because in a stratum only elements out of the elements of the population and only elements out of the elements of the sample belong to the subclass. Although is fixed, the number of subclass members in the sample is a random variable. As an example of crossclasses, one can consider the size classes of enterprises defined by employment, when the stratification variable is not the employment, but the annual turnover (Kish, 1965).
Boundaries of size classes used as strata
The choice of boundaries for size classes used as strata depends both on the needs for creating design domains and on the nature of the distribution of the stratification variables. For continuous variables, a practical procedure for obtaining these boundaries is the cumulative CUM rule (Dalenius and Hodges, 1959), creating optimal boundaries both for small and large number of size classes. The rule is roughly equivalent to make constant, as conjectured by Dalenius and Gurney (Cochran, 1977), where is the relative size of size class and is the standard deviation of the variable . Ekman’s similar rule makes constant, where is the width of size class (Ekman, 1959; Cochran, 1977). Additionally, for creating boundaries of continuous variables the rule of equal aggregate stratum sizes may be applied, making equal values in the size classes (Hansen, Hurwitz and Madow, 1953; Sethi, 1963), where is the mean of the stratification variable in size class h. This would work well when the coefficients of variation are about the same for each size class. Then the equality of implies that is equal between strata, which yields a solution close to the best one (Hess, Seth and Balakrishnan, 1966).