Sampling 101
An Introduction to Sampling

Sampling is the process of selecting a subset of units from the population. We use sampling formulas to determine how many to select because it is based on the characteristics of this sample that we make inferences about the population.

In preparing to take a sample, there many questions we might have, including:

  • Can we make procedural errors that influence the results, but are not related to the sample itself?
  • Can we make errors in our sampling?
  • How can we determine the size and the accuracy of the sample result?

Systematic Errors (Non-Sampling Errors)

Systematic errors result from decisions that bias the sample selection or response to your survey. Four common mistakes are made:

1. Population Specification Error: This error is one of not understanding who you should be surveying. As a simple example, imagine you are preparing a survey about the consumption of breakfast cereals. Who do you survey? It might be the entire family, the mother, or the children. The family consumes cereal, the mother purchases, and the children influence her choice.

2. Sample Frame Error: A frame error occurs when the wrong sub-population is specified from which the sample is drawn. A classic frame error occurred in predicting the 1936 presidential election between Roosevelt and Landon. The sample frame used was from car registrations and telephone directories. In 1936, car and telephone owners were largely Republicans. While the results may have reflected the sample, the predictions were not accurate for the US as a whole and the results wrongly predicted a Republican victory.

3. Selection Error: Selection error results when the respondents self select their participation... those who are interested respond. Selection error can be controlled by going extra lengths to get participation. Typical steps include initiating pre-survey contact requesting cooperation, actual surveying, post survey follow-up if a response is not received, a second survey request, and finally interviews using an alternate modes such as telephone or person to person.

4. Non-Response: Non response errors occur when non-respondents are different than those who respond. This may occur because either the potential respondent was not contacted (they did check their e-mail) or they refused to respond (they were all grumpy old men or beautiful young women afraid of strangers). Again, the extent of this non-response error can be checked through follow-up surveys using alternate modes.

Non-Systematic Errors (Sampling Errors)
Sampling errors occur because of variation in the number or representativeness of the sample that responds. Sampling errors can be controlled by (1) Careful sample designs, (2) Large samples, and (3) Multiple contacts to assure representative response. Two types of samples may be drawn, a probability sample where every person in the sample has an equal and known probability of being selected, and a non-probability sample where the probability of a person being selected is unknown.

Non-probability samples include quota samples, referral samples, and convenience samples. A quota sample assures that various subgroups of the population are represented on relevant sample characteristics. The quota sample does not use population proportions. For example a quota sample may be used to make sure you have at least 35 people who have an income more than $250,000. Respondents from group would rarely be interviewed because of their small incidence in the population as a whole.

A referral sample is used to locate a population of rare individuals by referral. Locating 100 adult croquet players may be difficult were it not by referral.

Convenience samples are based on convenience and may include members of affiliation groups, interest groups, or random intercepts on your website. The objective is to collect as much data as possible, regardless of where they come from.

Probability samples include simple random samples, systematic samples, and stratified samples.

A simple random sample occurs where every element has a known and equal probability of being selected. A true random sample is rarely used because we rarely have a sample frame that lists every person we could sample.

A systematic sample occurs all potential respondents have a known and equal chance of being selected. Typically a systematic sample would select every nth person from the list of potential respondents. For example, a systematic sample of 400 customers (out of the 3000 total customers) would be conducted by computing a respondent selection frequency (N/n) = 3000/400 = 7.5. Then this number is rounded to 8 and a random number is selected from 1-8 (suppose the result is 3). We would select a systematic sample by first selecting the third customer on the list and then every 8th thereafter. This will form a simple random sample of respondents if the customer list is not systematically ordered in some way.

A stratified sample is sometimes desirable if the population is to be broken up into different groups based on one or more characteristics of the population. In this case, the strata are identified. Strata may defined as any groups: Credit card users Vs noncredit card users, by gender, age, industry, purchasers Vs non-purchasers, current customers Vs past customers, etc. Once the strata are identified, a simple random sample is drawn within each strata. Once the survey is completed, the strata are then weighted back to the population proportions.

Estimating Sample Size.
Given the sampling methods discussed above, the final sample determination is that how many should be sampled? The most common method of sample size determination is that based on proportions. For example, suppose we are preparing for the winter Olympics and are interested in estimating "the proportion of out of state skiers that took at least one over night trip." We might use this number of people that would consider traveling to the Olympics.

In this case, the sample size is estimated using proportions. s
p = √ (p/(1-p)/ n)  where p is the proportion of " out of state skiers that took at least one over night trip". The most conservative number for this proportion is .50 and if the desired accuracy was .05 and the formula would appear as:

(Number of Standard Errors)2 * ((proportion)*(1-proportion)) / (Accuracy)
(1+((Number of Standard Errors)2 * ((proportion)*(1-proportion)) / (Accuracy)-1) / (the population size)

While this formula is easily entered into a spread sheet, this formula results in the following sample size determination table.

Rows Show the Sample Sizes at the 95% Confidence Level       
Columns Show the Expected Response
(Variable proportions)
Size 50/50% 40/60% 30/70.% 20/80% 90/10% 95/5%
25 20 19.6 18.3 16 12 8.7
50 14.2 13.9 13 11.4 8.5 6.2
75 11.5 11.3 10.5 9.2 6.9 5
100 10 9.8 9.2 8 6 4.4
150 8.2 8 7.5 6.6 4.9 3.6
200 7.1 7 6.5 5.7 4.3 3.1
250 6.3 6.2 5.8 5 3.8 2.7
300 5.8 5.7 5.3 4.6 3.5 2.5
400 5 4.9 4.6 4 3 2.2
500 4.5 4.4 4.1 3.6 2.7 2
600 4.1 4 3.8 3.3 2.5 1.8
800 3.5 3.4 3.2 2.8 2.1 1.5
3.2 3.1 2.9 2.5 1.9 1.4
2.9 2.8 2.7 2.3 1.7 1.3
2.6 2.5 2.4 2.1 1.6 1.1
2.2 2.2 2 1.8 1.3 0.96
2 2 1.8 1.6 1.2 0.87
1.8 1.8 1.7 1.4 1.1 0.79
1.6 1.5 1.4 1.3 0.95 0.69
1.4 1.4 1.3 1.1 0.85 0.62
1.2 1.1 1.1 0.92 0.69 0.5
1 0.98 0.92 0.8 0.6 0.44
0.82 0.8 0.75 0.66 0.49 0.36
0.63 0.62 0.58 0.5 0.38 0.27
0.4 0.39 0.31 0.32 0.24 0.17
In a product usage study where the expected product usage incidence rate is 30%, a sample of 500 will yield a precision of +/- 4 percentage points at the 95% confidence level.      
 Print  Feedback