Home > RippleStat > Sampling Distributions

RippleSoft

Instructional Software for
Statistics & Experimental Psychology

RippleSoft Software
www.ripplestat.com
ripplesoftware@mac.com

Sampling Distributions

Sampling Distribution is Fundamental to Inferential Statistics

The concept of "Sampling Distribution" is fundamental to inferential statistics. While the definition is straightforward ("A sampling distribution is the distribution of a statistic."), I always found it difficult to help students learn its utility in inferential statistics.

I've implemented Monte Carlo techniques to generate multiple samples to create the distribution of "a statistic" illustrating the way sampling distributions are used to make inferential decisions. The first block on this page describes one of the examples I have created.

The pages linked to the choices to the left describe the teaching tools available to help students understand some of the typical ways sampling distributions are used to make statistical decisions. Many of the links use screen captured videos and animations to recreate the experience the student has using the software--or the instructor using the software as a lecture-enhancing presentation tool.

Link in this site
Link on related site

Click here to open a page of full-sized card images

Examples of the Use of Sampling Distributions:

Hypothesis Testing Model of Inferential Statistics:
Illustrated with the One-Sample t-Test

Cavanagh (1972) reports Immediate Memory Span for digits is 7.7 items. I measure digit span in my introductory experimental psychology class every semester. Typically the digit span for 20 students is about 7.4 items. Is this evidence the digit span for my students is different from that reported by Cavanagh? (From data I have collected I estimate the standard deviation of digit span is about 1 item--an important value for the simulation of the sampling distribution.)

Cavanagh, J.P. (1972). Relation between IMS and the memory search rate. Psycholological Review, 79, 525-530.

Assume the null hypothesis is true (Ho: µ = 7.7) [Green Distribution]

  • Setup and run the simulation for about 1,000 times.
  • Each replication simulates a data set of 20 observations drawn from the N(7.7, 1.0) population.
  • Calculate the value of t for each replication and plot that value in a histogram
  • There's a Type-I error rate off 5.45%

Assume Ho is false in a particular way (µ = 7.4) [Blue Distribution]

  • Assume the mean digit span of my students is really 6.4 items.
  • Draw 1,000 or so samples of n 20 from the N(7.4,1.0) population.
  • Calculate the value of t for each replication and plot that value in a histogram.
  • Only 24.2% of the replications lead to rejecting Ho.

The simulation illustrates I have only 1 chance in 4 of showing my students have a shorter digit span than Cavanagh reports even if the mean digit span is less than 7.7 (and actually 7.4).

How large a class would I have to have to have a 50% chance of rejecting Ho if the true mean really is 7.4 items? Samples of size 50 yielded a power of 46% (806 replications). If the true standard deviation were 1.5 items, samples of size 50 provide a power of only about 0.25.


Hypothesis Testing Model of Inferential Statistis:
Illustrated with the Analysis of Variance

This card displays the result of an interactive simulation of the results of a one-way ANOVA. What values of F would occur if the Ho were true? What values of F would occur if the Ho were false (in a particular way)?

This simulation is not built around a particular research question. Instead, the "Ho False" sampling distribution is specified in terms of "Small", "Medium", "Large", and "Very Large" effects.

This simulation shows the results of 1001 replications of data sampled from the situation where Ho is true and 3001 where the Ho was false (again, in a particular way).

From this display, the meaning of the Type I eror rate becomes clearer (= 0.047 rather than the nominal 0.050) as does the power (0.866).

The user can set the number of groups, the effect size (when Ho is false), and the number of subjects per group.

Sampling Distribution Laboratory Module

Sampling Distribution Laboratory Session

RippleSoft Software includes an instructional module for introducing students to the concept of sampling distributon.

The two images to the left illustrate two of the cmponents of the laboratory module.

Generate An Example "Sampling Distribution"

Repeatedly Sample to Create Many Values of a Statistic

The programming on this card allows the user to use Monte Carlo techniques to generate sampling distributions for 7 common statistics. When the requested number of samples have been drawn, the target statistic calculated for each sample, and the value plotted, the user can click on the "Histogrammer" button to graphically display the empirical sampling distribution.

The image to the left shows the mean values for 30,000 samples of size 10 drawn from a N(28,5) distribution. This is the normal distribution used in the "Sampling Distribution Laboratry Module."

It takes about 10 seconds to generate this data using a MacBook Pro Core 2 Duo (2.4 GHz) computer.

Plot the Means in a Histogram

These are the 30,000 means plotted in the "Histogrammer" tool. It took about 30 seconds to complete the plot on the MacBook Pro.

Click here for a video of what the user sees

Plot the Variances in a Histogram

This figure shows the results of sampling 30,000 variances from a normal population and plotting the results.

How about 100,000 Means?

It took less than a minute to generate this data. It took about a minute to plot.

How Inferential Statistical Reasoning Works

Illustration of Inferential Statistical Reasoning

The mode of reasoning adopted by the "Hypothesis Testing Model of Inferential Statistics" is backwards from the way of thinking students bring to statistics--"Assume no difference and see if the data are usefully modeled by that assumption."

This card illustrates the hypothesis testing mode of reasoning. I've written two slightly different tutorials using the interactive features of this card. One or the other may make more sense to you:

Click here to open a page with more detailed information.
Click here to read a tutorial based on this example

The Meaning of "Margin of Error" in Polling Results

Margin of Error When the True Proportion is Not Known

What pollsters mean

If a pollster says, "I asked 250 'randomly sampled' persons whether they approved of [insert question here]. 49.6% indicated they did. The margin of error is +/- 6.2%."

The pollster means, "If the sampling process were to be repeated many times, then I would expect (in the long run) 95 out of 100 these future samples to have a proportion of agreement that is between 43.4% and 55.8% which is 6.2% larger and smaller than the observed proportion of 49.6%."

In the simulation results pictured to the left, 94% of the samples met this criterion.

Margin of Error When the True Proportion is Known

If we sample from a known population

The other way to describe the margin of error statement is when a sample is drawn from a known population. In this instance the margin of error is calculated from the population proportion and not from the observed proportion for the first sample of respondees.

For the results of the simulation pictured, 95% of the 100 samples are within the margin of error of 6.2% points of the true proportion of 0.50 (for this simulation).

And this interpretation isn't what the pollster means because the true proportion of agreement with the question is not known. To know it every individual would have to be queried not just a sample.

In the simulation programmed on this card, the settable parameters include the population proportion (aka "True Proportion"), sample size, number of replications (up to 100), and the confidence interval.

Confidence Interval on µ

Confidence Interval When the True Standard Deviation is Not Known

If you were to repeat this simulation many times, you'd notice that the first sample (in blue) doesn't always include the true value of the population mean. As you might expect, it will fail to do so about 5% of the time.

* This does not mean that 95% of future samples will have a mean (aka "average") within the confidence interval based on the first sample.

One random sample is drawn from the population and a confidence interval is calculated using the standard deviation calculated from the sample.

Based on that information, the researcher can say, "A sample of 50 observations set the 0.95 confidence interval on the population mean to be (94.901, 100.766). The sample mean was 97.833. The sample estimate of the population standard deviation was 10.26. The confidence interval was calculated using t(49) = 2.021.

A 95% Confidence Interval means that 95% of samples (in the long run) will yield data which includes the true population central tentency. There is no way to know whether or not a particular sample included the true central tendency value.

Because this card is a simulation--not data obtained by measuring something in the real world--and the simulation parameters are the true mean and true standard deviation of the parent population, the software can keep track of which samples did correctly capture the true mean in the confidence interval.In the simulation pictured at the left, 95 of the 100 samples did include the value of the true mean. Five did not and are identified in red.*

Confidence Interval When the True Standard Deviation is Known

One random sample s drawn from the population and a confidence interval is calculated using the true standard deviation (the population standard deviation).

Based on that information the researcher can say, "A sample of 50 observations set the 95% confidence interval on the population mean to be (97.569, 103.113). The sample mean was 100.341. The sample extimate of the population standard deviation was 8.786. The true standard deviation was 10.0. The confidence interval was calculated using z = 1.959964.

The remaining explanation is the same as in the description given above. In this instance only 94% of the samples "captured the true mean."

Illustrating the Central Limit Theorem

Click here for videos showing repeated simulations

Sampling from a Normal Distribution

The Central Limit Theorem is a fundamental truth of the universe ["How often does a psychology teacher get to make a statement like this?", I ask you?].

This image illustrates the Central Limit Theorem with samples drawn from a Normal Distribution.

Sampling from Uniform Distribution

Ditto except sampling from the uniform distribution. You can see in the image to the left the means of repeated samples form a normal distribution.


© 2005-2008 by Burrton Woodruff. All Rights Reserved. Modified Fri, Dec 28, 2007