Friday, September 28, 2012

Central Limit Theorem and Confidence Intervals

Okay, first of all, I hope that the people who monitor the school computers have nothing against Blogspot, because I am kind of typing this up in the middle of maths class.

And I also hope that the people either side of me won't shun me for being nerdy enough to write about maths two days after the 3CD mock exam.

Anyway...

I'm going to write about sampling. Just because I can, or rather I'm not sure I can and want your feedback on this. Actually, I was going to write about Spec but considering that I'm in 3CD class right now I might as well write something about 3CD.

First thing I'm going to tell you about is the Central Limit Theorem. Some smart person, or perhaps a bunch of smart people, worked out that if you take a bunch of samples from a larger population and work out their means, the means will make a graph similar to that of a normal distribution. It doesn't matter what kind of distribution you had before, the sample means will make a graph similar to a normal distribution.

Example number 1: Let's just say that you recorded the heights of all of the adults in China. The distribution of the heights, like the distribution of pretty much everything else in nature, will almost certainly resemble a normal distribution graph.

Then, if you take samples of, say, 300 people, and worked out the means of each sample (which would take a pretty damn long time), and plotted all of these means on a graph, you would find that the distribution of these means will also resemble a normal distribution. Yay.

And this phenomenon isn't just limited to the normal distribution as you shall see in the next example.

Example 2: Having completed the gargantuan task of measuring all of those people, you have decided to live a quieter life dedicated to coin flipping. You are going to flip a coin 10 times and record the number of heads that you get, and repeat this experiment 1000 times. As the event of getting a head is independent and there are only two possible outcomes (success and failure- i.e. getting a head and getting a tail), this can be represented by the binomial distribution.

Recording the number of heads you get from all 1000 experiments will result in a binomial distribution where n = 1000 and p = 0.5. Now, if you take samples of 50 experiments, record the mean number of heads for each sample of 50 experiments, and plot these on a graph, you will also get...

A normal distribution.

It works pretty much the same way for any other type of distribution you can think of. If you want the graph of the means to more closely resemble a normal distribution, try and make the size of the sample larger and the probability of success (in regards to the binomial distribution at least) closer to 0.5. It's recommended that you stick to samples of size 30 or greater and a probability of at least 0.1.

Now, here's the nice part. The mean of the sample means is the same as the population mean, which makes things easy when it comes to calculating the mean.

The standard deviation isn't quite so easy, but it's not very hard either. Just divide the population standard deviation by the square root of the sample size, i.e. (sigma)/sqrt(n).

So, what does it all mean?

Why bother working out the sample mean and standard deviation?

Well, using these figures, you can work out the probability that the mean of a sample will be smaller or greater than a figure. It's simple enough. The sample means are normally distributed, with the mean of the sample means equal to the population mean and the standard deviation equal to (population standard deviation)/sqrt(sample size). You can then work out probabilities the way that you would work out probabilities in any other normal distribution.

You can also use these figures to work out mathematically whether a sample mean deviates too far from the population mean for your liking. You do this by working out the confidence intervals for the sample mean, for example 90% or 95% confidence intervals, and testing to see if the population mean lies within these intervals.

To go about working out a confidence interval, you need to find out where the boundaries of the confidence interval lie. You can work this out in terms of how many standard deviations away from the mean. For example, if you wanted to work out a 95% confidence interval, you are essentially trying to find the value of k in P(-k < X < k) = 0.95 in a standard normal distribution (i.e. mean of 0, standard deviation of 1).

Using a calculator, k = 1.96, i.e. 1.96 standard deviations away from the mean.

To work out the 95% confidence mean, therefore, you take the sample mean and add the sample standard deviation multiplied by 1.96 to get the upper boundary:

(sample mean) + (1.96)(sample standard deviation)

Then you take the sample mean and subtract the sample standard deviation multiplied by 1.96 to get the lower boundary:

(sample mean) - (1.96)(sample standard deviation)

Now that you have your confidence interval, you can do other things with it. For example, you can check to see if the population mean lies within this confidence interval.

By the way, if you're not working out the 95% confidence interval, you would substitute whatever percentage you need (as a decimal) into the P(-k < X <k) equation. You would then plug in whatever answer you got from that into the next two equations.

And then there's one other kind of question. Questions like, "How large do you need a sample to be if you want to be x% confident that the sample mean lies within y units of the population mean?" In this case, you would first work out how many standard deviations away from the mean that you need (i.e. use the P(-k < X < k) equation).

Now you just need to remember that the sample standard deviation is (population standard deviation/sqrt(n)).

(No. of standard deviations away from the mean)(population standard deviation/sqrt(n)) = y

You can then rearrange this to find n.

Someday I might need to clarify this by giving examples using actual numbers because it is getting rather messy using variables. Also I might need to rewrite the equations so that they are easy to read. Watch this space...

No comments:

Post a Comment