Statistics

...Let's start with something fun...

The Capture/Recapture Method

  1. This method is used to estimate the size of a population.
  2. It's only valid when: new members can't join or leave, and a decent time period has passed.

A classic example

There's a lake in northern Scotland filled with Bitesize fish. Revision pupils want to estimate

how many there are.

They take a sample of 50 fish (the capture part).

They mark the fish and then let them swim about for a week in their pool of knowledge.

They then capture 100 fish (the recapture part) and out of the 100 fish, 17 are marked.

Estimate how many fish there are.

The answer is all in the question. From the knowledge we can say:

50 out of the total is equal to 17 out of 100. Then it's a matter of rearranging to find the total (T).

So there must be an estimate of 300 fish swimming in the lake. Easy.

Go to top!


Scatter diagrams and Spearman's Rank

One of the few remaining "easy" topics...

What is it?

The Spearman's Rank method is used with scatter diagrams, so I thought I might as well group them together. First, we all know what a scatter diagram is, right? If not, allow me to enlighten you...

The classic example:

Plot a scatter diagram for the time of a car journey compared to it's length from the following data:

Distance (miles) 3 8 3 14 9 7 6 10 1 18
Time taken (mins) 5 11 6 14 10 13 8 18 4 25

Draw your x and y axis like a normal graph and put Distance on x and Time on y. Then it's a case of plotting the pairs of points, simple enough.

Handy hint: If it really confuses you, why not reorder the points so the x ones increase as you go along?

Next find the mean distance (7.9) and mean time (11.4) and plot that on too. Draw a line of best fit that goes through your mean point.

That's your scatter graph. Here we can see strong positive correlation. Other correlations could be:

Strong Negative Weak Negative No correlation Weak Positive

What's Spearman's Rank? As with most maths, it requires a formula and some prior knowledge. They're likely to ask you about it in the exam, so revise, and learn! In a nutshell, it shows how strong correlation is, from 1 being strong positive, to 0 being no correlation, and -1 being strong negative.

The formula:

p = what we're working out

n = the total number of data sets

D = the difference in ranks.

First step: Add two more rows to your table of data called Rank (D) and Rank (T) -- Distance and Time, see?

Distance (miles) 3 8 3 14 9 7 6 10 1 18
Time taken (mins) 5 11 6 14 10 13 8 18 4 25
Rank (D)                    
Rank (T)                    

Now it's just a case of ranking your data, giving the lowest Distance 1 in the Rank (D) and the lowest Time 1 in the Rank (T). Just rank then in size order.

What happens if two numbers are the same? Simple really: they tie for the position. If they were taking rank 4 and 5, they both are labeled 4.5 and there would obviously be no rank 4 or 5.

Example, before you get any more confused:

Distance (miles) 3 8 3 14 9 7 6 10 1 18
Time taken (mins) 5 11 6 14 10 13 8 18 4 25
Rank (D) 2.5 6 2.5 9 7 5 4 8 1 10
Rank (T) 2 5 3 8 6 7 4 9 1 10

Next to find the "difference of ranks" you merely take one rank from the other. Then square this to remove any negatives (Thus creating D²):

Distance (miles) 3 8 3 14 9 7 6 10 1 18
Time taken (mins) 5 11 6 14 10 13 8 18 4 25
Rank (D) 2.5 6 2.5 9 7 5 4 8 1 10
Rank (T) 2 5 3 8 6 7 4 9 1 10
Difference 0.5 1 -0.5 1 1 -2 0 -1 0 0
Difference squared (D²) 0.25 1

0.25

1 1 4 0 1 0 0

Now we have D² and n we can place them in the formula and work out p:

From the equation we can see the results for correlation between distance and time had a Spearman's Rank of 0.948 meaning they are strongly positive.

Random Fact: If you have many tied results the Spearman's Rank will be less accurate.

And with that, you have pretty much everything you need to know about Scatter Diagrams and Spearman's Rank.

Go to top!


Questionnaires ... There's always a double 'n' in 'questionnaire'.

Exams being exams, like to throw random stuff at you, so here are a few random things about exams to combat that...

Example:

Mark on the scale above how strongly you agree that smoking should be banned in public places:

The exam might ask: Why might the question not provide accurate results?

The answer: Because people will interpret where 'agree' or 'weakly agree', for example, are on the scale differently.

The exam might ask: What can we do to check the answer's accuracy?

The answer: Ask the question in a different way.

Sensitive Questions

What are they? These are questions where people might lie. They're delicate subjects found in queries.

Example:

Have you ever smoked cannabis?

Yes No

An exam might ask, how can we make this question more reliable? There are a number of ways. One such is to give them something to do first.

Throw a coin. If it comes down heads, tick yes. If it comes down tails, answer to question.

With that, people will think they're safe, because if they answer yes, others will just assume they got a head. This is useful because:

We know the ratio of a fair coin is 1:1, and therefore, out of a sample of 100 people, we'd expect 50/100 to get a head and say yes. If, however, 58/100 say yes, it's probably 8 of them got a tails, and had to answer. This means that 8/50 people smoked cannabis.

However, 100 isn't really a big enough number. A sample of 1000 people would be much better, for the probabilities to work.

Simple, right?

Go to top!


Cumulative Frequency

Another long topic, so take a deep breath...

What are they? Cumulative frequency graphs are where you add up the frequencies of the data as you go along. Having two columns, you add a third:

Example:

A survey was taken calculating the length of 100 Bitesize fish in the old lake. The results were recorded:

Length (cm)
Frequency
Cumulative Frequency

0 x < 5

4
4
5 x < 10
16
20 (4+16)
10 x < 15
13
33 (20+13)
15 x < 20
15
48
20 x < 25
14
62
25 x < 30
17
79
30 x < 35
7
86
35 x < 40
8
94
40 x < 45
6
100

Notice the total cumulative frequency is 100 and the question told you 100 fish had been measured... That's a good tip for checking your maths is up to scratch. Which of course it will be.

Next, we need to draw the graph. First we need to understand two key words though:

Discrete Data Data that consists on separate, undividable categories.For example: Shoe size can only be 1, 2, 3, 4 and not something like 1.34454333... Eye colour can only be one of a set range of colours.
Continuous Data This is the opposite. It cannot be split into firm categories. Some examples are time, weight, length, height... You can get a length of . It will never quite fit anywhere.

While we're at it, there are two other definitions:

Qualititative Data Used to describe data that doesn't use numbers, e.g. Eye colour.
Quantitative Data Used to describe data that uses numbers, e.g. Shoe size.

So check, we're using length, so it must be continuous. To draw the graph, you plot each cumulative frequency point on the end point of the class intervals. Cumulative frequency goes on the y-axis (up) and length in this case on x-axis (across).

Now it's just a case of joining the dots, however make sure you do it dot-to-dot with straight lines:

N.b. (which comes from the latin 'nota bene' which means 'special attention', Jordan tells me). With discreet data, you do a frequency step polygon. This is just like a barchart and uses the upper values of data.

Once you've drawn your graph, the question remains, what can you do with it? Well, by measuring 50% of the Cumulative Frequency, we can find the median. Also between 25% and 75% we can see the inter quartile range.

From this we can see the inter quartile range is 16 (29 - 13).

Outliers: These are anomalies either at the beginning or end of the cumulative frequency graph. They are easy to spot when you draw a box plot (the green bit in the image above).

The only rule for outliers is that the result that causes them must be at least 1.5 times the inter quartile range away from either upper or lower quartile.

To draw the box plot with the outlier, you work out where the boundary should have been by multiplying the IQR by 1.5 and draw that as a dotted line. The outlier appears as a dot:

Go to top!


Mean and Standard Deviation

When you have a bar chart or histogram, you can work out the standard deviation. This little thing sounds complicated, but actually all it means is how much the data in the bar chart/histogram varies from the mean.

First off, as if you could forget, let's have a reminder of what a histogram is...

A histogram: Like a bar chart, but while the height of a bar chart is important, the area of a histogram is the important thing.

Example

Say a survey of Bitesize fish collected the following data:

Length (cm)
Frequency

0 x < 10

20
10 x < 15
13
15 x < 20
15
20 x < 25
14
25 x < 30
17
30 x < 45
21

This tells us immediately that we're dealing with histograms because the class intervals are different. Class interval is the range of a frequency. For example, there are 20 fish between 0 and 10 cm. The class width would be 10 - 0 = 10.

When ever dealing with drawing histograms, you need to add an extra two columns, class width and frequency density. The frequency density is just frequency divide by class width.

So in the table above we have:

Length (cm)
Frequency
Class Width
Frequency Density

0 x < 10

20
10
2
10 x < 15
13
5
2.6
15 x < 20
15
5
3
20 x < 25
14
5
2.8
25 x < 30
17
5
3.4
30 x < 45
21
15
1.4

Now the tricky part's done, it's just a case of plotting the graph with Frequency Density and Length:

And that's all there is to plotting a histogram. Remember though, discreet data has gaps between the bars.

Mean

Any graph that uses frequency for a range, makes finding the mean slightly trickier. However, there's a simple trick to it. You take your table of results and add two more columns, called mid-point (x) and fx which is frequency times by midpoint.

Length (cm)
Frequency
Mid-point (x)
fx

0 x < 10

20
5
100
10 x < 15
13
12.5
162.5
15 x < 20
15
17.5
262.5
20 x < 25
14
22.5
315
25 x < 30
17
27.5
467.5
30 x < 45
21
37.5
787.5

Next it's a case of totaling up frequency and fx. Σf = 100, Σfx = 2095

To work out the mean you just divide Σfx by Σf, so here it equals 20.95. Check to see if that looks reasonable.

Standard Deviation

Once you have the mean, it's useful to find out how much deviation there is from it.. This is where standard deviation fits in.

The formula for standard deviation varies, but the one I've been taught is:

It might look scary, but follow remember that and the instructions to come and you won't have a problem. This formula isn't on the paper either, so you need to learn it.

From the mean table above, we add an extra column, called fx². It's just a case of multiplying fx by x again. From that, you can find Σx²:

Length (cm)
Frequency
Mid-point (x)
fx
fx²

0 x < 10

20
5
100
500
10 x < 15
13
12.5
162.5
2031.25
15 x < 20
15
17.5
262.5
4593.75
20 x < 25
14
22.5
315
7087.5
25 x < 30
17
27.5
467.5
12856.25
30 x < 45
21
37.5
787.5
29531.25
Σ
100
2095
56600

Now it's just a case of completing the formula. Easy:

We can now say the mean length of fishies in the Bitesize Lake is 20.95, but the vast majority of them range between 13.13 and 28.77 because the standard deviation is 7.82.

Random Question: If the length of everything increases by 10 what happens to the statistics?

Answer: The Mean, Median and Mode increase by 10, but the Standard Deviation and Range stay the same.

Distribution

When you draw out a histogram you might notice any of the following shapes has been formed:

Here the distribution is positively skewed. There is more small data than there is large.  This is called normal distribution. It means the middle has the most data.  Here the distribution is negatively skewed. There is more large data than there is small.

In your graph is of normal distribution, magical things can happen. The first standard deviation makes up 68% of the entire graph. For example, if you have a mean of 6 and a standard deviation of 1, 68% of the graph will be between 5 and 7.

The second standard deviation covers around 95% of the graph. In the example above, 95% of the data would fall between 4 and 8.

Tom scored 60 on his English test, but 55 on his Maths. Explain which he did relatively better in.

This question is deceptively easy. However, you need to look at the mean and standard deviation of the test.

Subject Mean Mark Standard Deviation
English   50  10
Maths  45  5

 

To find out which he did better on, we need to standardise his scores. There's a very easy formula for this:

Mark - mean mark Standard Deviation

For English this would be: 60 - 50 = 10. 10 10 = 1. Therefore, his standardised score was 1.

In Maths it would be: 55 - 45 = 10. 10 5 = 2. His standardised score would be 2.

Which is bigger? Maths. That means, relatively speaking, he did better in Maths.

Go to top!


Stem and Leaf Diagram

This is a really easy way of grouping some data. It may look confusing at first, but once you get your head around it, you'll laugh at how simple it is.

Say you have a few numbers that you have to sort out:

1, 4, 69, 56, 37, 52, 75, 67, 54, 21, 11, 45, 34, 76, 23

Now, when you draw a stem and leaf diagram, you split them into tens and units. For all those less than 10, you put a 0 in the tens column, and then just write the digits. For numbers between 10 and 19, you write 1 in the tens column and the digits in the other column.

Here's how it's set out:

0 |  1 4 This represents the numbers: 1, 4
1 |  1 This represents the number: 11
2 |  1 3 This represents the number: 21,23
3 |  4 7 This represents the numbers: 34, 37
4 |  5 This represents the numbers: 45
5 |

 2 4 6

This represents the numbers: 52, 54, 56
6 |  7 9 This represents the numbers: 67, 69
7 |  5 6 This represents the numbers: 75, 76

Simple, right?

Remember, the digits after the stem (the vertical line) need to be arranged in numerical order.

From this you can work out the median, (middle number), which would be 45. You can also work out the Interquartile range by finding the 25% number and the 75% number. 21 and 67 here, so the IQR would be 46.

Go to top!


Sampling Techniques

There are five sampling techniques. You'll need to know what they are and what their advantages and disadvantages are:

Name Description Advantage Disadvantage
Convenience Picking the sample from the easiest to get data. Quick.  Not representative. 
Cluster  Taking a sample of all data in one strata.  Again quick.  Still not representative. 
Random 

Give each piece of data a number.

Generate random numbers to use for sample. (Excluding repeat numbers).

More likely to be representative.  It's not certain to be representative.
Systematic  Chose the nth item until you have your sample. e.g. If you need a sample of 100 from a population of 500, divide 500 by 100 to get 5. Then take every 5th person.  A representative sample.  Time consuming.
Stratified  Divide the population into categories (or stratas) and take a proportional sample from each strata.  A representative sample.  Time consuming.

To actually pick your sample in the stratified technique, you'll need another method. The most representative is either: stratified random or stratified systematic.

Go to top!


Geometric Mean

This is a really simple, slight variation of finding the mean. If you have a list of numbers:

1, 4, 7, 11, 45

You add them all together (in this example equalling: 68) just like you would normally.

Next, though instead of dividing by the number of items, you root it by the number. Here for example, you do: which equals 2.325. (You'll need a calculator to do that.)

Go to top!


Thought of a stats topic, but don't see it here? Email me!

 

 

AQA homepage

BBC GSCE Bitesize

Level Up

Pokemon Ultimate

Critique Circle


All text copyright © 2006 to EJ Taylor. Page Template created by James Taylor. Site created: 10th April, 2006. Last revised: 2 August, 2015