Content

Statistics

What's statistics all about? Graphs, graphs, graphs... Well, actually there are about 7 stages:

  1. A real world problem is observed.
  2. A mathematical model is thought up.
  3. The model is used to make predictions, "What happens if...?"
  4. Real world data is collected.
  5. Predicted results are obtained.
  6. These are compared with statistical tests.
  7. Models are refined as required and then it's back to stage 3...

We use models because:

  • They simplify a real world problem.
  • They improve our understanding of a r.w. problem.
  • They are quicker and cheaper.
  • They can be used to predict future outcomes.

Steam and Leaf

One of the simplest way of ordering data is to place it in a stem and leaf diagram. You might have done these at GCSE and they're really easy. For example, which the following data:

Person
1
2
3
4
5
6
7
8
9
10
11
Weight (lb)
166
164
143
189
191
178
165
159
189
191
176
Height (cm)
161
160
160
199
167

178

169
174
172
178
167

Unordered Stem and Leaf

Height in cm
3 | 4 represents 34 cm.

15 |
16 | 10079
17 | 84286
18 |
19 | 9

Ordered Stem and Leaf

Height in cm
3 | 4 represents 34 cm.

15 |
16 | 00179
17 | 24688
18 |
19 | 9

As you can see from the key, the | divides tens from units. It can vary between questions... The first step is to go along the table and record each piece of data in the order they appear. Then it's simply a case of rearranging it.

Stem and leafs can also be back to back, if you have two sets of data to display.

Using the data above:

Weight in pounds
4 | 3 represents 34 lb.

Height in cm
3 | 4 represents 34 cm.

3
9
7654
8
99
11
| 14 |
| 15 |
| 16 | 00179
| 17 | 24688
| 18 |
| 19 | 9

Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in this example, than height. If it were comparing something like scores on two exams, we could compare the median.

Frequency Tables

Lots of things can be done to frequency tables... here's all that you'll need to know for the exam...

Cumulative Frequency

Data is usually grouped, taking the frequency. For instance:

Amount (£x)
No. of people (frequency, f)

0 ≤ x < 20

5
20 ≤ x < 40
9
40 ≤ x < 60
20
60 ≤ x < 80
25
80 ≤ x < 100
9

One way we can interpret the data is by working out the cumulative frequency. This simply means add the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From the above example, we get:

Amount (£x)
No. of people (frequency, f)
Upper class boundary
Cumulative frequency

0 ≤ x < 20

5
20
5
20 ≤ x < 40
9
40
14 (5+9)
40 ≤ x < 60
20
60
34 (5+9+20)
60 ≤ x < 80
25
80
59
80 ≤ x < 100
9
100
68
 
68
 

To check you're right for the cumulative frequency, you can add the frequency column. Or the question will probably say something like, "a survey of 68 people..." and that's an even easier check.

When we have our cumulative frequency column, we can draw a cumulative frequency curve.

Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and finding the corresponding x-values:

Box plots are useful because they tell you lots of information, such as the median, show you the spread of the interquartile range, if there are any outliers and whether they're normal, positively or negatively skewed.

If you're asked to draw one, make sure you draw it to scale and label the axis!

Outliers are values a large way outside the range of the data. They are usually represented as a cross:

They can be either too low or two high and are usually worked out by the equations:

Q1 - 1.5(Q3 - Q1) (Anything less than this figure will be an outlier).
Q3 + 1.5(Q3 - Q1) (Anything greater than this figure will be an outlier).

The exam question will always state how to work out the outliers though, so this is one thing you don't have to worry about remembering (just as long as you know how to use the formula).

When you've distinguished the outliers, where does the end of the box plot occur? You can either use the next highest/lowest data value after the outlier, or use the value worked from the formula.

To work out the median, find the value.

For Q1 work out the value and for Q3 find the value.

Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the .

Linear Interpolation

For grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of estimating an answer, however, and this is called linear interpolation. First step, look at the following data:

Time (secs)
Frequency
Cumulative Frequency
Class width
0 ≤ x < 10
0
0
10
10 ≤ x < 15
8
8
5
15 ≤ x < 17.5
3
11
2.5
17.5 ≤ x < 20
7
18
2.5
20 ≤ x < 24
12
30
4

To find linear interpolation is a horrible formula. Sadly, you just gotta know it:

The first step is the find the value. In this example, it is 7.75.
We take away 0 and then divide it by 8 (the frequency of the row the cumulative 7.75 is found in.)
Next we times by 5 (the class width of the row 7.75 is found in).
Finally add on 10 (the lower class boundary of the row 7.75 is found in) and the answer appears:

14.8.

The only difference for the percentiles and other quartiles is replacing by whatever you're wanting to find.

Mean from frequency table

It's easy enough to work out the mean from normal data, just the simple formula:

(in other words, add them all up and divide by the number that there is.)

For a frequency table, however, it's slightly more complicated.

Time (secs)
Frequency (f)
0 - 9
0
10 - 14
8
15 - 17
3
18 - 20
7
21 - 24
12

For a grouped frequency table, you'll need to work out the mid-point of the x variable. Take the lower class boundary of the bottom value and the upper class boundary of the top value. Add them together and divide by 2.

For example: 0 + 9.5 = 9.5 2 = 4.75

The formula is:

Therefore, once you have the midpoint, you need to multiply f and x:

Time (secs)
Frequency (f)
Midpoint (x)
fx
0 - 9
0
4.75
0
10 - 14
8
12
96
15 - 17
3
16
48
18 - 20
7
19
133
21 - 24
12
23
276

Add the fx column and then divide by the f column to find the mean:

553 30 = 18.4 (3 s.f.)

Standard Deviation

For an ordinary set of data, the standard deviation is found by the following:

(Variance is the same formula, but without the square root).

For a frequency table, or grouped frequency table, though, again we have a slightly different formula:

Taking the above as an example, we need to add an fx2 column. Be careful with this. Notice only the x is squared, not (fx)2.

Time (secs)
Frequency (f)
Midpoint (x)
fx
fx2
0 - 9
0
4.75
0
0
10 - 14
8
12
96
1152
15 - 17
3
16
48
768
18 - 20
7
19
133
2527
21 - 24
12
23
276
6348

Now add up the fx2 and f columns, and write in the mean squared:

Stick all that in your calculator and you'll get the answer: 4.48 (3 s.f.)

Coding

When the numbers are too large to be reasonably worked with, there is an option for finding the mean. We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a smaller number).

Use the code to calculate the mean and standard deviaiton of the following frequency table:

x
f
15.5
8
25.5
12
35.5
15
45.5
16
55.5
11
65.5
6
75.5
2

We need to add the code column, and work out y and then add a column for fy and fy2 rather than fx and fx2:

x
f
fy
fy2
15.5
8
-3
-24
72
25.5
12
-2
-24
48
35.5
15
-1
-15
15
45.5
16
0
0
0
55.5
11
1
11
11
65.5
6
2
12
24
75.5
2
3
6
18

Next, work out the mean of y, using the formula:

= -0.49 (3 s.f)

We think back to the original code:

If we replace y with here, we can replace x with :

Add the numbers, and rearrange to make the subject of the formula.

= 40.6 (3 s.f.) and that's your answer!

Standard deviation is exactly the same:

Subbing in the values we get an answer of σ = 1.57 (3 s.f.)

Now, if we think of the dispersion, adding and subtracting won't affect it. Dividing by 10 will, however, so we need to times by 10 to get the standard deviation for x: 15.7.

Histograms

Histograms are used for representing data that is continuous. A characteristic of these is that the area of the bars represents the frequency, not the height. They're relatively simple to draw...

Example:

The heights of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a histogram to represent the data.

Height
Frequency
120-124
1
124-129
5
130-134
7
135-139
4
140-149
3

There are two columns that we need to add: the class width and the frequency density.

Class width is the width of each group... Be careful when calculating to work out from the lower class boundary and the upper class boundary. For example, 120-124 is actually: 119.5-124.5 and so the class width is 5.

Frequency density = frequency class width

Height
Frequency
Class Width
Frequency Density
120-124
1
5
0.2
124-129
5
5
1
130-134
7
5
1.4
135-139
4
5
0.8
140-149
3
10
0.3

When we have these values, we plot the lower class and upper class boundaries on the x axis and the frequency density on the y axis.

Easy enough, hm?

Skew

From the histogram above, we see a slight positive skew: there are more values towards the negative than there are towards the positive. There are three types of skew, positive, negative and normal, and there are three tests to differentiate between them:

Positive Skew
Symmetrical
Negative Skew
Mean > Median > Mode
Mean = Median = Mode
Mean < Median < Mode
Q2 - Q1 < Q3 - Q2
Q2 - Q1 = Q3 - Q2
Q2 - Q1 > Q3 - Q2

Back to the Top


Scatter Diagrams

When we have two sets of data, we can draw a scatter diagram to see if there is any correlation between them - to see how closely they are connected.

Data: The marks of 10 candidates in Maths and Physics is shown below:

Candidate 1 2 3 4 5 6 7 8 9 10
Physics (x) 18 20 30 40 46 54 60 80 88 92
Maths (y) 42 54 60 54 62 68 80 66 80 100

From the data, we can plot the x values corresponding to the y values. The only difference is that we don't join the crosses with a line:

We can already see that it's positively correlated. A way to test this is to divide the graph into four quadrants, and then look at where the majority of the points lie:

If most points lie in the 1st and 3rd quadrants, we have a positive correlation. If most points lie in the 2nd and 4th quadrants, we have a negative correlation. If points lie in all four quadrants randomly, we have no correlation.

However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the strength of the correlation. There's a nice formula for this called PMCC (product moment correlation co-efficient) and even nicer, it's in the formula booklet. (One thing you don't have to learn... besides knowing how to use it!)

That might not help a lot, but the formula booklet also tells us how to calculate Sxy, Sxx and Syy:

From the above information, we complete the following table:

x
y
x2
y2
xy
18
42
324
1764
756
20
54
400
2916
1080
30
60
900
3600
1800
40
54
1600
2916
2160
46
62
2116
3844
2852
54
68
2916
4624
3672
60
80
3600
6400
4800
80
66
6400
4356
5280
88
80
7744
6400
7040
92
100
8464
10000
9200
Σx = 528
Σy = 666
Σx2 = 34464
Σy2 = 46820
Σxy = 38640

If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.

So:

= 38640 -

= 3475.2

= 34464 -

= 6585.6

= 46820 -

= 2464.2

Now we have these, we merely stick them into the PMCC formula:

=

= 0.863 (3 s.f.)

PMCC value ranges from -1 to +1 with -1 being perfect negative correlation, 0 being no correlation and +1 being perfect positive correlation. 0.863 is strong positive.

Even if we code the data, the PMCC remains the same.

Line of Best Fit

Also referred to as "least squares" or "regression line", this is also given in the formula booklet:

y = a + bx
b =
a = - b

We can work out b easily enough from the data above:

=

= 0.527696

To work out a we need and . These are just worked out from and .

= 666 10 = 66.6

= 528 10 = 52.6

Then we stick them all into the formula a = - b to find a:

= 66.6 - (0.527696 x 52.6)

a = 38.738

Now we have a and b, we can put them in y = a + bx.

y = 38.738 + 0.528x

Easy! If the question asked you to draw on the regression line, an easy way is to plot the and point on the scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always lies on the line.

If the data is coded, we need to uncode when finding the mean.

A question might then ask you to work out y from an x value, but that's very simple. If an x value is much higher than the given range, the result for the corresponding y value may not be accurate.

Back to the Top


Probability

If A is an event, the probability of it occurring is the number of ways A can occur, divide by the sample space (total number of outcomes):

...so p(A) might be or ...

Probability is always 0 ≤ p ≤ 1.

If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find p(A'), we merely take p(A) away from 1.

A B -- this means A "intersection" B -- all elements that are in A and in B. We can see this on a Venn diagram:

A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:

Weird combinations can be asked... check out Mathsnet to make sure you understand!

Notice on this Venn diagram that we only include the middle bit where they both intercept, once. Therefore, we cannot say, as we did at GCSE, that "or means add", because it isn't always true.

The new formula for union is:

p(AB) = p(A) + p(B) - p(AB)

We can rearrange this to get:

p(AB) = p(A) + p(B) - p(AB)

Example:

There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hard-back and the remaining 9 are paper back.

Find the probability that a hard-back fiction book is chosen at random.

First stage is to draw a Venn diagram and write in all the numbers:

We're looking for p(HF)... so where is it both H and F? Where the two circles overlap, so 4/15.

Find the probability that a hardback is chosen but is not fiction.

We're wanting p(HF')... which is 2/15.

Conditional Probability

This occurs when the probability of A is conditional upon B having already occurred. Given B, find the probability of A. It's written out as p(A|B).

There is a nice formula:

We use tree diagrams to solve conditional probability.

Example:

A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained.

Find the probability that both balls are red.

First, draw out a tree diagram.

Notice the effect not replacing the balls has.

 

We want p(RR), so we just follow the tree diagram along:

6/10 x 5/9 = 30/90 = 1/3.

Find the probability that the balls are different colours.

We want p(RB) and p(BR)... multiply across both branches and then add these together:

p(RB) = 6/10 x 4/9 = 24/90

p(BR) = 4/10 x 6/9 = 24/90

= 48/90 = 8/15.

Find the probability that the second ball is red, given the first is blue.

We want p(R|B), so we use the formula:

= 24/90 4/10

= 2/3.

Independent Events

Independent events are the opposite of conditional, where one factor doesn't affect the next. Example, if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many times you pick from the bag.

Therefore, if they are independent:

p(A|B) p(A|B') P(A)

This means:

p(AB) = p(A)p(B)

This is the multiplication rule you learned at GCSE.

If they are mutually exclusive, they cannot occur at the same time and the p(AB) is 0.

This means that:

p(AB) = p(A) + p(B)

This is the addition rule you were taught at GCSE.

Back to the Top


Discreet Random Variables

Discreet Random Variables are probabilities such as the "number on a fair die".

The probability for discreet random variables is written as P(X=x).

Example: A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that:

P(X=x) = k          x = 1,2,3
P(X) = 3k            x = 4

If we draw out this in a probability distribution table we get:

x
P(X=x)
1
k
2
k
3
k
4
3k

All the probabilities added together = 1, naturally, so we can say:

k(1 + 1 + 1 + 3) = 1
6k = 1
k =

Therefore, we can write out the real probability distribution:

x
P(X=x)
1
2
3
4

 

We can also find the cumulative distribution, the F(x):

x
P(X=x)
F(x)
1
2
3
4
1

The cumulative probability always adds up to 1.

P(X≤2) means: what is the probability of getting an X value less than or equal to 2? We add up the probabilities we have, and so, in the above example, P(X≤4) =

F(x) means P(X≤x) so F(2) P(X≤2).

If a question asks you something like F(3.5), in our example 2.5 doesn't exist. Therefore, we do F(3) instead, which would be .

Mean and Variance

Finding the mean and variance is almost identical to finding the mean of a frequency table.

The formula for mean:

E(X) = xp

(For a frequency table, the mean is: ... the only difference is that we don't divide for E(X).)

For Variance, we have the formula:

Var(X) = x2p - μ2

μ is another way of writing E(X) or .

Example:

If X is a discreet random variable.

x
P(X=x)
xp
x2p
0
0.4
0
0
1
0.5
0.5
0.5
2
0.1
0.2
0.4
0.7
0.9

Therefore, E(X) = 0.7

Var(X) = 0.9 - (0.7)2 = 0.41

Suppose Y is the random variable given by 3X - 2 (like with coding) for the above table. The table would now look like this:

y
P(Y=y)
yp
y2p
-2
0.4
-0.8
1.6
1
0.5
0.5
0.5
4
0.1
0.4
1.6
0.1
3.7

E(Y) = 0.1

Var(Y) = 3.7 - (0.1)2 = 3.69

Remember the code: 3X - 2

There is a connection between E(X)= 0.7 and E(Y)=0.1.
0.7 x 3 - 2 = 0.1

There is also a connection between Var(X)=0.41 and Var(Y)=3.69, but this is harder to spot.
32 x 0.41 = 3.69

In general: (learn these!)

E(aX + b) = aE(X) + b

Var(aX + b) = a2Var(X)

Discreet Uniform distribution is where each random variable has the same probability. For example, when X is the probability of a fair 6-sided die. Each probability would be .

Back to the Top


The Normal Distribution

Most natural phenomena produce distributions that have the following shape:

It's called the normal distribution and some examples are male height, female height, I.Q...

Generally, it represents continuous data.

An example, to show how it's written:

If we let X be the random variable for IQ:

X has (approx) a Normal Distribution, a mean of 100 and a standard deviation of 15.

We write this as:

X ~ N(100,152)

What Normal Distribution means is that 68% of data lies within one standard deviation of the mean. 95% lies within 2 standard deviations of the mean.

Drawing the IQ graph, we see:

68% of the population have an IQ between 85 and 115. (within 1 s.d. of the mean.)

95% of the population have an IQ between 70 and 130. (within 2 s.d. of the mean.)

 

Standardising

This means converting a normal distribution into a "standard normal" distribution. The standard normal has a mean of 0 and a standard deviation of 1.

Z ~ N(0,12)

When we have a standard normal, we can use it to work out the probability of getting greater than, or less than, a particular value.

Example: p(z<1) = 0.8413

There are horrible tables that can be used, or we can use our calculator. Every graphical calculator is slightly different, so make sure you know how to use it.

Entering the button followed by the standardised value, will find a probability that z is less than ...

From the example above, you would enter: 1 and press enter, and out pops the probability 0.8413.

Entering the button followed by the standardised value, will find a probability that z is more than ...

In order to standardise X ~ N(, σ2), we use the following:

Example: If X ~ N(100, 152),find p(X>150).

The first step is to standardise:

p(Z>)

p(Z>3.3.)

We're wanting greater than, so we use the and out pops the answer: 0.000042912.

Find p(70<X<130)

As usual, standardise first:

p(-2 < Z < 2)

We want to draw this as a diagram, so we know exactly what we want:

From this, we can see that we want less than 2, take away less than -2.

We hit into the calculator:

2 - -2

and this should produce the answer: 0.9545

A question might give you a percentage and ask you to work backwards. Fortunately, the calculator has another magic button called InvN, which allows you to enter a percentage and get a value for the axis. (or you can consult the tables).

Example: The probability of a runner completing a marathon in under 140 mins is 0.0139. If the times are normally distributed, find the mean if the standard deviation is 10 mins.

So we have: X = time to complete a marathon

X ~ N(, 102)

p(X<140) = 0.0139

First step, as usual, is to standardise:

p(X < ) = 0.0139

Now we use the InvN function and find the number on the x-axis is -2.2.
From this, we can say:

= -2.2

140 - = -22

= 162

And there's the answer to the question!

Back to the Top


That's pretty much all there is for Statistics. Email me if you think something is missing!