|ChemistryMathematicsEnglish LiteratureEnglish LanguageHome|
What's statistics all about? Graphs, graphs, graphs... Well, actually there are about 7 stages:
We use models because:
One of the simplest way of ordering data is to place it in a stem and leaf diagram. You might have done these at GCSE and they're really easy. For example, which the following data:
As you can see from the key, the | divides tens from units. It can vary between questions... The first step is to go along the table and record each piece of data in the order they appear. Then it's simply a case of rearranging it.
Stem and leafs can also be back to back, if you have two sets of data to display.
Using the data above:
Stem and leafs can give us an indication of distribution. There is a much wider distribution for weight, in this example, than height. If it were comparing something like scores on two exams, we could compare the median.
Lots of things can be done to frequency tables... here's all that you'll need to know for the exam...
Data is usually grouped, taking the frequency. For instance:
One way we can interpret the data is by working out the cumulative frequency. This simply means add the frequency as you go along. Cumulative frequency is plotted against the upper class boundary. From the above example, we get:
To check you're right for the cumulative frequency, you can add the frequency column. Or the question will probably say something like, "a survey of 68 people..." and that's an even easier check.
When we have our cumulative frequency column, we can draw a cumulative frequency curve.
Using this, we can also create a box plot. This is deduced by looking at the quartiles up the y-axis and finding the corresponding x-values:
Box plots are useful because they tell you lots of information, such as the median, show you the spread of the interquartile range, if there are any outliers and whether they're normal, positively or negatively skewed.
If you're asked to draw one, make sure you draw it to scale and label the axis!
Outliers are values a large way outside the range of the data. They are usually represented as a cross:
They can be either too low or two high and are usually worked out by the equations:
The exam question will always state how to work out the outliers though, so this is one thing you don't have to worry about remembering (just as long as you know how to use the formula).
When you've distinguished the outliers, where does the end of the box plot occur? You can either use the next highest/lowest data value after the outlier, or use the value worked from the formula.
To work out the median, find the value.
For Q1 work out the value and for Q3 find the value.
Percentiles (P12) mean a percentage of the CF. To work out P12 for example, work out the .
For grouped frequency, it can be difficult to calculate the median and quartiles. There is a way of estimating an answer, however, and this is called linear interpolation. First step, look at the following data:
To find linear interpolation is a horrible formula. Sadly, you just gotta know it:
The first step is the find the value.
In this example, it is 7.75.
The only difference for the percentiles and other quartiles is replacing by whatever you're wanting to find.
It's easy enough to work out the mean from normal data, just the simple formula:
For a frequency table, however, it's slightly more complicated.
For a grouped frequency table, you'll need to work out the mid-point of the x variable. Take the lower class boundary of the bottom value and the upper class boundary of the top value. Add them together and divide by 2.
For example: 0 + 9.5 = 9.5 2 = 4.75
The formula is:
Therefore, once you have the midpoint, you need to multiply f and x:
Add the fx column and then divide by the f column to find the mean:
For an ordinary set of data, the standard deviation is found by the following:
(Variance is the same formula, but without the square root).
For a frequency table, or grouped frequency table, though, again we have a slightly different formula:
Taking the above as an example, we need to add an fx2 column. Be careful with this. Notice only the x is squared, not (fx)2.
Now add up the fx2 and f columns, and write in the mean squared:
Stick all that in your calculator and you'll get the answer: 4.48 (3 s.f.)
When the numbers are too large to be reasonably worked with, there is an option for finding the mean. We can use coding. This replaces x (the midpoint) with y (connected by a formula, which makes it a smaller number).
We need to add the code column, and work out y and then add a column for fy and fy2 rather than fx and fx2:
Next, work out the mean of y, using the formula:
We think back to the original code:
Standard deviation is exactly the same:
Now, if we think of the dispersion, adding and subtracting won't affect it. Dividing by 10 will, however, so we need to times by 10 to get the standard deviation for x: 15.7.
Histograms are used for representing data that is continuous. A characteristic of these is that the area of the bars represents the frequency, not the height. They're relatively simple to draw...
The heights of twenty children (to the nearest cm) was recorded in the following frequency table. Draw a histogram to represent the data.
There are two columns that we need to add: the class width and the frequency density.
Class width is the width of each group... Be careful when calculating to work out from the lower class boundary and the upper class boundary. For example, 120-124 is actually: 119.5-124.5 and so the class width is 5.
Frequency density = frequency class width
When we have these values, we plot the lower class and upper class boundaries on the x axis and the frequency density on the y axis.
Easy enough, hm?
From the histogram above, we see a slight positive skew: there are more values towards the negative than there are towards the positive. There are three types of skew, positive, negative and normal, and there are three tests to differentiate between them:
When we have two sets of data, we can draw a scatter diagram to see if there is any correlation between them - to see how closely they are connected.
Data: The marks of 10 candidates in Maths and Physics is shown below:
From the data, we can plot the x values corresponding to the y values. The only difference is that we don't join the crosses with a line:
We can already see that it's positively correlated. A way to test this is to divide the graph into four quadrants, and then look at where the majority of the points lie:
However, just looking at the scatter diagrams, is a bit inaccurate. It's much better to calculate the strength of the correlation. There's a nice formula for this called PMCC (product moment correlation co-efficient) and even nicer, it's in the formula booklet. (One thing you don't have to learn... besides knowing how to use it!)
That might not help a lot, but the formula booklet also tells us how to calculate Sxy, Sxx and Syy:
From the above information, we complete the following table:
If you're lucky the question will already give you these figures, and all you'll be asked to do is use them.
Now we have these, we merely stick them into the PMCC formula:
PMCC value ranges from -1 to +1 with -1 being perfect negative correlation, 0 being no correlation and +1 being perfect positive correlation. 0.863 is strong positive.
Even if we code the data, the PMCC remains the same.
Line of Best Fit
Also referred to as "least squares" or "regression line", this is also given in the formula booklet:
We can work out b easily enough from the data above:
To work out a we need and . These are just worked out from and .
Then we stick them all into the formula a = - b to find a:
Now we have a and b, we can put them in y = a + bx.
Easy! If the question asked you to draw on the regression line, an easy way is to plot the and point on the scatter diagram, and then draw the line from the y-axis point, crossing this point. The mean point always lies on the line.
If the data is coded, we need to uncode when finding the mean.
A question might then ask you to work out y from an x value, but that's very simple. If an x value is much higher than the given range, the result for the corresponding y value may not be accurate.
If A is an event, the probability of it occurring is the number of ways A can occur, divide by the sample space (total number of outcomes):
Probability is always 0 ≤ p ≤ 1.
If you have a probability, p(A), the probability of not getting A is written as: p(A'). We can say that to find p(A'), we merely take p(A) away from 1.
A B -- this means A "intersection" B -- all elements that are in A and in B. We can see this on a Venn diagram:
A B means A "union" B -- all elements that are in A or in B. On a Venn diagram this is:
Weird combinations can be asked... check out Mathsnet to make sure you understand!
Notice on this Venn diagram that we only include the middle bit where they both intercept, once. Therefore, we cannot say, as we did at GCSE, that "or means add", because it isn't always true.
The new formula for union is:
We can rearrange this to get:
There are 15 books on a bookshelf. 10 of these are fiction, 4 of which are hard-back. 6, in total, are hard-back and the remaining 9 are paper back.
Find the probability that a hard-back fiction book is chosen at random.
First stage is to draw a Venn diagram and write in all the numbers:
We're looking for p(HF)... so where is it both H and F? Where the two circles overlap, so 4/15.
Find the probability that a hardback is chosen but is not fiction.
We're wanting p(HF')... which is 2/15.
This occurs when the probability of A is conditional upon B having already occurred. Given B, find the probability of A. It's written out as p(A|B).
There is a nice formula:
We use tree diagrams to solve conditional probability.
A bag contains 6 red and 4 blue balls. 2 balls are picked at random and retained.
Find the probability that both balls are red.
First, draw out a tree diagram.
We want p(RR), so we just follow the tree diagram along:
6/10 x 5/9 = 30/90 = 1/3.
Find the probability that the balls are different colours.
We want p(RB) and p(BR)... multiply across both branches and then add these together:
p(RB) = 6/10 x 4/9 = 24/90
p(BR) = 4/10 x 6/9 = 24/90
= 48/90 = 8/15.
Find the probability that the second ball is red, given the first is blue.
We want p(R|B), so we use the formula:
Independent events are the opposite of conditional, where one factor doesn't affect the next. Example, if balls are taken from a bag and replaced. The probability of a red ball is the same no matter how many times you pick from the bag.
Therefore, if they are independent:
If they are mutually exclusive, they cannot occur at the same time and the p(AB) is 0.
This means that:
Discreet Random Variables
Discreet Random Variables are probabilities such as the "number on a fair die".
The probability for discreet random variables is written as P(X=x).
Example: A tetrahedral die has the numbers 1, 2, 3, 4 on its faces. The die is biased in such way that:
P(X=x) = k x = 1,2,3P(X) = 3k x = 4
If we draw out this in a probability distribution table we get:
All the probabilities added together = 1, naturally, so we can say:
Therefore, we can write out the real probability distribution:
We can also find the cumulative distribution, the F(x):
The cumulative probability always adds up to 1.
If a question asks you something like F(3.5), in our example 2.5 doesn't exist. Therefore, we do F(3) instead, which would be .
Mean and Variance
Finding the mean and variance is almost identical to finding the mean of a frequency table.
The formula for mean:
(For a frequency table, the mean is: ... the only difference is that we don't divide for E(X).)
For Variance, we have the formula:
μ is another way of writing E(X) or .
If X is a discreet random variable.
Therefore, E(X) = 0.7
Var(X) = 0.9 - (0.7)2 = 0.41
Suppose Y is the random variable given by 3X - 2 (like with coding) for the above table. The table would now look like this:
E(Y) = 0.1
Var(Y) = 3.7 - (0.1)2 = 3.69
Remember the code: 3X - 2
There is a connection between E(X)= 0.7 and E(Y)=0.1.
There is also a connection between Var(X)=0.41 and
Var(Y)=3.69, but this is harder to spot.
In general: (learn these!)
Discreet Uniform distribution is where each random variable has the same probability. For example, when X is the probability of a fair 6-sided die. Each probability would be .
The Normal Distribution
Most natural phenomena produce distributions that have the following shape:
It's called the normal distribution and some examples are male height, female height, I.Q...
Generally, it represents continuous data.
An example, to show how it's written:
If we let X be the random variable for IQ:
X has (approx) a Normal Distribution, a mean of 100 and a standard deviation of 15.
We write this as:
What Normal Distribution means is that 68% of data lies within one standard deviation of the mean. 95% lies within 2 standard deviations of the mean.
Drawing the IQ graph, we see:
This means converting a normal distribution into a "standard normal" distribution. The standard normal has a mean of 0 and a standard deviation of 1.
When we have a standard normal, we can use it to work out the probability of getting greater than, or less than, a particular value.
Example: p(z<1) = 0.8413
There are horrible tables that can be used, or we can use our calculator. Every graphical calculator is slightly different, so make sure you know how to use it.
Entering the button followed by the standardised value, will find a probability that z is less than ...
From the example above, you would enter: 1 and press enter, and out pops the probability 0.8413.
Entering the button followed by the standardised value, will find a probability that z is more than ...
In order to standardise X ~ N(, σ2), we use the following:
Example: If X ~ N(100, 152),find p(X>150).
The first step is to standardise:
We're wanting greater than, so we use the and out pops the answer: 0.000042912.
As usual, standardise first:
We want to draw this as a diagram, so we know exactly what we want:
From this, we can see that we want less than 2, take away less than -2.
We hit into the calculator:
and this should produce the answer: 0.9545
A question might give you a percentage and ask you to work backwards. Fortunately, the calculator has another magic button called InvN, which allows you to enter a percentage and get a value for the axis. (or you can consult the tables).
Example: The probability of a runner completing a marathon in under 140 mins is 0.0139. If the times are normally distributed, find the mean if the standard deviation is 10 mins.
So we have: X = time to complete a marathon
First step, as usual, is to standardise:
Now we use the InvN function and find the number on the x-axis is
And there's the answer to the question!
That's pretty much all there is for Statistics. Email me if you think something is missing!