Statistics
Table of Contents
Statistics: The Logic of Correlation (~20 marks, Paper 2)#
Statistics is worth ~20 marks in Paper 2. In Grade 12, the focus shifts from univariate data to bivariate data — the relationship between two variables.
The Big Idea: Relationships Between Variables#
If we spend more on advertising, do we sell more products? Statistics gives you three tools to answer this:
- Scatter Plot: A map of dots that shows if there is a pattern.
- Correlation ($r$): A number between $-1$ and $1$ that tells us how strong the relationship is.
- Regression: The line of “best fit” that lets us predict future outcomes.
The Correlation Coefficient ($r$)#
| $r$ value | Strength | Direction | What it looks like |
|---|---|---|---|
| $r = 1$ | Perfect | Positive | All points on an upward line |
| $0.8 \leq r < 1$ | Strong | Positive | Points tightly clustered, rising |
| $0.5 \leq r < 0.8$ | Moderate | Positive | Visible upward trend, some scatter |
| $0 < r < 0.5$ | Weak | Positive | Slight upward trend, lots of scatter |
| $r = 0$ | None | — | No pattern at all |
| $-1 \leq r < 0$ | Negative | Negative | As $x$ increases, $y$ decreases |
💡 Correlation ≠ Causation: Just because two variables are correlated doesn’t mean one CAUSES the other. Ice cream sales and drowning rates are correlated — but ice cream doesn’t cause drowning. Both are caused by hot weather.
The Least Squares Regression Line#
$$\hat{y} = a + bx$$Where:
- $b = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2}$ (gradient)
- $a = \bar{y} - b\bar{x}$ (y-intercept)
⚠️ You don’t calculate $a$ and $b$ by hand in the exam — your calculator does it. But you MUST know the formula exists and understand what $a$ and $b$ mean.
The regression line ALWAYS passes through the point $(\bar{x};\, \bar{y})$.
Interpolation vs Extrapolation#
| Type | Definition | Reliability |
|---|---|---|
| Interpolation | Predicting WITHIN the range of the data | ✅ Reliable |
| Extrapolation | Predicting OUTSIDE the range of the data | ⚠️ Unreliable — the trend may not continue |
Deep Dives (click into each)#
- Scatter Plots & Bivariate Data — plotting, interpreting, outliers, and revision of univariate data concepts
- Regression & Correlation — the Least Squares Regression Line, the $r$-value, interpolation vs extrapolation
🚨 Common Mistakes#
- Confusing correlation and causation: Strong $r$ does NOT prove one variable causes the other. State that there is “a strong positive/negative correlation” — never say “causes”.
- Extrapolating too far: Using the regression line to predict values far beyond the data range is unreliable.
- Outlier on the scatter plot: One outlier can drastically change $r$ and the regression line. Identify and comment on outliers.
- Calculator mode: Make sure your calculator is in STAT (regression) mode with the correct data entered. Double-check by verifying $\bar{x}$ and $\bar{y}$.
- Not interpreting $r$ in context: Don’t just state $r = 0.85$. Say “There is a strong positive correlation between $x$ and $y$.”
🔗 Related topics:
- Probability — statistics and probability are complementary branches of data science
📌 Grade 11 foundation: Statistics: Standard Deviation — measures of spread for univariate data
