# A tibble: 1 × 2
mean_bodymass count
<dbl> <int>
1 5076. 123
NC State University
ST 511 - Fall 2025
2025-10-01
– HW-3 due October 5th
– Are you keeping up with the prepare material?
– Quiz (released Thursday due Sunday)
– AE keys are posted
– Homework-3 key will be posted before the exam
We will finish the single_mean repo today. Please clone this if you haven’t!
in-class (October 8th)
– Statistical concepts
– Maybe one coding question
You may bring a hand-written 8x11 note sheet to the exam
Use this for concepts, and equations (ex. SE(\(\hat{p}\)))
take-home (Due October 17th)
– Statistical concepts
– Can you do the stuff we learned in class
Extension Question: Part of being a researcher/statistician is being able to take concepts and apply them in a new setting. There will be an extension question on the exam that will test your critical thinking skills. I will give you the appropriate material with the question to be successful.
We wanted to look at Penguin’s bodymass for the Gentoo species.
\(H_o: \mu = 6500\)
\(H_a: \mu \neq 6500\)
# A tibble: 1 × 2
mean_bodymass count
<dbl> <int>
1 5076. 123
Assumptions?
# A tibble: 1 × 1
count
<int>
1 123
What is:
– A t-distribution
– degrees of freedom?
A t-distribution is similar to a normal distribution, with a little more uncertainty!
– there are infinite t-distributions vs one-normal distribution
– t-distribution approaches a normal distribution when df goes up!
We use a t-distribution when the population standard deviation is unknown.
Let’s explain why we used a standard normal for categorical data, and why we will use a t-distribution for quantitative data!
For categorical data, our statistic (or assumption in hypothesis testing) is completely linked to the standard error.
We can not change one without the other.
\(\hat{p} = \frac{7}{12}\)
\(SE(\hat{p}) = \sqrt{\frac{\frac{7}{12} \left(1 - \frac{7}{12}\right)}{12}} = 0.142\)
\(\hat{p} = \frac{2}{12}\)
\(SE(\hat{p}) = \sqrt{\frac{\frac{2}{12} \left(1 - \frac{2}{12}\right)}{12}} = 0.108\)
When we change the proportion, the standard error changes!
They are “linked together”. When we know one, we have a lot of information about the other.
We use a Z-distribution for hypothesis tests / confidence intervals
n = 5
0,5,10,15,20
\(\bar{x} = \frac{0 + 5 + 10 + 15 + 20}{5} = \frac{50}{5} = 10\)
\(SE_{\bar{x}} = \frac{s}{\sqrt{n}}\)
\(SE_{\bar{x}} = \frac{\sqrt{\frac{\sum_{i=1}^{5} (x_i - 10)^2}{5-1}}}{\sqrt{5}}\)
\(SE_{\bar{x}} = \sqrt{\frac{\frac{250}{4}}{5}} \approx 3.5355\)
n = 5
8,9,10,11,12
\(\bar{x} = \frac{8 + 9 + 10 + 11 + 12}{5} = \frac{50}{5} = 10\)
\(SE_{\bar{x}} = \frac{\sqrt{\frac{\sum_{i=1}^{5} (x_i - 10)^2}{5-1}}}{\sqrt{5}}\)
\(SE_{\bar{x}} = \frac{\sqrt{\frac{10}{5-1}}}{\sqrt{5}} = 0.707\)
Two identical means have entirely different standard error calculations. These are not “linked”, and we need to account for the additional uncertanity by using a t-distribution!
Why t vs z
– Short answer
t-distribution for quantitative response
z-distribution for categorical response
– Long answer
the sample proportion is linked to the standard error
the sample mean is not linked to the standard error and the t-distribution accounts for the additional uncertanity in our estimation
there are infinite t-distributions… and their shape depends on the degrees of freedom…
so what are degrees of freedom?
Degrees of freedom for a t-distribution refer to the number of independent pieces of information that are available to estimate a parameter.
If our sample size is 124, our degrees of freedom would be 123.
For example, if you have n = 124, once you know the values of 123 observations, the 124th observation must be whatever makes the math equal \(\bar{x} = 124\)
We have 123 independent pieces of information, and one dependent piece of information.
Degrees of freedom are calculated n-1
Note: s is calculated from this constraint on the mean, so degrees of freedom are n-1.
Degrees of freedom go with t-distributions, and control the shape
For the single mean case, they are calculated to be n-1
This is not necessary for categorical data.
If the hypothesis test leads you to reject the null hypothesis, it means that the distance between \(\bar{x}\) and \(\mu_o\) is greater than the margin of error defined by the 95% CI.
\[ \left| \frac{\bar{x}_1 - \mu_o}{SE} \right| > t^* \]
\[ |\bar{x}_1 - \mu_o| > t^* \times SE \]
This inequality states that the distance from the center of the interval to the null value \(\mu_o\) is greater than the margin of error. If the distance to \(\mu_o\) is longer than the margin of error around the sample mean, the confidence interval must exclude \(\mu_o\).
These two equations are mathematically equivalent, proving that rejecting \(H_o\) is the same as the confidence interval excluding the null value \(\mu_o\).
– For a two-sided test where \(\alpha\) “matches up” with our confidence level
ex. \(\alpha\) = 0.05 and a 95% confidence interval
The difference in their numerical values is usually too small to break the equivalence.