# A tibble: 3 × 3
group count mean
<fct> <int> <dbl>
1 ctrl 10 5.03
2 trt1 10 4.66
3 trt2 10 5.53
NC State University
ST 511 - Fall 2025
2025-10-22
– Exam-1 (in-class) is posted
> Solutions on Moodle
> Regrade requests open Wednesday after class (until Sunday)
– I’m grading your take-homes now
– Quiz released Wednesday (due Sunday)
– Homework released Friday (due next Friday)
– Statistics experience released (we will talk about it)
Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.
Why can we not answer this question using difference in means?
# A tibble: 3 × 3
group count mean
<fct> <int> <dbl>
1 ctrl 10 5.03
2 trt1 10 4.66
3 trt2 10 5.53
Where did all of these values come from? What do they mean?
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.766 1.8832 4.846 0.0159 *
Residuals 27 10.492 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Sum of Squares (SS) is a measure of total variability (between or within)
The Mean Squares (MS) is an estimate (\(s^2\)) of the population variance (\(\sigma^2\))
It’s a generalized form of \(s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}\)
Deviations divided by degrees of freedom!
\(F = \frac{\text{MS}_{\text{Between}}}{\text{MS}_{\text{Within}}}\)
\(\mathbf{F} = \frac{\frac{\text{SS}_{\text{Between}}}{k - 1}}{\frac{\text{SS}_{\text{Within}}}{N - k}}\)
\(\text{SS}_{\text{Between}} = \sum_{i=1}^{k} n_i (\bar{X}_i - \bar{X}_{\text{grand}})^2\)
\(\text{SS}_{\text{Within}} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (X_{ij} - \bar{X}_i)^2\)
The main idea of the hsd is to compute the honestly significant difference (hsd) between all pairwise comparisons
Why do we need to use this thing called Tukeys hsd? Why can’t we just conduct a bunch of individual hypothesis tests?
\(\alpha\) is our significance level. It’s also our Type 1 error rate.
a Type 1 error is:
– rejecting the null hypothesis
– when the null hypothesis is actually true
FWER = \(1 - (1-\alpha)^m\)
where m is the number of comparisons
“k choose 2” = \(\frac{k(k-1)}{2}\) (write this down)
– Why we use tukeysHSD
– What distribution we use to account for the inflated type-1 error rate
– How to read output
Commonly report confidence intervals to estimate which means are actually different (also reports p-values).
tukey’s hsd (technically Tukey-Kramer with unequal sample size)
\(\bar{x_1} - \bar{x_2} \pm \frac{q^*}{\sqrt{2}}* \sqrt{MSE*\frac{1}{n_j} + \frac{1}{n_j`}}\)
Where
– \(q^*\) is a value from the studentized range distribution
– (MSE) refers to the average of the squared differences between individual data points within each group and their respective group mean, divided by the degrees of freedom. Also known as the estimated variance by pooling all the groups.
\[\text{MSE} = \frac{\sum_{i=1}^{k} (n_i - 1)s_i^2}{N - k}\]
\(q^*\) can be found using the qtukey function in R, or here
Typically, we use pre-packaged functions/applets to do the above calculations for us.
\[Q = \frac{\bar{x}_i - \bar{x}_j}{\sqrt{\frac{MSE}{2} \left(\frac{1}{n_i} + \frac{1}{n_j}\right)}}\]
anova_model <- aov(weight ~ group, data = PlantGrowth)
plant_tukey <- TukeyHSD(anova_model)
plant_tukey Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = weight ~ group, data = PlantGrowth)
$group
diff lwr upr p adj
trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
trt2-ctrl 0.494 -0.1972161 1.1852161 0.1979960
trt2-trt1 0.865 0.1737839 1.5562161 0.0120064
Tukey’s Honest Significant Difference (HSD) test is a post hoc test commonly used to assess the significance of differences between pairs of group means. Tukey HSD is often a follow up to one-way ANOVA, when the F-test has revealed the existence of a significant difference between some of the tested groups.
Tukey’s HSD is best for making all possible pairwise comparisons, while the Bonferroni correction is more general and suitable for a small number of pre-planned comparisons
You conduct individual t-tests, but divide by the original significance level \(\alpha\) by the number of tests being performed (\(m\)).
What does this do in terms of evidence to reject the null hypothesis?
There is (some) debate on if Anova is necessary to perform before using tukeyHSD to find which means are different.
Logistic guardrail
Applying a global test first is pretty solid protection against the risk of (even inadvertently) uncovering spurious “significant” results from post-hoc data snooping.
It is known that a global ANOVA F test can detect a difference of means even in cases where no individual test of any of the pairs of means will yield a significant result. In other words, in some cases the data can reveal that the true means likely differ but it cannot identify with sufficient confidence which pairs of means differ.
Df Sum Sq Mean Sq F value Pr(>F)
factor(group) 2 11.132 5.566 11.3 0.0401 *
Residuals 3 1.478 0.493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = y ~ factor(group))
$`factor(group)`
diff lwr upr p adj
2-1 2.90306081 -0.03000355 5.836125 0.0513476
3-1 2.87565627 -0.05740809 5.808721 0.0526193
3-2 -0.02740454 -2.96046891 2.905660 0.9991602