Intro to Regression

Dr. Elijah Meyer

NC State University
ST 511 - Fall 2025

Invalid Date

Checklist

Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3

– Quiz is due Tuesday (see Moodle)

– Homework later this week

– Statistics experience released

– Expect take-home grades by this Wednesday

– Final Exam is Dec 8th at 3:30 (expect an email this afternoon)

Warm up

A company HR department wants to know if the type of coffee an employee prefers (Black or With Milk/Cream) is independent of their primary method of commuting (Car or Public Transit).

What is \(H_o\), \(H_a\)?

Assuming the assumptions are met, what’s my statistic and distribution?

Warm up

A veterinary nutritionist wants to know if a new vitamin supplement increases the average distance a dog can run on a treadmill before tiring. They decide to test two different breeds: Border Collies and Beagles.

The goal is to test if the mean running distance of Border Collies is significantly different from that of Beagles.

What is \(H_o\), \(H_a\)?

Assuming the assumptions are met, what’s my statistic and distribution?

Questions

Learning objectives

What can we do when working with two quantitative variables?

– Summary statistics?

– What is simple linear regression?

– How a line of best fit is made

– How to talk about the line of best fit

We are reviewing exploratory data analysis

We will move to statistical inference in the following lesson

Airquality

We are going to look at daily air quality measurements in New York from May through September in 1973. Specifically, we are going to look at the relationship between wind (mph) and temperature (F).

– Can I analyze these data using difference in means?

– Difference in proportions?

– What kind of plot can we make?

Plot the data

How can we summarize these data?

Summary statistics

– correlation (r)

– slope + intercept (fit a line)

Correlation

– Is bounded between [-1, 1]

– Measures the strength + direction of a linear relationship

What do I mean by linear relationship?

What do I mean by strength?

What do I mean by direction?

Guessing Game

Applet

Let’s find the correlation coefficient between our two variables

syntax: cor(x, y)

airquality |> 
  summarise(corr = cor(Wind, Temp, use = "complete.obs"))
        corr
1 -0.4579879

Summary statistics

– correlation (r) βœ”οΈ

– slope + intercept (fit a line)

`geom_smooth()` using formula = 'y ~ x'

How do we suppose that this line was fit?

Residual

\(e_i = y - \hat{y}\)

where y is an observed value, and \(\hat{y}\) is the predicted value based on the line!

Minimize the residual sums of squares: \(\sum (y_i - \hat{y_i})^2\)

Residual

`geom_smooth()` using formula = 'y ~ x'

What can we do with this line?

Why do you suppose we fit a line?

– Prediction

– Interpretation

– Hypothesis testing to test for a relationship (inference)

The equation

Have you heard of \(y = mx + b\) ?

The equation

Let me introduce you to:

Population level: \(y = \beta_o + \beta_1*x + \epsilon\)

Sample: \(\hat{y} = \hat{\beta_o} + \hat{\beta_1}*x\)

The equation

\(\hat{y}\) (yhat) = predicted value of y

\(\hat{\beta_o}\) (b) = estimated intercept

\(\hat{\beta_1}\) (b1) = estimated slope

\(x\) = explanatory variable

Terms

– What is an intercept?

– What is a slope coefficient?

In-R

How do we interpret the intercept? How do we interpret the slope coefficient?

Why mean response?

The phrase expected value is a synonym for mean value in the long run (meaning for many repeats or a large sample size).

Questions?

Looking forward

What happens when we add other variables to the model?

Looking forward

What would a hypothesis test look like?