HW 1 - Summary Stats + Data visualization

Homework

How to clone your repo

You clone your homework-1 repo exactly how we have been cloning AEs! Please see Moodle for more information.

How to turn your HW in

Homework is turned in via Gradescope. You can find the Gradescope HW-1 button on our Moodle page under Week-2. Please remember to select your pages correctly when turning in your assignment. For more information, please see on Moodle: Submit Homework on the Gradescope Website.

How to format your Homework

For each question (ex. Question 1), put a level two (two pound signs) section header with the name of the question.

For questions with multiple parts (ex. a, b, c), please put these labels in bold as normal text.

For example…

Question 1

Important

This homework is due Sunday, Sep 7 at 11:59pm.

You can not earn more than 100% on this assignment.

Important

You will need to have at least 3 (meaningful) commits by the end of your homework assignment. Please practice proper version control techniques by committing and pushing after each answered question.

Packages

Start your document by making a Packages header, and copying this code and code chunk over into your .qmd file.

Use message: false and warning: false as code chunk arguments for this code chunk so you don’t get all of the extra unnecessary information when you render your document.

library(tidyverse)
library(openintro)

Tips

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be reminders in this assignment for you to Render your document. The last thing you want to do is work an entire assignment before realizing you have an error somewhere that makes it so you can’t compile your document. Render after each completed question.

Note

Please make sure to explain your output if the question asks you to do so.

Exercises

Data 1: Duke Forest houses

Note

Use the duke_forest dataset in the openintro package for Exercises 1 and 2.

For the following two exercises you will work with data on houses that were sold in the Duke Forest neighborhood of Durham, NC in November 2020. The duke_forest dataset comes from the openintro package. You can see a list of the variables on the package website or by running ?duke_forest in your console.

Exercise 1

We are going to first explore the duke_forest dataset by calculating a variety of summary statistics. Calculate the summary statistic associated with each scenario below. For this question, your code + code output is enough to earn full credit

Create a data frame that displays the mean house price across all categories of bedroom.
Create a data frame that displays the minimum and maximum lot area, in acres. Name your columns min_lot and max_lot.
Create a data frame that gives the number of homes for each combination of cooling system AND number of bathrooms. You will receive a bonus point if you use the functions arrange() and desc() to put your data frame in descending order of count. Name your count column n_count.

Committ and Push

Now would be a great time to save your work, committ, and push to GitHub.

Exercise 2

Usually, we expect that within any market, larger houses will have higher prices. We can also expect that there exists a relation between the age of an house and its price. However, in some markets newer houses might be more expensive, while in other markets antique houses will have ‘more character’ than newer ones and have higher prices. In this question, we will explore the relations among age, size and price of houses.

Your family friend ask: “In Duke Forest, do houses that are bigger and more expensive tend to be newer than smaller and cheaper ones?”.

Once again, data visualization skills to the rescue!

Create a scatter plot to exploring the relationship between price and area, also display information about year_built (that is conditioning for year_built, or your z variable).
Use size = 3 within the appropriate geom function used to make a scatter plot to make your points bigger.
Layer on geom_smooth() with the argument se = FALSE to add a smooth curve fit to the data and color the points by year_built.
Include informative title, axis, and legend labels.
Discuss each of the following claims (1-2 sentences per claim). Use elements you observe in your plot as evidence for or against each claim.
- Claim 1: Larger houses are priced higher.
- Claim 2: Bigger and more expensive houses tend to be newer ones than smaller and cheaper ones.

Committ and Push

Now would be a great time to save your work, committ, and push to GitHub.

Data 2: BRFSS

Note

Use this dataset for Exercises 3 through 5.

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

Source: cdc.gov/brfss

In the following exercises we will work with data from the 2020 BRFSS survey. The data originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use is called brfss.csv.

Copy this code into an R code chunk in your document to read in the data. You will now see an R object called brfss in your environment!

brfss <- read_csv("https://st511-01.github.io/data/brfss.csv")

Exercise 3

How many rows are in the brfss dataset? What does each row represent?
How many columns are in the brfss dataset?
Include the code and resulting output used to support your answer.

Write a sentence along with your code to answer your question

Committ and Push

Now would be a great time to save your work, committ, and push to GitHub.

Exercise 4

Do people who smoke more tend to have worse health conditions?

Use a segmented bar chart to visualize the relationship between smoking (smoke_freq) and general health (general_health). Put smoke_freq on the x-axis.
- Below is sample code for releveling general_health. Here we first convert general_health to a factor (how R stores categorical data) and then order the levels from Excellent to Poor. The same is done to smoke_freq, with the ordering being from Not at all to Every day.
You will add to the existing pipeline (code) to make the segmented bar chart.

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor"),
    smoke_freq = as.factor(smoke_freq),
    smoke_freq = fct_relevel(smoke_freq, "Not at all", "Some days", "Every day")
    ) # add a pipe here to start creating your bar chart

Include informative title, axis, and legend labels.
Comment on the motivating question based on evidence from the visualization: Do people who smoke more tend to have worse health conditions?

Committ and Push

Now would be a great time to save your work, committ, and push to GitHub.

Exercise 5

How are sleep and general health associated?

Create a visualization displaying the relationship between sleep and general_health.
Include informative title and axis labels.
Modify your plot to use a different theme than the default.
Comment on the motivating question based on evidence from the visualization: How are sleep and general health associated?

brfss |>
  mutate(
    general_health = as.factor(general_health),
    general_health = fct_relevel(general_health, "Poor", "Fair", "Good", 
                                 "Very good",  "Excellent")
  ) # add a pipe here to start creating your chart

Committ and Push

Now would be a great time to save your work, committ, and push to GitHub.

Submission

Go to http://www.gradescope.com and click Log in in the top right corner (or click the Gradescope icon in Moodle).
Log in with your school credentials.
Click on your STA 511 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your homework should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your PDF submission to be associated with the “Workflow & formatting” question.

Grading

Exercise 1: 6 points
Exercise 2: 10 points
Exercise 3: 4 points
Exercise 4: 10 points
Exercise 5: 9 points
Workflow + formatting: 5 points
Total: 44 points

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
Pipes %>%, |> and ggplot layers + should be followed by a new line
You should be consistent with stylistic choices, e.g. %>% vs |>
Have appropriate section headers for each question