# R代写 | STAT346: Statistical Data Science I

Problem Set #1 (10 Points)

Using the famous Galton data set from the mosaicData package:

library(mosaic)
data(Galton)

Use the ggplot2 package to answer the followings:

(a) [5 points] Create a scatterplot of each person’s height against their father’s height.
(b) [5 points] Seprate your plot into facets by sex. Add regression lines to all of your facets.

Problem Set #2 (15 Points)

The ﬁle ranking.csv contains two columns:

• The ID of an item being rated.
• A rating, which is one of negative, positive, indifferent, or wtf (meaning the respondent
didn’t understand the question).

There are multiple ratings for each item. The plot below shows this data:

• Each dot represents one item i.
• The size of the circles shows the total number of ratings for item i.
• The X coordinate for item i is the percentage of ratings for that item that are negative.
• The Y coordinate for item i is the percentage of ratings for that item that are positive.
• The regression line is created using the lm method. Re-create this plot using the tidyverse and ggplot2, ﬁxing any mistakes you notice along the way.

Problem Set #3 (20 Points)

Read the ﬁle measurements.csv to create a tibble called measurements. (The strings rad, sal,
and temp in the quantity column stand for radiation, salinity, and temperature, respectively.)

(a) [5 points] Create a tibble containing only rows where none of the values are NA and save in a
tibble called cleaned.

(b) [5 points] Count the number of measurements of each type of quantity in cleaned. Your
result should have one row for each quantity rad, sal, and temp.

(c) [5 points] Display the minimum and maximum value of reading separately for each quantity
in cleaned. Your result should have one row for each quantity rad, sal, and temp.

(d) [5 points] Create a tibble in which all salinity (sal) readings greater than 1 are divided by
100. (This is needed because some people wrote percentages as numbers from 0.0 to 1.0, but
others wrote them as 0.0 to 100.0.)

Problem Set #4 (35 Points)

For this problem, we will be using the data from the survey collected by the United States National
Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition
surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed
every year and they complete the health examination component of the survey.

Part of the data is made available via the NHANES package. Once you install the NHANES package,
you can load the data like this:

library(NHANES)
data(NHANES)

Let’s now explore the NHANES data.

(a) [5 points] We will provide some basic facts about blood pressure. First let’s select a group to
set the standard. We will use 20-to-29-year-old females. AgeDecade is a categorical variable
with these ages. Note that the category is coded like ” 20-29″, with a space in front! What
is the average and standard deviation of systolic blood pressure as saved in the BPSysAve
variable? Save it to a variable called ref.

(b) [5 points] Using a pipe, assign the average to a numeric variable ref_avg.

(c) [5 points] Now report the min and max values for the same group.

(d) [5 points] Compute the average and standard deviation for females, but for each age group
separately rather than a selected decade as in (a). Note that the age groups are deﬁned by

(e) [5 points] Repeat (d) for males.

(f) [5 points] We can actually combine both summaries for (d) and (e) into one line of code. This
is because group_by permits us to group by more than one variable. Obtain one big summary
table using group_by(AgeDecade, Gender).

(g) [5 points] For males between the ages of 40-49, compare systolic blood pressure across race
as reported in the Race1 variable. Order the resulting table from lowest to highest average
systolic blood pressure. E-mail: itcsdx@outlook.com  微信:itcsdx 