代码代写|STAT 231 Assignment 4: Je ne regress rien

这是一篇来自加拿大的关于了解用户行为是如何随着时间的推移而变化的代码代写

 

Analysis 1: Hashtags

Your dataset contains the variate hashtags.binary which indicates whether a tweet does, or does not, contain at least one hashtag. In our first analysis we will look at the use of hashtags by accounts in your sample. A 2010 study found that 14% of English-language tweets contained at least one hashtag1 , and our goal is to investigate whether this has changed.

Let the random variable H be the number of tweets which contain at least one hashtag in a random sample of n tweets from our study population2 , and assume that H has a Binomial(n, θ) distribution.

In this analysis you should use your entire sample (that is, all five accounts) to explore the research question. (You may find it interesting to re-run your code on each account separately, to see if different accounts exhibit different behaviours!)

1a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.

1b. [1 mark] Give the sample size, the number of tweets in your sample that contain at least one hashtag, and the maximum likelihood estimate of θ based on your sample.

We’ll now explore two different methods for testing the null hypothesis H0 : θ = 0.14. Note that in real-world data analysis we would only use one test, but this is an opportunity for us to practice using different methods, and compare the results!

In the following, we refer to two tests, denoted Test A and Test B:

Test A: Central Limit Theorem approximation with test statistic D = |Y E [Y |H0true]| 0(1θ0)

Test B: Likelihood Ratio Test with test statistic Λ(θ0) = 2 log h L(θ 0)L(θ˜)i

Hint: The following R command will calculate the likelihood ratio test statistic for testing H0 : θ = theta0 for a sample with maximum likelihood estimate of θ equal to thetahat and sample size n.

> lambda <- (-2*log((theta0/thetahat)^(n*thetahat)*((1 – theta0)/(1 – thetahat))^(n – n*thetahat)))

Note: This command is also included in the R Tutorial code file. You may find it easier to copy from that file than from this PDF!

1c. [4 marks] Calculate the observed value of each of the test statistics using Test A and Test B, and the resulting approximate p-values, of a test of H0 : θ = 0.14.

1d. [2 marks] In 1-2 sentences, summarize the conclusions of your hypothesis tests. Note that if both of your tests result in the same conclusion, you can combine this into a single answer (see the Layout Lowdown for an example).

1e. [2 marks] Were you surprised by how similar, or different, the results of your two tests were?

Briefly explain why or why not in 1-2 sentences.

Analysis 2: Time Between Tweets Revisited

In Assignment 3 we looked at the distribution of tweet.gap.hour, a variate we created using tweet.gap to give the time in hours since a user’s previous tweet was published. In Analyses 3 and 4 of that assignment, we focused on the time between the first tweet a user sends on any given day, and their last tweet the preceding day. (This was the variate we called tgh.first.) We will now extend that analysis to test a hypothesis relating to the time a user spends ‘offline’ (that is, time between publishing tweets) from one day to the next.

The typical workday runs for an eight hour period, and we wish to test the null hypothesis that,on average, users are offline for 16 hours. (Optional: Think about whether you agree that this is a suitable hypothesis. Would you test different null hypotheses for your organizational and personal accounts?)

In Assignment 3 you explored transformations of tgh.first in order to use a Gaussian model. As a reminder, the transformated variates are defined as:

❼ Square Root: si = q】

Analysis 3: Length and Likes Revisited

In Assignment 2 we explored the relationship between length (the length of a tweet, measured in number of characters) and likes.log (the log-transformed number of likes a tweet received). We will now extend that analysis using linear regression. As a reminder, likes.log = log(likes + 1).

We shall treat length as the explanatory variate, and likes.log as the response variate. (Think about why this makes sense.) Assume the simple linear regression model

Yi G(α + βli , σ) i = 1, 2, . . . , n

independently, where li is the length of the i th tweet in a sample of n tweets.

In this analysis we recommend using the tweets from the account you chose in Assignment 2. However,if you would prefer to analyze a different account, or you did not complete Assignment 2, then you are welcome to choose any account from your dataset.

3a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number and the username of the account you will analyze.

3b. [2 marks] Give the least squares estimates of α and β, including a 95% confidence interval for each. Hint: Use the confint() command.

3c. [1 mark] Give an estimate of σ that results from fitting your linear model.

3d. [2 marks] In 1-2 sentences, briefly discuss what the parameter α represents in the context of the study.

3e. [3 marks] Generate a scatterplot of likes.log vs. length, add the fitted regression line corresponding to the model you fit in Analysis 3b. Hint: You should be able to reuse some of the code you wrote in Assignment 2!

3f. [4 marks] Based on your model, generate a scatterplot of the standardized residuals vs. the explanatory variate, and a Q-Q plot of the standardized residuals.

3g. [5 marks] Based on the analyses carried out so far, write 3-4 sentences discussing whether the linear regression model assumptions are satisfied for your sample. Your answer should describe what those assumptions are, what you would expect to see in your results if the assumptions hold, and what you observe. Your answer should include an overall conclusion about the suitability of the linear regression model for your sample.

3h. [2 marks] Regardless of your conclusion in Analysis 3g, use your model to estimate the value of likes.log for a future tweet that is 200 characters long, including a 95% prediction interval.

Hint: Use the predict() command.

3i. [2 marks] What p-value results from a test of the null hypothesis H0 : β = 0? What probability distribution was used to calculate this p-value?

3j. [2 marks] Based on your results in Analysis 3i, write 1-2 sentences discussing what you conclude about the relationship between tweet length and log number of likes received. Note that this answer should be written in the context of the study, and not simply in terms of β.