What’s this assignment about?
This assignment covers the material up to and including Chapter 4, with a focus on interval estimation techniques. We will seek to model the tweet.gap variate, which measures the time (or ‘gap’) between the publication of tweets. More precisely, for a particular tweet, tweet.gap gives the number of seconds since the user’s previous tweet was published.
Data about how often a user is interfacing with a website, service, or product, are valuable for a variety of reasons. The regularity, and reliability, with which users return (sometimes referred to as ‘stickiness’) is a key metric to assess product performance, as well as for testing the effectiveness of new features and initiatives.
In addition to providing insights into how often users post tweets, the variate tweet.gap also provides an opportunity to explore some challenges commonly encountered in real-world data analysis. Many of you will find that tweet.gap contains some particularly large values, as a result of users not tweeting for several days, or even weeks. When working with real-world data it is common to encounter unusual behaviour such as this, which can make finding a suitable statistical model difficult.
In this assignment we will explore two approaches for modelling data with unusual distributions. One is to consider a subset of the data, narrowing the focus of our research question so as to facilitate meaningful analysis. The other is data transformation, which we have used previously (such as taking logs of the likes variate) and will now extend to other, more complex transformation procedures.
Before we begin
For the purposes of this assignment, the study population is defined as the set of tweets in the primary dataset from which you downloaded your sample at the start of term.
In this analysis we will include all of the data in your Twitter dataset (that is, all five accounts).
You may find it interesting to re-run your analyses on your personal and organizational accounts separately, while thinking about why we might expect these accounts to have different distributions for this variate.
Because tweet.gap is measured in seconds, we will convert this to hours to make it easier to interpret our results. You should create the variate tweet.gap.hour, just like how we created time.of.day.hour in Assignment 1:
mydata$tweet.gap.hour <- mydata$tweet.gap/3600
Analysis 1: Time Between Tweets and an Exponential Model
In Analyses 1 and 2 we will be exploring the distribution of tweet.gap.hour for tweets that are not he first tweet of the day. In the following, we refer to two sets of tweets denoted Tweet Set A and
Tweet Set B as follows:
Tweet Set A: All tweets in your dataset.
Tweet Set B: Just tweets that are not the first tweet of the day. Note that these are the tweets for which first.tweet equals 0.
You may find it helpful to create an R object for each of these, such as by the following:
> tweet.set.A <- mydata$tweet.gap.hour
> tweet.set.B <- mydata$tweet.gap.hour[mydata$first.tweet == 0]
1a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.
1b. [2 marks] Do you have any concerns about measurement error in the first.tweet variate?
Briefly explain why or why not.
1c. [2 marks] State the sample size, and calculate the sample mean, sample median, sample minimum, sample maximum, and sample standard deviation of tweet.gap.hour for Tweet Set A and Tweet Set B. Display these values in a table in your Report.
1d. [1 mark] Briefly explain why the maximum value of tweet.gap.hour for Tweet Set B should not be greater than 24. Note: This question is not asking you to simply verify that the maximum calculated in Analysis 1c is not larger than 24; your answer should explain why, based on how
Tweet Set B is constructed, it should not contain a value larger than 24 for any possible sample.
1e. [4 marks] Generate a relative frequency histogram and an empirical cumulative distribution function plot of the variate tweet.gap.hour for each of Tweet Set A and Tweet Set B (that is, you should include a total of four plots, two for each Tweet Set). All plots should feature a suitable superimposed Exponential probability density or cumulative distribution function curve. Hint: use pexp() and dexp() for these curves, as shown in the R Tutorial. You may wish to use par(mfrow = c(2, 2)) so that your plots are displayed in a single image.
1f. [7 marks] For each of Tweet Set A and Tweet Set B, discuss how well an Exponential model fits the data. Your answer should explain what you would expect to observe if the data were generated from an Exponential distribution, and compare this with what you observe in your sample. You should make at least three comparisons (of what you would expect, and what you observe) for each of Tweet Set A and Tweet Set B, and include an overall conclusion on which of Tweet Set A and Tweet Set B the Exponential model appears to fit better. Note that it is possible that the Exponential model will not fit either set of tweets well, but you should try to identify which it fits better.
Analysis 2: Interval Estimation Using an Exponential Model
In this analysis we will use an Exponential model to describe the time between tweets that were not the first tweet of the day. Note that, regardless of your conclusion in Analysis 1f, you should complete Analysis 2 using Tweet Set B.
Let Y ∼ Exponential(θ) denote the value of tweet.gap.hour for a randomly chosen tweet from the study population that was not the first tweet of the day. You are reminded that in our notation E[Y ] = θ.
2a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number.
2b. [1 mark] What is the maximum likelihood estimate of θ based on your sample?
2c. [3 marks] Generate a plot of R(θ), the relative likelihood function for θ based on your sample and the assumed Exponential(θ) model. Your plot should include a horizontal line that could be used to identify the 15% likelihood interval for θ.
2d. [2 marks] Using uniroot() or uniroot.all(), calculate the 15% likelihood interval for θ.
2e. [3 marks] Calculate approximate 15%, 95%, and 99% confidence intervals for θ based on a Central Limit Theorem approximation. Hint: See Table 4.3 in the Course Notes. Your Report should include an explanation of how this was calculated, which may be expressed algebraically or, if you wish, by including the relevant R command(s).
2f. [2 marks] Which of the confidence intervals you calculated in Analysis 2e is most similar to the 15% likelihood interval found in Analysis 2d? Is this what you would expect? Briefly explain why or why not.
2g. [3 marks] Write 1-2 sentences that explain what the 95% confidence interval calculated in Analysis 2e means in the context of the study. Note: your answer should relate your interval to the real-world question under consideration, and not simply be written in terms of θ.
Analysis 3: Time Between Tweets and a Gaussian Model
In Analyses 3 and 4 we will be exploring the distribution of tweet.gap.hour for tweets that are the first tweet of the day. We will exclude tweets that were published more than 24 hours after the preceding tweet (think about why we might wish to do this). You can create this subset of tweets as follows:
> tgh.first <- mydata$tweet.gap.hour[mydata$first.tweet == 1 & mydata$tweet.gap.hour <= 24]
Note: I have called the variate tgh.first as shorthand for ‘tweet gap hour first tweets’; you are welcome to use your own choice of naming convention!
The data in tgh.first are therefore the times between the first tweet sent on a particular day, and the last tweet sent the preceding day. Hint: Run summary(tgh.first) and check the results make sense based on how we have defined this variate.
We will explore various transformations of the variate in an attempt to facilitate the use of a Gaussian model. In particular, we will consider the following three transformations, which we first define in general terms for data y1, y2, . . . , yn, recalling that y(n) denotes the maximum value in our sample.
❼ Square Root: si = q
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: email@example.com 微信:itcsdx