云计算代写｜Cloud Computing and Big Data Homework Assignment 2
STAT380 is a 3-credit undergraduate course. This is a case study-based course in the use of computing and statistical reasoning to answer data-intensive questions. This course addresses the fact that real data are often messy by taking a holistic view of statistical analysis to answer questions of interest. Various case studies will lead students from the computationally intensive process of obtaining and cleaning data, through exploratory techniques, and finally to rudimentary inferential statistics. This process will exploit students exposure to introductory statistics as well as the R programming language, hence the required prerequisites, yet novel computing and analytical techniques will also be introduced throughout the course. For the collection of data,students will learn scripting and database querying skills; for their exploration, they will employ R capabilities for graphical and summary statistics; and for their analysis, they will build upon the basic concepts obtained in their introductory statistics course. The varied case studies will elucidate additional statistical topics such as identifying sources of bias and searching for highdimensional outliers.
At the end of this course, successful students will be able to…
- Collect and tidy complex data from varied sources such as logs, email messages, and relational databases
- Perform visualization and exploratory data analysis using the R programming language
- Employ statistical learning to understand relationships in data, including both supervised and unsupervised learning approaches.
- Assess model validation and prediction through simulation and cross-validation.
STAT 200 and STAT184
Required: An Introduction to Statistical Learning (https://link-springer-
com.ezaccess.libraries.psu.edu/book/10.1007%2F978-1-4614-7138-7) by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (2017). Available for free through the Penn State Libraries E-Book program at no cost to students. Student can use the Library Resources link at the left hand side course navigation to access the book.
Modern Data Science with R (https://catalog.libraries.psu.edu/catalog/31178880) by B.
Baumer, D. Kaplan and N. Horton (2017). Available online for free.
A data.table and dplyr tour. Free through https://atrebas.github.io/post/2019-03-03-
All course material and assignments will be made available via Canvas. Students are responsible to visit the Canvas course site regularly and keep track of announcements/emails.
Lectures: The lectures will not be recorded. Students are required to attend lectures.
Assignments: Weekly/bi-weekly assignments will be administered as Kaggle competitions. A project workflow template will be shared which you will be expected to follow for all assignments.
Code submissions will be via Canvas and usually will be due by 2359 (ET) on Sundays.
Assignment due dates are set to Sundays, to allow maximum time for students to work on them.
Note that the deadline on the Kaggle website is usually incorrect, so please ignore it. Students should not wait till the last moment to start working on their assignments as support will be extremely limited during the weekends. NO LATE homework will be accepted!
Kaggle: All assignments and projects in this class will be set up as Kaggle competitions. You will be required to create an account at Kaggle.com. You are encouraged to use your Penn State email address to sign up.
R/RStudio: We will use R/RStudio extensively in this class. All class assignments will require you to use R for statistical analysis. To be successful in this class it is recommended that students have access to a computer with R/RStudio installed locally. Alternatively, students can use RStudio Server hosted by Penn State’s Teaching and Learning with Technology (TLT), however support for this is highly limited and therefore inadvisable. To access TLT’s RStudio Server from off campus locations you will be required to login through VPN (you will find more information on downloading and installing VPN at
Extra Credit: There are no make-up assignments and no extra credit opportunities.
Kaggle: We will be using Kaggle for class related questions and discussions once the competitions start. This will enable everyone to benefit from each others questions and also help each other. Content related questions can be posted to the entire class. I encourage you to ask questions if you are struggling to understand a concept, and to answer your classmates’ questions when you can. We will monitor the discussion boards at least once per day.
Do Not use Kaggle for issues related to your grade or other private matters; please use Canvas INBOX (on the left class navigation banner) option for those questions or comments.
Email: For any questions, comments or inquiries of personal nature (i.e. anything not related to class material), please use the CANVAS INBOX option (on the left class navigation banner). If you have your CANVAS email forwarded to a personal account, please DO NOT reply to me using your personal email accounts as these will not automatically come through CANVAS.
I will check my CANVAS email regularly throughout the workweek (Monday-Friday), and possibly on weekends as I get the chance. I will try my best to get back to you within 24 hours, if I do not feel free to ping the message. Please keep in mind that questions asked less than 24 hours before the deadline may go unanswered and it is your responsibility to turn in the assignments on time.