Python代写 | CITS1401 Computational Project 2 Semester 2 2020

本次Python代写是完成一个爬虫程序

CITS1401 Computational Thinking with Python
Project 2 Semester 2 2020

To be completed individually.
You should construct a Python 3 program containing your solution to the following
problem and submit your program electronically on Moodle. No other method of
submission is allowed. Your program will be automatically tested on Moodle. Remember
your first two checks against the tester on Moodle will not have any penalty. However
any further check will carry 10% penalty per check.
You are expected to have read and understood the University’s guidelines on academic
conduct. In accordance with this policy, you may discuss with other students the general
principles required to understand this project, but the work you submit must be the
result of your own effort. Plagiarism detection, and other systems for detecting potential
malpractice, will therefore be used. Besides, if what you submit is not your own work
then you will have learnt little and will therefore, likely, fail the final exam.
You must submit your project before the submission deadline listed above. Following
UWA policy, a late penalty of 5% will be deducted for each day (or part day), after the
deadline, that the assignment is submitted. No submissions will be allowed after 7 days
following the deadline except approved special consideration cases.
Context:
For this project, imagine for a moment that you have successfully completed your UWA
course and recently taken up a position for the Department of Prime Minister and
Cabinet in Canberra with the Australian Federal Government. At first you were quite
reluctant to leave Perth to move ‘over east’ and, more generally, wondered what use a
new graduate with a heavy focus on computing, programming and data could be to this
department. Regardless, the opportunity to gain experience in the ‘real world’ was too
good, and although it is not quite your own multi-million dollar technology start-up,
there was no way you weren’t taking up the offer.
Your first few weeks of orientation was a mostly blur. However, one thing you noticed
was that any time you mentioned your skills in programming, and with Python1 in
particular, to any senior bureaucrat, or even some of the savvier politicians, their eyes
seemed to ‘light up’ and they suddenly became much more interested in whatever you
were saying to them. After reflecting on these experiences, maybe there would be some
even more interesting opportunities for you in the near future?
However, for now you decide to put aside these, as it’s not like the work that you have
been doing already has not been interesting, and this is what you need to focus on for
today. At an early morning meeting with your immediate supervisor, you were told that
the Government is very interested in reducing its spend on trying to understand what
(and how) the Australian population currently thinks about it. Instead of spending
millions of dollars calling randomised groups of Australian residents every quarter to
ask about their opinions on various Government services, many senior bureaucrats have
wondered for a while now whether there was any way to use the masses of freely
available data on the internet to provide similar insights at a fraction of the cost.
It is within this context that your supervisor has asked you to develop a program, as a
proof-of-concept, to demonstrate that it is possible to provide some of these insights at
a much lower cost. At your meeting your supervisor noted that, for the proof-of-concept
stage, the use of any ‘live’ internet data will not be possible without approval from the
legal team (as well as possibly many others). This seemed like quite an obstacle until
you thought back to one of your early Python units (maybe this one?) and remembered
that there is an open source, freely available corpus collection of billions of recently
crawled websites called the Common Crawl (http://commoncrawl.org/). More
specifically the Common Crawl corpus consists of tens of thousands of files saved in a
certain format (the WARC format, see below), each of which contains the raw HTML of
tens of thousands of web pages from a web ‘crawl’ performed in the recent past. Being
open source this data is free for you to use so with it you can immediately begin building
your proof-of-concept.
The Project:
As your program is to be a proof-of-concept, both you and your supervisor decided that
its scope should be kept as narrow as possible (but, of course, it must be broad enough
so that it can successfully demonstrate some really good insights). For this reason, it
was decided that your program is to focus only on providing four insights only:
1. How ‘positive’ is Australia generally?
2. How ‘positive’ does Australia feel towards their Government specifically?
3. How ‘patriotic’ is Australia compared with two other major English speaking
countries – UK and Canada?
4. What are the most referred-to websites (domains) by all Australian websites
(your team may want to use this information in the future to better understand
how ‘influential’ each Australian web result is to your insights, i.e. highly-referred
to web domains should be counted as more influential, and lowly-referred to web
domains should be counted as less influential).
As outlined in the ‘context’ section, in order to generate these insights (which will be
discussed in greater detail later in this document), your program will need to examine
the raw HTML from large quantities of Australian web pages, and such information is
available in WARC format from the Common Crawl.