Python爬虫代写 | Implement The Core of A Web Crawler
In this project, you are going to implement the core of a Web crawler, and then you are going to crawl the following URLs (to be considered as domains for the purposes of this assignment) and paths:
As a concrete deliverable of this project, besides the code itself, you must submit a report containing answers to the following questions:
How many unique pages did you find? Uniqueness for the purposes of this assignment is ONLY established by the URL, but discarding the fragment part. So, for example, http://www.ics.uci.edu#aaa and http://www.ics.uci.edu#bbb are the same URL. Even if you implement additional methods for textual similarity detection, please keep considering the above definition of unique pages for the purposes of counting the unique pages in this assignment.
What is the longest page in terms of the number of words? (HTML markup doesn’t count as words)
What are the 50 most common words in the entire set of pages crawled under these domains ? (Ignore English stop words, which can be found, for example, here) Submit the list of common words ordered by frequency.
How many subdomains did you find in the ics.uci.edu domain? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The content of this list should be lines containing URL, number, for example:
http://vision.ics.uci.edu, 10 (not the actual number here)
What to submit: a zip file containing your modified crawler code and the report.
Grader meetings: this project requires a meeting of all members of your group with one of the TAs/Readers, where all of you will be asked questions about your crawler — your code and the operation of the crawler. These meetings will occur a few days after the submission deadline. Instructions will be sent at the time.
To get started, fork or get the crawler code from https://github.com/Mondego/spacetime-crawler4py
Read the instructions in the README.md file up to, and including the section “Execution”. This is enough to implement the simple crawler for this project. In short, this is the minimum amount of work that you need to do:
Install the dependencies
Set the USERAGENT variable in Config.ini so that it contains all students’ IDs separated by a comma (the numbers! e.g. IR UF21 123123213,12312312,123123 ) of the group members, and please also modify the quarter information (i.e. UF21 for Undergraduate Fall 2021). If you fail to do this properly, your crawler will not exist in the server’s log, which will put your grade for this project at risk.
(This is the meat of the crawler) Implement the scraper function in scraper.py. The scraper function receives a URL and corresponding Web response (for example, the first one will be “http://www.ics.uci.edu” and the Web response will contain the page itself). Your task is to parse the Web response, extract enough information from the page (if it’s a valid page) so to be able to answer the questions for the report, and finally, return the list of URLs “scrapped” from that page. Some important notes:
Make sure to return only URLs that are within the domains and paths mentioned above! (see is_valid function in scraper.py — you need to change it)
Make sure to defragment the URLs, i.e. remove the fragment part.
You can use whatever libraries make your life easier to parse things. Optional dependencies you might want to look at: BeautifulSoup, lxml (nudge, nudge, wink, wink!)
Optionally, in the scraper function, you can also save the URL and the web page on your local disk.
See Crawler Details
Run the crawler from your laptop/desktop or from an ICS openlab machine ( you can use either the classical ssh&scp to openlab.ics.uci.edu or you can use the web interface hub.ics.uci.edu from your browser; I would recommend you to use ssh, such that you will learn a skill that will be probably important for the rest of your professional life… note that to install any software in machines that you do not own or that you are authorized to sudo, you need to install them to your user folder, and in pip/pip3 you need to use the –user option to do so ). Note that this will take several hours, possibly a day! It may even never end if you are not careful with your implementation! Note that you need to be inside the campus network, or you won’t be able to crawl. If your computer is outside UCI, use the VPN.
Monitor what your crawler is doing. If you see it trapped in a Web trap, or malfunctioning in any way, stop it, fix the problem in the code, and restart it. Sometimes, you may need to restart from scratch. In that case, delete the frontier file (frontier.shelve), or move it to a backup location, before restarting the crawler.