Java应用程序代写|CIT 594 Group Project



1 Background

The OpenDataPhilly portal1 offers, for free, more than 300 data sets, applications, and APIs related to the city of Philadelphia. This resource enables government officials, researchers, and the general public to gain a deeper understanding of what is happening in our fair city. The available data sets cover topics such as the environment, real estate, health and human services, transportation, and public safety. The United States Census Bureau2 publishes similar information (and much more) for the nation as a whole.

For this assignment, you will use course-provided files containing data from these sources. Specifically, you will be given:

  • “COVID” data, from the Philadelphia Department of Public Health
  • “Properties” data (information about land parcels in the city), from the Philadelphia Office of Property Assessment
  • 2020 populations of Philadelphia ZIP Codes, from the US Census Bureau

1.1 COVID Data

This data set tracks reported COVID cases, hospitalizations, vaccinations, and deaths in the city of Philadelphia for each day, updated daily.3 OpenDataPhilly has pointers to explore the data on the department’s own site as well as a GitHub repository4 that stores historic snapshots of the data.

All three sites have more details about the collection methodology and other information about the data sets.

The files provided with the assignment include these COVID data in a combined form (all 4 sets as a single file) indexed by recording time and ZIP Code. Note that the ZIP Codes in these data sets are for the reporting locations, which may not match the patients’ home ZIP Codes. For simplicity,we will ignore this issue and assume reporting ZIP Codes and home ZIP Codes are the same.

1.2 Properties Data

Your program will also use a data set of property values of houses and other properties in Philadelphia. This data set includes details about each property including its ZIP Code and current market value (the estimated dollar value of the property, which is used by the city to calculate property taxes). It also includes the total livable area for the property, a measure of the floor space of the structure[s] on the property in square feet.

2 Input Data Format

As the OpenDataPhilly data sets are very large and have quite a lot of extra data you do not need,we will provide somewhat simplified versions for you to use for this assignment. You do not need to download anything from the OpenDataPhilly site.

Your program will need to support reading all three types of data from CSV (Comma-Separated Values) files, as well as an additional JSON file for the COVID data. All valid CSV files will start with a header row that will include all of the designated fields for each data set. Your program should use the header row to determine the order of the columns at runtime.

See Appendix A for more details about parsing CSV files for this assignment.

2.1 COVID-19 Data

Your program needs to be able to read the set of vaccination data from both CSV and JSON files; the type should be inferred from the file name extension (the portion of the name following the last “.”, case-insensitive). The format only determines the organization of the data and is independent of the actual contents (the provided CSV and JSON files contain the same information). Each invocation of the program will be given at most one COVID data file, which will be in one of these two formats.

Each record contains statistics relating to COVID-19 for a single ZIP Code on a single day. The fields include:

  • The ZIP Code where the vaccinations were provided.
  • The timestamp at which the vaccination data for that ZIP Code were reported, in “YYYY-MM-DD hh:mm:ss” format.
  • The total number of persons who have received their first dose in the ZIP Code but not their second dose (“partially vaccinated”), as of the reporting date.
  • The total number of persons who have received their second dose (“fully vaccinated”) in the ZIP Code, as of the reporting date.

The record for each ZIP Code also contains statistics for the total number of COVID infection tests conducted as of each date (both positive and negative results), the total number of booster doses administered as of each date, the total number of COVID patients hospitalized as of that date (including previously hospitalized persons who have recovered or died), and the total number of deaths attributed to the disease to date. You may, if you wish, use these additional fields for the free-form analysis in subsection 3.7.

Note that all of the above-described data fields are cumulative, with two caveats. First, when a person who is “partially vaccinated” receives their second dose, they are removed from that count and added to the “fully vaccinated” count, which may result in overall decreases in the “partially vaccinated” count. Second, the reporting agencies may have made occasional data corrections or errors which result in one of the other cumulative fields temporarily decreasing in value.

You should ignore any records where the ZIP Code is not 5 digits or the timestamp is not in the specified format. For any other fields, an empty value should be interpreted as being 0. For example,if the record for ZIP Code 19000 on 2021-06-01 has an empty field for fully_vaccinated then this should be interpreted as meaning there were no fully vaccinated people as of this date in that ZIP Code.

The JSON format is an array of objects much like the flu tweets in the Solo Project. You will need to use the same JSON.simple library (included with the starter files) for parsing the JSON file.

Review your solution to that assignment if you do not recall how to set up and use that library.

2.2 Property Values

The property values data set will only be provided as a CSV file; there is no JSON file for these data. Each row of the CSV file represents data about one property (residential, commercial, vacant land, etc.). For the prescribed activities you will need three fields:

  • market_value
  • total_livable_area
  • zip_code

You may also use any of the other fields and records in the included properties.csv for your free-form activity. We would not recommend having your program store fields that you will not use in your analysis since this file is quite large and doing so would take up a lot of memory.

The zip_code field of the property values data may make use of extended forms of ZIP Codes. In your analysis you should use only the first 5 characters. For example, if the value read is “19104-3333” or “191043333”, it should be interpreted as “19104”. If the ZIP Code has fewer than 5 characters or the first 5 characters are not all numeric then you should ignore that record entirely.

Because this is real world data, sometimes there will be errors in the data sets, such as missing ZIP Codes, market values that are non-numeric, etc. For the property file, if your program encounters data that is malformed but is needed for a particular calculation, then your program should ignore it for the purposes of that calculation and produce the result based only on the well-formed data.