One individual from your your group will submit two deliverables for this project:
- An R file of your code.
- Code must be commented so that what you are doing is clear. Use the commented question lines as a starting framework. The code should be written so that once I change the file paths to import the datasets, everything should work out of the box.
- Answers to specific questions should be entered via Canvas Quizzes.
- All uploaded figures should be professionally labeled and titled.
Import the “nal_cloud_compressed.dta” and “srpp_cloud_compressed.dta” datasets. Then merge them together using the key and call the resulting data frame “merged_house”. You may now remove NAL and SRPP from your R environment if your computer has limited RAM.
- Restrict your sample to properties that meet the following criteria and call the resulting data frame “sfh” (for single family home). How many observations are in sfh?
- Residential class code (classcodes == “R”)
- Land used for single family home (see “landuse”)
- Sold on or after 1980 (see “owner_dateacquired”)
- Using sfh, calculate the mean sale price and acres of residential properties by “addresscity” for all “addresscity”’s that had at least 100 property sales during the sample period.
- Visualize via table/figure the mean sale price by “addresscity”.
- Which “addresscity” has the highest mean sale price and what is the mean value? Which “addresscity” has the highest mean property “acres” and what is the mean value?
- Suppose we are interested in predicting property sale price. Estimate a regression with “ownersaleprice” as the dependent variable and include “addresscity” as a categorical variable (factor variable in R).
- Interpret the intercept. Which city is acting as the reference group?
- Interpret the point estimate on Brentwood.
- When comparing the reference group to other cities, do you see anything problematic that might point toward a data entry error?
- Add “finishedsqft” to the specification above and interpret the point estimate on “finishedsqft.”
- Suppose that you think the relationship between finished square footage and sale price is nonlinear. Alter the specification above to capture this nonlinearity via using a squared term. Do results imply a nonlinear relationship and can you back this up with a figure?
- Estimate the relationship between sale price and address city, finishedsqft, acres, and year.
Control for year as a factor/categorical variable.
- Interpret the point estimate on acres. Does it make sense? Why or why not?
- Interpret the point estimate on 2019.
- Model the relationship between sale price and finishedsqft, acres, year (of sale), and addressfullstreet. Treat year as a factor/categorical variable. There are over 10,000 street names in Nashville, which means that the traditional “lm” command would create over 10,000 binary variables to add as controls. If you run a model like this as you’ve done above with “addresscity”, you might as well take a vacation because it could take hours/days/weeks to run.
“Big data” problems like this are common but fortunately econometricians have created R packages that run models significantly faster than traditional programs. Use the “fixest” package and use “feols” instead of “lm” to estimate this model (documentation here).
- What happened to the adjusted r-squared?
- What does controlling for addressfullstreet do (statistically speaking) and do you see any issues controlling for the name of the street?
- Using the above model, predict the sale price of a residential single-family home sold in 2019, with 1218 square feet, .27 acres, built in 1946, and located on McClellan Avenue. *** You will need to play around with addressfullstreet and “McClellan Avenue” because of the quirky way in which Nashville analysts designed the variable.
- Suppose that prior to selling, the seller had the option of enclosing a mudroom that would have added 200 square feet to the finished square footage of the house. If finishing the mudroom would have cost $6,000, would our model predict this to be an economically intelligent decision? Why or why not?
- Because of reasons discussed in class, you decide to use a log-linear model where you take a natural log transformation of sale price. Use a log-linear model with the same explanatory variables in part (c) to predict the sale price of a residential single-family home sold in 2019, with 1218 square feet, .27 acres, built in 1946, and located on McClellan Avenue. What is your prediction?
- Build the “best” model you can to predict the sale price of a residential single-family home (withheld from this dataset) sold in 2019, with 1218 finished square feet, .268 acres, built in 1946, located on McClellan Avenue, 2 bedrooms, 1 bath, 0 half baths, 0 basement area, crawl space foundation, wood frame exterior wall, 1 story building. Think about nonlinearity,interactions, and trade-offs between linear/log-linear/log-log models. The group with the prediction closest to the true sale price will automatically receive a HW 3 group grade no lower than an A- (it is not advantageous to share your group’s prediction with other groups).
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: email@example.com 微信:itcsdx