Pittsburgh Neighborhood Zillow Tracker Methodology

Pittsburgh Neighborhood Zillow Tracker Methodology



The results can be found here.

The code can be found here.


Data Collection

Data was collected by scraping from Zillow's website. First, all sold houses were scaped for zip codes 15206, 15208, 15217, 15218, 15221, and 15232. This was done by cycling through the first 25 pages that would appear when you type those zip codes into Zillow's website. At this step, data collected were very basic information on the homes sold, such as the address, beds, baths, area, latitude, longitude, sold date, Zillow's zpid, and the URL.


Using the URLs from the prior scrape, I then scrape more information on the homes from those URLs, such as the description, suburb, cooling, heating, parking, flooring, stories, and other information listed on the site. Using the address information, I then can also scrape walk, transit, and bike scores from https://www.walkscore.com/ to join onto the data.


Both pieces of the scraper code are designed only to append onto the data that has already been scraped. Therefore, information is easily updateable as it only appends onto what is already there instead of having the re-pull all the data every time. Additionally, the sold date is used with Zillow's zpid to form the unique identifier for the sold house. Houses sold more than once in the time period can have multiple entries in the dataset.


Cleaning the Data

Once the data was pulled, it then needed to be cleaned. There were some observations missing suburb, which were all Wilkinsburg. Because I was not looking at Wilkinsburg, these could be dropped. I kept only suburbs in my area that would be somewhere I would live - Edgewood, Highland Park, Point Breeze, Regent Square, Squirrel Hill North, Squirrel Hill South, Swisshelm Park, and (part of) Swissvale. The longitude could not be missing, and to carve off the part of Swissvale I wanted to exclude, I set longitude <= -79.885.


Sometimes units differed between properties, so these were converted to standard units. For example, I converted properties reported in acres to square feet. Additionally, indicator variables needed to be made for variables. For example, indicator variabes were created for different flooring types listed in the flooring field pulled from Zillow.


There was one last filter placed on the data. The list date needed to be within 1.5 years of the sold date. There were a few observations where the list date was over a decade ago and didn't look to be reflective of the current housing market. If the list date was greater than 1.5 years, this reflected a propoerty either nobody wanted or the seller was asking too much for.


Finally, average sale price data from Redfin could be appended on to the data to try to estimate what the sales prices of those homes would be today. Additionally, the 30 year fixed rate was pulled from FRED to place on the dataset. This 30 year fixed rate was not used for modeling, but could be an important variable in predicting the number of houses sold if I chose to build that view in the future. With this data finally all joined together, the final dataset was now complete.


Building the Model

For the model, I modeled the SoldPriceAdj, which was equal to the sold price adjusted for the average home price's appreciation from the sold time to now. The dependent variables were those housing variables collected from Zillow.


I had originally intended to build an ensemble model from what I had anticipated would be weaker models. However, after running a random forest (R^2 = 42%) and gradient boosting machine (R^2 = 90%), I chose to just use the GBM. For the GBM, I used 10-fold cross-validation building 250 trees. I also tested the GBM on out-of-sample data (new data for houses sold after the time of the data used to build the model), and there did not appear to be any major reduction in model performance. The average residual for the out-of-sample was close to 0, suggesting little bias.


Because a GBM is more of a black box than explanatory model, I built test cases so that I could view how changes in the most important variables in the model change the predicted home price. For example, looking at the living area between 1,000 sqft to 2,500 sqft or the difference in home value between 1, 2, and 3 bathrooms. I then predicted the sale price of these test cases using the GBM model and constructed averages for the scenarios, which could then be used to create the Tableau views used in the report.