Zillow: Machine learning and data disrupt real estate

Learn how big data and the Zillow Zestimate changed and disrupted real estate. It's an important case study on the power of machine learning models and digital innovation.
Written by Michael Krigsman, Contributor

Anyone buying or selling a house knows about Zillow. In 2006, the company introduced the Zillow Estimate, or Zestimate for short, which uses a variety of data sources and models to create an approximate value for residential properties.

The impact of Zillow's Zestimate on the real estate industry has been considerable, to say the least.

From the home buyer perspective, Zillow's Zestimate enables significant transparency around prices and information that historically was available only to brokers. The company has genuinely democratized real estate information and adds tremendous value to consumers.

For real estate brokers, on the other hand, Zillow is fraught with more difficulty. I asked a top real estate broker working in Seattle, Zillow's home turf, for his view of the company. Edward Krigsman sells multimillion-dollar homes in the city and explains some of the challenges:

Automated valuation methods have been around for decades, but Zillow packaged those techniques for retail on a large scale. That was their core innovation. However, Zillow's data often is not accurate and getting them to fix problems is difficult.

Zillow creates pricing expectations among consumers and has become a third party involved in the pre-sales aspects of residential real estate. Accurate or not, Zillow affects the public perception of home value.

Zillow's market impact on the real estate industry is large, and the company's data is an important influence on many home transactions.

Zillow offers a textbook example of how data can change established industries, relationships, and economics. The parent company, Zillow Group, runs several real estate marketplaces that together generate about $1 billion in revenue with, reportedly, 75 percent online real estate audience market share.

As part of the CXOTALK series of conversations with disruptive innovators, I invited Zillow's Chief Analytics Officer (who is also their Chief Economist), Stan Humphries, to take part in episode 234.

The conversation offers a fascinating look at how Zillow thinks about data, models, and its role in the real estate ecosystem.

Check out the video embedded above and read a complete transcript on the CXOTALK site. In the meantime, here is an edited and abridged segment from our detailed and lengthy conversation.

Why did you start Zillow?

There's always been a lot of data floating around real estate. Though, a lot of that data was largely [hidden] and so it had unrealized potential. As a data person, you love to find that space.

Travel, which a lot of us were in before, was a similar space, dripping with data, but people had not done much with it. It meant that a day wouldn't go by where you wouldn't come up with "Holy crap! Let's do this with the data!"

In real estate, multiple listing services had arisen, which were among different agents and brokers on the real estate side; the homes that were for sale.

However, the public record system was completely independent of that, and there were two public records systems: one for deeds and liens on real property, and then another for the tax rolls.

All of that was disparate information. We tried to solve for the fact that all of this was offline.

We had the sense that it was, from a consumer's perspective, like the Wizard of Oz, where it was all behind this curtain. You weren't allowed behind the curtain and really [thought], "Well, I'd really like to see all the sales myself and figure out what's going on." You'd like the website to show you both the core sale listings and the core rent listings.

But of course, the people selling you the homes didn't want you to see the rentals alongside them because maybe you might rent a home rather than buy. And we're like, "We should put everything together, everything in line."

We had faith that type of transparency was going to benefit the consumer.

What about real estate agents?

You still find that agency representation is very important because it's a very expensive transaction. For most Americans, the most expensive transaction, and the most expensive financial asset they will ever own. So, there continues to be a reasonable reliance on an agent to help hold the consumer's hands as they either buy or sell real estate.

But what has changed is that now consumers have access to the same information that the representation has, either on the buy or sell side. That has enriched the dialogue and facilitated the agents and brokers who are helping the people. Now a consumer comes to the agent with a lot more awareness and knowledge, as a smarter consumer. They work with the agent as a partner where they've got a lot of data and the agent has a lot of insight and experience. Together, we think they make better decisions than they did before.

How has the Zestimate changed since you started?

When we first rolled out in 2006, the Zestimate was a valuation that we placed on every single home that we had in our database at that time, which was 43 million homes. To create that valuation in 43 million homes, it ran about once a month, and we pushed a couple of terabytes of data through about 34 thousand statistical models, which was, compared to what had been done previously an enormously more computationally sophisticated process.

I should just give you a context of what our accuracy was back then. Back in 2006 when we launched, we were at about 14% median absolute percent error on 43 million homes.

Since then, we've gone from 43 million homes to 110 million homes; we put valuations on all 110 million homes. And, we've driven our accuracy down to about 5 percent today which, from a machine learning perspective, is quite impressive.

Those 43 million homes that we started with in 2006 tended to be in the largest metropolitan areas where there was much transactional velocity. There were a lot of sales and price signals with which to train the models. As we went from 43 million to 110, you're now getting out into places like Idaho and Arkansas where there are just fewer sales to look at.

It would have been impressive if we had kept our error rate at 14% while getting out to places that are harder to estimate. But, not only did we more than double our coverage from 43 to 110 million homes, but we almost tripled our accuracy rate from 14 percent down to 5 percent.

The hidden story of achieving that is by collecting enormously more data and getting a lot more sophisticated algorithmically, which requires us to use more computers.

Just to give a context, when we launched, we built 34 thousand statistical models every month. Today, we update the Zestimate every single night and generate somewhere between 7 and 11 million statistical models every single night. Then, when we're done with that process, we throw them away and repeat the next night again. So, it's a big data problem.

Tell us about your models?

We never go above a county level for the modeling system, and large counties, with many transactions, we break that down into smaller regions within the county where the algorithms try to find homogeneous sets of homes in the sub-county level to train a modeling framework. That modeling framework itself contains an enormous number of models.

The framework incorporates a bunch of different ways to think about values of homes combined with statistical classifiers. So maybe it's a decision tree, thinking about it from what you may call a "hedonic" or housing characteristics approach, or maybe it's a support vector machine looking at prior sale prices.

The combination of the valuation approach and the classifier together create a model, and there are a bunch of these models generated at that sub-county geography. There are also a bunch of models that become meta-models, which their job is to put together these sub-models into a final consensus opinion, which is the Zestimate.

How do you ensure your results are unbiased to the extent possible?

We believe advertising dollars follow consumers. We want to help consumers the best we can.

We have constructed, in economic language, a two-sided marketplace where we've got consumers coming in who want to access inventory and get in touch with professionals. On the other side of that marketplace, we've got professionals -- be it real estate brokers or agents, mortgage lenders, or home improvers -- who want to help those consumers do things. We're trying to provide a marketplace where consumers can find inventory and professionals to help them get things done.

So, from the perspective of a market-maker versus a market-participant, you want to be completely neutral and unbiased. All you're trying to do is get a consumer the right professional and vice-versa, and that's very important to us.

That means, when it comes to machine learning applications, for example, the valuations that we do, our intent is to come up with the best estimate of what a home is going to sell for. Again, from an economic perspective, it's different from the asking price of the offer price. In a commodities context, you call that a bid-ask spread between what someone is going to ask for in a bid.

In the real-estate context, we call that the offer price and the asking price. And so, what someone's going to offer to sell you his or her house for is different from a buyer saying, "Hey, would you take this for it?" There's always a gap between that.

What we're trying to do with Zestimate is to inform some pricing decisions so the bid-ask spread is smaller, [to prevent] buyers from getting taken advantage of when the home was worth a lot less. And, [to prevent} sellers from selling a house for a lot less than they could have got because they just don't know.

We think that having great, competent representation of both sides is one way to mitigate that, which we think is fantastic. Having more information about pricing decision to help you understand that offer-ask ratio, what the offer ask-spread looks like, is very important as well.

How accurate is the Zestimate?

Our models are trained such that half of the Earth will be positive and half will be negative; meaning that on any given day, half of [all] homes are going to transact above the Zestimate value and half are going to transact below. Since launching the Zestimate, we have wanted this to be a starting point for a conversation about home values. It's not an ending point.

It's meant to be a starting point for a conversation about value. That conversation, ultimately, needs to involve other means of value, include real estate professionals like an agent or broker, or an appraiser; people who have expert insight into local areas and have seen the inside of a home and can compare it to other comparable homes.

I think that's an influential data point and hopefully, it's useful to people. Another way to think about that stat I just gave you is that on any given day, half of the sellers sell their homes for less than the Zestimate, and half of the buyers buy a home for more than the Zestimate. So, clearly, they're looking at something other than the Zestimate, although hopefully, it's been helpful to them at some point in that process.

How have your techniques become more sophisticated over time?

I've been involved in machine learning for a while. I started in academia as a researcher at a university setting. Then at Expedia, I was very heavily involved in machine learning, and then here.

I was going to say the biggest change has really been in the tech stack over that period, but, I shouldn't minimize the change in the actual algorithms themselves over those years. Algorithmically, you see the evolution from at Expedia, personalization, we worked more on relatively sophisticated, but more statistical and parametric models for making recommendations; things like unconditional probability,and item-to-item correlations. Now, most of your recommender systems use things like collaborative filtering for algorithms that are optimized for high-volume data and streaming data.

In a predictive context, we've moved from things like decision trees and support vector machines to now a forest of trees; all those simpler trees with much larger numbers of them... And then, more exotic decision trees that have in their leaf nodes more direction components which are very helpful in some contexts.

As a data scientist now, you can start working on a problem on AWS, in the cloud. Then have an assortment of models to quickly deploy much easier than you could back twenty years ago when you had to code a bunch of stuff; start out in MATLAB and import it to C, and you were doing it all by hand.

CXOTALK brings you the world's most innovative business leaders, authors, and analysts for in-depth discussion unavailable anywhere else.

Editorial standards