Moneyball for movies: Data science and AI in Hollywood

The Chief Analytics Officer of movie and television studio Legendary Entertainment shares the predictive analytics, AI, and tools that bring box office success to films such as Jurassic World and Straight Outta Compton. This in-depth, exclusive discussion is Episode 276 of the CXOTalk series of conversations with innovators.
Written by Michael Krigsman, Contributor

Data science and related disciplines, such as artificial intelligence and machine learning, have become table stakes in almost all areas of software. From marketing to customer service and beyond, data and analytics are changing existing enterprise processes and enabling new ones.

There are many challenges to using data science effectively, including which problems to consider and collecting the right data at scale. For business people, these challenges are magnified because they intersect multiple domains such as organizational decision-making, innovation and business model reinvention, technology capability, and even privacy and ethics.

This complexity drives most organizations to buy off-the-shelf software products that automate specific processes and include AI baked-in as part of the offering. For instance, most marketing software today includes AI to analyze customer data. It's much easier to buy data science or AI inside commercial products than to roll your own.

Although standard products are best for most companies, there are exceptions. For example, if your core business model relies on unique sets of data that need specialized analytic techniques, then building custom tools makes sense. But, these situations are exceptions to the general rule that buying technology is usually better than building it yourself.

In 2013, Legendary Entertainment -- a top studio that creates movies like Superman Returns, Jurassic World, Straight Outta Compton, and Warcraft -- faced this make-vs-buy decision.

The company's then-CEO, Thomas Tull, wanted to apply data and analytics to the domain of Hollywood entertainment, but there were no off-the-shelf tools would work. Tull's inspiration was professional sports and "Moneyball," as presented in the popular film of the same name. Tull convinced Matthew Marolda, who had built and sold a sports analytics company, to join Legendary as Chief Analytics Officer and create a team to solve the "Moneyball for Hollywood" problem.

To learn more about this intersection of data science and the movie industry, I invited Matt Marolda to take part in Episode 276 of the CXOTalk series of conversations with the world's leading innovators.

Matthew is a self-described data nerd and a super interesting person. On this episode, he shares details of the approach, goals, and components his team created. It's fascinating to learn, for example, that they built their storage solution because no commercial systems at the time would work at the scale and performance required.

Personally, I enjoyed hearing how Matt and team use social media to target micro-segments, almost to the individual level. It's a glimpse inside the near-term future of digital marketing.

Listen to the video embedded above to watch our entire CXOTalk conversation and read the complete transcript. Down below are edited excerpts from the 45-minute discussion.

Matt, tell us about Legendary Entertainment?

Legendary is a producer of both movies and television shows. The types of movies we produce are large-scale, things like Godzilla, Kong: Skull Island, The Dark Knight series, movies of that scale, which are intended to be large, what people often refer to as tentpole movies that are big, global events all around the world.

What was the original plan when Legendary's CEO asked you to join?

In hindsight, it's funny because we didn't know what we were going to do. [Laughter] We knew, at a high level, what it was. We knew we needed to use data and analytics to inform the process. The first thing I said to our chief creative officer when I joined Legendary, and again these are two roles that could effectively be oil and water, creative and analytics. Those could be things that are opposing forces. What I said to him is the attitude that we've had from the beginning from the creative side, which was that analytics, especially in sports, but the same with content, never produced a player, but all it tries to do is put the player in the best position to succeed. That was the attitude from the beginning.

The marketing side was a little bit different. The marking side was, "How could we use data and analytics to gain a competitive advantage?" On that front, what we realized very quickly was that there was a real opportunity in how we addressed our audiences, meaning the traditional approach, and this is still often the dominant approach for these kinds of movies is what we always call the spray and pray. Meaning, quite literally, spray the population with TV ads and pray they go to the box office. That works in a certain world, but maybe not even the world of today, but at some point, it worked.

What we realized back four, five years ago was that we needed to be much more precise. It's a game of impressions, meaning how do we deliver the trailer, the TV spot, or the poster to the right people? Which of those things do we deliver, and in what format? Doing that in a very precise and individual way. What we've built are tools that enable us to, at individual levels, predict people's propensity, as we call it, their likelihood to take the action we want, which may be a trailer view. It may be buying a ticket.

That meant we had to use some very sophisticated tools and techniques that we had to build ourselves. We built up a suite of assets and capabilities that are all rooted back in AI. This was all when AI was not cool. [Laughter] This is a time at which AI was Skynet or something. It wasn't embedded as broadly as it may be becoming now. We had to go down that path and to use machine learning, neural networks, computer vision, because of the scale in which we needed to operate. It was so massive that, without those kinds of tools, you're almost back to that spray and pray mode where you are quite literally taking broad guesses at large groups of people.

How do you use all that data?

The first step in that process for us is to try to understand people. The best way for us to understand people is with data.

The first day I walked into Legendary, my first question was, "Where's all the data?" Again, coming from a world that wasn't connected to Hollywood, I didn't understand how the dynamics worked, which was, we produce a movie, we deliver it to a distributor, who then hands it to exhibitors. Then the exhibitors or, ultimately, maybe an Apple, Amazon, or whomever, all the transactions, all the customer interactions happen at that level, which is too removed from us.

When the answer came back to me, "Oh, what data do you mean?" I said, "Anything on people," they came back with an Excel spreadsheet of about 50,000 email addresses. I realized at that point that there was a different challenge we had to face, which was, how do we get data on people?

I'll put that to the side for a second, but the principle, though, that we were taking wasn't data, necessarily. It was analytics.

Our bet was not necessarily on getting the best and most precise data on people. It was, how do we build the analytic tools to take whatever data is available to us and use that to do our targeting. That is a recognition of a lot of factors that I think were true then, but even more true now: privacy issues, social platforms and how they share data and what levels of granularity they'll provide you, regulatory issues, all sorts of things.

The data will shift. What's available to you Monday might not be available to you the next, or new things will pop up that weren't there before. We knew we had to have data. That was table stakes, and so we invested a lot of money, millions of dollars, into data to acquire data on people, on content, unstructured data from social networks, everywhere we could find it.

The real bet for us came at the next level, which was, what can we build on top of that data? With that, what we drove towards was these AI solutions. Meaning, could we take a billion or more email addresses and attach hundreds, if not a thousand or more, attributes to those email addresses that we created from, sourced them from, some partnership, to constructing them from unstructured data, meaning text, image, and things of that nature. We produced a very robust picture of people.

Then once we had that robust picture, we needed to do something with it. It's inert if we don't act upon it. The next option to take is to use, effectively, that big table of data on people, which is not what it literally is, but that's a good visual of it, and create audiences from it and to make individual predictions. The first step in our process is to use our models. There are many different inputs into them, but to use them to home in on who we think the most likely audience is.

It's not binary. In fact, we have three major categories that we drill in on specifics. The three major categories for us are people we consider to be given, meaning they're going to watch the movie no matter what. They're wearing the Godzilla T-shirt. They've watched the Kong movie, from 30 years ago or even ten years ago, dozens of times, that kind of person. There's a small number of them, but they're there.

There's a much larger number of people who will never watch, who are never going to consume this content. That's fine. We don't want to spend impressions on them.

Who we care about are the people in the middle of those two groups. We call them the persuadables. The people who we can persuade by giving them the right piece of content or the right creative at the right moment through the right channel is key. Those are trite things now. People talk about that a lot, but we try to be very precise about it.

The first step we'll do is take that persuadable audience and define them exclusive of the givens and the nevers. Then, within the persuadable audience, we will effectively score every single person. In the U.S., for a movie of a scale we typically would work on, it could be 40 million or 50 million people. They'll get a score from zero to 100. 100 being very likely, zero being very unlikely.

Once we have that, where we can, we deploy media to them specifically and individually. A lot of people use the term onboarding. We might onboard them into, say, programmatic buy on websites, publishers that you would see on the sidebar or across the top. That includes social media. That includes search. That includes video like YouTube. Wherever we can find these people, we'll reach them, and so we'll launch these at the lowest granularity that that platform will accept. Sometimes it's small audiences. Sometimes it's individuals, but wherever we can.

Then, once that's launched, the next thing we'll do is take very small pieces of those audiences, so not only cutting into small micro-segments but now I'm taking even small subsegments of them to test. We call it calibration. We'll launch many, many combinations, hundreds or thousands of combinations of subsegments and creative. That will give us an indication as to which of those segments will respond better to which pieces of creative.

Once we've done that, then we start scaling. Then we start applying more spend, and that will lead us to a more global kind of scale. At that level, once we've done that, where we can, and China happens to be the territory we can do this the best, we will try to measure conversion. Meaning, we will try to see who is buying tickets. Those ticket purchases will then feed into our models and enable us to be more honed.

What's interesting about our approach is we tend to do things that a lot of them are the opposite of what others do. A lot of folks will start to become narrow and them maybe even get panicked and go broad. We do the opposite. The closer we get to release, the more honed we're trying to get and the more precise we're trying to get.

Which platforms do you use?

In no particular order: social media, so Facebook, Twitter, Instagram, Snap, all those platforms. It would include the Google platforms, which would be search or YouTube. We'll also do things programmatically, so we'll be able to target people across many different websites. Those are the major categories. We do try to do analytics to help us guide what we would consider being nonaddressable media, like television buys and outdoor ads. But, it's using the same concept of audience. We're just now deploying it more coarsely.

In certain cases, we can provide individuals and track them. That's rare, but we can do that. In other cases, it's these sort of subsegments that may be hundreds of people or a thousand or two thousand, something like that. In other cases, for example, a television buy, you're buying against the people who watch that show. We're predicting as to who we think are going to watch the show, but we can't precisely say, "Oh, these are the 700,000 people we want to get through this show." We are taking a bet that they'll be watching, but we don't know precisely what they are.

Are you looking at real-time or historical data?

It's a great question. The pace of these kinds of campaigns is very fast. What will happen is, for any given movie, the vast majority -- when I say vast, 80 percent, 90 percent will be spent over the course of about four or five weeks. This is where I think people who are just awake and alive will see these massive sorts of media dumps out into the world. We knew that was the phenomena, and we knew that we had to be able to react very quickly within those timescales.

If this were an always-on campaign that ran over the course of years, it'd be much different. We do try to operate very precisely within that very short window of four to six weeks. I would say our cadence for changes and adjustments are typically within a day or so. It's not real time in the sense of every minute or every hour, but once a day we're recalibrating and adjusting.

Tell us about your team?

Going back to the beginning, I was a guy in a room. [Laughter] I had a checkbook, effectively, and we could have done a number of things. We could have built a mosaic of solutions.

What we found was that that didn't exist, and so we built out a team. Our team is about 70 people. Of the 70, about half are some form of engineer, whether they're data scientists or computer scientists, and we have people who have all kinds of disciplines.

We accumulated these people, and we built these tools because they didn't exist, and we couldn't find that solution. It's that singular solution that goes from front to back. There were a lot of good point solutions along the way, but they didn't have the full integration.

The loop you described is a very logical loop, and that's exactly what we were trying to build toward, but we had a hard time finding the solution that would meet both the speed and the pace at which we were spending, along with the sophistication with which we wanted to spend. To go back to your loop, this data platform that we've created will suck in data from whatever sources we've started with, the initializing sort of data. Then it will launch the media out into the different platforms.

To your point, as the campaigns run, new data is being constantly created. That comes back into the system, enables us to calibrate and change dynamically, and then re-spend. It's a virtuous cycle that continues.

We need the right people, for sure. I've said humility, that's an initial starting point for us. We look for people like that.

Of course, we have other things we're interested in, and so the specific skill sets we have accumulated. On a data science side, it's multidisciplinary. The person who runs our data science team has a Ph.D. in astrophysics. That's a discipline you wouldn't expect at a Hollywood studio.

Just like that discipline, some people have different backgrounds in social sciences like human decision sciences, or they are statisticians or econometricians. That's a whole category of people we have as data scientist folks.

On the software development side, we knew -- and we talked about it briefly earlier -- that we were going to have these very large data sets. We needed people who had the skills to be able to build these repositories to query and analyze data at remarkable speeds, to be able even to build the infrastructure and the thousands of servers we have running at any given time to support all that, to build the user interfaces that make it all work. Those were skillsets we were very specific and targeted on.

We also needed the other half of our team of people who are experts at applying these kinds of outputs into a campaign. That last group I just mentioned was by no means the last. In fact, we considered all three simultaneously because we knew that if the data science team and the development team built all the amazing tools they build, but they were just shiny toys on a shelf, it was all for not. We needed to make sure we had a group of people who knew how to translate those tools into action. That creates the whole iterative loop we use to develop further.

What's coming next?

For us, I think the thing that we think a lot about is two things. One is the increasing addressability of media channels, so the ability to get more precise. That feels around the corner. Whether that means addressable TV in that you can send an ad over whatever form of viewing you were doing, that's one thing for sure.

The other thing for us, which is always that holy grail -- in a lot of industries this is not the case, but for things like ours -- are conversion measurement is hard. Meaning, can we tell if someone took the action we wanted? As data becomes stronger and better there, that just makes everything better. Those are the two things that feel, in these sort of 18 to 24 months, or maybe even a little longer than that, but the 1 to 3- or 1 to 5-year range.

CXOTalk brings together the most world's top business and government leaders for in-depth conversations on digital disruption, AI, innovation, and related topics. Be sure to watch our many episodes! Thanks to Laura Hoang from CredPR for the introduction to Matt Marolda.

Editorial standards