Data science secrets in finance and media

Two top data scientists share their goals and challenges in analyzing huge datasets to make sense of complex business problems. Business people should read this carefully to gain a better understanding of data science and how it works.
Written by Michael Krigsman, Contributor

Video: Data science and machine learning in marketing, media, and finance

Data is the foundation of AI and thus crucially important to many of the most important techniques and trends in computing today. From marketing to accounting to photography and many other domains, data is the lifeblood of advanced software. At the same time, data science remains hidden as a black box in many companies.

For this reason, I invited two world-class data experts to take part in episode 294 of the CXOTalk series of conversations with the world's top innovators.

Also: Data.world: The importance of linking data and people

Matt Marolda is the chief analytics officer of Legendary Entertainment, a major Hollywood movie and game studio. Matt comes from the world of professional sports and Moneyball and now applies those data techniques to media and entertainment, with large-scale movies and television shows.

Anthony Scriffignano is the chief data scientist at the major financial industry data company, Dun & Bradstreet. He handles innovation around advanced data science topics at D&B and works with regulators around the world on these issues. He started his career doing physics for cranes, construction cranes, offshore oil rigs, and nuclear power plants.

This CXOTalk episode offers an unusual glimpse deep inside the data science from two highly articulate and expert practitioners.

Watch the entire conversation in the video embedded above, or read the complete transcript. You can also review the edited summary comments below.

Matt, what kinds of problems do you work on in media and entertainment?

Matthew Marolda: We live in this unusual place where we have these very large, binary outcomes, meaning we have a movie that we're going to release, say Godzilla or Kong, movies of that kind of scale. There's only really one world we can live in, which is the world where that movie is released, which means we can't run tests. We can't do a lot of things that a lot of people in data science would like to be able to do where you have controls.

We can do that within the campaign and within very small windows, but it's very hard to, over long periods of time, iterate and adjust. We're in this situation where we have to work to thread the needle and learn as much as we can as quickly as we can in these also ambiguous environments where the correlation to the data we have isn't perfect to the outcome. We don't have these direct correlations. We have to operate in these ambiguous environments that force us to look at all different kinds of data and pull it from lots of different places.

We're very audience driven. Meaning, we need to understand audiences and people at a very specific level.

Also: IoT boom will change how data is analysed

That starts all the way at the beginning. Is there an audience for this movie or TV show? Does that audience have enough scale to support the budget we might have for it? Those [are] the kinds of questions.

We then want to understand what the audience likes and how they might respond to different elements or aspects of the movie.

Then, ultimately, when you get close to marketing, this is where it kind of escalates. We want to understand; how do we reach that audience? How do we persuade them? What creative materials, meaning the trailers or the ads or the TV spots we could show them, how are they going to impact and affect their ability to at least have a desire to watch the movie?

We're just trying to dial it up. We're just trying to shift the odds to make it more likely, although we can't guarantee an outcome, we're working on that. It's all very much at the individual level.

Anthony, describe the kinds of business problems you look at with data?

Anthony Scriffignano: The types of problems that I'm working on are very similar, believe it or not, to the types of problems that Matt just described, but in a very different way. If you think about our customers, they're trying to solve a problem that's somewhere in the category of either total risk or total opportunity. What's the white space? What could I possibly do if I penetrated this market? If I went into this country, can you help me find more companies that look like my best customers or don't look like my best customers?

Then, on the risk side, are they going to pay me? Are they fraudulent? Are they going to go out of business? Those are the problem spaces.

But, I have the same edge of the possible that Matt just described. The unstructured data, the data we've never seen before. Everyone is really good at what's called supervised learning right now, looking at structured, longitudinal data that's been around for a long time and building, basically, regressive relationships and then saying, "Here's what I think is going to happen," assuming the future looks something like this past set of data that you've trained on.

The problem is, the future doesn't look like that set of data. The future is ambiguous. The data in the future has never been seen before. Now, recently, some of it you can't use because of those different regulations, so you have to unlearn things.

Also: How to build a data science team

The problems of understanding things we've never looked at before in ways that are changing while we're looking at them are the same. This tale of two cities that we're telling, it's the same set of problems. It's just a different use case at the end.

[Laughter] There is something that we work on that I call a Black Cat Problem where you're looking for something that may not be there in a place that's inherently hard to look. In our case, think about fraud, or think about maybe some other type of bad behavior, malfeasance. If you try to model your way out of finding things like that by looking at all the previous bad stuff, the best bad guys, when they know they're being watched, they change their behavior, so you'll model how the best ones are no longer behaving.

Michael Krigsman: How do you take the rifle shot of correct focus when looking at large datasets?

Matthew Marolda: When we think about rifle shots, we're trying to use these collections of data. Again, we're not generating first-party data; we're absorbing it from many other places, whether it's from activities we run in a market where we're actually spending media and buying advertising, or whether it's taking data from publicly available sources like a Twitter, even, or Reddit.

What we're trying to do is sift through that and use tools that'll help us to highlight the insight. That's almost the language we're using to look for these insights, these things that'll tell us something. For example, men of a certain set of interests, shared interests, respond to a certain piece of creative, as we call it, so maybe a trailer, a 20-second TV spot, or whatever it is, in a certain way. That tells us something.

The rifle shot would then become how do we then make more creative like that? How do we find more people like that and target it at them?

Also: The Complete Tableau 10 Data Science Bundle CNET

The fail fast is, we want to learn as quickly as possible because our campaigns, we're spending an enormous amount of money over very short periods of time, so five-six weeks. We need to understand very quickly, did the insight we have lead to the outcome we expected? That's what we're trying to do. Once we've taken that shot, so to speak, we'll quickly understand did it work or did it not.

We also try to contain it in a very small area. For example, we might find people who fit the phenotype we're looking for, but a sample of them. A large enough one to understand that our approach is working, but not so large that we affect the campaign. Once we see that, then we accelerate. That's at least an example of how we do that.

Anthony Scriffignano: Yeah, so that's a really interesting way to describe looking forward, while looking backward, very quickly, at the yellow line right behind the car. You're doing a combination of unsupervised learning. I don't want to start to get into all the methods and the names. There is a name for what you just described.

It's a powerful way of thinking and appropriate to the environment that you're in. If someone led with that method to understand fraud, for example, I would say, "Well, you can't do that." We first have to figure out how much of fraud.

First of all, it's not even fraud when it gets presented to us. It's the proto-fraud. It's the thing that precedes the fraudulent activity where somebody else loses money. Then they lie to us.

We do have years and years and years, decades of experience, more than decades, of that kind of behavior. But, that behavior is inherently changing with cryptocurrency and with new ways of cheating on the Internet, all kinds of cybercrime, and so forth.

Also: Uber fights off scammers every day. Here's how CNET

Now, what we have to do is we have to say, "What percentage of this problem do we think is the new behavior versus the old behavior? How would we know when the environment is changing in such a way that the preexisting methods are not performing as well as we thought they were? And, what would be the triggers against something that we won't recognize when it's happening?"

If we use this analogy of driving and looking at the yellow line down the road, some methods look way behind the car, and they look at the yellow line. They assume that the shape of the road in front is going to be just like the shape of the road behind us. We all know that's not true.

Other methods try to look only at the line in front of the car. Then, depending on how far ahead they're looking, they either miss the thing that comes right out in front of the car, or they miss the thing that's very close to the horizon that would indicate a change in direction.

You have to have a mixed methods approach that does a little bit of all of these, and that rifle shot, I think that you're talking about, what I'm imagining is almost more of a shotgun kind of rifle that it's shooting in multiple directions, but sort of in the same general direction. It's a very good analogy if you think of each of those pellets being a different method and a different analytical approach or a different type of curation, looking for different types of signals that may never have existed before. I could see that being super powerful.

How large are your datasets?

Anthony Scriffignano: It's really hard to answer the question, "How large are data sets in this day and age?" Do you answer that in terabytes or petabytes? Do you answer that in numbers of entities? Do you answer that regarding the rate of change?

I'll give it a shot in my world. There are about 300 million businesses in our databanks. Just to give you a rough idea, there are about 27 million or so businesses in the United States. About half of those change in a year in some way regarding identity. We update this data from every country on earth except for North Korea and Cuba. We do it more than ten million times a day.

All of those different countries have different writing systems, different regulations. There are laws about what data can cross the border, what data must stay in the country, where you may fabricate products, where you may not. We have to comply with all of that everywhere while those laws are changing.

Also: Cheat sheet: How to become a data scientist TechRepublic

If you think of hundreds of millions of entities, you've got several thousand times that, tens of thousands of times that pieces of data producing that end product. We have to start to go into powers of ten to get to this. The number of things you need to look at when you start looking at relationships on top of pieces of data, on top of entities is in the order of 10 to the 24th in my world.

Matthew Marolda: It's interesting. There are two ways to look at it. One is, data is everywhere. Data is our reaction in a movie theater. There is data there, but we don't capture it very well or at all. There's data that's being discussed online. There's data in ticket sales. These things are enormous.

We have a relatively small slice of that. Still, I don't think we're at 10 to the 24th, but we have an enormous scale. I think one of the things that I was going to highlight from our point of view because so much of what we do is from unstructured data, it's almost this odd concept. I don't like using the term "create out of data," but what we're doing is taking all this unstructured data and turning it into more structured insights.

Also: How data scientists can improve their careers in 2018 TechRepublic

For example, if you take a pool and make it simple. Take all of Twitter, which we have access to, and we use all the time; we have on our servers. Just Twitter alone can generate enormous amounts of structured data for us. It's almost infinite because, depending on what angle we decide to go into that data and pull it out, we're going to have a whole new set of things you could be examining.

We have many, many examples like that. For us, it's as much about the pools of data and then drawing out from them these new structured pieces. Because data, by its nature, is unstructured, typically, that enables us almost infinitely to create data on top of it.

CXOTalk offers in-depth conversations with the world's top innovators. Be sure to watch our many episodes! Thumbnail image Creative Commons from Unsplash.

These are 2018's biggest hacks, leaks, and data breaches

Previous and related coverage:

Data scientist: The cult of pan-galactic data doesn't work for business

A foremost data scientist and a top CIO share their advice on collecting and using data in business. Learn practical lessons from two of the smartest people out there.

Are consumer brands jumping the gun when it comes to hiring data scientists?

A new Salesforce survey shows brands are eager to hire data scientists, but they may not have the data ready for them to work with.

Editorial standards