With the NBA playoffs in full swing, we are used to having statistics nuggets thrown into game coverage. While it has been argued that not every aspect of the game should be purely data driven, sports analytics can be fun for fans as well as a useful tool for organizations.
The NBA has come into organizing analytics hackathons, asking participants to propose novel ideas in terms of both the game itself as well as its business side. Projecting the impact of hypothetical rule changes or predicting the entertainment value of games are some examples of ideas investigated in this context.
Read also: NBA analytics: Going data pro
You don't have to be the NBA, or professional media, or a sports organization with a dedicated analytics team to do some analysis of your own. But some problems that are hard to tackle at any level may be approachable via flexible graph data modeling.
Does data-driven optimization -- and three-pointers -- help the game?
Basketball has traditionally been somewhat different in the east versus the west. This year's Western conference finals featured two of the most representative NBA teams in terms of the evolution of the game these days: Golden State Warriors and Houston Rockets.
It was the Warriors that made it to the finals, same as the last three years. Both teams' offense, however, seems to be focused on pursuing three-point shots and layups. The reason, it has been argued, is because analytics show this to be the most effective type of shooting.
The Warriors proceeded to the NBA finals to face the Cleveland Cavaliers for the fourth time in a row, which should say something about the effectiveness of their game. Effective or not, however, some people like it and others don't.
While there's a lot of subjectivity in this discussion, it may serve as a case study to check whether perceptions correspond to reality as can be seen through the lens of data. It can also help highlight some of the fine print when working with evolving and cross-cutting datasets.
So, what does the data say about the influence of three-point shots in the game? Is there a difference in three-pointers that come through moving the ball, as the Warriors allegedly do, and relying on isolation game, as people say the Rockets do?
The Warriors are considered a passing team. But an analysis using offense duration as a proxy for passing shows that the average time of their plays ending with a made two-point shot is almost a standard deviation (σ) shorter than average. The average time for made three-pointers is more than 1σ below average.
Their match in the finals, the Cavaliers, on the other hand, are almost exactly average for both. Incidentally, the Warriors have almost identical numbers to the Philadelphia 76ers in this analysis. They have the two lowest=average seconds per three-point shot. This could point to an advantage to a style of play that favors smart passes to quickly get to a three-point shooter.
Interestingly, not only do the Warriors shoot later in the shot clock than any other team in the NBA, they also force their opponents a shorter shot clock on average. Most of teams that they don't force early shots are teams that on average take shots later in the shoot clock.
You're probably wondering, where is that coming from, and what does it all mean. So let's get to that.
Keeping track of evolving and cross-cutting data and metadata with RDF
Andrew Stellman and myself have a few things in common, including a background in analytics, experience in graph data modeling, and a love of basketball. So, when I saw his message on a Hacker News thread referring to some NBA analysis he was working on, using RDF graph data modeling, I knew I had to reach out.
Stellman is also a consultant and a writer, and he's just finishing his latest book, the fourth edition of Head First PMP (O'Reilly Media). He did, however, make some time to discuss, research, test analytics hypotheses, and share his analytics tools and methodology in time for the NBA finals.
Stellman has been working on NBA analytics as a side project. As it turns out, however, his approach can help deal with issues common in professional sports leagues and beyond. Stellman wanted to use NBA data that's free, readily available from multiple sources, and as raw and complete as possible. His goal was to turn it into something usable for real analytics.
Stellman says that play-by-play data is an effective source of complete data to fuel analytics, and he uses play-by-play web pages from ESPN. He has developed tools that download and parse pages, convert them to RDF, and load them in an RDF graph database. He published his code as an open-source project.
He says he considered using SQL or object repositories like Hadoop, but adds that having done a lot of work with RDF over the last five years, it quickly became obvious that RDF was the right choice. The reason may not be obvious for everyone, though.
It looked like working with that data, and running the queries that he did, could just as well have been done using a relational database for example.
"That would be easy, but only if your RDBMS already has the data. One big advantage that RDF has over relational databases is that it's much easier to update the structure of the data, which is really valuable for doing hypothesis-driven analytics," said Stellman.
Stellman shared discussions he has had with members of the Minnesota Timberwolves analytics team over the last couple of summers:
"Many NBA teams have been trying to crack the play-by-play nut for a while. The problem with play-by-plays is that they contain all of the raw data, but the structure is difficult to work with.
Suppose I was using an RDBMS, and wanted to do an analysis for which I ran into a problem: If I'm not keeping track of the number of seconds for each play-by-play line, I'd have to go and modify the table that stores the plays.
Sure, it makes sense to add this for each individual piece of new data. But as you add more and more data, you keep having to update tables. You either end up with a huge number of tiny, denormalized tables, or a really wide, sparse table. You need to store who assisted, different types of shots, a ton of metadata. RDF is built for metadata."
From a side project to the NBA
Stellman refers to the time he spent on a consulting job working with Dean Allemang, RDF expert and author of The Semantic Web for the Working Ontologist.
Allemang compared adding RDF data to using slide transparencies that you might see in a university class: "With RDF, you can overlay new data over the old data, and the existing data is not affected at all."
So, to add his data, all Stellman had to do was add extra triples for each play. He updated his ontology for working with RDF data to add some metadata about the new prefixes, but didn't have to do any modeling at all: "For a small change like this, it doesn't make too much of a difference, but for a large change it saves a HUGE amount of modeling headache," he said.
Another reason RDF lends itself so well to analytics, according to Stellman, is because statistics is based in discrete math:
"At its root, it's basically counting. Almost all common stats are just count of one thing as a percentage of count of another thing. Even really complex formulas boil down to counts; specifically, finding the right subset of things to count. RDF is really useful for this. Everything in a play-by-play is an event, so the key is to attach the right metadata to each event."
Stellman refers, for example, to NBA teams trying to figure out how to use the data from the overhead cameras that track player movements. He mentions different ways to do this: Tracking individual movement coordinates, tracking passes to players, creating metadata tags based on machine learning, etc.
"If you wanted to attach that data to a play, you'd have to do a bunch of RDBMS modeling, and your database diagrams would start getting huge and unmanageable. But with RDF, you could create a different context for each of those kinds of data. They wouldn't have to know about each other.
And you can start by generating RDF triples, which is a lot more satisfying and productive than starting by trying to create a table model. So, you could do one analysis of individual plays and create triples with the play IRI as the subject.
Then, you add triples to your triplestore in their own context, so you can query them but also isolate them. All without having to touch any of the existing data, or do any modeling at all. It's really convenient," Stellman said.
Coming up with stories, and Occam's razor
Stellman and myself worked on a number of hypotheses, trying to come up with interesting findings and plausible explanations.
One of the things we looked into for example was the average percentage at which players shoot for three points after the other team makes or misses. What we found is that this is pretty much in line with player's overall three-point shooting average.
But, as Stellman noted, the standard deviation after a player on the other team makes a shot is more than twice as high as standard three-point percentage standard deviation. He pointed out that this means this metric is extremely player-specific:
"Some players have a huge 'anything you can do, I can do better' motivation, and it shows up in their stats. So, if a player on the other team just made a three, you definitely want the ball in the hands of Karl-Anthony Towns or Kevin Durant."
And then, there's Steph Curry. What does it mean that Steph Curry actually has a lower three-point percentage after a player on the other team makes a three?
Stellman came up with some explanations for why this is not necessarily a bad thing. He suggested Curry is a team player, so maybe he knows that his team gets especially fired up when a three is answered with a three.
Or maybe there's a chance that the momentum can shift, and a three goes unanswered, so it's worth taking the shot. This reminded me another analysis I've seen on Steph Curry. That one was about his comeback from a bad streak.
It's an interesting read from another analytics expert, Eric Colson from Stitch Fix. Colson writes about the many imaginative stories people came up with to explain Curry's fall out of and rise back to grace from the three-point line.
Colson's point is that, despite being imaginative, those stories do not necessarily hold true. Colson's explanation was that this was a statistical anomaly that was soon restored. Curry's explanation was that he just kept shooting -- that's all.
And that's something to keep in mind when coming up with stories or doing analytics. Choosing tools and coming up with plausible explanations is important, but not more than keeping things simple and sticking to the facts.
There is, after all, a theory for that too. It's called Occam's razor, and it says that the simplest possible explanation is usually true. Sometimes, it's just about keeping shooting.
Previous and related coverage
At the NBA-sponsored "Leaders Meet: Innovation" Event, Steve Hellmuth explains how analytics is helping improve basketball for fans.
Just in time for March Madness, Toyota engineers unveil a 6'3" humanoid with a sweet J.