LinkedIn: Machine learning is like oxygen, but the human element is not going away anytime soon
How do data and machine-learning powered algorithms work to control newsfeeds and spread stories? How much of that is automated, how much should you be able to understand and control, and where is it all headed? LinkedIn has answers.
Recently LinkedIn revamped its newsfeed and rolled out a new feature called Trending Storylines. Coupled with the acquisition by Microsoft, this is a move with far-reaching implications.
Social media and their newsfeeds have come to play a key role in our lives. To a large extent, they shape our perceptions, the way we get our information and connect with each other and the world at large. LinkedIn is a professional network, but its size and ambition mean that its newsfeed can be hugely important in its own way.
Leveraging data to offer its users a relevant experience is at the core of LinkedIn's operation. As Igor Perisic, LinkedIn's CDO, VP of Engineering and head of machine learning (ML) puts it, "machine learning is like oxygen for LinkedIn's organism."
We connected with Perisic to discuss his insights on the use of data and ML in LinkedIn. As we had never met before, our conversation started in a rather typical ice-breaking way -- by sharing some information about the course of our day and our places of residence. You may think that's irrelevant to our topic, but is it really?
LinkedIn is a professional social network, and we're having a professional conversation about it. And yet here we are talking about Berlin in the springtime. Is that professional? And who gets to judge that? Would LinkedIn's algorithms classify that as spam, were this shared online instead of over the phone?
LinkedIn has been working to define what is a professional conversation and how that reflects in our newsfeeds. Many of us have seen the "what's the next number in this sequence" type of posts there for example. For some they are intriguing, for others they are clickbait.
Perisic says that most of the feedback they got from users indicated they did not consider this a professional conversation, so LinkedIn decided they did not want it to overwhelm their newsfeeds.
ML works as a trigger that evaluates content at two stages. Initially, LinkedIn's online and nearline classifiers label every image, text, or long form post shared as "spam," "low-quality," or "clear" in near real time. As the content gathers audience, another set of classifiers runs to identify shares that are likely to go viral and are likely to be of a lower quality, incorporating user flagging as well.
When these classifiers can deduce with a high precision what category shares fall under they act on their own, letting shares be, demoting them, or filtering them out. When the classifiers cannot safely decide, human editors come to the rescue. LinkedIn marks its human labeling team, run by the Trust and Safety organization, as the centerpiece of spam-fighting efforts.
The decisions made by editors that work with LinkedIn are fed back to the ML algorithms to improve them. Rushi Bhatt, senior engineering manager at LinkedIn, says that the human feedback loop has been in place at this scale for approximately a year. It started in 2016 with little use of ML classifiers for most of the content feed, so this loop has bootstrapped LinkedIn's whole program.
How well does it work though? "Different classifiers do different things, so it's difficult to arrive at a single figure that measures the effectiveness of our program" says Bhatt. "Online A/B testing of one set of classifiers has shown a reduction of 48 percent in spam and low-quality content impressions due to these predictors. Another set of predictors has increased the precision of tagging low quality content six-fold."
Bhatt does not anticipate the human element to go away anytime soon.
"One of the reasons LinkedIn would like to keep the human feedback loop operational is to monitor the site for any new and novel types of spam attacks, and to continuously measure the performance of the system. There will also be cases that will require deeper scrutiny.
If anything, LinkedIn is seeing that classifiers are taking away the "grunt work"-type labeling from humans and its labelers are freed to look at more nuanced content that requires human intelligence to adjudicate. LinkedIn also uses a variety of techniques to avoid biasing training towards only what goes into the feedback loop."
So, is talking about Berlin in the springtime spam or not? It depends. When sharing something which is restricted, for example to a group of people with whom you are connected in some way, the algorithms will go easier on you compared to when sharing with the world.
Perisic goes even further, suggesting that ML can proactively help users adjust their sharing to prevent side effects and maximize impact. So it wouldn't be a surprise to see something along these lines unveiled soon.
Trending Storylines: What if LinkedIn was a media organization?
But if LinkedIn can predict whether posts will go viral, could that not also be used to generate such posts? Apparently it can, and that's what LinkedIn seems to be out to achieve with its new Trending Storylines feature.
LinkedIn introduced Trending Storylines as part of the new feed experience. It is promoted as a feature that helps members discover and discuss news, ideas, and diverse perspectives. The way it works is by using systems combined with the expertise of the LinkedIn editorial team to create relevant news recommendations. The idea there is that editors pick and create stories, ML does the rest, including updating them as with new content as it emerges.
Although it's too early to tell how well it will work, since it was released just a few days ago and only in the US for the time being, this sounds like every news organization's wet dream. Perisic pointed out that the fact that ML works significantly better when applied in a shrunk space, as is the case here. But this also hints at a couple of interesting points.
Apparently classifying content in just three buckets -- spam, low quality, or clear -- does not cut it. After making the "clear" zone, another set of classifiers takes over to rank items according to a combination of criteria. Perisic mentioned for example a classifier that evaluates content in terms of its conversation starting potential.
All classifiers have to be tuned and combined appropriately though, and this is perhaps more art than science. Job openings for example are usually not great conversation starters. Relying heavily on that conversation evaluation classifier for the newsfeed meant that job openings took a hit in terms of visibility, which was an obviously undesirable side effect, so tuning the newsfeed was required.
LinkedIn currently uses three sources for content: shares and status updates, its blogging platform, and content marked as important by its editorial team. Does employing editors that curate newsfeeds and generate stories make LinkedIn a media organization, as has similarly been suggested for Facebook? Are there responsibilities that come with this, and would removing the human in the loop altogether make a difference?
When asked to comment, Dan Roth, executive editor at LinkedIn replied as follows:
"We are platform agnostic and focused on fostering conversations among professionals -- stories are of course a key component of this. We also have strong relationships with publishers, and have seen our referral traffic increase by 2-3X.
We've always believed that the real magic comes when we combine editors and algorithms. The editors -- mostly all journalists -- can spot, plan for or encourage high-quality, urgent conversations. The algorithms allow us to reach the long tail beyond just these top topics. The actions of the editors help train the algorithms and the algorithms help surface potentially high-quality conversations."
How do you open up the black box, and how much open is open enough?
The introduction of Trending Storylines brings back an old question: are LinkedIn's policy and rules of engagement as clear as they could be for content creators and newsfeed consumers? There has been some criticism on the topic, and obviously there is a fine line between protecting IP and being an attractive channel for content creators and consumers. When asked for comment, Steve Lynch, senior communications manager at LinkedIn replied as follows:
As far as walking the line between protecting IP and being transparent to content creators, we provide a pretty straightforward explanation of how articles generally are shared and weighted in members' feeds. You can take a look at that explanation here.
Ultimately, our goal is to continue to add and improve on tools so that members control their feed experience are empowered to tell us what they want to see."
Although determining whether LinkedIn is a media organization and how much of its inner workings it should reveal is mostly up to others, Perisic is also concerned about transparency. As far as transparency in the newsfeed goes, he sees this as an arms race of sorts: "describing in detail how our algorithms work would be a double-edged sword" he claims. "If we explained how it works, we would enable spammers to game the system."
Opening up the ML black box however does fall within Perisic's responsibilities. Perisic has been in LinkedIn's talks with EU regulators concerning GDPR, which will come into effect in May 2018. Part of that has to do with explanations that LinkedIn, among others, will have to provide to individuals regarding why certain things are happening in the platform. What's his take on this?
"If I give you a ML algorithm with 100K features and try to explain it to you, it's not going to be trivial. This needs to be understood from the perspective of the person asking a question, not the expert. If I look at the model, I'll understand, but the point is to be able to explain to the person asking 'what are you going to do with my content.'
How regulations are interpreted is an ongoing discussion. High complexity cannot lead ML and statistics practitioners to be washed away in complexity. It's not like when something is complex, you cannot explain it. What are the factors that affect the algorithm?
We should work backwards from the algorithm trying to understand in the words of individuals, not in our own words. I can describe a number of regression techniques that work in a number of ways, but that's too technical. But we need to make sure you understand -- we can't hide behind the formula.
We can identify the major factor our models pick. Not all factors, but we can identify what it was mainly influenced by. Is this exactly complete? No, but it may be enough to cover the person's requirement to understand."
LinkedIn + Microsoft = All your data are belong to us? It's complicated
Regulation also influences LinkedIn in terms of what it can do as part of Microsoft. Perisic is both CDO and VP of Engineering for LinkedIn, and although he jokes about how LinkedIn's logo is still all over their buildings, things are pretty serious when it comes to merging datasets.
So, what's keeping Microsoft from doing what Google did with YouTube and other services at some point, and creating a unified user profile across the spectrum with everything it knows about you? Part of LinkedIn's discussions with regulators are precisely about that.
According to Perisic:
"We have to be careful there. We have a very recent precedent with the case of Facebook and WhatsApp. There are certain terms for the approval of the acquisition, but perhaps more important, in order to do our work, we need to have the trust of our members. Members first is our philosophy.
Imagine, what would happen for example if your Xbox chats were available to the world? You would go -- wait a minute, I only shared that with the group of people I was in that game today, I don't want you to know who I am or merge that with my Outlook data or my Bing searches.
You have to decide what your platform is known for. In LinkedIn we like to think we're different, and not just because of the way we focus on using ML to shape how our product should behave.
By using ML you do things like classify intent, so you have to be careful there. If you ask members who they want to share data with, you have to respect their decision. For us members first means that data belongs to members -- we are just the custodians."
That however does not mean that there are not synergies than LinkedIn and Microsoft are pursuing. To begin with:
"Microsoft has many clever individuals, and having access to them --and vice versa -- is obviously beneficial. Instead of meeting people at conferences and getting hints at how they approach problems, we can interact directly and have full access.
For example, working with GPU clusters, which is something both we and Microsoft do. Configuring those clusters is not an easy task, and we were able to benefit from Microsoft's experience there. It also works the other way round and in other areas, like tools and algorithms.
Many of our tools are open source and have already been in use by Microsoft, like Kafka. We also have algorithms that Microsoft is looking at, for example to do large scale logistic regression. But by and large, we retain our autonomy."