X
Innovation

CatBoost Machine Learning framework from Yandex boosts the range of AI

This is the year artificial intelligence (AI) was made great again. AI is all about machine learning, and machine learning is all about deep learning (DL), according to the hype. For connaisseurs like Yandex, there's more to AI than deep learning. CatBoost, the open source framework Yandex just released, aims to expand the range of what is possible in AI and what Yandex can do.
Written by George Anadiotis, Contributor
v-3-mi-landscape-3-81-1400px-3b7a8a307b1c6a3d2c20c22613e4438f.jpg

The AI landscape is changing day by day. (Image: Shivon Zilis and James Cham, designed by Heidi Skinner. A larger version can be found on Shivon Zilis' website.)

It's hard to avoid the AI buzz out there. Beyond the hype, there's no denying that progress is done in leaps and strides. We are in mid-2017, and already the image of machine intelligence as painted for 2016 has seen notable new entries.

Just keeping in the technology stack, we have seen the introduction of Caffe2 from Facebook, Core ML just out from Apple, which has entered the game, and let's not forget the widely ambitious NeoPulse.

One thing all of these have in common: Deep learning. Caffe2 and NeoPulse are exclusively DL frameworks, and DL is also central to Core ML. While DL is certainly valuable, there is more to ML. And there are also more players in the game than the usual suspects.

Meet CatBoost, a new ML library based on gradient boosting (GB) and aiming to find its own sweet spot in the AI landscape.

CatBoost, your friendly neighbourhood feline

The release of CatBoost as open source was officially announced today, but CatBoost did not come out of nowhere. It has been developed by Russia-based and NASDAQ-traded Yandex. Yandex, known to many as the "Russian Google," touts itself as a technology company that builds intelligent products and services powered by ML.

"ML powers more than 70 percent of Yandex products and services" says Misha Bilenko, head of Machine Intelligence and Research (MIR) at Yandex. Although its MatrixNet and DaNet libraries are not as well known as others in this domain, they have been around for a while and are used heavily by the likes of CERN and Gazprom.

"CatBoost is the next generation of MatrixNet and Yandex will be implementing CatBoost almost everywhere MatrixNet is already in place," says Bilenko.

Great. But what is CatBoost and why should you care?

Yandex describes CatBoost as "a state-of-the-art open-source gradient boosting library," and elaborates that while DL is indeed useful and something have had great experiences with, there's more to life and AI than DL, such as GB.

Yandex applies GB to the kind of problems businesses encounter every day -- like detecting fraud, predicting customer engagement, and ranking recommended items. Yandex claims the key advantage of GB over DL is the ability to deliver highly accurate results even when there is relatively little data.

This, says Yandex, makes it ideal for predictive models that analyze many different forms of data, and especially descriptive data formats with categorical features (features with discrete rather than continuous values). Yandex advocates CatBoost as the one model to rule them all, integrating inputs from many different ML techniques.

Yandex made sure that the structure of CatBoost can support their story, as it can be fed with models from DL frameworks such as TensorFlow or Keras. What's more, it can in turn feed to Core ML, thus bringing CatBoost-powered apps to a wide array of devices around the world.

CatBoost boasts best-in-class accuracy among GB algorithms, and Yandex says it improves the ability to create predictive models using a variety of data sources such as sensory, historical, and transactional data.

Yandex calls CatBoost the most powerful "ultimate" model. While such claims have to be proven in practice, one can not help but notice that Yandex seems to be putting its money where its mouth is. To begin with, Yandex focuses its own future development around CatBoost.

Yandex stands strong behind CatBoost

cat.png

CatBoost may be playfully named and sleekly marketed, but make no mistake as to the seriousness with which Yandex approaches this. (Image: Yandex)

Yandex will be implementing CatBoost almost everywhere MatrixNet is already in place, says Bilenko. That stands for something, as MatrixNet has been key to Yandex. As far as others are concerned, Yandex is trying to make CatBoost appealing by providing options for it.

Besides TensorFlow and Core ML integration, CatBoost can be used in Python and R or via a command-line tool, has visualization hooks and automated feature importance calculation, and it offers options for parameter tuning and boasts superiority in benchmarks.

Admittedly, Yandex makes some compelling arguments. There's just a couple of things you are probably wondering about.

One, who is Yandex again and what makes them such experts in ML? And two, if CatBoost is so great, why not keep it to themselves? Well, the two may be actually related.

We already mentioned how Yandex is colloquially known as the Russian Google. While there certainly is some basis to this, Yandex people, and most notably its CEO, beg to differ. First of all, they say, Yandex was founded in 1997, "a year before Google, so we didn't follow them."

Yandex started as a search engine, much like Google, but then diversified to other domains. Yes, much like Google, but also like Amazon and Uber. Yandex, in addition to owning a 54-percent share of the online search market in Russia, has expanded to offer services like Shopping (Yandex.Market is used by 19 million people a month) and taxi rides (Yandex.Taxi owns 60 percent of this market in Moscow).

Some of that may have to do with Russian protectionism, but probably not all of it. Yandex has built on a number of advantages in the local market and is expanding to other markets, too. Hiring ex-Microsoft Bilenko, in addition to other high-profile hires and internal reorganisation, seems to be part of the plan to take on the world.

When asked what barriers are there to be addressed in this effort, Bilenko responded by mentioning that "Yandex is committed to maintaining high quality products and services for users in our core markets, but as a global technology company, we find it invaluable to contribute more broadly to the larger tech community.

"Given the fundamental importance and widespread use of GB, we wanted to contribute to a core need and create something that's easy for data scientists to integrate with other machine learning frameworks. Offering the community a great out-of-the-box tool is something we anticipate will be widely used and highly beneficial."

Machine Learning heavyweight

Bilenko mentioned Yandex Clickhouse as an example of the tools Yandex made available to the open source community. Bilenko says he hopes to see CatBoost impact the tech community in a positive way, whether that it is for retail or insurance or any other commercial use, and he emphasizes the wealth of developer talent in Russia.

Yandex utilizes ML in a number of consumer-facing applications, such as translation, image recognition, web search, advertising, weather forecasting, speech recognition, and anti-fraud. What's more, Bilenko says Yandex will be implementing ML with the Yandex.Cloud team. So expect to see more ML in the cloud from Yandex soon, keeping with the times.

Another interesting and little-known fact, however, is that Yandex also has an enterprise side -- and data is the driving force behind it. CatBoost is also meant to succeed MatrixNet in domains such as industrial process optimization or improving the efficiency of particle physics research.

catboostfeatures.png

CatBoost has enterprise-ready features, and that's no surprise considering its origins and applications. (Image: Yandex)

Yandex Data Factory (YDF) is a division of Yandex that provides AI-based solutions to increase productivity, reduce costs, and improve energy efficiency. It works with the likes of Gazprom, CERN and Intel, and it was there that MatrixNet, originally developed by Yandex in 2009, was hardened.

Although Bilenko says his MIR division is normally not related to YDF, CatBoost was used to create a prediction model for a YDF customer, a large steelmaking company.

This quality prediction model was trained on past data about the production of steel slabs in order to predict the likely amount of defect mass in each individual slab based on available measurements. The result was decreased overall production costs and defect rates.

The process industry in Yandex's home court markets is heavyweight, and the combination of access to this industry, know-how, and talent may give Yandex the potential to leverage its stronghold to take on other markets as well.

So, should you consider CatBoost? Probably yes. Where does it fit in Yandex's strategy? Looks like a key move to getting exposure, establishing expertise, and attracting talent and clients while accelerating its evolution. Also looks like an interesting twist in the plot of the ongoing AI saga; let's see how the dice will roll.

Editorial standards