X
Education

The best programming language for data science and machine learning

Hint: There is no easy answer, and no consensus either.
Written by George Anadiotis, Contributor

Video: What programming languages do you need to know to earn more?

Arguing about which programming language is the best one is a favorite pastime among software developers. The tricky part, of course, is defining a set of criteria for "best."

With software development being redefined to work in a data science and machine learning context, this timeless question is gaining new relevance. Let's look at some options and their pros and cons, with commentary from domain experts.

Read also: JavaScript rules but Microsoft programming languages are on the rise

Even though, in the end, the choice is at least to some extent a subjective one, some criteria come to mind. Ease of use and syntax may be subjective, but things such as community support, available libraries, speed, and type safety are not. There are a few nuances here, though.

Execution speed and type safety

In machine learning applications, the training and operational (or inference) phases for algorithms are distinct. So, one approach taken by some people is to use one language for the training phase and then another one for the operational phase.

istock-629285904.jpg

When choosing a programming language for data science and machine learning, there are some special considerations. (Image: Getty Images/iStockphoto)

Getty Images/iStockphoto

The reasoning here is to work during development with the language that is more familiar or easy to use, or has the best environment and library support. Then the trained algorithm is ported to run on the environment preferred by the organization for its operations.

Read also: The 5 worst programming languages to learn in 2018 - TechRepublic

While this is an option, especially using standards such as PMML, it may increase operational complexity. In addition, in many cases things are not clear-cut, as programming done in one language may call libraries in another one, thus diluting the argument on execution speed.

Another thing to note is type safety. Type safety in programming languages is a little like schema in databases: While not having it increases flexibility, it also increases the chances of errors.

In this thread initiated by Andriy Burkov, machine learning team leader at Gartner, Burkov argues against using dynamically typed languages such as Python for machine learning.

"You can run an experiment for several hours, or even days, just to find out that the code crashed because of an incorrect type conversion or a wrong number of attributes in a method call," says Burkov.

Java

Despite having what is arguably the largest footprint in enterprise deployment, Java is not getting much love these days. Some of this may have to do with the "coolness factor," as Java has been challenged by new programming languages, but there are also some very real concerns here.

Read also: The 10 programming languages developers hate - TechRepublic

What has greatly helped Java establish it footprint, namely the JVM, is also a reason why people are skeptical about using it for machine learning. Similarly, one famous feature of Java, which helps deal with the complexities of C++, garbage collection, may pose problems in production environments.

java-mac-ogrady.jpg

Java may not be getting much love these days, but it remains the one programming language with the widest deployment base in the enterprise.

When discussing trends in software development with Paco Nathan, managing partner at Derwen and data science practitioner and thought leader, the topic did come up.

Nathan notes that the trend he sees is toward real-time applications, and this is not something he believes the JVM is well-suited for, as it is an abstraction over the hardware. Adding a layer between the code and the hardware provides cross-platform portability, but also slows down execution.

Nathan also cites Ion Stoica, the initiator of Apache Spark, which is heavily used for real-time applications. Nathan mentioned that one of the rules Stoica has recently set for his research team in Berkeley is abolishing Java.

Nathan commented that he expects that to spill over from research to industry over a five-year timeframe, as is typical for directions initiated in research environments. But maybe we should not be too fast in writing off Java.

Read also: Fastest growing programming languages in 2018 - TechRepublic

The ups and downs that have been following Java during its stewardship by Oracle may have contributed to its falling out of grace. They may also have something to do with the perceived stalemate in the evolution of the JVM.

With enterprise Java being handed off to the Eclipse foundation, however, there is a chance Java and the JVM may be revitalized. There are also initiatives, such as Gandiva, which aim to optimize Java code for specialized hardware, potentially making it a competitive option for machine learning.

In addition, that large footprint has given rise to initiatives, such as DeepLearning4J, which aim to bring to Java users access to the same libraries typically used through other languages.

Python

According to a recent survey by KDNuggets, Python is the undisputed leader in use for data science and machine learning. Some often cited reasons for this preference are the wide choice in libraries and the fact that it's considered an easy language to work with.

the-complete-python-programming-bundle.jpg

Python is the language of choice for most when it comes to data science and machine learning.

Ashok Reddy, GM DevOps at CA Technologies, notes that Python was the language of choice in his recently completed master's in AI and Machine Learning at Georgia Tech.

Reddy goes on to add that Python is gaining popularity in universities due to its simplicity, so graduates are more likely to know Python than Java. Beyond simplicity, he also cites the abundance of libraries as a key reason for this.

Read also: Mozilla Rust programming language offers internet security - CNET

Reddy notes that, from a performance perspective, C is also a popular choice for use in AI and embedded-IoT applications, but Java is not going away. Reddy also sees a pattern in using Python for development and then other languages for deployment of machine learning algorithms.

This also applies internally at CA, as Reddy notes that, in addition to having legacy code in C and Java, the cross-platform portability that Java offers is a key priority for CA.

"Many startups use Ruby or Python initially, and when they grow up they switch to Java," says Reddy.

R

In the KDNuggets survey, R's share seems to be dropping compared to last. R, however, has been gaining enterprise adoption over the last few years.

Read also: Which programming languages pay best?

In some ways R is not a typical programming language, as it's not a general purpose one. R's roots lies in statistics, as it has been developed specifically to deal with such needs.

r-notebook-and-visualizations.png

R is a language purpose-built for statistics and data science, but its applicability in the enterprise is largely dependent on its supporting ecosystem.

That, and the fact that it's open source, make for a wealth of off-the-shelf libraries for common and not-so-common related tasks. The flip side of this is that R has been plagued by issues such as memory management and security, and its syntax is not very straightforward or disciplined.

In the past few years, R has seen development environments been built around it in order to fill the gaps required to take it out of the data science lab and into enterprise deployments.

One of those, created by Revolution Analytics, has been integrated in Microsoft's offering (Visual Studio, SQL Server, Power BI and Azure) following its acquisition by Microsoft. Another one, R Studio, has been integrated initially with Apache Spark and now with Databricks.

The way this was done is indicative of another strength of R -- its package system. It is through this, and its ties with the academic community, that R keeps up to date with all latest developments in data science and machine learning.

While R may be a good choice for development, its value in production is highly dependent on its supporting ecosystem.

Julia, Golang, Rust, Swift, and JVM languages

And what about those who do not want the dynamic typing of Python, or the lecagy baggage of Java or C / C++? Well, apart from the fact that Python 3.6 and later supports static typing.

Read also: Which programming languages earn you the most?

Burkov notes that Scala and Kotlin, two newer languages based on the JVM, have optional typing, but a steep learning curve and low user adoption, respectively. And, in the end, we might add, they also come with the same restrictions imposed by the JVM.

Swift, notes Burkov, has static typing and low availability of machine learning libraries/data analysis. Other options suggested by contributors in the same thread are Golang, Julia, and Rust.

codeistock-521971811stevanovicigor.jpg

All programming languages have their proponents, but not all are equally equipped with libraries for data science and machine learning. (Image: Igor Stevanovic, Getty Images/iStockphoto)

Igor Stevanovic, Getty Images/iStockphoto

Golang has been pointed out as being fast, thread ready, easy, clean, compiled, and simple. And it has increasing support for libraries for NLP, general machine learning, and data analysis, extraction, processing and visualization.

Julia has been pointed out as being flexible with type usage and JIT complied similar to Java, but having execution speed comparable to C. It's a relatively new language, so its community is not the biggest around. However, Julia does have some support for machine learning libraries.

Read also: These five programming languages have flaws

Rust has been pointed out as compiling natively and efficiently like plain C/C++, lacking garbage collection, and being type safe and rich. Admittedly, even by its proponents, though, it is not really ready for ML due to lack of ML specific libraries.

The choice of programming language is not a simple one, and in the end it may not even be the most important one either. As pointed out by Luiz Eduardo Le Masson, data science leader at Stone Co.:

"For 'ordinary machine learning,' it does not matter what language you use. But when you need to have real online learning algorithms and inferences in realtime for millions of simultaneous clusters and respond in less than 500 ms, the topic does not only involve languages, but architecture, design, flow control, fault tolerance, resilience."

Innovative artificial intelligence, machine learning projects to watch

Related stories:

Editorial standards