X
Business

Python vs R for data science: Professor rates programming language rivals

R expert hopes to settle the debate with an analysis of the programming languages that's "fair and helpful".
Written by Liam Tung, Contributing Writer

Programming languages Python and R are often pitted against each other over which is best for data science and analysis. Both are popular, although Python appears to be much more widely used, at least by people learning how to program. 

But data science is a specific field, so while Python is emerging as the most popular language in the world, R still has its place and has advantages for those doing data analysis. 

Hoping to settle the perennial R versus Python debate, University of California, Davis, professor of computer science Norm Matloff has published a concise wrap-up of their relative strengths across key measures, including elegance, the fields they're used in, library ecosystems, and difficulty to learn. 

Matloff has written four books about R and is the editor in chief of the R Journal, so he could be seen to favor it over Python. But he says he hopes his analysis is seen as "fair and helpful". 

He says it's a "clear win for Python" when it comes to elegance, in part due to Python's limited use of parentheses and braces. "Python is sleek," he adds. 

But it's a "huge win for R" for newcomers learning either of the two languages. His argument against Python is that a person using it for data science needs to learn about extra Python packages, like NumPy, which brings Matlab-like data-analysis powers to Python. R, which is built for statistical computing, has data analysis features already built in.     

"By contrast, matrix types and basic graphics are built in to base R. The novice can be doing simple data analyses within minutes," contends Matloff. 

"Python libraries can be tricky to configure, even for the systems-savvy, while most R packages run right out of the box."

The Python Package Index (PyPI) currently has over 183,000 projects, greatly outnumbering R packages available on the Comprehensive R Archive Network (CRAN). According to CRAN, there are 14,385 packages available. Despite this difference, Matloff considers it a tie. 

SEE: Python is eating the world: How one developer's side project became the hottest programming language on the planet (cover story PDF) (TechRepublic)

PyPI, he notes, "seems thin on data science." Searches on PyPI "turned up nothing" for log-linear model, Poisson regression, instrumental variables, spatial data, and familywise error rate.   

However, Python does have a "slight edge" over R in machine learning, and Matloff seems to be calling for the development of machine-learning libraries for R, which he says could be done with little difficulty. 

"The Python libraries' power comes from setting certain image-smoothing ops, which easily could be implemented in R's Keras wrapper, and for that matter, a pure-R version of TensorFlow could be developed," argues Matloff. 

SEE: How to build a successful developer career (free PDF)

He goes on to take a stab at typically pro-Python machine learning (ML) people who "often have a poor understanding of, and in some cases even a disdain for, the statistical issues in ML". Therefore, on the question of which language has the greatest statistical correctness, it's a "big win for R". 

One "horrible loss for R" is its language unity. R, he says, is "devolving into two mutually unintelligible dialects, ordinary R and the Tidyverse". And he blames that situation squarely on the company RStudio

Tidyverse is a collection of very popular R packages. Basically, Matloff believes a commercial outfit like RStudio shouldn't have the "undue influence" it has over the R project. 

"It might be more acceptable if the Tidyverse were superior to ordinary R, but in my opinion it is not. It makes things more difficult for beginners. Eg, the Tidyverse has so many functions, some complex, that must be learned to do what are very simple operations in base R," argues Matlof. 

More on Python and programming languages

Editorial standards