Why do data scientists prefer R.


Author: Enoch Kan
Compile: Mika
This article is an original work by a CDA data analyst. Permission is required for reprinting

For decades, researchers and developers have argued that Python and R are better choices for data science and data analysis. In recent years, data science has grown rapidly in various industries such as biotechnology, finance, and social media. The importance of data science is being recognized not only by industry insiders but also by many academic institutions, and more and more schools are starting to establish data science degrees.

As open source technologies quickly replaced traditional closed source technologies, Python and R became increasingly popular in data science.

Short introduction

Python was invented by Guido van Rossum and first published in 1991. Python 2.0 was released in 2000 and Python 3.0 8 years later. Python 3.0 contains some important syntax corrections and is not compatible with Python 2.0. However, Python libraries like 2to3 can be automatically converted between the two versions. Python 2.0 should be discontinued in 2020.

The R language was invented in 1995 by Ross Ihaka and Robert Gentleman. The R language was originally an implementation of the S language invented by John Chambers in 1976. The first stable test version of R language 1.0.0 was released in 2000. Currently it is managed by the core R development team and the latest stable release is 3.5.1. In contrast to Python, R has not made any significant changes in the past that required a syntax conversion.


Both Python and R have tremendous support for the user community. According to a 2017 survey by Stack Overflow, nearly 45% of data scientists use Python as their primary programming language. On the other hand, 11.2% of data scientists use R.

It is worth noting that when it comes to Python, Jupyter Notebook in particular has been in high demand in recent years. Although Jupyter Notebook can be used in languages ​​other than Python, it is mainly used for recording and viewing Python programs in the browser for data science competitions like Kaggle. According to a survey by Ben Frederickson, the monthly percentage of active users (MAU) of Jupyter Notebook on Github increased significantly after 2015.

As Python has grown in popularity over the past few years, we've seen a decrease in the percentage of monthly active users who use R on Github.

Even so, these two languages ​​are still very popular with data scientists, engineers, and analysts.

Availability

Originally used in research and science, R is now more than just a statistical language. R can easily be downloaded from the Comprehensive R Archive Network (CRAN). CRAN can also be used as a package manager that can download more than 10,000 packages. Popular open source development environments (IDEs) like R Studio can be used to run R.

As a statistics professional, I'll admit that R has a very strong user base for Stack Overflow. During my undergraduate studies, you can find many R-related questions I came across on the R Language Tag under = Stack Overflow. If you're just starting out with R, introductory R and Python courses are available in online courses like Coursera.

Setting up a Python engineering environment on your local computer is also easy. In fact, built-in Python 2.7 and several useful libraries were recently installed on the Mac. If you're a Mac user like me, I recommend checking out Brian Torres-Gil's guide on it:

Definitive Guide to Python on Mac OSX
https://medium.com/@briantorresgil/definitive-guide-to-python-on-mac-osx-65acd8d969d0

You can also download open source Python package management systems such as PyPI and Anaconda from the official Python website. Similarly, Anaconda also supports the R language. Of course, most users prefer to use the CRAN management pack directly. PyPI or Python usually has more packages than R. However, not everyone is good for statistics and data analysis.

Visualization

Both Python and R have excellent visualization libraries. Ggplot2, created by Hadley Wickham, R Studio's chief scientist, is one of the most popular data visualization packages in R history today. I really like the various features and customizations of ggplot2. Compared to simple R-graphics, ggplot2 allows users to customize drawing components at a higher level of abstraction. ggplot2 offers more than 50 types of images suitable for different industries. My favorite images include the calendar heat map, hierarchical tree map, and cluster map. Selva Prabhakaran has a great tutorial on how to use ggplot2.

Python also has excellent data visualization libraries. Matplotlib and its seafarer extension are useful for visualizing and creating statistical charts. I recommend reading the accompanying visualizations by George Seif to better understand matplotlib.

Similar to ggplot2 by R, matplotlib can create a variety of plots, e.g. B. Histograms, vector field streamline plots, radar plots, etc. Probably one of the best features of matplotlib is the terrain and mountain shadow effect. In my opinion, it's better than R-Raster's. Stronger.

Both R and Python have Leaflet.js wrappers, and Leaflet.js is an interactive map module written in Javascript. Leaflet.js is one of the best open source GIS technologies I've used as it allows for seamless integration with OpenStreetMaps and Google Maps. Leaflet.js also makes it easy to create bubble, heat and contour charts. I strongly recommend using the Leaflet.js wrapper for absolute Python and R. This is easier to install than Basemap and other GIS libraries.

Plotly is a great graphics library for Python and R. Plotly (or Plot.ly) is made with Python and Django Framework. The frontend is based on JavaScript and integrates Python, R, MATLAB, Perl, Julia, Arduino and REST. If you want to build a web application to display visualizations, we recommend that you use Plotly because it includes interactive charts with sliders and buttons.

4. Predictive analysis

Both Python and R have powerful predictive analysis libraries. It is difficult to compare the performance of the two in high-level predictive modeling. R is specifically written as a statistical language, so searching and statistical modeling is easier with R than with Python.

Search in GoogleCan get 60 million results, this is the search 37 times. However, for data scientists with a software engineering background, it is easier to use Python because R is eventually written by statisticians. I also found that R and Python are also easier to understand than other programming languages.

Kaggle user NanoMathias conducted a very familiar survey to determine whether Python or R would be better suited for predictive analysis. He concluded that the numbers of Python and R users were roughly the same among data scientists and analysts. His research also found that people with more than 12 years of programming experience were more likely to choose R as a Python. This shows that programmers who choose R or Python for predictive analysis are nothing more than their personal preferences.

Hence, it is widely believed that the two languages ​​have similar predictive capabilities. But is that really the case?

Let's use R and Python to fit a logistic regression model to the Iris dataset and calculate the accuracy of its predictions. The iris dataset was chosen because of its small size and lack of data. I haven't done any exploratory data analysis and feature engineering here. I just did 80-20 training test segmentation and used predictors to match the logistic regression model.

The accuracy of R's glm model is 95% which is not bad.


Python sklearn logistic regression model accuracy reaches 90%

Using R stat functions and Python Scikit-Learn I fitted two logistic regression models to a random subset of the iris data set. In the model we only use one predictor to predict flower types. Both models achieve greater than 90% accuracy and the R language works better. However, this is not enough to prove that R has a better predictive model than Python, and logistic regression is just one of many predictive models that Python and R have created.

One aspect of Python over R is the excellent deep learning module. Popular Python deep learning libraries include Tensorflow, Theano, and Keras. These libraries contain many text tutorials, and Siraj Raval has also published several tutorials on Youtube.

To be honest, I'd rather spend an hour programming deep convolutional neural networks on Keras than spend half a day figuring out how to implement them in R. Igor Bobriakov has a lot of articles in this area too, I recommend you check it out too.

5. Performance

The speed of measuring a programming language is usually slightly different. Each language has built-in optimization plugins for specific tasks (e.g. R can optimize statistical analysis). There are many different ways to perform Python and R performance tests. I wrote two simple scripts in Python and R to compare the load time of Yelp's academic user record, which is a little larger than 2GB.

R.

python

R loads JSON files almost five times faster than Python. As we all know, Python has faster load times than R, as Brian Ray's test shows. Let's see how two programs handle large CSV files as CSV is a commonly used data format. We're modifying the above code slightly to load the Seattle Library Inventory dataset, which is about 4.5GB in size.

Seattle Library Inventory Record
https://www.kaggle.com/city-of-seattle/seattle-library-collection-inventory/version/15

R.

python

Compared to Python's Pandas, R takes 4.5 GB to load a CSV file twice as fast. Although pandas are mainly written in Python, the more critical parts of the library are written in Cython and C. Depending on the data format, this can have some effects on the loading time.

Let's do something interesting.

Bootstrapping is a statistical method that is randomly resampled from a population. This is a time consuming process as we have to repeatedly re-sample the data for several iterations. The following code tests the repeated runtime required to boot 100,000 bootstraps in R or Python:

R.

python

It took R almost twice as long to bootstrap. Given that Python is often viewed as a "slow" programming language, this is quite surprising. I started to regret using R language instead of Python for my student statistics assignments.

Conclusion

This article only describes the basic differences between Python and R. Personally, I choose Python or R depending on the task at hand. Recently, data scientists have had problems using Python with R. A third language is likely to emerge in the near future, and eventually become more popular than Python and R. As data scientists and engineers, we have a responsibility to keep up with the latest technology and stay innovative.

So do you prefer Python or R? Please leave us a message! !!