I got to join this amazing community of Data Scientists in Nigeria. We are a mix of experts and beginners. Today, I created a tutorial for the beginners to see how to do a common task like frequency distribution plot in both Python and R, also decided to include my dearest Microsoft Excel as a control.
The sample data is a fictionalized data for Dominos Pizza Nigeria. One day sales data for their Lekki branch. You can download the practice along raw data file here: https://dl.dropboxusercontent.com/u/28140414/Dominos%20Pizza.csv
So the business question we want to tackle is: Is there a pattern in the quantities each customer buys? To be more specific, we want to examine the frequency distribution of the quantities purchased per sales transaction.
In Excel, it is extremely straightforward. Just plot a histogram on the quantity field.
Now let’s head to doing same with R
I use R 3.3.2 and RStudio.
First, I import the csv file into RStudio.
Though not necessary for what we want to do, but I like doing it for any data I bring into R, I run the summary command on the dataframe/table. > summary(Dominos_Pizza)
Again, not a required step. I check out the standard plot graph on the Quantity field. > plot(Dominos_Pizza$Quantity)
Finally, I do the histogram chart on the Quantity field. > hist(Dominos_Pizza$Quantity)
For now I don’t bother customizing the graph elements (labels, color, title, etc.)
It is Python time.
I use Rodeo IDE and Anaconda.
I import Pandas and use it to read in the csv file.
And here is the plot graph, like we did in R.
Finally, I create the histogram.
I will try to follow up with more tutorials of complex tasks, and some that are best suited to R and others that are best suited to Python. As per Excel, it is in a completely different class. It is a spreadsheet application.
Got any particular task you will like me to create a tutorial around? Ask away!