Excel is a fantastic tool and a workhorse of an application. Anywhere you work, it’s there! However, when you really try to get some serious analysis done, you will find that you quickly run into walls with limits on what the built-in formulas will return and quickly get lost with numerous steps needed to clean and structure your data. Below are my top three reasons to stop using Excel and start using RStudio for analysis and reporting work.
Being able to reproduce your results in a very short period of time is a huge time saver when you compare the alternatives. If you are working with multiple files or data sources that have multiple steps to clean and shape your data, repeating all of those steps manually in Excel will easily make you go insane.
Let’s go on a little journey. You have two CSV files with thousands of records. Both of these files need to be joined together based on an ID number, filtered down to only show records from this fiscal year, and then aggregated by vendor and then year.
In Excel, you could most definitely accomplish this task. It would require a handful of steps to get to the final numbers. Nothing too crazy, right?
In RStudio, you could also accomplish this task. It would require several lines of code to get to the final steps. Nothing too crazy.
Now assume that the data analyst that originally gave you the data informs you that he missed a thousand records, and all the analysis needs to be redone. I’ve been there before, and it’s enough for you to want to pull out your hair.
In Excel, you’ll need to start from square one with joining, filtering, and aggregating to get to the results. If you were in RStudio and performed the analysis with your code from the start, you could simply load the new file and have the resulting dataset in a few seconds.
The speed at which you can rework your script to get your result is incredible. Additionally, if your work is ever questioned, you can share your script and data files for easy reproducibility.
When you use RStudio to perform your analysis, you essentially have built in process documentation. As you write code, line by line, your steps are being logged in order. If you find an error in what you have written, you can go to the previous line, fix the error and continue working. You are able to write code, all while evaluating the results as you work.
Using RStudio allows you to see exactly how your results were generated, without having to switch from coding and analysis to documenting your steps.
When using Excel, there aren’t many options for quickly documenting your process. You’ll need to keep step-by-step notes on every action you perform, all while constantly switching from documenting to performing actual analysis. I can’t tell you how many much time I wasted bouncing back and forth between Excel and a notebook all to document my process.
Usually it’s after everything is complete that I realize I forgot some steps or what I wrote down didn’t exactly match what I performed in Excel. At that point, forget being able to have someone else reproduce the work you created, and heck, it might be impossible for you to exactly replicate what you did without all the steps.
Occasionally, the work that I do in RStudio is a one-off request that will never need to be reproduced, but most often I am given things to produce for teams on a consistent basis. When these requests come to me, it’s generally because the system they are using does not produce the results needed, or a team needs a custom report that is refreshed daily.
I try to write the R script in a way that allows me to manually process the request and then transform it into an automated script using a CRON job on my primary machine.
This way, I can write the R script once and set it to run at a specific time of day on a recurring schedule. So, as long as my machine is on, the CRON job will run the script using the inputs in the script and produce information that is usable by the organization.
Example: CSV reports from an information system are delivered to a folder two times a day (5 AM and 12 PM). At 6 AM and 1 PM every day, a CRON job kicks off a script that filters the information out by vendor, adds the data to a spreadsheet, and then splits that set of data out into four sheets containing different user statuses and date calculations. Each of the 170 spreadsheets are stored where employees can access the information to process other requests.
Prior to this script, each internal employee would need to access the data for all vendor, filter the list down to the vendor they are looking into, filter based on status, and then perform the separate date calculations necessary. It was a time waster and caused unnecessary bottlenecks when trying to answer a simple question.
In my opinion, all data analysis should be performed in RStudio in order to quickly rework your script, have consistent documentation of what steps you took to get your result, and so that you can easily automate your script if the need arises.
Doing data analysis and report generation outside RStudio (in Excel) can seem like a good approach at the start, but things quickly change when you are given additional data for your analysis, requiring you to completely start over. Additionally, when you use Excel, you must document each step you took to generate your results. Constantly switching from coding to documenting is a challenge, especially if you are getting into a flow with your work. Missing a step in your documentation can be very problematic when results are different when someone else attempts to use your documentation.
Finally, if you have to repeat your analysis or report over and over again, it’s generally a good idea to think about automation. Repeatedly doing the same work over and over again in Excel is great if you need job security, but it does nothing to remove bottlenecks in the organization. Automating a script that you’ve written in R is a game changer, and ensures people have the information they need on a consistent schedule.
Once you make the switch from Excel to RStudio, you won’t go back!