The top 3 tools in the data science tool box

Tool box

Many tools are available for working with data. Of all these tools there are a number of programming languages that I regularly use and have experience with. In particular Python and SQL but R has also be proven to be quite useful.

Other popular languages used in this field are Scala, Java and Julia. The latter is a language that strives for the ease of use of R and Python with the performance of C++.

Key takeaways:

  • Python is my default choice for any task related to data except configuring a database. Its large software repository allows fast results for about anything you may want to do.
  • R is tailored for data analysis and statistics. R is an excellent choice to experiment with data, to visualise results and to publish the outcome of experiments with data.
  • SQL is essential to efficiently interact with databases and it is used everywhere. I like SQL to query huge sets of data to search and select specific entries. SQL is simply the best tool for that job.

Python

Python is a general-purpose programming language that is simple and powerful. Others can relatively easy read the scripts that are being written in Python. Python has a very comprehensive standard library that provides functionality for many tasks. And then there is the huge amount of functionality available via its third party software repository. As of early 2022, this repository contains more than 350,000 software packages: this repository covers a wide range of functionalities among which data analysis, databases, image processing, machine learning and web scraping used for data science applications.

All of this makes Python a very popular language in many domains and the most popular language for data science.

Just to mention a few packages used for data science:

  • NumPy: used for scientific computing; it provides a high-performance multi-dimensional array object.
  • Pandas: used for data analysis and manipulation; it offers structures for manipulating tables and time series.
  • scikit-learn: used for machine learning.
  • matplotlib: used for graphics, provides lots of functionality, and it works with NumPy.

If anything, a weakness of Python is its performance for very specific compute intensive tasks.

R

R is more than Python a language designed for data science, specifically statistics and visualisation of results. It is based on the S statistical programming language. Like Python, it is open source and runs on all major platforms. One of R's key strengths is graphics: it can create graphs of high quality, suitable for publication.

Like Python, R's functionality can be extended through packages. Currently, there are around 19,000 packages available in the repository at CRAN. The wealth of functionality offered by all these packages, contribute to the popularity of R. R also allows linking and calling of Fortran code, a language that was de facto used for numerical computations and therefore many efficient implementations for numerical computations are available in this language.

A weakness of R is it is memory management: it is essentially an application that uses available RAM to do its work. Beneficial for performance but it will limit the size of the data sets that you can use. When you hit the limits, you need to understand how memory management in R works.

The software repository of R contains a number of packages that are worth mentioning here for data science:

  • ggplot2: excellent package for visualisation based on the Grammar of Graphics
  • dplyr: package used for data manipulation
  • tidyr: helps to create tidy data (well-organised high quality data)
  • Shiny: builds web applications without requiring JavaScript
  • R Markdown: create fully reproducible documents

SQL

If you work with databases, you can't get around SQL. SQL is a programming language that is used to manage data stored in relational databases.
A relational database is a type of database that is already used for many years for a huge number of applications.

SQL has been around since the mid-70s and it got standardized in the mid-80s.
The standard has evolved to meet new requirements and the latest revision was released in 2019. Vendors of relational databases certify their product themselves against the standard. Actually none of them is fully compliant. It is not seen as a major hurdle: database users consider performance more important than standard compliance.

In SQL, you only specify what needs to be done not how. This differs from a programming language like C or Java where the execution steps are part of the program. SQL is a very efficient language to access, filter and manipulate data stored in a relational database with a single command. Besides its use as a query language, it is also a a manipulation, control and definition language.

Other types of databases, like NoSQL databases and graph databases, use similar SQL-like languages for querying and other purposes.

Leave a Reply

Your email address will not be published. Required fields are marked *