Data Science Tools

Overview
Data Science Languages
Visualization Libraries
Table of Visualization Libraries
R and R Studio
Other Tools
Backend Integration Tools
Databases
- SQL / Relational Databases
- NoSQL Databases
Dashboarding and Visualization Platforms
Data Storage Formats
Monitoring and Metrics
Conclusion

Overview

Data Science can be done in many different ways, with many different tools. I've compiled here some common resources used for data science, grouped them, and included links and short explanations. When I was starting data science, I wish I could have seen the most common tools grouped together in a way like this, so I put one together myself.

Data Science Languages

There are many programming languages used in Data Science. Below are some of the most commonly used.

Python - docs.python.org
"Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive[...] Python's simple, easy to learn syntax emphasizes readability."
- python.org/doc/essays/blurb/"
- Does not need to be compiled
- Syntax is fairly straightforward
- There are many libraries and resources available
Javascript
"The analogy for this question is: can I use a hammer to dig a hole? Sure you can but it's going to cost you a lot more effort. Just use the shovel, just use Python / R." - Anon
- Less extensive resources and libaries for data science
- There can be issues with typing data structures
- Javascript has asynchronous processing which may help in some situations
- See also: Can Javascript Be Used For Data Science?
SQL

SQL, or Structured query language, is a programming language for relational databases. It can be used with database softwares to quickly filter through and work with large amounts of information.
R - https://www.r-project.org/
"R is a free software environment for statistical computing and graphics." https://www.r-project.org/
- R can be used for complex statistical modeling and visualisation
- R can do some things in a couple lines of code what might take other languages many more lines of code.
Julia - https://julialang.org/

Created by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman in 2012, Julia infers the type of data it's using at run time. It has a flexible data typing system, so code works for multiple types of data without breaking. However, you can enforce constraints if you wish.

Julia uses a programming paradigm called 'Multiple Dispatch' where, based on all of the types of a function's inputs, the runtime environment will pick which 'version' of a function would best handle those data types.
- Julia can 'beat' python in some instances, but Python still remains more popular amongst data scientists.
- Can run on GPUs efficiently.
- Julia libraries end in .jl
- Not object oriented in the traditional sense
- IJulia is a Julia-language backend combined with the Jupyter interactive environment
See also:

Python vs. Julia: Key Differences

Why Was Julia Created?

Julia in 100 Seconds

Python Packages and Frameworks

Dataframes and Basic Data Wrangling

Numpy - Numpy.org
(Numerical Python) handles different kinds of computations above base python. Pandas and Polars are more commonly used but some complicated numerical computations like with matrices or algebra can still better in Numpy. Pandas is built partially on Numpy. (2005)
Pandas - pandas.pydata.org
Standard python library for loading data frames. Built on Numpy. Typing is assumed sometimes. Strengths - been the defacto library for data science and analysis for awhile now.
Polars - Pola.rs
Newer python library that works similarly to pandas. Many head-to-head tests show that Polars is faster. Typing is never assumed. Runs in parallel and written in lower level programming than pandas (which is written on numpy).
- Strengths - faster than pandas, and simpler in some instances.
- Because it's newer, there are not as many libraries that work with it compared to pandas.
See also:

Blog.jetbrains - Polars vs Pandas
GeoPandas - geopandas.org
GeoPandas combines pandas and shapely to provide geospatial operations in pandas. Enables operations in Python that would otherwise require a spatial database such as PostGIS. (2013)
Xarray - xarray.dev
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.

Machine Learning in Python

Pytorch, Scikit-learn, and Tensorflow are 3 separate and independent python libraries for machine learning in python.

Pytorch - nvidia.com/glossary/pytorch
"PyTorch is a fully featured framework for building deep learning models, which is a type of machine learning that’s commonly used in applications like image recognition and language processing. Written in Python, it’s relatively easy for most machine learning developers to learn and use. PyTorch is distinctive for its excellent support for GPUs and its use of reverse-mode auto-differentiation, which enables computation graphs to be modified on the fly. This makes it a popular choice for fast experimentation and prototyping." (Nvidia)
Tensorflow - tensorflow.org
"TensorFlow makes it easy to create ML models that can run in any environment."

A tensor is a multidimensional array, used to represent data with multiple dimensions, such as image data.
Scikit learn / sklearn - scikit-learn.org
Scikit learn is a suite of machine learning tools for python. Main libraries include:
- Classification
- Regression
- Clustering
- Dimensionality Reduction
- Model Selection
- Preprocessing
MLflow
MLflow is an open-source platform that is built to assist in the machine learning process. It helps with tracking, testing, and evaluation to help the process run smoother. It works with a variety of libraries, including the three listed previously.

Parallel Processing and Cloud Computing

Apache Spark
The most widely-used engine for scalable computing
Pyspark – the python API for Apache Spark
https://spark.apache.org/docs/latest/api/python/index.html
Spark is also available for Python, SQL, Scala, Java, and R
Spark can be used through Docker
Spark RDD Example
Spark allows computation loads to be spread across different machines (or nodes) to make computation faster and scalable. You can create a spark session, manipulate data, and run computations, and spark will share the task across other nodes. Spark is helpful for handing larger computation tasks where a single computer might struggle to handle the data.
Dask
Similar to Apache Spark
Built using pandas datasets
According to their website, runs 50% faster than pyspark
https://docs.coiled.io/blog/spark-vs-dask.html
Open source and free to run on your own computer.

APIs

FastAPI - fastapi.tiangolo.com/

"FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints."
- fastapi.tiangolo.com/

Visualization Libraries

There are a lot of options for visualization, and some libraries will work just as well as some of the others. Find what works for you.

Bokeh – bokeh.org/
Altair – altair-viz.github.io/
Matplotlib – matplotlib.org/stable/gallery/index
Plotly / Express – plotly.com/graphing-libraries/
Express is a simplified wrapper for Plotly designed to make it faster and simpler to visualize information, using less coding.
PlotAPI – plotapi.com/
Seaborn – seaborn.pydata.org/
Ggplot2 - ggplot2.tidyverse.org/
D3.js - d3js.org

Table of Visualization Libraries

Name	Year Released	Main Language	Other Languages / Interfaces	Strengths
Altair	2016	Python	None – It generates Vega‑Lite JSON specs	Can stack simple syntax and graphs to make more complicated ones
Bokeh	2013	Python	JavaScript (BokehJS for rendering)	Interactive web plots and dashboards.
Bqplot	2014	Python	None	Primarily designed for interactive Jupyter notebooks.
D3.js	2011	JavaScript	Wrappers exist in other languages	Great for web visualizations.
ggplot (ggplot2) (Python)	2013	Python	None – Inspired by R’s ggplot2	Great for plotting in both R and Python. Easily extendable.
HoloViews	2015	Python	None (uses Bokeh or Matplotlib)	Makes visualization easy, great for large datasets, integrates with Bokeh/Matplotlib.
hvPlot - HoloViz	-	Python	None (built on HoloViews)
Matplotlib	2003	Python	None officially
PlotAPI	?	Language‑agnostic	Any language that can make HTTP requests	Paid framework that excels at interactive, colorful, and dynamic visualization.
Plotly / Plotly Express	2013	Python (primary)	R, MATLAB, Julia, JavaScript	Plotly is exceptional for making graphs and interactive graphs.
Seaborn	2012	Python	None
Vega‑Lite	2016	JSON spec (JS)	Python (via Altair), R (via wrappers)

R and R Studio

R Studio – posit.co/download
Tidyverse – tidyverse.org
Dplyr – dplyr.tidyverse.org
Car – cran.r-project.org/package=car
Readr – readr.tidyverse.org
Purrr – purrr.tidyverse.org
Broom – broom.tidyverse.org
Pander cran.r-project.org/package=pander
A tool for rendering R objects into Pandoc's markdown format.

Other Tools

Apache Superset
Jupyter Notebook

Backend Integration Tools

Node.js – https://nodejs.org/en
Node can be used to write 'serverside' or 'backend scripting', or just script on your computer. It can be used alone or with Python as it has asynchronous functionality that can make it better than Python for these use cases.
Docker - Docker.com
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security lets you run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you don't need to rely on what's installed on the host. - docker.com/get-started/docker-overview

Databases

SQL / Relational Databases

MySQL – Open-source, widely used for web applications.
PostgreSQL – Open-source SQL relational database with strong features.
- Free and open source
- postgresql.org/about
SQLite – Lightweight, embedded SQL database.
Microsoft SQL Server – Enterprise-level database from Microsoft.
Oracle Database – High-performance, enterprise-grade database.
DuckDB
- MotherDuck

NoSQL Databases

MongoDB – Document-oriented, JSON-based database.
- Uses BSON (binary JSON) and MQL (similar to SQL)
- Free and paid options (MongoDB Atlas, Enterprise)
- Sharding – Splitting up data across servers
- mongodb.com
Cassandra – Distributed NoSQL database
- Great for high availability and heavy loads
- Free and open source
- Paid options: DataStax Enterprise, Amazon Keyspaces, Azure Cosmos DB
- “Distributed” means Cassandra can run on multiple machines while appearing to users as a unified whole
- cassandra.apache.org
Redis – In-memory key-value store, great for caching
Flat storage – Amazon S3

Dashboarding and Visualization Platforms

PowerBi
Looker
Google's online dashboarding website
Redash
- Integrates SQL and dashboards in a frontend editor
- Connect to a backend database and create visualizations quickly
- Offers both open-source (free) and paid options
Tableau – tableau.com
BigQuery
Looker – Google’s dashboarding and visualization platform
Snowflake – snowflake.com

Data Storage Formats

JSON
- JavaScript Object Notation
- Widely used in many programming languages
SQLite
- Implements part of SQL to store data in a single file
- Readable by other programs such as Python or R
Parquet – parquet.apache.org
- Column-oriented file format for efficient storage
CSV
- Comma-separated values
- Simple, human-readable plaintext format

Monitoring and Metrics

These tools and services are better geared towards companies than individuals, but are still worth knowing about.

Datadog – datadoghq.com
"Datadog is the observability and security platform for cloud applications."

From datadoghq.com/about/leadership

Datadog is a performance and metrics tracking service geared towards large networks and interconnected applications.
Datathink
Snowflake
Amazon Redshift

"Tens of thousands of customers use Amazon Redshift for modern data analytics at scale..."

Conclusion

There are so many tools and options available for data science. I hope that aggregating and succinctly explaining what these tools are for can help you better explore and analyze your next data set. Happy wrangling!

A Compilation of Common Data Science Tools

Table of Contents