Table of Contents
Overview
Data Science can be done in many different ways, with many different tools. I've compiled here some common resources used for data science, grouped them, and included links and short explanations. When I was starting data science, I wish I could have seen the most common tools grouped together in a way like this, so I put one together myself.
Data Science Languages
There are many programming languages used in Data Science. Below are some of the most commonly used.
-
Python -
docs.python.org
"Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive[...] Python's simple, easy to learn syntax emphasizes readability."
- python.org/doc/essays/blurb/"- Does not need to be compiled
- Syntax is fairly straightforward
- There are many libraries and resources available
- Does not need to be compiled
-
Javascript
"The analogy for this question is: can I use a hammer to dig a hole? Sure you can but it's going to cost you a lot more effort. Just use the shovel, just use Python / R." - Anon
- Less extensive resources and libaries for data science
- There can be issues with typing data structures
- Javascript has asynchronous processing which may help in some situations
- See also: Can Javascript Be Used For Data Science?
-
SQL
SQL, or Structured query language, is a programming language for relational databases. It can be used with database softwares to quickly filter through and work with large amounts of information.
-
R - https://www.r-project.org/
"R is a free software environment for statistical computing and graphics." https://www.r-project.org/
- R can be used for complex statistical modeling and visualisation
- R can do some things in a couple lines of code what might take other languages many more lines of code.
-
Julia - https://julialang.org/
Created by Jeff Bezanson, Stefan Karpinski, Viral B. Shah, and Alan Edelman in 2012, Julia infers the type of data it's using at run time. It has a flexible data typing system, so code works for multiple types of data without breaking. However, you can enforce constraints if you wish.
Julia uses a programming paradigm called 'Multiple Dispatch' where, based on all of the types of a function's inputs, the runtime environment will pick which 'version' of a function would best handle those data types.
- Julia can 'beat' python in some instances, but Python still remains more popular amongst data scientists.
- Can run on GPUs efficiently.
- Julia libraries end in .jl
- Not object oriented in the traditional sense
- IJulia is a Julia-language backend combined with the Jupyter interactive environment
See also:
Python vs. Julia: Key Differences
Why Was Julia Created?
Julia in 100 Seconds
Python Packages and Frameworks
Dataframes and Basic Data Wrangling
-
Numpy - Numpy.org
(Numerical Python) handles different kinds of computations above base python. Pandas and Polars are more commonly used but some complicated numerical computations like with matrices or algebra can still better in Numpy. Pandas is built partially on Numpy. (2005)
-
Pandas - pandas.pydata.org
Standard python library for loading data frames. Built on Numpy. Typing is assumed sometimes. Strengths - been the defacto library for data science and analysis for awhile now.
-
Polars - Pola.rs
Newer python library that works similarly to pandas. Many head-to-head tests show that Polars is faster. Typing is never assumed. Runs in parallel and written in lower level programming than pandas (which is written on numpy).
- Strengths - faster than pandas, and simpler in some instances.
- Because it's newer, there are not as many libraries that work with it compared to pandas.
Blog.jetbrains - Polars vs Pandas -
GeoPandas - geopandas.org
GeoPandas combines pandas and shapely to provide geospatial operations in pandas. Enables operations in Python that would otherwise require a spatial database such as PostGIS. (2013)
-
Xarray -
xarray.dev
Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
Machine Learning in Python
Pytorch, Scikit-learn, and Tensorflow are 3 separate and independent python libraries for machine learning in python.
-
Pytorch -
nvidia.com/glossary/pytorch
"PyTorch is a fully featured framework for building deep learning models, which is a type of machine learning that’s commonly used in applications like image recognition and language processing. Written in Python, it’s relatively easy for most machine learning developers to learn and use. PyTorch is distinctive for its excellent support for GPUs and its use of reverse-mode auto-differentiation, which enables computation graphs to be modified on the fly. This makes it a popular choice for fast experimentation and prototyping." (Nvidia)
-
Tensorflow - tensorflow.org
"TensorFlow makes it easy to create ML models that can run in any environment."
A tensor is a multidimensional array, used to represent data with multiple dimensions, such as image data.
-
Scikit learn / sklearn -
scikit-learn.org
Scikit learn is a suite of machine learning tools for python. Main libraries include:
- Classification
- Regression
- Clustering
- Dimensionality Reduction
- Model Selection
- Preprocessing
-
MLflow
MLflow is an open-source platform that is built to assist in the machine learning process. It helps with tracking, testing, and evaluation to help the process run smoother. It works with a variety of libraries, including the three listed previously.
Parallel Processing and Cloud Computing
-
Apache Spark
The most widely-used engine for scalable computing -
Pyspark – the python API for Apache Spark
https://spark.apache.org/docs/latest/api/python/index.html
Spark is also available for Python, SQL, Scala, Java, and R
Spark can be used through Docker -
Spark RDD Example
Spark allows computation loads to be spread across different machines (or nodes) to make computation faster and scalable. You can create a spark session, manipulate data, and run computations, and spark will share the task across other nodes. Spark is helpful for handing larger computation tasks where a single computer might struggle to handle the data. -
Dask
Similar to Apache Spark
Built using pandas datasets
According to their website, runs 50% faster than pyspark
https://docs.coiled.io/blog/spark-vs-dask.html
Open source and free to run on your own computer.
APIs
-
FastAPI - fastapi.tiangolo.com/
"FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints."
- fastapi.tiangolo.com/
Visualization Libraries
There are a lot of options for visualization, and some libraries will work just as well as some of the others. Find what works for you.
- Bokeh – bokeh.org/
- Altair – altair-viz.github.io/
- Matplotlib – matplotlib.org/stable/gallery/index
-
Plotly / Express –
plotly.com/graphing-libraries/
Express is a simplified wrapper for Plotly designed to make it faster and simpler to visualize information, using less coding. - PlotAPI – plotapi.com/
- Seaborn – seaborn.pydata.org/
- Ggplot2 - ggplot2.tidyverse.org/
- D3.js - d3js.org
Table of Visualization Libraries
Name | Year Released | Main Language | Other Languages / Interfaces | Strengths |
---|---|---|---|---|
Altair | 2016 | Python | None – It generates Vega‑Lite JSON specs | Can stack simple syntax and graphs to make more complicated ones |
Bokeh | 2013 | Python | JavaScript (BokehJS for rendering) | Interactive web plots and dashboards. |
Bqplot | 2014 | Python | None | Primarily designed for interactive Jupyter notebooks. |
D3.js | 2011 | JavaScript | Wrappers exist in other languages | Great for web visualizations. |
ggplot (ggplot2) (Python) | 2013 | Python | None – Inspired by R’s ggplot2 | Great for plotting in both R and Python. Easily extendable. |
HoloViews | 2015 | Python | None (uses Bokeh or Matplotlib) | Makes visualization easy, great for large datasets, integrates with Bokeh/Matplotlib. |
hvPlot - HoloViz | - | Python | None (built on HoloViews) | |
Matplotlib | 2003 | Python | None officially | |
PlotAPI | ? | Language‑agnostic | Any language that can make HTTP requests | Paid framework that excels at interactive, colorful, and dynamic visualization. |
Plotly / Plotly Express | 2013 | Python (primary) | R, MATLAB, Julia, JavaScript | Plotly is exceptional for making graphs and interactive graphs. |
Seaborn | 2012 | Python | None | |
Vega‑Lite | 2016 | JSON spec (JS) | Python (via Altair), R (via wrappers) |
R and R Studio
- R Studio – posit.co/download
- Tidyverse – tidyverse.org
- Dplyr – dplyr.tidyverse.org
- Car – cran.r-project.org/package=car
- Readr – readr.tidyverse.org
- Purrr – purrr.tidyverse.org
- Broom – broom.tidyverse.org
-
Pander
cran.r-project.org/package=pander
A tool for rendering R objects into Pandoc's markdown format.
Other Tools
- Apache Superset
- Jupyter Notebook
Backend Integration Tools
-
Node.js – https://nodejs.org/en
Node can be used to write 'serverside' or 'backend scripting', or just script on your computer. It can be used alone or with Python as it has asynchronous functionality that can make it better than Python for these use cases.
-
Docker - Docker.com
Docker provides the ability to package and run an application in a loosely isolated environment called a container. The isolation and security lets you run many containers simultaneously on a given host. Containers are lightweight and contain everything needed to run the application, so you don't need to rely on what's installed on the host. - docker.com/get-started/docker-overview
Databases
SQL / Relational Databases
- MySQL – Open-source, widely used for web applications.
-
PostgreSQL – Open-source SQL relational database with strong features.
- Free and open source
- postgresql.org/about
- SQLite – Lightweight, embedded SQL database.
- Microsoft SQL Server – Enterprise-level database from Microsoft.
- Oracle Database – High-performance, enterprise-grade database.
-
DuckDB
- MotherDuck
NoSQL Databases
-
MongoDB – Document-oriented, JSON-based database.
- Uses BSON (binary JSON) and MQL (similar to SQL)
- Free and paid options (MongoDB Atlas, Enterprise)
- Sharding – Splitting up data across servers
- mongodb.com
-
Cassandra – Distributed NoSQL database
- Great for high availability and heavy loads
- Free and open source
- Paid options: DataStax Enterprise, Amazon Keyspaces, Azure Cosmos DB
- “Distributed” means Cassandra can run on multiple machines while appearing to users as a unified whole
- cassandra.apache.org
- Redis – In-memory key-value store, great for caching
- Flat storage – Amazon S3
Dashboarding and Visualization Platforms
- PowerBi
-
Looker
Google's online dashboarding website
-
Redash
- Integrates SQL and dashboards in a frontend editor
- Connect to a backend database and create visualizations quickly
- Offers both open-source (free) and paid options
- Tableau – tableau.com
- BigQuery
- Looker – Google’s dashboarding and visualization platform
- Snowflake – snowflake.com
Data Storage Formats
-
JSON
- JavaScript Object Notation
- Widely used in many programming languages
-
SQLite
- Implements part of SQL to store data in a single file
- Readable by other programs such as Python or R
-
Parquet – parquet.apache.org
- Column-oriented file format for efficient storage
-
CSV
- Comma-separated values
- Simple, human-readable plaintext format
Monitoring and Metrics
These tools and services are better geared towards companies than individuals, but are still worth knowing about.
-
Datadog – datadoghq.com
"Datadog is the observability and security platform for cloud applications."
From datadoghq.com/about/leadership
Datadog is a performance and metrics tracking service geared towards large networks and interconnected applications.
- Datathink
- Snowflake
-
Amazon Redshift
"Tens of thousands of customers use Amazon Redshift for modern data analytics at scale..."
Conclusion
There are so many tools and options available for data science. I hope that aggregating and succinctly explaining what these tools are for can help you better explore and analyze your next data set. Happy wrangling!