Skip to main content

Data-Science-Tools

Big data processing

When datasets don’t fit in memory:

Apache Spark
Still the standard for distributed batch processing and ETL pipelines.
Apache Flink
Increasingly used for real-time analytics and event-driven systems.
SQL engines on distributed systems:
- Trino (formerly Presto)
- DuckDB (fast-growing for local analytics and embedded SQL)

Cloud data warehouses & lakehouse systems

Modern data science is heavily cloud-native:

Snowflake
Leading cloud data warehouse with strong separation of storage/compute.
Google Cloud BigQuery
Serverless analytics, widely used for large-scale SQL workloads.
Databricks
Popularizes the “lakehouse” model (combining data lakes + warehouses) and deeply integrates Spark + ML tooling.
Amazon Web Services (Redshift, Athena, SageMaker)
Broadest service ecosystem for data + ML pipelines.

MLOps (productionizing models)

This is one of the fastest-evolving areas:

Model tracking: MLflow, Weights & Biases
Pipeline orchestration: Airflow, Prefect, Dagster
Deployment: Kubernetes, Docker
Feature stores: Feast, Tecton
Monitoring: Evidently AI, Arize

The goal is reproducibility, versioning, and continuous model updates.

Visualization & BI

matplotlib / seaborn (classical Python visualization)
Plotly, Dash (interactive dashboards)
Tableau / Power BI (enterprise BI layer)
Superset / Metabase (open-source alternatives)

Big data processing
Cloud data warehouses & lakehouse systems
MLOps (productionizing models)
Visualization & BI