Skip to main content

Data-Science-Tools

Big data processing

When datasets don’t fit in memory:

  • Apache Spark
    Still the standard for distributed batch processing and ETL pipelines.
  • Apache Flink
    Increasingly used for real-time analytics and event-driven systems.
  • SQL engines on distributed systems:
    • Trino (formerly Presto)
    • DuckDB (fast-growing for local analytics and embedded SQL)

Cloud data warehouses & lakehouse systems

Modern data science is heavily cloud-native:

  • Snowflake
    Leading cloud data warehouse with strong separation of storage/compute.
  • Google Cloud BigQuery
    Serverless analytics, widely used for large-scale SQL workloads.
  • Databricks
    Popularizes the “lakehouse” model (combining data lakes + warehouses) and deeply integrates Spark + ML tooling.
  • Amazon Web Services (Redshift, Athena, SageMaker)
    Broadest service ecosystem for data + ML pipelines.

MLOps (productionizing models)

This is one of the fastest-evolving areas:

  • Model tracking: MLflow, Weights & Biases
  • Pipeline orchestration: Airflow, Prefect, Dagster
  • Deployment: Kubernetes, Docker
  • Feature stores: Feast, Tecton
  • Monitoring: Evidently AI, Arize

The goal is reproducibility, versioning, and continuous model updates.

Visualization & BI

  • matplotlib / seaborn (classical Python visualization)
  • Plotly, Dash (interactive dashboards)
  • Tableau / Power BI (enterprise BI layer)
  • Superset / Metabase (open-source alternatives)