Verified 2026·05·30

§ For data engineers

Best data tools
for data engineers.

The tools that watch the pipelines, not just the models. 21 indexed, 7 open source.

Why these

What fits.

Breakage usually happens before dbt runs — in ingestion, in orchestration, in a late upstream load. These tools watch those layers: warehouse-native freshness and volume checks, plus cross-system lineage that reaches above the transformation step.

21 tools

The shortlist.

DataHub

OSS SaaS / Self-host

Apache-2.0 metadata platform with a serious managed counterpart — strongest event-driven architecture and column-level SQL lineage in OSS.

Catalog & discovery OSS · free

dbt-expectations

OSS Self-host

Open-source dbt package adding 50+ Great Expectations-style assertions as native dbt tests that run in your own warehouse.

Quality & testing OSS · free

Elementary

OSS SaaS / Self-host

The dbt-native observability layer — tests, anomaly detection, and lineage that live inside your dbt project.

Quality & testing OSS · free

Great Expectations

OSS SaaS / Self-host acquired

Python-native data validation framework — the OSS standard, now in stewardship transition after the May 2026 acquisition.

Quality & testing OSS · free

Marquez

OSS Self-host

The OpenLineage reference backend — vendor-neutral lineage events from Spark, Airflow, dbt, and Flink, stored and visualised.

Lineage & metadata OSS · free

OpenMetadata

OSS SaaS / Self-host

Apache-2.0 unified metadata platform with a deliberately simple stack — discovery, lineage, quality, and contracts in one project.

Catalog & discovery OSS · free

Unity Catalog

OSS Self-host

Open-source universal catalog for data and AI under Apache-2.0 — Iceberg-REST and Hive-MS compatible, Databricks-led, LF AI hosted.

Catalog & discovery OSS · free

Acceldata

Hybrid

Enterprise data observability with ML data quality, reconciliation, and a built-in catalog — strong on hybrid and on-prem estates.

Quality & testing Contact sales

Alation

SaaS / Self-host

The incumbent that defined the data catalog — behavioral search, deep governance, and strong column-level lineage.

Catalog & discovery Contact sales

Anomalo

SaaS / Self-host

GUI-first ML anomaly detection at petabyte scale — pivoting in 2026 around agentic AI and unstructured-data monitoring.

Quality & testing Contact sales

Atlan

Hybrid

Enterprise catalog and governance plane positioned as the AI context layer — connectors, lineage, contracts, and an MCP server for agents.

Catalog & discovery Contact sales

Bigeye

SaaS / Self-host

Enterprise data observability with Autometrics ML thresholds — repositioning in 2026 as an AI Trust Platform with runtime governance.

Quality & testing Contact sales

Cloudera Data Lineage (Octopai)

SaaS acquired

SaaS lineage with 60+ connectors and a 24-hour deploy story — built for hybrid enterprise estates without IBM-stack baggage.

Lineage & metadata Contact sales

Collibra

SaaS

Enterprise data-and-AI governance incumbent: catalog, glossary, workflow stewardship, lineage, and a separate ML data-quality module.

Catalog & discovery Contact sales

Datafold

SaaS / Self-host

Pre-merge data diffing and column-level lineage — the tool that shifts data quality left into the pull request.

Quality & testing From $799/custom

IBM Manta Data Lineage

SaaS / Self-host acquired

The deepest scanner-driven lineage product on the market — built for legacy estates (SAP, Cognos, Informatica) modern catalogs miss.

Lineage & metadata Contact sales

Metaplane

SaaS acquired

ML-powered, no-code data observability for the dbt and warehouse stack with automatic column-level lineage — now Metaplane by Datadog.

Quality & testing Published

Monte Carlo

SaaS

Warehouse-side data observability for teams whose problems are upstream of dbt — ingestion, streaming, and across the full pipeline.

Quality & testing Contact sales

Secoda

SaaS / Self-host acquired

AI-native data catalog, lineage, and observability from Toronto — acquired by Atlassian in December 2025 to power Rovo AI.

Catalog & discovery Contact sales

Sifflet

SaaS / Self-host

EU-built full-stack data observability pairing ML-driven monitoring with an embedded catalog and field-level lineage.

Quality & testing Contact sales

Soda

SaaS / Self-host

YAML-first data contracts and observability — SodaCL plus Soda Cloud, with anomaly detection and a self-hosted Kubernetes runner.

Quality & testing From $750/custom

How this list sorts.

Open-source options sort first, then alphabetical — no editorial ranking, no paid placement. Every entry matches a structured field on the tool profile; see the methodology, or compare any two on the comparisons page.

Best data toolsfor data engineers.

What fits.

The shortlist.

How this list sorts.

Best data tools
for data engineers.