Best data tools
for self-hosted deployments.
Keep data and metadata inside your perimeter. 17 indexed, 7 open source.
What fits.
Some teams cannot or will not send metadata to a vendor cloud. These tools can run inside your own perimeter — fully self-hosted, bring-your-own-cloud, or hybrid with a customer-managed data plane — so the data and metadata stay where your security review needs them.
The shortlist.
Apache-2.0 metadata platform with a serious managed counterpart — strongest event-driven architecture and column-level SQL lineage in OSS.
Open-source dbt package adding 50+ Great Expectations-style assertions as native dbt tests that run in your own warehouse.
The dbt-native observability layer — tests, anomaly detection, and lineage that live inside your dbt project.
Python-native data validation framework — the OSS standard, now in stewardship transition after the May 2026 acquisition.
The OpenLineage reference backend — vendor-neutral lineage events from Spark, Airflow, dbt, and Flink, stored and visualised.
Apache-2.0 unified metadata platform with a deliberately simple stack — discovery, lineage, quality, and contracts in one project.
Open-source universal catalog for data and AI under Apache-2.0 — Iceberg-REST and Hive-MS compatible, Databricks-led, LF AI hosted.
Enterprise data observability with ML data quality, reconciliation, and a built-in catalog — strong on hybrid and on-prem estates.
The incumbent that defined the data catalog — behavioral search, deep governance, and strong column-level lineage.
GUI-first ML anomaly detection at petabyte scale — pivoting in 2026 around agentic AI and unstructured-data monitoring.
Enterprise catalog and governance plane positioned as the AI context layer — connectors, lineage, contracts, and an MCP server for agents.
Enterprise data observability with Autometrics ML thresholds — repositioning in 2026 as an AI Trust Platform with runtime governance.
Pre-merge data diffing and column-level lineage — the tool that shifts data quality left into the pull request.
The deepest scanner-driven lineage product on the market — built for legacy estates (SAP, Cognos, Informatica) modern catalogs miss.
AI-native data catalog, lineage, and observability from Toronto — acquired by Atlassian in December 2025 to power Rovo AI.
EU-built full-stack data observability pairing ML-driven monitoring with an embedded catalog and field-level lineage.
YAML-first data contracts and observability — SodaCL plus Soda Cloud, with anomaly detection and a self-hosted Kubernetes runner.
How this list sorts.
Open-source options sort first, then alphabetical — no editorial ranking, no paid placement. Every entry matches a structured field on the tool profile; see the methodology, or compare any two on the comparisons page.