Catalog
& discovery.
Tools for finding the right asset — and the right context — without asking on Slack.
Catalog tooling splits along two fault lines that matter more than the marketing language. The first is how metadata gets in: pull connectors that the catalog runs against your stack on a schedule, push APIs that your jobs emit events to, or some hybrid. The second is what the catalog is for: a discovery surface where humans search (“where’s the customer table?”), a governance surface where decisions are made and recorded (“who owns this and can it leave the EU?”), or both.
The 2024–2026 wave added a third axis: how AI-native the search and authoring experience is. Some catalogs lean on classical search with a glossary on top; some are fully LLM-driven, generating documentation, answering natural-language questions, and reasoning over lineage. Depth varies dramatically — judge it on search_approach and natural_language_search, not the label.
Open-source matters more in this cluster than in quality testing. DataHub and OpenMetadata are both Apache-2.0 and run by real organisations at real scale. Unity Catalog became Apache-2.0 in mid-2024 and is now a third serious open-source option.
Questions a buyer actually asks.
- 01Managed catalog, or open-source catalog I run myself?
- OSS catalogs (DataHub, OpenMetadata, Unity Catalog — all Apache-2.0) run at real scale, but you operate stateful services (search index, metadata store, ingestion) much like a database. Managed options (Atlan, Collate, DataHub Cloud) trade that operational load for cost plus polished governance and AI. The deciding factor is platform-engineering appetite, not licence price.
- 02How important is column-level lineage for my catalog?
- It depends on who uses the catalog. For impact analysis on warehouse changes — what breaks if I rename this column — column-level lineage is what makes the catalog actionable. For pure discovery and documentation, table-level is often enough. Check lineage_granularity; the strongest catalogs derive column-level lineage from SQL parsing rather than manual entry.
- 03Do I need a real business glossary, or is technical metadata enough?
- A real business glossary links each term to the assets that embody it, propagates ownership and tags along that linkage, and is versioned — not just a wiki of definitions. If governance and stewardship are the goal, glossary depth is the best signal a catalog is in active use. If you only need search and docs, technical metadata may be enough.
- 04Which catalog actually integrates with my BI tools — Tableau, Looker, Power BI?
- Most major catalogs ingest Tableau, Looker, and Power BI metadata, but depth differs — some pull dashboards and fields with lineage back to warehouse columns, others only list the dashboard. The distinction to check per catalog: does BI lineage reach warehouse columns, or stop at the asset?
- 05How does PII / sensitive-data auto-classification compare across catalogs?
- Auto-classification scans values or schemas to flag PII, PHI, and PCI without manual tagging. Coverage and accuracy vary widely; false positives on synthetic-looking strings and false negatives on internal IDs are common. Treat it as a first pass a steward reviews — check each catalog's pii_auto_classification and whether sensitivity tags propagate along lineage.
- 06What does "AI-powered catalog" actually mean tool by tool, in 2026?
- It ranges from classical search with an LLM summary bolted on, to fully LLM-native experiences that answer natural-language questions, generate documentation, and reason over lineage. AI-powered is one of the noisiest claims in this space — look at search_approach and natural_language_search, and test it on your own metadata before trusting the demo.
7 tools, different shapes.
Alation
Alation · est. 2012 · Redwood City, CA
The incumbent that defined the data catalog — behavioral search, deep governance, and strong column-level lineage.
Atlan
Atlan · est. 2019 · Singapore
Enterprise catalog and governance plane positioned as the AI context layer — connectors, lineage, contracts, and an MCP server for agents.
Collibra
Collibra · est. 2008 · Brussels, Belgium
Enterprise data-and-AI governance incumbent: catalog, glossary, workflow stewardship, lineage, and a separate ML data-quality module.
DataHub
Acryl Data · est. 2021 · Palo Alto, CA
Apache-2.0 metadata platform with a serious managed counterpart — strongest event-driven architecture and column-level SQL lineage in OSS.
OpenMetadata
Collate · est. 2021 · Saratoga, CA
Apache-2.0 unified metadata platform with a deliberately simple stack — discovery, lineage, quality, and contracts in one project.
Secoda
Secoda (Atlassian) · est. 2021 · Toronto, Ontario, Canada
AI-native data catalog, lineage, and observability from Toronto — acquired by Atlassian in December 2025 to power Rovo AI.
Unity Catalog
Databricks · est. 2024 · San Francisco, CA
Open-source universal catalog for data and AI under Apache-2.0 — Iceberg-REST and Hive-MS compatible, Databricks-led, LF AI hosted.
What each catalog actually ships.
| Tool | 01 Glossary | 02 NL search | 03 Contracts | 04 Govern flows | 05 Access req | 06 PII auto | 07 OpenLineage | 08 Col lineage | 09 Free self-host |
|---|---|---|---|---|---|---|---|---|---|
| Alation | |||||||||
| Atlan | |||||||||
| Collibra | |||||||||
| DataHub | |||||||||
| OpenMetadata | |||||||||
| Secoda | |||||||||
| Unity Catalog |
Connector counts, ingestion model, and asset types vary substantially — open any tool name above for the full capability spec.
Three trade-offs that matter.
Open-source, or managed?
The OSS catalogs (DataHub, OpenMetadata, Unity Catalog) mean running a search index, metadata store, and ingestion layer yourself — stateful infra, not a binary. Managed (Atlan, DataHub Cloud, Collate) buys you that plus a proprietary AI/governance layer. Decide on platform-engineering bandwidth, not licence price.
Engineering-led, or steward-led?
DataHub is more developer-shaped — event-driven architecture, strong SQL parser, Kafka MCL. Atlan is more steward-shaped — polished governance UX, certifications, glossary as a first-class artifact. OpenMetadata sits between, with a simpler stack and fast governance feature cadence. Pick by who actually uses the catalog day-to-day.
Discovery catalog, or governed registry?
Discovery catalogs (DataHub, OpenMetadata, Atlan) crawl your stack and present a search-and-lineage UI for humans. Governed registries (Unity Catalog OSS) are read by engines — Spark, Trino, DuckDB — to access tables. Different problems, both legitimate; some mature stacks run a registry below a discovery catalog.
Also strong at catalog & discovery — primarily categorised elsewhere.
These tools earn their primary classification in another cluster but score 2 or 3 of 3 on catalog capability — the cluster overlap is real, not aspirational. Worth a look when consolidating two budgets into one.
- Acceldata → Primary: quality testing · Catalog strength 2/3
- Cloudera Data Lineage (Octopai) → Primary: lineage metadata · Catalog strength 2/3
- IBM Manta Data Lineage → Primary: lineage metadata · Catalog strength 2/3
- Monte Carlo → Primary: quality testing · Catalog strength 2/3
- Sifflet → Primary: quality testing · Catalog strength 2/3
Drill into one feature.
Compare two side by side.
Every same-cluster pair a buyer realistically shortlists — see all comparisons.
The OSS-vs-managed catch.
The catch on this cluster is the OSS-vs-managed gap: Unity Catalog and DataHub do materially less in their free/self-hosted tiers than in their managed offerings, and several matrix capabilities are paid-only — each tool page spells out which.