FAIR Data,
Grounded AI

A knowledge graph linking 9.3 million data citations across 9 life sciences repositories — demonstrating how FAIR principles power trustworthy, traceable AI answers grounded in structured evidence.

101M
RDF Triples
9.3M
Data Citations
1.3M
Publications
3.2M
Unique Datasets
9
Repositories
3
Linked Data Sources
Why It Matters

Grounded AI vs Ungrounded AI

The same question, two approaches. One traces every claim to structured evidence in a knowledge graph. The other relies on statistical patterns from training data — and invents plausible-sounding facts.

Knowledge-grounded (GraphRAG)
“Which repository has the most data citations?”
ENA (European Nucleotide Archive) leads with 3.76 million data citations across 1.28 million unique datasets, followed by PDB with 1.75M citations. Of the 9 repositories tracked, only Figshare and Dryad mint DOIs — the remaining 7 use repository-specific accession numbers.

✓ Every number verified against the knowledge graph.
Repositories graph (pre-computed stats) → Citations graph (CZI text-mining)
SPARQL query executed · 9 results returned · Full provenance available
Ungrounded LLM (no knowledge graph)
“Which repository has the most data citations?”
Based on general knowledge, GenBank is likely the most cited data repository, with ✗ millions of citations across the life sciences. Other highly cited repositories include ✗ ArrayExpress and the ✗ NCBI Sequence Read Archive. Data citation practices vary widely across disciplines.

✗ Fabricated specifics — GenBank is not in this corpus. ArrayExpress and SRA are not tracked. No actual numbers provided.
⚠ No data source queried · No provenance chain · No verification possible · Trained on statistical patterns only
Knowledge-grounded (GraphRAG)
“What percentage of publications include ORCID identifiers?”
ORCID adoption in data-citing publications has grown from 8% in 2012 to 74% in 2023 — a near-tenfold increase. This trajectory suggests the research community is steadily embracing persistent author identifiers, though 26% of recent publications still lack any ORCID.

✓ Year-by-year data from 1.3M publications in the knowledge graph.
Analytics graph (pre-computed) → Publications graph (Crossref)
SPARQL query executed · 14 years of data · Full provenance available
Ungrounded LLM (no knowledge graph)
“What percentage of publications include ORCID identifiers?”
ORCID adoption has been growing steadily. As of recent estimates, ✗ approximately 40-50% of scholarly publications include at least one ORCID. ✗ A 2022 study found that ORCID usage reached about 45% in major publisher outputs. Adoption varies significantly by discipline and publisher.

✗ Vague and inaccurate — no specific dataset, no year-by-year trend, invented study reference.
⚠ No data source queried · Invented citation · Approximate numbers from training data · Cannot be verified

FAIR data doesn’t just improve data management — it makes AI trustworthy. Every answer from the knowledge graph comes with a provenance trail: which data sources were queried, which graphs were traversed, and how many results were returned. Try it yourself →

Measuring FAIR

FAIR Scorecard

FAIR isn’t abstract — it’s measurable. None of these scores are 100%, and that’s the point. Even partial FAIR compliance unlocks powerful cross-source queries, traceable AI, and insights that would be impossible with siloed data. Imagine what becomes possible as these numbers climb.

F
Findable
33%
33% of datasets have persistent
identifiers (DOIs). The remaining 67%
use repository-specific accession numbers.
A
Accessible
70%
70% of publications include at least
one author ORCID, enabling traceable
attribution and access to creator profiles.
I
Interoperable
85%
85% of datasets have standardised
subject classifications, enabling
cross-repository discovery and linking.
R
Reusable
60%
60% of datasets have machine-readable
licenses. 52% of publications include
funder metadata for provenance.

Scores derived from 1.3M publications (Crossref) and 1.0M DOI-minted datasets (DataCite) in the knowledge graph. February 2026 snapshot.

This knowledge graph was built with 9 repositories and 3 data sources. There are thousands more. Every additional FAIR-compliant repository, every DOI minted instead of an accession number, every ORCID added to a publication — expands what’s queryable. The 33% Findability score isn’t a failure; it’s 1.0 million datasets already discoverable through persistent identifiers, with 2.2 million more waiting to become machine-readable. Perfect shouldn’t be the enemy of good — and good is already remarkably powerful.

The Data

Nine Repositories, One Graph

Data from domain-specific repositories (ENA, PDB, GEO, UniProt) and general-purpose repositories (Figshare, Dryad) — each with different identifier practices — unified through a knowledge graph.

Repository Identifier Citations Datasets Publications Relative Scale
ENA Accession 3,756,882 1,283,574 544,112
PDB Accession 1,752,406 57,693 380,754
Figshare DOI 1,287,134 887,294 298,654
CCDC Accession 1,103,472 466,331 255,887
dbSNP Accession 682,218 147,920 199,553
GEO Accession 399,311 97,416 156,220
Dryad DOI 215,642 105,390 85,310
UniProt Accession 68,112 34,221 28,440
BioProject Accession 42,887 21,445 18,922
How It Works

From Data to Grounded AI

Three open data sources are linked through a knowledge graph, enabling AI that can trace every answer back to its evidence. This is GraphRAG in practice — retrieval-augmented generation grounded in structured, FAIR data.

CZI Text Mining

9.3M data citations extracted from full-text publications by the Chan Zuckerberg Initiative. The raw link between papers and datasets.

Crossref Enrichment

1.3M publications enriched with titles, authors, ORCIDs, journals, funders, and citation counts. The scholarly metadata layer.

DataCite Metadata

1.0M datasets with creators, subjects, licenses, and download counts. Rich descriptive metadata for DOI-minted research data.

Knowledge Graph

101M triples in GraphDB, modelled with a custom ontology. Seven named graphs with pre-computed analytics for instant queries.

AI Chat (GraphRAG)

Natural language questions are translated to SPARQL, executed against the graph, and interpreted with full provenance. Every answer is traceable.

Open & Reproducible

Built on open data (DataCite Data Citation Corpus), open standards (RDF, SPARQL, FAIR), and open-source tools. Fully reproducible.

DataCite Data Citation Corpus
Citation links (via CZI)
+
Crossref
Publication metadata
+
DataCite
Dataset metadata
Knowledge Graph
101M triples in GraphDB
Grounded AI
Traceable answers via GraphRAG