A knowledge graph linking 9.3 million data citations across 9 life sciences repositories — demonstrating how FAIR principles power trustworthy, traceable AI answers grounded in structured evidence.
The same question, two approaches. One traces every claim to structured evidence in a knowledge graph. The other relies on statistical patterns from training data — and invents plausible-sounding facts.
FAIR data doesn’t just improve data management — it makes AI trustworthy. Every answer from the knowledge graph comes with a provenance trail: which data sources were queried, which graphs were traversed, and how many results were returned. Try it yourself →
FAIR isn’t abstract — it’s measurable. None of these scores are 100%, and that’s the point. Even partial FAIR compliance unlocks powerful cross-source queries, traceable AI, and insights that would be impossible with siloed data. Imagine what becomes possible as these numbers climb.
Scores derived from 1.3M publications (Crossref) and 1.0M DOI-minted datasets (DataCite) in the knowledge graph. February 2026 snapshot.
This knowledge graph was built with 9 repositories and 3 data sources. There are thousands more. Every additional FAIR-compliant repository, every DOI minted instead of an accession number, every ORCID added to a publication — expands what’s queryable. The 33% Findability score isn’t a failure; it’s 1.0 million datasets already discoverable through persistent identifiers, with 2.2 million more waiting to become machine-readable. Perfect shouldn’t be the enemy of good — and good is already remarkably powerful.
Data from domain-specific repositories (ENA, PDB, GEO, UniProt) and general-purpose repositories (Figshare, Dryad) — each with different identifier practices — unified through a knowledge graph.
| Repository | Identifier | Citations | Datasets | Publications | Relative Scale |
|---|---|---|---|---|---|
| ENA | Accession | 3,756,882 | 1,283,574 | 544,112 | |
| PDB | Accession | 1,752,406 | 57,693 | 380,754 | |
| Figshare | DOI | 1,287,134 | 887,294 | 298,654 | |
| CCDC | Accession | 1,103,472 | 466,331 | 255,887 | |
| dbSNP | Accession | 682,218 | 147,920 | 199,553 | |
| GEO | Accession | 399,311 | 97,416 | 156,220 | |
| Dryad | DOI | 215,642 | 105,390 | 85,310 | |
| UniProt | Accession | 68,112 | 34,221 | 28,440 | |
| BioProject | Accession | 42,887 | 21,445 | 18,922 |
Three open data sources are linked through a knowledge graph, enabling AI that can trace every answer back to its evidence. This is GraphRAG in practice — retrieval-augmented generation grounded in structured, FAIR data.
9.3M data citations extracted from full-text publications by the Chan Zuckerberg Initiative. The raw link between papers and datasets.
1.3M publications enriched with titles, authors, ORCIDs, journals, funders, and citation counts. The scholarly metadata layer.
1.0M datasets with creators, subjects, licenses, and download counts. Rich descriptive metadata for DOI-minted research data.
101M triples in GraphDB, modelled with a custom ontology. Seven named graphs with pre-computed analytics for instant queries.
Natural language questions are translated to SPARQL, executed against the graph, and interpreted with full provenance. Every answer is traceable.
Built on open data (DataCite Data Citation Corpus), open standards (RDF, SPARQL, FAIR), and open-source tools. Fully reproducible.