Wikidata:Vector Database

Wikidata Vector Database
	Wikimedia Deutschland
Project	Wikidata Embedding Project
Webapp / API	wd-vectordb.wmcloud.org
Documentation	wd-vectordb.wmcloud.org/docs
Feedback Survey	Share your feedback
Contact	embedding@wikimedia.de

Overview

The Wikidata Vector Database, part of the Wikidata Embedding Project, is a system that enables semantic and context-aware access to Wikidata. It represents Wikidata entities as high-dimensional vectors that capture meaning than surface text, allowing users and applications to search for conceptually related data.

How to use

The Wikidata Vector Database is publically available and can be accessed through either a web interface or a REST API.

Web interface: Available at wd-vectordb.wmcloud.org
API: You can integrate it into your own projects or applications. Full API documentation is available at wd-vectordb.wmcloud.org/docs

API endpoints

Endpoint	Method	Description	Output
`/item/query/`	GET	Searches Wikidata items (QIDs) using a combined vector and keyword retrieval method. Results from both searches are merged using Reciprocal Rank Fusion (RRF).	Returns QIDs with vector similarity and RRF scores. Results are ranked by the RRF score.
`/property/query/`	GET	Searches Wikidata properties (PIDs) using the same hybrid vector–keyword retrieval approach as item search.	Returns PIDs with vector similarity and RRF scores. Results are ranked by the RRF score.
`/similarity-score/`	GET	Computes vector similarity between a text query and a specified list of Wikidata entities (QIDs or PIDs).	Returns the provided QIDs or PIDs with their similarity scores, ranked by similarity.

Use cases

The Wikidata Vector Database supports a wide range of AI-based applications by providing context-aware vector embeddings of Wikidata items. These embeddings integrate the textual information associated with each item, resulting in meaningful and interpretable vector representations. While a single vector is simply a set of decimal numbers, a collection of well-constructed vectors forms a vector space that allows us to interact with these numbers to infer additional insights. Here are some potential applications that can benefit from the Wikidata Vector Database:

Question-answering

The embedding model is designed to capture the semantic meaning of both questions and their answers. By querying the vector database with a user question, the system can return Wikidata items that are semantically aligned with the query and potentially contain statements that answer the question. A question-answering application can retrieve QIDs of relevant Wikidata items from the vector database, fetch the most recent data from Wikidata, and analyze the results to identify supporting statements or sets of triplets that answer the question.

Retrieval augmented generation (RAG)

Aside from question-answering applications, RAG systems can leverage the vector database in several ways:

AI agents / MCP integration: The Wikidata Vector Database can be integrated into a Model Context Protocol (MCP) application, enabling generative AI models to query the vector database autonomously. A Wikidata MCP implementation is available, allowing AI agents to retrieve contextually relevant Wikidata items and use them to inform content generation.
GraphRAG: Wikidata’s structured knowledge graph can be used in combination with the vector database to enable advanced retrieval techniques. One approach is a hybrid method in which the vector database serves as an initial filter to identify a relevant set of Wikidata items. Based on the subgraph formed by these items, the system can then apply graph-based reasoning to further refine the results and construct multi-hop answers.
Reference linking: Because Wikidata includes references for many of its claims, retrieved items can be linked to generated content to provide verifiable, human-curated sources. The vector database helps identify Wikidata items with statements that contextually match the generated text.

By integrating Wikidata's data into generative AI applications, we can mitigate several limitations of pure language models:

Reduce misinformation: Referencing external, human-verified sources like Wikidata reduces reliance on the model’s internal knowledge stored in its weights and helps minimize errors.
Combat disinformation: Providing sources alongside generated responses allows users to verify information and improving transparency and traceability.
Ensure freshness: Unlike static LLM training data, Wikidata is continuously updated by a global community. Using it as an external source helps maintain the relevance and accuracy of AI-generated content.
Amplify underrepresented knowledge: LLMs tend to favor information that is repeated frequently across many sources found in the training data. In contrast Wikidata represents each statement only once, offering a more balanced representation regardless of popularity. By integrating Wikidata in RAG systems, contributors will have a stronger impact on generative AI, helping preserve knowledge diversity and counterbalance dominant narratives.

Named entity disambiguation

The Wikidata Vector Database can improve entity linking tasks by supporting the disambiguation phase of a Named Entity Recognition pipeline. Once entities are extracted from a text, the database can be queried using the surrounding context of each mention. The vector search then returns a ranked list of Wikidata items based on semantic similarity, helping identify the most likely corresponding QID.

In many cases, the sentence containing the entity mention focuses on topics other than the entity itself. As a result, querying the entire vector database may not reliably return the correct item. To address this, we first retrieve a list of candidate QIDs using keyword search. We then compare the vector representation of the query context to the vectors of these candidate items. The Wikidata item whose vector is most similar to the context vector is likely to be the correct match for the entity.

Zero-shot text classification

The Wikidata Vector Database can be used to perform zero-shot text classification by comparing the text to Wikidata items or categories. Instead of training a model for each classification task, classes can be mapped to specific Wikidata items. By computing the similarity between the text vector and the vectors of these target items, we can infer which topics are most relevant or contextually aligned with the input.

Semantic visualization & exploration

The vector representations of Wikidata items enable interactive and interpretable visualizations of the knowledge graph in ways that go beyond traditional node-link diagrams. With dimensionality reduction techniques, we can project the vectors into 2D or 3D to visualize semantic clusters of items.

Setup

The process involves transforming structured data from Wikidata into vectorized representations that enables semantic search and contextual relevance. This process is performed for each language on Wikidata, producing multilingual embeddings that support cross-language search and retrieval. The setup consists of the following stages:

Transforming Wikidata items into text: Each Wikidata item consists of structured information, including labels, descriptions, aliases, and claims. Labels serve as the primary name for each item, while descriptions provide brief contextual information. Aliases offer alternative names, and claims define additional properties and relationships to other entities. To create a text representation of each item, these components are combined into a coherent string that preserves the item’s essential information. As an example, consider the Wikidata item for Douglas Adams (Q42). A transformation would be:
Douglas Adams, English science fiction writer and humorist (1952–2001), also known as Douglas N. Adams, Douglas Noël Adams, Douglas Noel Adams, DNA. Attributes include: - instance of: human - sex or gender: male - occupation: playwright, Screenwriter, novelist (start time: 1979) - notable work: The Hitchhiker's Guide to the Galaxy pentalogy, Dirk Gently series, The Private Life of Genghis Khan - date of birth: 1952 Mar 11 - place of birth: Cambridge … ◼ Label ◼ Description ◼ Aliases ◼ Property Label ◼ Statement Value ◼ Qualifiers

The transformation process required navigating from one item to another within the Wikidata dump to aggregate all relevant labels. To simplify this process, we curated a dataset on Hugging Face that expands the original Wikidata dump by adding labels for properties and connected entities in the claims. You can find the Hugging Face dataset here.
Transforming Wikidata properties into text: Each Wikidata property is processed in two complementary ways, resulting in two separate vector representations per language:
- Property as entity text: A version generated exactly like an item, combining the property's label, description, aliases, and statements about the property itself.
- Property with examples text: A second version that combines the property's label, description, aliases, and a selection of example statements showing how the property is used in Wikidata:
  award received, award or recognition received by a person, organization or creative work, also known as winner of, prize awarded, honorary title, honors. Examples: - Douglas Adams: award received: Inkpot Award (point in time: 1983), Ditmar Award (for work: The Hitchhiker's Guide to the Galaxy) - Luxembourg: award received: Charlemagne Prize - Tim Berners-Lee: award received: Prix Ars Electronica … ◼ Label ◼ Description ◼ Aliases ◼ Property Label ◼ Statement Value ◼ Qualifiers
Chunking: Chunking is applied during text transformation for both items and properties at the statement level. Each chunk includes a subset of statements but keeps the same label, description, aliases, and "instance of" information. This ensures that each vector stays within the model’s input limits while preserving the entity’s semantic context. Before chunking, statements are sorted according to the property order defined in MediaWiki:Wikibase-SortedProperties to maintain consistency across entities.
Converting text into vectors: Once entities are transformed into text, they are passed through an embedding model. This model generates vector representations, or embeddings, of the text, capturing the semantic meaning and context of each item. We use the Jina Embedding V3 to compute the vectors with the retrieval.passage task.
Storing vectors in a vector database: The generated vectors are stored in DataStax Astra DB. This database is optimized for storing and retrieving high-dimensional vectors, enabling efficient and scalable vector search capabilities. Each vector is linked to its corresponding Wikidata item, allowing for fast retrieval of relevant entities during search.

Search

The Wikidata Vector Database API uses a hybrid search approach that combines vector-based semantic search with traditional keyword search to improve both accuracy and coverage. This approach retrieves conceptually related entities while still enabling precise keyword-based results when contextual information is limited.

Vector search: The input query is first converted into an embedding using the retrieval.query task of the Jina Embedding V3 model.
- When a specific language is selected, the search is performed only on vectors stored for that language.
- If the chosen language is not yet included in the database, the query is automatically translated into English using MinT (based on NLLB-200), and the search runs across all available vectors.
Keyword search: A parallel keyword-based search is performed through the Wikidata API.
- Stop words in the selected language are removed to improve precision.
- Remaining terms are joined with logical OR operators to broaden recall and allow partial matches. This step ensures that entities not well represented in the vector space can still be found through exact textual matches.
- The entities returned by the keyword search are then matched against the vector database to compute their semantic similarity scores, aligning them within the same vector space as the results from the vector search.
Instance of filtering (optional): Users can optionally filter results by specifying an instanceof parameter, a list of Wikidata QIDs that filters search results to entities that are instance of specific classes.
Reciprocal rank fusion: Results from the vector and keyword searches are combined using Reciprocal Rank Fusion (RRF). This method merges both result lists by ranking entities that appear in either list, favoring entities that score highly in both.
The RRF score for an entity is computed as:

${\text{RRFscore}}(d\in D)=\sum _{r\in R}{\frac {1}{k+{\text{r}}(d)}}$ ^[1]
where:
- $R$ is the set of ranked result lists being fused
- ${\text{r}}(d)$ is the rank position of entity $d$ in the list $r$
- $k$ is a constant (typically 60) that reduces the contribution of lower-ranked results.
Reranking (optional): An optional reranker model can be applied after RRF to refine the ordering of the results based on deeper contextual relevance. The model used is the Jina reranker v2 base multilingual, a multilingual cross-encoder that jointly encodes the query and each retrieved entity before scoring their relationship. This makes it more accurate than vector search, which computes embeddings for the query and the entities independently. However, this process is slower and generally unnecessary for retrieval-augmented generation (RAG) applications, where the order of the results doesn't matter.

Languages

The embedding model used for the Wikidata Vector Database is the Jina Embedding V3, a multilingual model supporting over 100 languages.

Each language is embedded separately to provide broad coverage of Wikidata items beyond English and to preserve language-specific nuances in meaning and phrasing. This multilingual setup allows:

Inclusion of items that exist primarily in non-English Wikipedias or contain localized labels and descriptions.
Cross-lingual retrieval, where a query in one language can return conceptually related items written in another.
Improved accuracy through multilingual diversity. Evaluations show that combining data from multiple languages within the same vector space increases the semantic robustness of search results.

Limitations

The Wikidata Vector Database is currently in an alpha release stage and continues to evolve. The following points outline its current limitations:

Coverage: The database includes vector embeddings for approximately 30 million Wikidata items connected to at least one Wikipedia article.
Version: The current embeddings are based on the Wikidata dump from September 2024. Future versions will introduce scheduled updates to keep the database aligned with recent edits and additions.
Purpose: The vector search is primarily designed for exploration and discovery. It helps users and researchers identify related entities that might not appear through traditional keyword search. The results are not intended to serve as direct answers but as starting points for further investigation and analysis.

Links

Web application & API
API documentation
Embedding Project - includes the project page, mission, and goals.
GitHub repository for the vector database preparation
GitHub repository for the API & web app
Hugging Face repository - includes an extended Wikidata dump with resolved property and entity labels.
Feedback survey – help us improve by sharing your feedback and the projects you’re building with the Wikidata Vector Database.

References

↑ Cormack, Gordon V.; Clarke, Charles LA; Buettcher, Stefan (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods". Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/1571941.1572114.

[rrf-1] Cormack, Gordon V.; Clarke, Charles LA; Buettcher, Stefan (2009). "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods". Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. doi:10.1145/1571941.1572114.

[1]