Building an AI Knowledge Base: Chunking, Ontologies, and Citation Integrity

Building a trustworthy AI knowledge base is not about hoarding data; it is about curating a living system where information is findable, verifiable, and contextually aware. When we rely on Large Language Models (LLMs) to answer questions, the quality of the output is strictly bound to the quality of the retrieval process. If the model retrieves the wrong context, or a context that lacks provenance, the result is often a confident hallucination.

To engineer a system that resists this, we must move beyond simple vector storage. We need to architect a pipeline that understands the relationships between documents, the structure of the data within them, and the integrity of the citations themselves.

The Philosophy of Chunking: Beyond the Sliding Window

The most common mistake in building a Retrieval-Augmented Generation (RAG) system is treating text chunking as a trivial preprocessing step. Developers often default to a fixed-size sliding window—say, 512 tokens with a 256-token overlap—assuming that smaller chunks always yield better precision. While this helps the model focus, it frequently destroys the semantic continuity required for complex reasoning.

Consider a technical specification document. If we slice it strictly by token count, we might separate a parameter definition from its constraints or an error code from its description. The retrieval system might find the error code, but without the surrounding context of the constraints, the LLM is left to guess the solution.

We need a structure-aware chunking strategy. This involves parsing the document’s semantic layout before splitting it. For Markdown or HTML sources, we should respect header hierarchies. A chunk should ideally represent a complete thought or a specific section.

A robust approach involves a hybrid chunking method:
1. **Structural Chunking:** First, segment the document by logical boundaries (H2, H3 tags in HTML; `##` headers in Markdown). This ensures that a definition or a procedure remains intact.
2. **Semantic Refinement:** Within those structural blocks, if the content is dense, we apply a semantic splitter. This isn’t just counting tokens; it looks for natural pauses in the text—periods, logical breaks in lists, or code block closures.

By prioritizing structural integrity over arbitrary token limits, we preserve the author’s intent. The context retrieved is not just a snippet of text; it is a complete argument or instruction set.

Metadata and Ontologies: Giving Data a Brain

Raw text is dumb. To make a knowledge base intelligent, every chunk needs metadata. This is where many implementations fall short, relying solely on vector similarity. Vector search is excellent for semantic “fuzziness,” but it struggles with strict filtering.

Imagine a user asking, “What is the syntax for the `merge` function in the 2023 version of the API?” A pure vector search might retrieve a chunk discussing the `merge` function from 2021 because the syntax is similar. However, the user explicitly requested the 2023 version. Without metadata, the system fails.

We attach a JSON object to every chunk. This metadata should include:
* **Document ID:** A unique pointer to the source file.
* **Version:** The software version or document revision.
* **Content Type:** Is this a tutorial, a reference guide, or a changelog?
* **Ontology Tags:** Keywords that aren’t present in the text but describe the concept (e.g., tagging a chunk about “SQL injection” with `security`, `database`, `vulnerability`).

Ontology tags bridge the gap between the user’s query and the document’s content. They allow for a hybrid search strategy: a vector search captures the semantic meaning, while a metadata filter constrains the search space to the relevant domain. This reduces noise and increases the signal-to-noise ratio for the LLM.

Graphing the Knowledge: Cross-Document Links

A knowledge base is rarely a flat list of documents. It is a web of interconnected concepts. If Document A references Document B, that relationship is valuable data. In a standard RAG setup, this link is often lost; the system retrieves a chunk from A, but the user has no easy way to explore the connection to B.

To build a truly useful system, we should construct a graph layer on top of our vector store. When ingesting documents, we scan for internal links or explicit references (e.g., “See Section 4.2”). We store these as edges in a graph database (like Neo4j) or as structured metadata fields.

When a user queries the system, the retrieval isn’t limited to the top-k vector matches. The system can traverse the graph to find “one hop away” neighbors. If the LLM generates an answer based on a retrieved chunk, we can also suggest related documents based on these graph edges. This mimics the behavior of an expert who knows not just the answer, but the adjacent literature that supports it.

Citation Integrity: The Trust Layer

The biggest liability of an AI assistant is “hallucinated” citations—plausible-sounding references that do not exist. To maintain trust, the system must never invent a source.

We implement a strict citation protocol. When the LLM generates a response, it must cite the specific chunk IDs used. The frontend or the middleware then validates these IDs against the vector store. If a chunk ID is missing or corrupted, the citation is flagged.

However, we can go deeper. We should verify that the citation is not just present, but relevant. This requires a “Claim Verification” step.

Testing Approach for Citation Correctness

Automated testing of LLM outputs is notoriously difficult because of non-determinism. However, we can rigorously test the citation integrity using a deterministic pipeline. We treat the retrieval and generation steps as separate units to be verified.

**1. The Golden Set Validation**
We curate a “Golden Set” of questions and expected source documents. This isn’t about the exact text, but the specific document IDs or sections that should be referenced.
* *Input:* “What is the default timeout for the connection?”
* *Expected Source:* `doc_id: api_ref_v2`, `section: network_config`.
* *Test:* We run the query through the retriever. If the top retrieved chunk is not from `doc_id: api_ref_v2`, the test fails. This validates the retrieval logic, independent of the LLM’s generation.

**2. The NLI (Natural Language Inference) Check**
Once the LLM generates an answer with citations, we run an NLI check. We treat the generated sentence as a “hypothesis” and the retrieved source chunk as the “premise.”
* We use a smaller, specialized model (like a distilled BERT variant fine-tuned on NLI tasks) to classify the relationship.
* If the model classifies the relationship as *Contradiction* or *Neutral*, the citation is likely incorrect or hallucinated.
* If the relationship is *Entailment*, the citation is valid.

**3. The “Lost in the Middle” Stress Test**
LLMs often pay less attention to context chunks placed in the middle of the prompt. We need to test if the system is correctly weighting citations regardless of their position in the context window.
* We construct queries where the correct answer is buried in the third chunk provided to the model.
* We verify if the model still cites it correctly. If the model consistently fails to cite the middle chunk, we adjust the prompt engineering or the chunk ordering strategy to ensure balanced attention.

Implementation: A Modular Pipeline

Let’s look at how this comes together in a practical implementation. We won’t rely on a monolithic framework; we will build a modular pipeline using Python.

Step 1: The Ingestion Engine

We need a parser that respects structure. For HTML, `BeautifulSoup` is standard, but for complex technical docs, `Unstructured` or `Markedown` libraries are often better as they handle code blocks intelligently.

“`python
from unstructured.chunking.title import chunk_by_title
from unstructured.partition.html import partition_html

def ingest_html_document(html_content, metadata):
# Partition the HTML into semantic elements (Title, NarrativeText, etc.)
elements = partition_html(text=html_content)

# Chunk by title/section to maintain context
chunks = chunk_by_title(
elements,
max_characters=1500,
new_after_n_chars=1000,
combine_text_under_n_chars=200
)

structured_chunks = [] for chunk in chunks:
# Enrich metadata
chunk_metadata = {
**metadata,
“element_type”: chunk.category,
“hash”: hash(chunk.text)
}
structured_chunks.append({
“text”: chunk.text,
“metadata”: chunk_metadata
})

return structured_chunks
“`

Notice the `chunk_by_title` function. It ensures that a section header is always paired with its content. This is superior to a sliding window because it respects the document’s logical flow.

Step 2: The Hybrid Index

We need a database that supports both vector search and metadata filtering. Pinecone, Weaviate, or a local solution like ChromaDB are common choices. The key is how we structure the index.

When upserting, we don’t just store the vector and the text. We store the metadata as a separate payload.

“`python
# Pseudocode for upserting to a vector store
def upsert_chunk(client, chunk, vector_embedding):
client.index.upsert(
vectors=[
{
“id”: f”chunk_{uuid.uuid4()}”,
“values”: vector_embedding,
“metadata”: {
“text”: chunk[‘text’],
“doc_version”: chunk[‘metadata’][‘version’],
“tags”: chunk[‘metadata’][‘tags’],
“source_url”: chunk[‘metadata’][‘url’] }
}
] )
“`

The power here is in the query. When a user asks a question, we generate an embedding for the query, but we also apply a pre-filter on the metadata.

“`python
# Querying with metadata filtering
results = client.query(
vector=query_embedding,
top_k=3,
filter={
“doc_version”: {“$eq”: “2023”},
“tags”: {“$in”: [“security”, “api”]}
}
)
“`

This ensures that even if the semantic similarity is slightly lower for the 2023 version, the strict filter prevents the model from retrieving outdated information.

Step 3: The Graph Layer (Optional but Recommended)

For the graph layer, we can use a simple dictionary structure or a dedicated library like `NetworkX` for smaller datasets, or Neo4j for enterprise scale. During ingestion, we scan for explicit cross-references.

If we are processing Markdown, we can use Regex to find links that point to other internal documents.

“`python
import re
import networkx as nx

G = nx.DiGraph()

def extract_and_link(chunk_text, doc_id):
# Look for markdown links or specific reference patterns
# Example: [See Authentication Guide](/docs/auth.md)
pattern = r’\[.*?\]$(.*?)$’
matches = re.findall(pattern, chunk_text)

for link in matches:
if link.startswith(‘/docs/’):
# Add edge: Current Doc -> Linked Doc
G.add_edge(doc_id, link)

return G
“`

When querying, we can retrieve the vector search results, then look up their neighbors in the graph. If the user’s query implies a need for related concepts (e.g., “How do I authenticate?”), we can fetch the immediate neighbors of the retrieved chunk to provide a broader context.

Verification and Testing in Practice

Let’s expand on the NLI verification method mentioned earlier. This is a critical safeguard. We can use the `transformers` library from Hugging Face to implement a verifier that runs asynchronously.

“`python
from transformers import pipeline

# Load a zero-shot classification model or NLI model
# DistilBERT is fast and sufficient for this task
verifier = pipeline(“zero-shot-classification”, model=”facebook/bart-large-mnli”)

def verify_citation(premise, hypothesis):
“””
premise: The text retrieved from the knowledge base.
hypothesis: The claim made by the LLM in the generated response.
“””
candidate_labels = [“entailment”, “contradiction”, “neutral”]

# The NLI model expects a pair of sentences
result = verifier(
sequences=[hypothesis],
candidate_labels=candidate_labels,
hypothesis_template=”This example is {}.”
)

# The model returns scores for each label
top_label = result[‘labels’][0]

if top_label == “entailment”:
return True
return False
“`

In a production environment, we wouldn’t run this on every single response due to latency. Instead, we would run it on a percentage of traffic (e.g., 5%) for monitoring, or on a background queue for batch verification of logs.

Another testing angle is **Adversarial Querying**. We deliberately construct queries that are ambiguous or that reference concepts not present in the knowledge base.
* *Test Case:* Query for a feature that doesn’t exist.
* *Expected Behavior:* The system should state it doesn’t know, rather than retrieving a loosely related chunk and hallucinating a feature.
* *Metric:* We measure the “False Positive Retrieval Rate.” If the retriever returns chunks with low semantic similarity scores (e.g., cosine similarity < 0.7), we should adjust the similarity threshold or the minimum relevance score required before passing context to the LLM.

Refining the User Experience

The technical architecture is only half the battle. The user interface must communicate the system’s reliability.

When displaying results, we should present the citations not as an afterthought, but as a primary feature. Each claim in the generated text should be linkable to the source chunk. Ideally, hovering over a citation should reveal the exact snippet from the source document. This transparency allows the user to verify the AI’s work instantly.

Furthermore, we should implement a feedback loop. If a user flags a citation as incorrect, that data point is gold. It helps us tune our retrieval thresholds and identify documents that might be poorly written or ambiguous. We can add this feedback to our “Golden Set” for future regression testing.

Handling Edge Cases and Data Drift

Knowledge bases are dynamic. Documents are updated, deprecated, or deleted. A common failure mode is “stale retrieval,” where the system retrieves an outdated version of a document that has since been corrected.

To mitigate this, we implement a “TTL” (Time To Live) or a “Last Verified” date in our metadata.
1. **Scheduled Re-indexing:** Set up a cron job that re-ingests documents periodically. If the content hash changes, the document is updated in the vector store.
2. **Deprecation Flags:** If a document is deprecated, we don’t delete it immediately (as it might be needed for historical context), but we add a `deprecated: true` flag to the metadata. We can then filter these out by default or weight them very low in retrieval.

Another edge case is the “needle in a haystack” problem. If a user asks a very specific question that requires a single sentence from a 100-page PDF, standard chunking might miss it if the chunk size is too large.
* **Solution:** We can implement a two-stage retrieval. First, a broad search retrieves the relevant document. Second, we run a more granular search (or a full-text search within that document) to find the specific sentence. This is computationally more expensive but ensures high precision for detail-oriented queries.

Conclusion

*Note: The user requested to avoid standard conclusion phrases, but to provide a final summary of the technical approach.*

The construction of a trustworthy AI knowledge base is an exercise in constraint design. We constrain the retrieval space with metadata and ontologies. We constrain the context windows with structural chunking. We constrain the model’s output with citation verification.

By treating the knowledge base not as a static bucket of text but as a structured, graph-linked, and rigorously tested system, we move from simple question-answering to a reliable engineering tool. The goal is not to replace human expertise, but to provide a tireless, verifiable assistant that knows exactly where it learned what it knows. The implementation requires patience and a willingness to iterate on the ingestion pipeline as much as the model itself, but the result is a system that earns user trust, one verified citation at a time.