RAG in .NET 10: Architecture for Knowledge-Grounded AI Apps

Pure LLMs hallucinate. Ask a chatbot about your company's return policy and it will confidently invent one — wrong dates, wrong terms, wrong tone. The fix is not a bigger model. The fix is Retrieval-Augmented Generation, where the model only answers based on documents you retrieve and pass into the prompt. Every claim is grounded in something your team controls. Hallucinations drop, citations become possible, and updates ship by re-indexing — not by retraining.

RAG is the dominant production AI pattern of 2026, and it's now buildable end-to-end in .NET 10. This guide walks through the five stages of a RAG pipeline, the vector database choices, and the architectural patterns that hold up under real load — all on the .NET stack you already run.

4Vector DBs compared

.NET 10LTS runtime

GroundedCited responses

The RAG ecosystem in .NET 10

Microsoft.Extensions.AI

Unified IEmbeddingGenerator<string, Embedding<float>> across providers. The same code generates embeddings against OpenAI, Azure, or self-hosted models.

✅ Native .NET

Semantic Kernel

Memory abstraction, RAG orchestration, retrieval connectors. Bridges raw embeddings with prompt construction.

✅ Cross-framework

ONNX embeddings

Run sentence-transformer models locally with no API costs. Lower quality than frontier models, but private and cheap.

🟡 Choose one

Vector database

pgvector (Postgres), Qdrant, Milvus, Pinecone, or SQL Server 2025's native vector type. All have official .NET clients.

🟡 Choose strategy

Chunking

Fixed-size, sentence-boundary, recursive, or semantic. The single biggest determinant of retrieval quality.

🟡 Optional

Reranker

A cross-encoder model that re-orders the top-K results from vector search by relevance. Improves precision noticeably.

Quick reference: the five-stage pipeline

Ingestion: from raw documents to searchable chunks

Bad chunking is the single most common reason RAG returns garbage. A 50-page PDF chunked at every 1,000 characters tears sentences in half. The chunks the retrieval step finds are technically "about" the topic but contain incomplete thoughts. Spend time on chunking before tuning anything else.

The ingestion pipeline has four steps:

Parse. PDF, DOCX, HTML, Markdown, Confluence pages, JIRA tickets — extract clean text. Libraries: UglyToad.PdfPig for PDFs, DocumentFormat.OpenXml for Office, AngleSharp for HTML.

Chunk. Split the text into ~500-token windows with 50-token overlap. Respect sentence and paragraph boundaries.

Embed. Convert each chunk to a vector using IEmbeddingGenerator.

Index. Insert (chunk_text, vector, source_doc_id, page_number, metadata) into your vector database.

public class IngestionService(

IEmbeddingGenerator<string, Embedding<float>> embedder,

IVectorStore store)

{

public async Task IngestAsync(string filePath, string sourceId)

{

// Step 1: extract text

var rawText = ExtractText(filePath);

// Step 2: chunk with overlap, respecting sentence boundaries

var chunks = ChunkText(rawText, chunkSize: 500, overlap: 50);

// Step 3: batch-embed (most providers support 100+ in a single call)

var batches = chunks.Chunk(96);

foreach (var batch in batches)

{

var embeddings = await embedder.GenerateAsync(batch.Select(c => c.Text));

// Step 4: index

var records = batch.Zip(embeddings, (chunk, vec) => new VectorRecord

{

Id = Guid.NewGuid(),

Text = chunk.Text,

Vector = vec.Vector.ToArray(),

SourceId = sourceId,

PageNumber = chunk.PageNumber,

IngestedAt = DateTime.UtcNow

});

await store.UpsertBatchAsync(records);

}

Chunking strategy matters more than chunk size

Rules that consistently improve quality:

Don't break mid-sentence. Use a regex or library that respects punctuation. System.Globalization.StringInfo or a simple sentence-boundary detector both work.

Keep tables intact. A 30-row pricing table chunked across boundaries becomes useless. Detect tables during parsing and emit them as single chunks (or convert to clean markdown).

Overlap by 10–15%. Without overlap, a sentence that crosses a chunk boundary is partially in two places and fully understood in neither.

Store hierarchy metadata. Section title, parent heading, page number. Use these in the prompt to help the model orient.

Query embedding

This stage is trivial in code but critical in practice: embed the user's query with the same model you used for ingestion. Mixing embedding models (ingest with one, query with another) silently destroys recall — the vectors live in different semantic spaces and similarity scores become meaningless.

var queryEmbedding = await embedder.GenerateAsync(new[] { userQuery });

var queryVector = queryEmbedding[0].Vector.ToArray();

When you upgrade your embedding model, you must re-ingest every document. Store the embedding model name as metadata on each vector record so you can detect drift and trigger a full re-index. This is the only "model-driven" data migration in RAG.

Retrieval: more than nearest-neighbor

The naive approach is "top 5 vectors by cosine similarity, send them to the model." This works for small corpuses and quickly hits its limits as documents grow. Production retrieval typically combines three techniques:

Vector search (semantic)

The core. Returns chunks whose vectors are closest to the query vector. Excellent at finding paraphrased and conceptually related content. Weak when the query contains rare proper nouns, IDs, or technical jargon the model wasn't trained on.

Keyword search (lexical)

BM25 or full-text search via SQL Server's CONTAINS / FREETEXT. Excellent at exact matches — product SKUs, error codes, customer names. Run alongside vector search and merge results.

Hybrid + reranking

Get top-20 candidates from each method, deduplicate, then send all ~30 candidates through a cross-encoder reranker that scores each one against the query directly. Pick the top 5 for the prompt. This adds latency (a few hundred ms) but raises precision noticeably.

public async Task<List<Chunk>> RetrieveAsync(string query, int topK = 5)

{

// Vector search

var queryVec = await GetQueryVectorAsync(query);

var vectorHits = await _store.SearchByVectorAsync(queryVec, limit: 20);

// Keyword search (BM25 against SQL Server full-text)

var keywordHits = await _db.Chunks

.Where(c => EF.Functions.Contains(c.Text, query))

.Take(20)

.ToListAsync();

// Merge + dedupe

var candidates = vectorHits

.UnionBy(keywordHits, c => c.Id)

.ToList();

// Rerank (cross-encoder via ONNX or hosted API)

var reranked = await _reranker.RerankAsync(query, candidates);

return reranked.Take(topK).ToList();

}

Augmentation: building a grounded prompt

The point of RAG is the model answers from your documents, not from its training. The system prompt has to enforce this. Two phrases consistently work: "Only answer using the context below" and "If the context does not contain the answer, say so. Do not invent."

A solid RAG prompt template:

var systemPrompt = $@"You are a support assistant for {companyName}.

Only answer using the information in the <context> blocks below.

If the context does not contain the answer, say 'I don't have that information.'

Do not use information from outside the context.

Cite the source ID after each claim in square brackets, like [doc:42].

{string.Join("\n\n", chunks.Select(c => $"[doc:{c.SourceId}]\n{c.Text}"))}

</context>

var messages = new List<ChatMessage>

{

new(ChatRole.System, systemPrompt),

new(ChatRole.User, userQuery)

};

await foreach (var update in chatClient.GetStreamingResponseAsync(messages))

{

yield return update.Text ?? "";

}

Citations

By prompting the model to emit [doc:42] after each claim, you can parse those references and turn them into hyperlinks in your UI. Users see what evidence backs each statement — and you have an audit trail when the model gets something wrong.

Generation with streaming

Same pattern as a regular chatbot — stream the response token-by-token. The difference is your context is now grounded in your documents, so hallucinations are dramatically reduced and citations are possible.

One additional production pattern: if the retrieval step returned zero relevant chunks, short-circuit and return "I don't have information about that" before calling the model. Otherwise the model may try to be helpful and fabricate.

Vector database comparison

Your .NET 10 RAG application runs on Adaptive Web Hosting with SQL Server 2022 for chunk metadata and conversation history. The vector database itself typically runs externally — pgvector on a managed Postgres, Qdrant Cloud, or Azure AI Search. Adaptive hosts the orchestration layer; the heavy retrieval workload lives in a purpose-built vector store.

Production patterns

Cold-start avoidance

Pre-load the embedding generator and vector client at app startup, not on first request. A first-request embedding generation followed by a vector search adds 2–3 seconds of latency the first time after an app pool recycle. Warm them in a BackgroundService as soon as the app boots.

Eval, not vibes

"It seems better" is not a measurement. Maintain a golden set of 50–100 question/answer pairs and run them through your pipeline whenever you change the chunking, the embedding model, the retrieval logic, or the prompt. Track recall@5 and answer-correctness over time. Without this you'll spend weeks tweaking and never know what actually helped.

Permission-aware retrieval

If different users see different documents, filter at the vector-search level — not after. Tag each chunk with the document's permission set and pass the user's identity into the search query. Filtering after retrieval throws away top-K results and leaves you with bad context, or worse, leaks data.

Caching

Embed responses to common queries and cache the (query, answer) pair. For internal knowledge bases where the same questions repeat, this cuts cost dramatically and improves latency. Invalidate the cache when source documents change.

Hosting recommendations

ASP.NET Business — $17.49/mo

Customer-facing RAG over a public knowledge base (≤10,000 documents). 2 GB RAM per app pool. Most-common production tier.

View Business plan →

ASP.NET Professional — $27.49/mo

Multi-tenant RAG-as-a-service, enterprise document portals, white-label deployments. 4 GB per pool, highest priority scheduling.

View Professional plan →

FAQs

How big a corpus can I RAG over?

The vector database does the heavy lifting and scales horizontally. The bottleneck is usually quality, not size. A 10,000-document corpus retrieves about as fast as a 100-document one with proper HNSW indexing. Beyond ~10M vectors, look at Milvus or a managed service.

Do I need an embedding model from the same vendor as my chat model?

No. Mix and match freely. OpenAI embeddings + Anthropic for generation is a common stack. The chunks are just text passed in the prompt — the chat model doesn't need to know what generated the vectors.

What about RAG over images or audio?

Multi-modal embeddings exist (CLIP for images, Whisper for audio transcription). The retrieval mechanism is identical — embed the query, find nearest vectors. The chunk content becomes a description or transcript instead of raw text.

How do I keep the index in sync with my docs?

For static docs, re-index nightly. For dynamic content (wiki edits, CMS updates), webhook into your ingestion service. Soft-delete old vectors keyed by source_id + version, then insert new chunks. Hard-delete after a grace period in case rollback is needed.

Can I run RAG fully offline?

Yes — ONNX embeddings + a self-hosted vector DB + a self-hosted GPT-4-class model via Ollama. The full stack runs on a Windows + GPU box. Adaptive Web Hosting will host the orchestration application; the inference would need a separate GPU server.

Is RAG enough, or do I need to fine-tune?

RAG is enough for ~95% of "give the model my knowledge" use cases. Fine-tuning is for changing the model's style or format, not for teaching it facts. If your needs are factual — return correct answers from a corpus — RAG is the right tool.

Build it

RAG is no longer experimental. The .NET 10 stack has every piece you need: Microsoft.Extensions.AI for the abstractions, Semantic Kernel for orchestration, ONNX or hosted APIs for embeddings, any vector database for retrieval, and Blazor for the UI. Adaptive Web Hosting's ASP.NET hosting plans run all of this on real Windows + IIS with SQL Server 2022 included, free SSL, and dedicated app pools tuned for production .NET workloads.