Self-Hosted LLM Inference on Windows: ONNX, llama.cpp, Ollama

For every team running .NET 10 against a hosted model API, there's another team asking three questions: What happens to our data once it leaves our network? What does our monthly bill look like at scale? Can we run this offline? The answer to all three is the same — self-hosted LLM inference. Models small enough to run on commodity hardware are now genuinely useful, and the .NET ecosystem has first-class support: ONNX Runtime for embedding and small classification models, LLamaSharp for llama.cpp-based inference, and Ollama as a managed local-model runtime that plugs straight into Microsoft.Extensions.AI's IChatClient.

This guide covers when self-hosting actually makes sense, the four .NET-native paths to running models on your own hardware, and the realistic architecture for combining self-hosted inference with the rest of your stack on Windows + IIS.

GGUFQuantized model format

.NET 10LTS runtime

$0Per-call cost

When self-hosting makes sense (and when it doesn't)

Self-hosted models in 2026 are catching up to frontier APIs, but they aren't there yet. A locally-run 8B-parameter quantized model is excellent at classification, embeddings, structured extraction, and routine writing. It is not yet competitive with GPT-4-class models on complex reasoning. Pick self-hosting for the right tasks, not all tasks.

Embedding generation (millions per month)

Text classification and routing

Sentiment analysis

Structured extraction from typed schemas

Document summarization (short docs)

Internal tooling where privacy matters

Air-gapped deployments (regulated, government)

🟡 Keep on hosted APIs

Complex multi-step reasoning

Long-context summarization (100K+ tokens)

Code generation at developer-tool quality

Customer-facing chat where quality is critical

Function-calling with many concurrent tools

Workloads with intermittent traffic (idle GPU = wasted money)

The four .NET-native self-hosting paths

ONNX Runtime

Sentence-transformer models in ONNX format run inside your .NET process. CPU-only, fast, completely offline. Best for embeddings + small classification.

✅ Generation

LLamaSharp

.NET bindings for llama.cpp. Runs GGUF-quantized models (Llama, Mistral, Phi, Qwen) on CPU or CUDA GPU. Direct in-process, no separate server.

✅ Managed

Ollama

Runs as a background service on Windows. Single command pulls + serves a model. OllamaChatClient in Microsoft.Extensions.AI plugs in directly.

🟡 Heavyweight

vLLM / Triton

Python-based servers with OpenAI-compatible APIs. Best raw throughput. Run on a separate Linux GPU box; consume from .NET as a regular HTTP client.

🟡 Required reading

Quantization (GGUF)

Q4_K_M, Q5_K_M, Q8 levels trade quality for size. Q4_K_M is the standard production choice — ~half the memory of full-precision with minimal quality loss.

🟡 Architectural choice

Hybrid: hosted + self

Hot path on hosted APIs for quality; high-volume / batch / sensitive paths on self-hosted. The right answer for most production teams.

Quick reference: pick a path by workload

Embedding models with ONNX Runtime

Embeddings are the highest-volume, lowest-complexity AI calls in most apps. A semantic search over 10,000 products embeds 10,000 items + every query. At hosted API prices, this adds up. Self-hosting them on CPU eliminates the cost line entirely. The quality gap to frontier embeddings is small for most search and clustering tasks.

Models like all-MiniLM-L6-v2 (384-dim) and bge-small-en-v1.5 (384-dim) are small, fast on CPU, and have ONNX builds available on Hugging Face. The .NET integration:

// NuGet: Microsoft.ML.OnnxRuntime, Microsoft.ML.Tokenizers

public class LocalEmbedder : IEmbeddingGenerator<string, Embedding<float>>

{

private readonly InferenceSession _session;

private readonly BertTokenizer _tokenizer;

public LocalEmbedder(string modelPath, string tokenizerPath)

{

_session = new InferenceSession(modelPath);

_tokenizer = BertTokenizer.Create(tokenizerPath);

}

public async Task<GeneratedEmbeddings<Embedding<float>>> GenerateAsync(

IEnumerable<string> values,

EmbeddingGenerationOptions? options = null,

CancellationToken cancellationToken = default)

{

var results = new List<Embedding<float>>();

foreach (var text in values)

{

var encoded = _tokenizer.EncodeToIds(text);

var inputIds = encoded.Select(id => (long)id).ToArray();

var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray();

var inputs = new[]

{

NamedOnnxValue.CreateFromTensor("input_ids",

new DenseTensor<long>(inputIds, new[] { 1, inputIds.Length })),

NamedOnnxValue.CreateFromTensor("attention_mask",

new DenseTensor<long>(attentionMask, new[] { 1, attentionMask.Length }))

};

using var output = _session.Run(inputs);

var embedding = ExtractMeanPooledEmbedding(output, attentionMask);

results.Add(new Embedding<float>(embedding));

}

return new GeneratedEmbeddings<Embedding<float>>(results);

}

// Wire it up — same IEmbeddingGenerator interface as hosted providers

builder.Services.AddSingleton<IEmbeddingGenerator<string, Embedding<float>>>(_ =>

new LocalEmbedder("./models/bge-small-en-v1.5.onnx", "./models/tokenizer.json"));

Throughput on a modern CPU: 50-200 embeddings per second per core depending on model size. For most apps this exceeds demand by an order of magnitude.

Where this runs on Adaptive Web Hosting

ONNX Runtime works fine on Windows + IIS within a regular ASP.NET Core app. The ASP.NET Developer plan has enough RAM (1 GB per dedicated app pool) for small embedding models; the Business plan at 2 GB gives comfortable headroom for keeping the model loaded plus your application logic.

LLamaSharp for local LLM inference

LLamaSharp wraps llama.cpp — the C++ reference implementation for running quantized Llama-family models. It loads GGUF files directly and runs them in-process. Pure .NET API; no Python, no separate server.

// NuGet: LLamaSharp, LLamaSharp.Backend.Cpu (or .Cuda12 for GPU)

var parameters = new ModelParams("./models/Phi-3-mini-4k-instruct-q4.gguf")

{

ContextSize = 4096,

GpuLayerCount = 0 // 0 = CPU only; >0 = offload that many layers to GPU

};

using var model = LLamaWeights.LoadFromFile(parameters);

using var context = model.CreateContext(parameters);

var executor = new InteractiveExecutor(context);

var session = new ChatSession(executor);

session.AddSystemMessage("You are a helpful assistant.");

await foreach (var token in session.ChatAsync(

new ChatHistory.Message(AuthorRole.User, "What is .NET 10?"),

new InferenceParams { MaxTokens = 256 }))

{

Console.Write(token);

}

For models in the 3-7B parameter range with Q4 quantization, RAM requirements land around 4-6 GB. CPU inference works but is slow (~5-15 tokens/second). GPU inference via the CUDA backend lifts that to 50-200+ tokens/second on a modest consumer card.

Adaptive Web Hosting plans are shared Windows + IIS hosting. We don't provision GPUs, and CPU-only LLM inference on shared resources isn't practical at production scale. The pattern that works: your .NET 10 application (orchestration, business logic, Blazor UI) runs on Adaptive; the inference workload runs on a separate dedicated GPU server you provision elsewhere; the two communicate via HTTP.

Ollama: the easy mode

Ollama is to LLM inference what Docker was to application packaging: a single command pulls a model, starts a server, exposes an API. On Windows, install once, run as a background service, and your .NET app talks to it via Microsoft.Extensions.AI's OllamaChatClient.

On the inference machine (Windows or Linux)

ollama pull phi3:mini

ollama serve # exposes API at http://localhost:11434

// In your .NET 10 app — same IChatClient interface as hosted providers

builder.Services.AddSingleton<IChatClient>(_ =>

new OllamaChatClient(

endpoint: new Uri("http://your-inference-server:11434"),

modelId: "phi3:mini"));

// Use it identically to a hosted client

var response = await chatClient.GetResponseAsync("What's new in .NET 10?");

Ollama's killer feature is its catalog: hundreds of pre-built model + quantization combinations available with a single ollama pull. Phi-3, Llama 3, Mistral, Qwen, Gemma — all instantly available. Switching models is a configuration change, not a deployment.

The hybrid architecture

Pure self-hosted limits quality. Pure hosted-API limits privacy and cost control. The right answer is usually hybrid: route each request to the right backend based on what it needs. Embeddings → local. Classification → local. Customer-facing chat → hosted API. Document extraction → local. Complex reasoning → hosted API.

Microsoft.Extensions.AI's IChatClient abstraction makes this trivial. Inject a router:

public class ChatClientRouter : IChatClient

{

private readonly IChatClient _local; // Ollama

private readonly IChatClient _hosted; // OpenAI / Azure / Anthropic

public ChatClientRouter(

[FromKeyedServices("local")] IChatClient local,

[FromKeyedServices("hosted")] IChatClient hosted)

{

_local = local;

_hosted = hosted;

}

public async Task<ChatResponse> GetResponseAsync(

IEnumerable<ChatMessage> messages,

ChatOptions? options = null,

CancellationToken cancellationToken = default)

{

// Route by a flag in ChatOptions.AdditionalProperties or by message length / complexity

var useLocal = options?.AdditionalProperties?["preferLocal"] as bool? ?? false;

var client = useLocal ? _local : _hosted;

return await client.GetResponseAsync(messages, options, cancellationToken);

}

// ... GetStreamingResponseAsync, Dispose, GetService follow the same pattern

}

Tag specific request types as "prefer local" in your application code. Embedding generation, classification, internal summarization → local. Customer-facing chat, complex agents → hosted. Fallback logic (try local, fall back to hosted on timeout) is a few extra lines.

The cost math

Self-hosting wins economically above certain volume thresholds. A rough rule:

Embeddings. Self-hosting saves money above ~50K embedded items per month. Below that, hosted API is cheaper than the time investment.

Small-model classification. Above ~100K classifications per month, a dedicated CPU-bound .NET process running an ONNX classifier wins.

LLM generation. The break-even is workload-dependent. A GPU server running 24/7 at $300-800/mo only beats hosted-API costs above ~5-10M tokens/month of typical usage — and only if you can keep utilization high.

Self-hosting always wins for privacy regardless of volume. If "no third party can ever see this content" is a hard requirement, self-hosted is the only option — at any volume.

Production patterns

Model warm-up

Loading a quantized LLM into memory takes 5-30 seconds depending on size. Do this once at startup, not per request. For ONNX models, the same applies — initialize the InferenceSession at app boot and reuse across requests.

Concurrent request handling

llama.cpp / LLamaSharp are not natively thread-safe for the same context. Pool contexts or use a semaphore to serialize requests. Ollama handles this internally — your client code just calls and waits.

Monitoring

Track tokens/sec, queue depth, and error rates. A self-hosted setup needs the same observability discipline as any other production service. Health checks via builder.Services.AddHealthChecks() + a custom probe that runs a dummy inference.

Updates and quantization sweeps

New models drop monthly. Test new candidates against your golden set before swapping in production. Quantization levels matter — Q4_K_M is the standard, but Q5 or Q8 may be worth the memory cost on quality-sensitive tasks.

Hosting recommendations

ASP.NET Business — $17.49/mo

Higher-volume embedding workloads, larger ONNX models in-process, production .NET apps that route to a GPU inference server. 2 GB RAM per pool.

View Business plan →

ASP.NET Professional — $27.49/mo

Multi-tenant AI platforms with mixed hosted + self-hosted routing. 4 GB RAM, highest priority scheduling. Best for orchestration-heavy stacks.

View Professional plan →

FAQs

Can I run an LLM directly on an Adaptive Web Hosting plan?

For embedding models and small classifiers via ONNX Runtime — yes, comfortably. For full LLM inference (Llama 3 8B, Phi-3 medium, Mistral 7B), CPU-only on shared hosting isn't a fit. Run inference on a separate GPU-equipped machine and have your AWH-hosted .NET app consume it via HTTP.

What's the smallest model that's actually useful?

Phi-3 mini (3.8B) and Llama 3.2 (3B / 1B) are surprisingly capable for classification, summarization, and structured extraction. Below that, quality drops sharply. For pure embeddings, models in the 30-80M parameter range are excellent.

Should I use LLamaSharp or Ollama?

Ollama for everyday use — easier setup, automatic model management, OpenAI-compatible API. LLamaSharp when you need fine control: custom sampling, deeper integration with your app's lifecycle, or no-network deployment scenarios.

What about quantization quality loss?

Q4_K_M (the standard) loses around 1-2 percentage points on most benchmarks vs full-precision. Q8 is essentially lossless but uses 2x the memory. For most production use cases, Q4_K_M is the right tradeoff. Test on your own evaluation set before committing.

Can I fine-tune self-hosted models?

Yes, but the workflow lives in Python — Unsloth, LoRA tuning, etc. Export the fine-tuned model to GGUF and run inference from .NET as usual. Training stays Python; inference becomes .NET.

How do I keep models updated?

Watch new releases on Hugging Face. Pull the latest GGUF (or pull via Ollama) into a staging environment. Run your golden-set evaluations. If quality matches or improves, promote to production. Most teams update quarterly.

Ship it

Self-hosted LLM inference is the right answer for embeddings, classification, structured extraction, and any workload where privacy or per-call cost matters. The .NET 10 ecosystem has every piece: ONNX Runtime for small models in-process, LLamaSharp for GGUF inference, Ollama for managed local serving, and Microsoft.Extensions.AI to unify them with hosted providers under IChatClient.

Adaptive Web Hosting's ASP.NET hosting plans are the orchestration layer in this architecture — your .NET 10 app, Blazor UI, SQL Server 2022, and ONNX-based embedding inference all run there on real Windows + IIS. The LLM inference workload sits on a separate GPU server you provision; the abstraction makes the boundary invisible to the rest of your code.