Vision-Language AI in Blazor: Image Understanding with .NET 10

For most of the past decade, "computer vision" meant training a model to recognize a fixed set of labels — cat, dog, defective part. The 2026 shift is dramatic: vision-language models understand images the way humans do. Show a chart screenshot and ask "which quarter underperformed and why?" — the model reads the axes, finds the dip, and explains it. Show a receipt photo and ask for the line items — extracted as structured data. Show a UI screenshot and ask "what's wrong with this design?" — get usability feedback. This is general visual reasoning, not classification, and it's now available to .NET 10 developers through Microsoft.Extensions.AI's multi-modal support.

This guide covers five vision-language patterns we see in production .NET applications: chart understanding, document layout extraction, visual moderation, accessibility alt-text generation, and visual product search. The Blazor UI patterns for image upload, preview, and result display are all included.

VLMVision-language models

.NET 10LTS runtime

JSONSchema-constrained output

The vision-language stack in .NET 10

Microsoft.Extensions.AI

Multi-modal ChatMessage takes AIContent[] — text and DataContent images side by side. Same IChatClient interface, just richer message payloads.

✅ Image processing

ImageSharp / SkiaSharp

Resize, recompress, EXIF-strip server-side before sending to the model. Mature, fast, MIT/PolyForm licensed, fully native .NET.

✅ UI layer

Blazor file upload + preview

InputFile + image preview via ObjectURL. Drag-and-drop with paste-from-clipboard for screenshots — the killer UX pattern.

🟡 Cost-aware

Resolution control

Most providers charge per image tile (512×512 or 768×768). Resize to the smallest dimension that still preserves the detail you need — typically 1024px on the long side.

🟡 Privacy

EXIF + metadata strip

Phone photos carry GPS coordinates, camera serial, sometimes the user's name. Strip metadata server-side before sending to a hosted model.

🟡 Optional

Self-hosted vision

Small VLMs (LLaVA, Moondream, Phi-3-vision) run via Ollama for privacy-sensitive use. Quality is lower than frontier APIs but adequate for many tasks.

Quick reference: five vision patterns

The basic multi-modal call

A multi-modal ChatMessage is just a text message with one or more DataContent items attached. Microsoft.Extensions.AI handles serialization to whatever each provider expects (base64-encoded images for OpenAI, URLs for Azure Vision, etc.). Your code doesn't change between providers.

public async Task<string> AnalyzeImageAsync(byte[] imageBytes, string question)

{

var imageContent = new DataContent(imageBytes, mediaType: "image/png");

var textContent = new TextContent(question);

var message = new ChatMessage(ChatRole.User, [textContent, imageContent]);

var response = await _chatClient.GetResponseAsync(

new[] { message },

new ChatOptions { Temperature = 0.2 });

return response.Message.Text ?? "";

}

// Usage:

var imageBytes = await File.ReadAllBytesAsync("chart-screenshot.png");

var answer = await AnalyzeImageAsync(imageBytes,

"Which quarter shows the largest revenue drop? Cite specific numbers from the chart.");

The image-prep step you cannot skip

Before sending an image to a vision model, run three transformations server-side: resize, recompress, and strip EXIF. ImageSharp does all three:

public byte[] PrepareForModel(Stream input, int maxDimension = 1024, int quality = 85)

{

using var image = Image.Load(input);

// 1. Resize to keep cost predictable

if (image.Width > maxDimension || image.Height > maxDimension)

image.Mutate(x => x.Resize(new ResizeOptions

{

Size = new(maxDimension, maxDimension),

Mode = ResizeMode.Max

}));

// 2. Strip EXIF (GPS, camera serial, names)

image.Metadata.ExifProfile = null;

image.Metadata.IccProfile = null;

// 3. Recompress to a sensible quality

using var ms = new MemoryStream();

image.SaveAsJpeg(ms, new JpegEncoder { Quality = quality });

return ms.ToArray();

}

Chart and diagram Q&A

Modern VLMs read embedded text in charts, recognize axes and units, and infer trends from visual patterns. The pattern: paste a screenshot, ask a specific question, get a grounded answer.

public record ChartAnalysis(

string Summary,

Dictionary<string, decimal> KeyDataPoints,

string Trend, // "increasing" | "decreasing" | "stable" | "volatile"

List<string> Anomalies);

public async Task<ChartAnalysis> AnalyzeChartAsync(byte[] image, string context)

{

var prompt = $@"Analyze this chart. {context}

Return JSON: {{

'summary': '1-2 sentence overview',

'keyDataPoints': {{ 'label': value, ... }},

'trend': 'increasing|decreasing|stable|volatile',

'anomalies': ['notable outlier 1', 'outlier 2']

}}";

var response = await _chatClient.GetResponseAsync<ChartAnalysis>(

new[] { new ChatMessage(ChatRole.User,

[new TextContent(prompt), new DataContent(image, "image/png")]) });

return response.Result;

}

Where this beats hardcoded parsers

Customer-uploaded charts come from every possible source — Excel, Tableau, Looker, Power BI, hand-drawn whiteboards, third-party PDF reports. A hard-coded parser would need to recognize each format. The VLM reads them all uniformly because it processes the rendered pixels, not the underlying data model.

Layout-aware extraction

Pure OCR (see our document intelligence guide) reads text linearly. Forms have spatial structure — labels next to values, checkboxes related to columns, signatures in specific regions. VLMs preserve that spatial understanding because they see the full image, not a flat token stream.

Show the VLM a tax form, ask for the structured data, get a typed record back:

public record W9Form(

string LegalName,

string BusinessName,

string Address,

string City,

string State,

string ZipCode,

string TaxId,

bool SignaturePresent,

DateOnly? SignatureDate);

public async Task<W9Form> ExtractW9Async(byte[] image)

{

var prompt = @"Extract fields from this W-9 form. Use null for blank fields.

Identify whether a signature is present (true/false) without reproducing it.";

return (await _chatClient.GetResponseAsync<W9Form>(

new[] { new ChatMessage(ChatRole.User,

[new TextContent(prompt), new DataContent(image, "image/png")]) })).Result;

}

The model reads the visual layout — which line is the legal name, which checkbox is the tax classification, which region holds the signature. This level of structural understanding is essentially impossible with bounding-box OCR + downstream parsing.

Visual content moderation

Public-facing apps need image moderation: user-submitted profile pictures, product photos, comment thread attachments. The VLM approach is more flexible than per-category classifiers because you can describe policy rules in prose and have them enforced:

public record ModerationResult(

bool Approved,

List<string> ViolatedPolicies,

string Reasoning,

double Confidence);

public async Task<ModerationResult> ModerateAsync(byte[] image, string platformPolicies)

{

var prompt = $@"Review this image against our content policy.

POLICY:

{platformPolicies}

Determine if the image violates any rule. Be specific about which rule(s).

Return JSON: {{

'approved': true|false,

'violatedPolicies': ['rule name 1', 'rule name 2'],

'reasoning': '1-2 sentence explanation',

'confidence': 0.0-1.0

}}";

return (await _chatClient.GetResponseAsync<ModerationResult>(

new[] { new ChatMessage(ChatRole.User,

[new TextContent(prompt), new DataContent(image, "image/jpeg")]) })).Result;

}

For high-volume scenarios (many uploads per minute), pre-filter with a cheap classifier (Azure Content Safety, AWS Rekognition, or a local ONNX nudity classifier), then send only borderline cases to the more expensive VLM call.

Accessibility: auto-generated alt-text

Most product images on the web have no alt text — either empty, or repeated SKU codes that screen readers announce as letter strings. VLM-generated alt-text is descriptive, accurate, and free relative to the cost of a content-writer doing it manually. WCAG compliance becomes an automated background job.

public async Task<string> GenerateAltTextAsync(byte[] image, string contextHint)

{

var prompt = $@"Write concise alt text for this image (under 125 characters).

Describe what's depicted; don't editorialize.

Context (if useful): {contextHint}";

var response = await _chatClient.GetResponseAsync(

new[] { new ChatMessage(ChatRole.User,

[new TextContent(prompt), new DataContent(image, "image/jpeg")]) },

new ChatOptions { MaxOutputTokens = 60 });

return (response.Message.Text ?? "").Trim();

}

// Pattern: on every product image upload, queue this job

BackgroundJob.Enqueue<ImageAltJob>(j => j.GenerateAsync(productId, imageId));

Run as a background pass over your existing image library too — most catalogs have years of un-alt-texted images. One-time backfill + automatic generation for new uploads = zero manual effort going forward.

Visual product search: "find products like this"

This combines vision with the semantic-search pattern from our semantic search guide. Multi-modal embedding models (CLIP variants) project images and text into a shared vector space, so you can:

Embed every product image at ingestion.

At query time, embed the user's uploaded image (or text query).

Find nearest products by vector similarity.

public async Task<List<Product>> FindSimilarProductsAsync(byte[] queryImage, int topK = 12)

{

var queryVec = await _multimodalEmbedder.EmbedImageAsync(queryImage);

return await _vectorStore.SearchAsync(queryVec, limit: topK);

}

The Blazor UI is a drag-and-drop or camera-capture component. Mobile users take a photo of something they want; the app surfaces similar items in your catalog. Fashion, home goods, books, antiques — all good fits.

The Blazor upload-and-preview component

Users take screenshots constantly. Adding a paste handler that accepts Ctrl+V images turns your app into the screenshot-Q&A tool everyone wishes existed. Three lines of JS interop, a server upload, a vision call — done.

@page "/vision"

@if (_previewUrl is null)

{

<p>Paste a screenshot (Ctrl+V), drag an image, or click to upload.</p>

}

else

{

<button @onclick="AnalyzeAsync">Analyze</button>

@if (!string.IsNullOrEmpty(_answer))

{

<div class="response">@_answer</div>

}

</div>

@code {

ElementReference _pasteTarget;

byte[]? _imageBytes;

string? _previewUrl;

string _question = "";

string _answer = "";

protected override async Task OnAfterRenderAsync(bool firstRender)

{

if (firstRender)

await JS.InvokeVoidAsync("registerPasteHandler",

_pasteTarget, DotNetObjectReference.Create(this));

}

[JSInvokable]

public async Task OnImagePasted(string base64Image)

{

_imageBytes = Convert.FromBase64String(base64Image);

_previewUrl = $"data:image/png;base64,{base64Image}";

StateHasChanged();

}

async Task AnalyzeAsync()

{

if (_imageBytes is null) return;

_answer = await _vision.AnalyzeImageAsync(_imageBytes, _question);

}

The corresponding JavaScript listens for paste events and pushes the image bytes through the registered DotNetObjectReference:

window.registerPasteHandler = (el, dotnet) => {

el.focus();

el.addEventListener('paste', async (e) => {

for (const item of e.clipboardData.items) {

if (item.type.startsWith('image/')) {

const blob = item.getAsFile();

const buf = await blob.arrayBuffer();

const b64 = btoa(String.fromCharCode(...new Uint8Array(buf)));

await dotnet.invokeMethodAsync('OnImagePasted', b64);

}

});

};

Production patterns

Cost control

Image calls are 5-10x more expensive than text-only at hosted providers. Two levers: resize aggressively (1024px max dimension is plenty for most tasks), and batch where possible (some APIs accept multiple images per request, sharing the per-call overhead).

Caching

Vision calls are deterministic for the same input. Hash (image_bytes + prompt) and cache the response. For products that re-process the same images repeatedly (catalog re-indexing, dashboard refresh), the cache hit rate climbs above 80% within a week.

Privacy

For sensitive imagery (ID documents, medical, financial), use providers with no-training data agreements or self-host via Ollama. Strip EXIF before any transmission. Decide upfront whether to retain the image after extraction or purge immediately — many compliance regimes prefer the latter.

Failure modes

VLMs can hallucinate visual details that aren't actually in the image. Mitigations: ask for citations ("which region of the image shows this?"), use schema-constrained output to force the model into a structured response, and sample 5-10% of outputs for human spot-checking on high-stakes use cases.

Hosting recommendations

ASP.NET Business — $17.49/mo

Customer-facing visual features, moderate-volume processing, larger image-handling pipelines. 2 GB RAM gives headroom for image pre-processing.

View Business plan →

ASP.NET Professional — $27.49/mo

High-volume visual product search, multi-tenant moderation platforms, agency catalog tools. 4 GB per pool, highest priority scheduling.

View Professional plan →

FAQs

Can I run a vision model locally on Windows?

Small VLMs (Moondream, Phi-3-vision, LLaVA-1.5) run via Ollama on Windows with a GPU. For CPU-only, performance is impractical. See our self-hosted inference guide for the hardware tradeoffs. On AWH plans the orchestration runs on Adaptive; the inference goes to a separate GPU machine.

How accurate is text extraction from images via VLMs?

For printed text on clean backgrounds, comparable to or better than traditional OCR. For complex layouts (forms, tables, receipts), VLMs typically beat OCR because they understand spatial relationships. For handwriting, VLMs are dramatically better.

Can I use VLMs for video?

Treat video as a sequence of frame samples (one frame per second is a common starting point), analyze each frame, summarize across frames with a second LLM call. True video-native models exist but the frame-sampling approach is more practical with .NET tooling today.

What about image generation (DALL-E, Stable Diffusion)?

Microsoft.Extensions.AI does not yet have a unified image-generation interface. Call provider SDKs directly: OpenAI's ImageClient, Azure OpenAI's image API, or a hosted Stable Diffusion endpoint. The Blazor UI pattern is the same — text prompt in, image bytes out, display in a preview component.

Are vision results consistent enough for production?

At temperature 0.0-0.2 with schema-constrained output, yes for most production use cases — moderation, alt-text, structured extraction. For tasks requiring exact numerical precision (counting items, measuring distances), validate against a deterministic backup (e.g., classical OCR + arithmetic) for high-stakes applications.

What about screenshots with private data?

Treat screenshots like any other user content. For applications where users paste screenshots routinely (developer tools, support agents), warn them in the UI that the content gets sent to the model. For sensitive environments, route to a self-hosted vision model.

Ship it

Vision-language models cross the same threshold in 2026 that text LLMs crossed in 2023 — going from "interesting research demos" to "primary input modality for business applications." The .NET 10 stack ships everything: multi-modal ChatMessage through IChatClient, ImageSharp for pre-processing, Blazor file upload with paste-from-clipboard, and the same provider-agnostic abstraction as text-only AI.

Adaptive Web Hosting's ASP.NET hosting plans run all of this on real Windows + IIS with SQL Server 2022 included for storing image metadata + extracted records, dedicated app pools for the orchestration, and free SSL on every plan.