Document Intelligence in .NET 10: OCR, Extraction, and Blazor
Half the business documents in the world are PDFs nobody can search. Invoices, contracts, ID documents, expense receipts, lab reports, shipping manifests — locked in scans and image-based PDFs that traditional databases can't query and traditional code can't reliably parse. Document intelligence is the AI category that fixes this: turn unstructured documents into structured data your applications can use.
In 2026 the .NET 10 stack has every piece needed to build production document intelligence: OCR via Tesseract or cloud services, structured extraction via LLMs with JSON-schema-constrained outputs, and Blazor for the upload + review UI. This guide walks through the four-stage pipeline and the production patterns we see in field deployments.
PDF + ImageInput formats
.NET 10LTS runtime
JSONSchema-constrained output
The document intelligence stack in .NET 10
PDF parsing
UglyToad.PdfPig for text-layer extraction. PDFsharp for manipulation. Both mature, MIT-licensed, fully native .NET.
✅ Native .NET
OCR
Tesseract via the .NET wrapper. Free, runs on Windows, good for English + 100+ languages. No per-page fees.
✅ AI extraction
Structured output
Microsoft.Extensions.AI + JSON schema. GPT-4-class models return data conforming to your C# record type — no string parsing, no regex fragility.
🟡 Premium
Azure Document Intelligence
Better than Tesseract for tables, forms, handwriting. Per-page billing. Use for complex documents where Tesseract accuracy is insufficient.
🟡 Premium
Vision-capable LLMs
Skip OCR entirely — feed images directly to a vision model and ask for structured output. Slower, more expensive, but handles complex layouts well.
🟡 Optional
Human review
For high-stakes documents (invoices, contracts), surface low-confidence fields for human verification before posting downstream.
Quick reference: the four-stage pipeline
- Ingestion and classification
An invoice has different fields than a contract. A receipt has different fields than a lab report. The first thing the pipeline should do is figure out what kind of document this is, then route to the appropriate extraction schema. A small LLM call on the first page handles this reliably.
Blazor upload component, classification call, route to type-specific handler:
// Blazor upload component
<InputFile OnChange="OnFileSelected" accept=".pdf,.jpg,.png" />
@code {
async Task OnFileSelected(InputFileChangeEventArgs e)
{
var file = e.File;
await using var stream = file.OpenReadStream(maxAllowedSize: 25 1024 1024);
// Save to disk + run pipeline
var docId = await _docs.IngestAsync(stream, file.Name, file.ContentType);
Navigation.NavigateTo($"/documents/{docId}");
}
}
// Classification — extract first-page text or thumbnail, ask the model
public async Task<DocumentType> ClassifyAsync(string firstPageText)
{
var resp = await _chatClient.GetResponseAsync<ClassifyResult>(
$@"Classify this document into one of: invoice, receipt, contract, id_document, lab_report, other.
Respond as JSON: {{ ""type"": ""..."", ""confidence"": 0.0-1.0 }}.
Document content:
{firstPageText}");
return resp.Result.Type;
}
Once the type is known, the pipeline branches: invoices go through invoice-extraction; contracts go through contract-extraction. Each branch has its own JSON schema for the fields it needs.
- Text extraction: PDF, then OCR fallback
PDFs come in two flavors: text-layer PDFs (the text is embedded as text) and image PDFs (the text is a picture of text). The pipeline tries text extraction first — it's free and instant — and falls back to OCR only when needed.
public async Task<string> ExtractTextAsync(Stream pdfStream)
{
using var doc = UglyToad.PdfPig.PdfDocument.Open(pdfStream);
var pages = doc.GetPages().ToList();
var totalChars = pages.Sum(p => p.Text.Length);
var avgCharsPerPage = totalChars / Math.Max(1, pages.Count);
if (avgCharsPerPage > 200)
{
// Plenty of text — use PDF's own text layer
return string.Join("\n\n", pages.Select(p => p.Text));
}
// Scanned PDF — fall back to OCR
return await OcrAsync(pdfStream);
}
private async Task<string> OcrAsync(Stream pdfStream)
{
var pages = await PdfToImageAsync(pdfStream); // rasterize via PDFsharp or PdfToImage
var sb = new StringBuilder();
using var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default);
foreach (var imagePath in pages)
{
using var pix = Pix.LoadFromFile(imagePath);
using var result = engine.Process(pix);
sb.AppendLine(result.GetText());
}
return sb.ToString();
}
When to use Azure Document Intelligence instead
Tesseract is excellent for clean printed text. It struggles with:
Complex tables (cells get merged or split incorrectly)
Handwriting (basically doesn't work)
Skewed or low-resolution scans
Documents with multiple columns or non-linear reading order
For those, Azure Document Intelligence or AWS Textract are worth their per-page fee. You can branch by classification: simple printed text → Tesseract; complex tables or handwriting → cloud OCR.
- Structured extraction with JSON schema
The pre-LLM way to extract invoice data was 200 lines of regex praying the vendor's date format didn't change. Modern structured output means: define a C# record, ask the model to fill it from the text, get back a strongly-typed object. The model handles the parsing variability you used to handle with code.
Define the target shape once as a C# record. Microsoft.Extensions.AI generates the JSON schema and constrains the model's output.
public record Invoice(
string VendorName,
string InvoiceNumber,
DateOnly InvoiceDate,
DateOnly DueDate,
decimal TotalAmount,
string Currency,
List<LineItem> LineItems);
public record LineItem(
string Description,
int Quantity,
decimal UnitPrice,
decimal LineTotal);
public async Task<Invoice> ExtractInvoiceAsync(string text)
{
var response = await _chatClient.GetResponseAsync<Invoice>(
$@"Extract invoice fields from the document below. Use null for unknown fields.
Document:
{text}");
return response.Result;
}
The model returns a JSON object conforming exactly to the Invoice record. The runtime deserializes it. You operate on a typed object. The whole "string parsing" layer that used to dominate document automation just disappears.
Confidence + human review
For financial documents, surface uncertainty rather than hiding it. Have the model return a confidence score per field, route low-confidence fields to a review queue:
public record ExtractedField<T>(T Value, double Confidence, string Source);
public record InvoiceWithConfidence(
ExtractedField<string> VendorName,
ExtractedField<decimal> TotalAmount,
/ etc /);
// In your Blazor review UI, highlight any field where Confidence < 0.85
An admin reviews flagged fields in a side-by-side view (original document + extracted form), corrects, approves. The pattern scales — humans only touch the 5-10% of fields the model wasn't confident about.
- Q&A and full-document indexing
Once the document text is clean, you have two further options:
Per-document Q&A
Pass the full document text as context with a user's question. Works well for documents that fit in a single context window (most PDFs under ~30 pages). Pattern:
public async Task<string> AnswerQuestionAsync(int docId, string question)
{
var docText = await _docs.GetTextAsync(docId);
var prompt = $@"Answer the question using ONLY the document below. If the answer
isn't in the document, say 'The document does not contain that information.'
DOCUMENT:
{docText}
QUESTION: {question}";
var response = await _chatClient.GetResponseAsync(prompt);
return response.Message.Text ?? "";
}
Indexed Q&A across many documents
For larger corpuses, chunk + embed each document and build a RAG index. Querying then retrieves the relevant chunks across all docs before generating an answer. This is the same pattern as our RAG architecture guide, applied at the document-collection level.
Blazor review UI
The killer feature in a document intelligence product is the review UI: original document on the left, extracted fields on the right, fields auto-highlight the source text. Blazor makes this straightforward — render the PDF as embedded iframe or via a JS component, render the form alongside, link fields to source coordinates.
Store the extracted records, raw text, confidence scores, and processing history in SQL Server. Every Adaptive Web Hosting plan includes SQL Server 2022 databases. Tables map cleanly: Documents, DocumentPages, ExtractedFields, ReviewQueue. EF Core 10 handles the schema and migrations.
Where this gets used in production
Accounts payable. Drag vendor invoices into a Blazor portal; extracted line items post directly to the accounting system. Save 5-10 minutes per invoice.
Expense reports. Employees upload receipt photos. The pipeline reads vendor, amount, date, category. No more manual data entry.
Contract intake. Legal contracts get summarized, key dates and parties extracted, terms compared against your standard playbook.
Customer onboarding. ID documents and proof-of-address scans get parsed, fields auto-populate the customer record.
Lab and medical reports. Structured lab values extracted for trend analysis. (Handle PHI per your compliance posture — see the section on hosting.)
Production patterns
Background processing
OCR + LLM extraction takes 5-30 seconds per document. Run via Hangfire so the upload returns immediately and the user is notified when processing completes. Don't block the request thread on OCR.
Cost control
Token costs add up when extracting from thousands of documents. Cache OCR results (the OCR step is deterministic) and only re-run LLM extraction when the schema or prompt changes. A 10,000-document re-index goes from "expensive" to "cheap" with proper caching.
Retention and PII
Document intelligence often touches PII. Decide upfront: do you store the original PDF/image, or only the extracted fields? If you store originals, encrypt at rest and define a retention policy. Many teams keep the original for 30-90 days, then purge once the extracted data is verified.
Multi-tenant isolation
If you run document intelligence as a service for multiple customers, isolate per-tenant: separate SQL Server databases or schemas, separate file storage prefixes, separate processing queues. SQL Server's role-based security is your friend here.
Hosting recommendations
ASP.NET Business — $17.49/mo
Customer-facing intake portals, mid-volume processing (10K+ docs/month). 2 GB RAM gives headroom for Tesseract.
View Business plan →
ASP.NET Professional — $27.49/mo
Multi-tenant document SaaS, agency platforms, high-throughput automation. 4 GB per pool, highest priority scheduling.
View Professional plan →
FAQs
Do I need a vision-capable model, or is OCR + text-only LLM enough?
For most documents (invoices, receipts, contracts), OCR + text-only is faster and cheaper. Switch to a vision model only when layout matters — fillable forms, signature placement, multi-column layouts that don't OCR cleanly.
How accurate is the extraction?
For clean printed invoices, expect 95%+ field-level accuracy out of the box. For receipts (often crumpled, low-resolution), expect 80-90%. The remaining 5-20% is what your review queue catches. Don't promise 100% — promise reliable extraction with human verification on low-confidence fields.
Can I extract from Word documents?
Yes — DocumentFormat.OpenXml reads .docx natively. The extraction prompt is the same; only the text-extraction stage differs.
How do I handle 500-page documents?
Chunk by section or by page-range that fits the context window. Run extraction per chunk, then merge results. For Q&A over large documents, use the RAG pattern — embed chunks, retrieve relevant ones, answer from those.
What about HIPAA / PHI documents?
Adaptive Web Hosting does not hold HIPAA certifications. If your use case requires regulated handling of protected health information, the hosting environment is one piece of the compliance puzzle; you would need additional controls (BAA-covered model providers, encrypted storage, audit logging) that go beyond what shared hosting provides. Discuss your specific requirements before deploying PHI workloads.
How do I prevent prompt injection in user-uploaded documents?
Treat extracted text as untrusted input. When passing it to a downstream agent or chatbot, sanitize obvious injection attempts ("ignore previous instructions") and use structured extraction (JSON schema) rather than free-form output where possible. Schema-constrained models can't be hijacked into emitting prose that bypasses the schema.
Ship it
Document intelligence is one of the highest-ROI AI applications a business can deploy — every white-collar workflow has unstructured document drag in it somewhere. The .NET 10 stack handles the full pipeline natively: PDF parsing, OCR, structured extraction, review UI, and indexed Q&A.
Adaptive Web Hosting's ASP.NET hosting plans run all of this on real Windows + IIS with SQL Server 2022 included on every plan, dedicated app pools tuned for production .NET workloads, and free SSL out of the box.