How I solved the “out-of-context” issue when analyzing large documents with LLMs

When I first started building the Data Analyzer module for DeepDoc AI, I took the straightforward approach , pass the entire document to the LLM and extract metadata.
It worked perfectly for small documents. But the moment I tried analyzing larger PDFs, things broke down:
Models exceeded token limits
Metadata extraction became inconsistent
Context was lost across pages
I needed a smarter strategy.
The Solution: Intelligent Chunking & Context-Aware Analysis
Instead of treating the document as one big blob, DeepDoc now:
Splits text into manageable chunks (1200 tokens with 300 overlap for continuity)
Applies context-aware prompts:
Early chunks → extract title, author, creation date
Middle chunks → focus on summaries, sentiment, main content
Final chunks → look for publisher info, last modified dates, page counts
Processes chunks in parallel for efficiency
Consolidates results into a clean, validated JSON metadata structure
Smart Routing
Not every document needs chunking. If the document is small enough (≤ 6000 tokens), DeepDoc runs a full-document analysis. Larger docs are routed through the chunking pipeline automatically.
Resilience Built-In
Invalid JSON? Fixed using LangChain’s OutputFixingParser.
Failed chunks? Fallback merge guarantees schema consistency.
Final pass? A validation prompt checks metadata quality.
The Result
DeepDoc AI can now handle even 100+ page documents without losing context, producing accurate, structured metadata (title, author, dates, summaries, sentiment, language, etc.).
What’s Next
With the Data Analyzer module done, the next milestone is integrating this into the document comparison engine to enable page-by-page diffs between versions.
0
5
0