Building DeepDoc AI: Cracking the Large Document Analysis Problem

How I solved the “out-of-context” issue when analyzing large documents with LLMs

When I first started building the Data Analyzer module for DeepDoc AI, I took the straightforward approach , pass the entire document to the LLM and extract metadata.

It worked perfectly for small documents. But the moment I tried analyzing larger PDFs, things broke down:

Models exceeded token limits
Metadata extraction became inconsistent
Context was lost across pages

I needed a smarter strategy.

The Solution: Intelligent Chunking & Context-Aware Analysis
Instead of treating the document as one big blob, DeepDoc now:

Splits text into manageable chunks (1200 tokens with 300 overlap for continuity)
Applies context-aware prompts:
- Early chunks → extract title, author, creation date
- Middle chunks → focus on summaries, sentiment, main content
- Final chunks → look for publisher info, last modified dates, page counts
Processes chunks in parallel for efficiency
Consolidates results into a clean, validated JSON metadata structure

Smart Routing
Not every document needs chunking. If the document is small enough (≤ 6000 tokens), DeepDoc runs a full-document analysis. Larger docs are routed through the chunking pipeline automatically.

Resilience Built-In

Invalid JSON? Fixed using LangChain’s OutputFixingParser.
Failed chunks? Fallback merge guarantees schema consistency.
Final pass? A validation prompt checks metadata quality.

The Result
DeepDoc AI can now handle even 100+ page documents without losing context, producing accurate, structured metadata (title, author, dates, summaries, sentiment, language, etc.).

What’s Next
With the Data Analyzer module done, the next milestone is integrating this into the document comparison engine to enable page-by-page diffs between versions.

Join Siddharth on Peerlist!

Join amazing folks like Siddharth and thousands of other builders on Peerlist.