View Project
Progressed from Junior to Data Scientist over a 3+ year tenure, taking ownership of core modules and mentoring new team members and remained a core contributor across all project phases.
Built and scaled a robust document parsing pipeline for healthcare PDFs with highly inconsistent layouts, powering a live database of 800,000+ US medical and pharmaceutical formulary documents using Python, Pandas and table extraction libraries like Tabula and Camelot
Enabled clients to launch 3 commercial subscription-based B2B products that help stakeholders track coverage changes and analyze formulary status across health plans using outputs from this pipeline.
Designed and implemented a deep learning–based document routing system using Vision Transformers (ViT) and CNNs replacing rule-based extractors achieving 92% routing accuracy and reducing downstream processing failures by 50%.
Led fine-tuning and experimentation of ViT models (via PyTorch and Transformers) for layout classification and table presence detection across diverse document templates.
Replaced unreliable OCR/table extractors by integrating Detectron2 (Mask R-CNN) to predict column lines, drawing them programmatically using PyMuPDF, and extracting data using both Camelot and Tabula, with quality-based selection logic to pick the better output.
Fine-tuned custom spaCy NER models to extract clinical entities like dosage, drug names, and restrictions from noisy policy text, enhancing downstream entity resolution performance.
Effectively used Git to manage a rapidly evolving codebase, submitting over 500 PRs across 3+ years; remained the sole project constant through changing team sizes (2–6), consistently delivering on client-specific requirements.
Mentored all new team members, onboarding them into a large and evolving codebase, providing KT sessions and driving best practices in model development, data cleaning, and debugging.
Built with