Production-grade LLM deployment system optimizing inference and training at scale. Key Achievements: • Reduced inference latency from 2.5s to 450ms (82% improvement) using quantization, batching, and vLLM optimization • Implemented PEFT/QLoRA fine-tuning enabling 70B+ parameter models on consumer-grade GPUs • Reduced deployment costs by 40%+ through resource optimization and auto-scaling strategies • Deployed 3+ production models with 98%+ uptime on Azure ML infrastructure • Built automated retraining pipelines reducing manual intervention by 80% • Multi-GPU training optimization achieving 2-3x speedup improvements Tech Stack: PyTorch, CUDA, vLLM, Azure ML, Docker, Kubernetes, Redis