This project explored the implementation of a neural processing unit (NPU) on a field-programmable gate arrays (FPGA) to enhance convolutional neural network (CNN) inference performance for handwritten digit recognition using the MNIST dataset. The project employed a modular and scalable architecture based on LeNet-5, developed in SystemVerilog, alongside a custom software implementation used for training developed in C++. A pre-trained Python TensorFlow model additionally served as a reference for benchmarking. Promising results achieved underscored the potential of FPGA-based NPUs for accelerating CNN inference in real-world applications and contributing to the expanding research in FPGA-based deep learning accelerators.