-Built an AI-driven evaluation system in Python that compares outputs from multiple large language models to identify bias and measure response quality. -Leveraged sentence embeddings and NLTK within Google Colab to provide structured, data-driven assessments of fairness and accuracy in model responses.