Project Overview
This project provides an end-to-end data pipeline and predictive analytics system designed to estimate Scope-3 emissions, identify high-risk suppliers, and deliver actionable insights through an interactive Power BI dashboard. It addresses the challenge of quantifying Scope-3 emissions, particularly Category 1 (Purchased Goods & Services), by mapping procurement spend to emission factors and developing a supplier-level risk score.
Key Features:
- ETL Pipeline: Processes raw procurement data, maps categories to NAICS codes, joins with EPA emission factors, handles imputation for missing data, and calculates line-level emissions.
- Supplier Risk Scoring: Generates a composite score based on normalized emissions and spend to prioritize suppliers.
- Machine Learning Model: Utilizes a Random Forest model, achieving high accuracy (97.6% R² = 0.9968) in predicting supplier emissions. Other models like Linear Regression, Ridge Regression, and Gradient Boosting were also evaluated.
- Data Visualization: Presents findings through a comprehensive 4-page Power BI dashboard, including an executive overview, supplier and category analysis, and supplier drilldown views.
- Data Sources: Leverages procurement data (e.g., from Kaggle), US EPA USEEIO emission factors, and NAICS codes for accurate estimation.
- Methodology: Employs spend-based emissions calculation aligned with the GHG Protocol Scope 3 Standard and US EPA USEEIO methodology.
Tech Stack:
- Data Engineering: Python 3.11, Pandas, NumPy
- Machine Learning: scikit-learn (Random Forest, Ridge, Gradient Boosting)
- Visualization: Matplotlib, Seaborn, Plotly
- Business Intelligence: Power BI Desktop, DAX
- SQL Analytics: PostgreSQL-compatible queries
The project includes detailed notebooks for the ETL pipeline and ML model training, SQL query examples, and clear instructions on how to run the system.