Distributed Trajectory Clustering with Apache

Implemented and evaluated scalable trajectory clustering pip

Implemented and evaluated scalable trajectory clustering pipelines using Apache Spark for large-scale GPS data from the Microsoft GeoLife dataset, comparing DBSCAN and TRACLUS across spatial and temporal partitions. Developed a PySpark-based pipeline to ingest, parse, segment, and cluster over 10,000 GPS trajectories, applying region-, time-, and user-based partitioning strategies. Applied DBSCAN and a custom TRACLUS implementation using Euclidean and DTW distances, each assessed for clustering quality (Silhouette Score), runtime, and memory usage. Demonstrated that DTW-based TRACLUS produced fewer noise points (6%) but had approximately 130 times longer runtime (569 seconds) than DBSCAN (4.39 seconds), indicating a significant trade-off between accuracy and efficiency. Region-based partitioning enhanced performance by approximately 20%, reducing shuffle overhead and facilitating improved cluster workload distribution in Spark. Utilized Pandas UDFs and task-level tuning for DTW alignment, achieving consistent performance while managing executor memory and avoiding recomputation using Spark's caching mechanisms. Evaluated across six clustering configurations, each repeated three times, and monitored with Spark Web UI to analyze runtime patterns and task skew. Technologies Used : Apache Spark, PySpark, Databricks, Pandas UDF, MLlib,Jupyter Notebooks