Do you want to know the latest trends in the restaurant industry, but don't have the time or expertise to sort through Yelp data? This project takes care of the heavy lifting for you.
🤖 It 𝘤𝘰𝘭𝘭𝘦𝘤𝘵𝘴, 𝘱𝘳𝘰𝘤𝘦𝘴𝘴𝘦𝘴, 𝘵𝘳𝘢𝘯𝘴𝘧𝘰𝘳𝘮𝘴 𝘢𝘯𝘥 𝘷𝘪𝘴𝘶𝘢𝘭𝘪𝘻𝘦𝘴 the key metrics for you to take decisions on!
🎯 𝐆𝐨𝐚𝐥
This project aims to learn and get hands-on experience working with different data engineering tools and technologies. So yes, there is space for improvement (A LOT!) but it's a very good start.
🎯 𝐓𝐨𝐨𝐥𝐬 𝐚𝐧𝐝 𝐓𝐞𝐜𝐡𝐧𝐨𝐥𝐨𝐠𝐢𝐞𝐬 𝐔𝐬𝐞𝐝:
𝐒𝐭𝐨𝐫𝐚𝐠𝐞: Google Cloud Storage
𝐃𝐚𝐭𝐚 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧: DBT, Dataproc
𝐖𝐨𝐫𝐤𝐟𝐥𝐨𝐰 𝐎𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧: Cloud Composer
𝐈𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐚𝐬 𝐚 𝐂𝐨𝐝𝐞: Terraform
𝐂𝐈/𝐂𝐃: Cloud Build
𝐒𝐞𝐫𝐯𝐞𝐫𝐥𝐞𝐬𝐬 𝐂𝐨𝐦𝐩𝐮𝐭𝐢𝐧𝐠: Cloud Run
𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧: Google Data Studio
🎯 𝐏𝐢𝐩𝐞𝐥𝐢𝐧𝐞 𝐎𝐯𝐞𝐫𝐯𝐢𝐞𝐰:
Cloud Composer starts the Airflow DAG that executes these tasks:
👉 Get the Yelp data
👉 Submit Spark jobs to Dataproc, which processes and ingests data into BigQuery
👉 Perform DBT transformations on top of it to answer business questionsAnd finally, these DBT models are used for visualization on Google Data Studio.
🎯 𝐂𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐞𝐬:
👉 Cloud Composer single-handedly proved to be the most significant road blocker for this project. It seems it wasn't working as it should, in the asia-south1 region compared to the us-central1.
👉 Orchestrating DBT core jobs through Cloud Composer was another challenge. But I got a workaround using Cloud Run for that.
🎯 𝐋𝐞𝐚𝐫𝐧 𝐌𝐨𝐫𝐞:
If you want to learn more about this, you can visit the project's repo.
If you want to try it on your own, I have written a blog post explaining in detail about replicating this pipeline. You can read it here.
Comments