Stepping into the Data Engineering world
I have completed the “Data Engineering Nanodegree” by Udacity and I thought I would share my experience.
Like many other online providers Udacity offer video lessons (which you can follow whenever it suits you) complemented with quizzes, online workspaces for exercising and additional material (documentation and references).
There is an online communities to seek help or exchange tips, a KnowledgeBase for technical (and non) questions and a dedicated mentor (more later…..).
The web-based portal is pretty good, you can resume a lesson at the point that was left and there is some extra material (introduction to Python or other libraries).
Every module terminates with an assignment (few days to a week effort) to solve a data engineering problem with the concepts learned during the lessons.
Knowledge of SQL and Python is required, ideally at an intermediate level at least: in every lesson there is data modeling and Python coding, if you struggle with those it is going to be a long painful journey.
Udacity bonus: some introductory material (Python Numpy and Pandas) is provided.
1. Welcome to the Nanodegree Program
Safely skip this. This an intro to the platform and Udacity Career Services (PLEASE!!). Seriously? We come here to learn, not to look for a job.
2. Data Modeling
A good start with a dive into data modeling with relational and NoSQL databases, covering important concepts like denormalization, star and snowflake schema, and also working with Postgres and Cassandra.
This block requires to deliver 2 projects: Data Modeling with Posgres and Data Modeling with Cassandra. Good stuff.
3. Cloud Data Warehouses
Another good one: designing a Data Warehouse using Amazon AWS computing, specifically Amazon Redshift and S3 Storage.
This is a juicy section: copy raw data from S3 into staging tables, create a cluster to process the data following a Star Schema design. `Infrastructure-as-code`is also introduced with some clear examples.
PS. you are entitled to AWS free credits
4. Data Lakes with Spark
Apache Spark (finally) turns up. The theory behind about the tool, features and data wrangling are fine, but the practical part is disappointing.
There are neither clear instructions nor good examples on how to setup and run Apache Spark, some lessons have clearly gaps which make the project quite hard. A big disappointment, especially considering that Spark was one of the attractive elements of this program.
5. Data Pipelines with Airflow
Design, execute and debug pipelines on Airflow, that’s good, but mainly because the tool is actually amazing. Again the concepts are relatively well presented but there is not a lot of info on how to run Airflow rather than relying on the available workspace.
6. Capstone Project
The course closes with the final project where you are supposed to put into practice what you learned from day one. Datasets are made available and there are references to a few good sources. The project instructions are kind of broad to ensure the student makes assumptions and justifies design decisions.
A partially frustrating experience because a lot of the effort has been put into configuring and running the actual tools rather than develop the ETL.
Another big let down. I was looking forward to soaking up views and tips from an experienced data engineer, but it didn’t happen.
The first mentor cancelled a meeting at the very last minute and did not show up at another agreed time (always evening meetings by the way), the second one did the same, so I decided to get along on my own.
Overall I would recommended it if you are keen on getting into the #dataengineering arena. Despite some pain and unnecessary effort the nanodegree covers all fundamentals and gives a good overview of the available toolset. It could be a better, but I guess this is true for everything (and everyone).
- Online Jupyter Notebook workspaces
- Portal usability
- AWS credits
- Spark examples
- Personal mentor
- Career services