Udacity - Data Engineering - Cloud Data Warehousing

 So I've roughly navigated this course and it was quite a challenge going through the theory because once the architectures were discussed it was just a whole bunch of clicking through in AWS and that can get boring.

Architecture :

For the architecture itself, the idea is a back office and a front office. The back office is all the sources and individual processes that bring in data. The Data warehouse simplifies these schemas into (possibly) a star model and then makes it easier and faster to use for the analytics / BI division (front office). 

There are a few variants, where BI can directly access the main source, where DW can be department specific, or unique for each department still maintaining integrity among common columns.

Cloud / AWS :

Doing this in the cloud provides quicker start time, elasticity, scalability.

The general idea is to read from sources and move to a staging S3 bucket and then push to a DW. For smaller tables it might be possible to directly use an EC2 instance instance instead of an S3 bucket.

the dw solution used here is redshift - it is aws managed. we could do self managed by installing something on a EC2 for DW. but going aws managed will take care of updates as well as it will allow us to do IaC - infrastructure as code or devops. also redshift can be implemented with columnar storage which is faster. also redshift if massively parallel, that is a query can run on multiple partitions across multiple nodes. in relational DB, multiple queries can run on multiple nodes, but a single query can't run on multiple nodes.

Project - Working my way backwards

Since I did not pay attention to the course very well, I'm going to jump right into the project and go revisit the course and do deep dives as needed.

Also, since I only have limited free credits on AWS, I'm going to think through the project right here and only go to implementation when I have a good idea of how to solve it.

Objective

ETL pipeline that will extract from S3 , stage in a redshift and then transform into dimensional tables for analytics team.

Problem

The song app currently has json logs and metadata in an S3 bucket with user activity and song metadata. I have to stage it to a redshift and then convert it to a DW.

  • create_table.py is where you'll create your fact and dimension tables for the star schema in Redshift.
  • etl.py is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
  • sql_queries.py is where you'll define you SQL statements, which will be imported into the two other files above.
  • README.md is where you'll provide discussion on your process and decisions for this ETL pipeline.
Project Steps

Create Table Schemas

  1. Design schemas for your fact and dimension tables
  2. Write a SQL CREATE statement for each of these tables in sql_queries.py
  3. Complete the logic in create_tables.py to connect to the database and create these tables
  4. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
  5. Launch a redshift cluster and create an IAM role that has read access to S3.
  6. Add redshift database and IAM role info to dwh.cfg.
  7. Test by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.

Comments

Popular posts from this blog

Udacity - Data Engineering - Intro

Designing Data Intensive Applications - Ch 1 - Reliability, Scalability, Maintainability