Udacity - Data Engineering - Cloud Data Warehousing

 So I've roughly navigated this course and it was quite a challenge going through the theory because once the architectures were discussed it was just a whole bunch of clicking through in AWS and that can get boring.

Architecture :

For the architecture itself, the idea is a back office and a front office. The back office is all the sources and individual processes that bring in data. The Data warehouse simplifies these schemas into (possibly) a star model and then makes it easier and faster to use for the analytics / BI division (front office). 

There are a few variants, where BI can directly access the main source, where DW can be department specific, or unique for each department still maintaining integrity among common columns.

Cloud / AWS :

Doing this in the cloud provides quicker start time, elasticity, scalability.

The general idea is to read from sources and move to a staging S3 bucket and then push to a DW. For smaller tables it might be possible to directly use an EC2 instance instance instead of an S3 bucket.

the dw solution used here is redshift - it is aws managed. we could do self managed by installing something on a EC2 for DW. but going aws managed will take care of updates as well as it will allow us to do IaC - infrastructure as code or devops. also redshift can be implemented with columnar storage which is faster. also redshift if massively parallel, that is a query can run on multiple partitions across multiple nodes. in relational DB, multiple queries can run on multiple nodes, but a single query can't run on multiple nodes.

Project - Working my way backwards

Since I did not pay attention to the course very well, I'm going to jump right into the project and go revisit the course and do deep dives as needed.

Also, since I only have limited free credits on AWS, I'm going to think through the project right here and only go to implementation when I have a good idea of how to solve it.

Objective

ETL pipeline that will extract from S3 , stage in a redshift and then transform into dimensional tables for analytics team.

Problem

The song app currently has json logs and metadata in an S3 bucket with user activity and song metadata. I have to stage it to a redshift and then convert it to a DW.

  • create_table.py is where you'll create your fact and dimension tables for the star schema in Redshift.
  • etl.py is where you'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
  • sql_queries.py is where you'll define you SQL statements, which will be imported into the two other files above.
  • README.md is where you'll provide discussion on your process and decisions for this ETL pipeline.
Project Steps

Create Table Schemas

  1. Design schemas for your fact and dimension tables
  2. Write a SQL CREATE statement for each of these tables in sql_queries.py
  3. Complete the logic in create_tables.py to connect to the database and create these tables
  4. Write SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, you can run create_tables.py whenever you want to reset your database and test your ETL pipeline.
  5. Launch a redshift cluster and create an IAM role that has read access to S3.
  6. Add redshift database and IAM role info to dwh.cfg.
  7. Test by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.

Comments

Popular posts from this blog

Designing Data Intensive Applications - Ch 2 - Data Models and Query Languages