Covid-19 AWS Data Analytics Project
Covid-19 is a once-in-a-lifetime situation and almost every human being on earth learned something new from this unprecedented event and changed perspective towards life one way or the other. And this has become the most obvious data source for my data enthusiast out there.
Here I also participated in analyzing COVID-19 dataset provided by NYTimes and Johns Hopkins University.
Here is a dashboard snapshot from my project.
Let me begin by saying, how thrilled I am to finish up my project. I am preparing this blog in a way that will help anyone to understand the requirement of the project as well as implement the project. I will explain each step that i followed to build the project and launch it on AWS. Also, not only about the project but also about new trending AWS-CDK paradigm.
Have you heard of AWS CDK, if not you are missing a big item in your list. CDK has completely changed the DevOps world. Now you can use your choice of programming language to build your application infrastructure. ohh, I completely forgot we are here to talk about my project and not CDK. I can write a lot about CDK but not right now, more on CDK over here.
Project Overview and Design approach
Are you interested in learning a simple but most powerful use-case of Data Engineering and DevOps, then this is going to be a great source for you. The outline of the project can be understood from here.
After understanding the outline from the challenge, the first thing that I did is to write the basic design approach that is coming in my mind. First and foremost, what platform are you planning to implement this on. I decided AWS cloud for myself. Below is final design of project. Thanks to the source. This architecture was already implemented using AWS cloudformation. But the source motivated me to implement it using AWS-CDK and thats what i did. Let me help you to do the same.
Implementation approach
Just want to be clear that I am considering reader as novice in AWS CDK, because thats how I was and if you follow below steps you will be able to at-least walk and talk with AWS CDK applications more comfortably. You are not only going to implement this project but also understand how anything can be learned by following a specific procedure. I am gonna share my journey that eventually and hopefully will help in yours.
Project implementation Steps and how i learned it
Step 1: Set up environment using https://cdkworkshop.com instructions. I followed this cdkworkshop project first to familiarize about the application framework and how the overall AWS-CDK application structure works.
you can skip step-1 if you already know the basics of CDK environment.
I hope now you gained some confidence in AWS-CDK after implementing workshop project. Let’s begin our project and understand different components of project that implemented using AWS CDK.
Step 2: Clone my github repository
Once you downloaded the project and inside the base folder of the project you can initiate the virtual environment. This part is very similar to what you might have understood it from workshop.
Step 3: Install needed libraries for CDK app.
for this specific purpose CDK application utilizes “requirements.txt” and “setup.py” files. You list your required libraries name here in the file and application downloads it in the environment. How easy it is. For this project we used below libraries:
you can install these libraries into your env using below command:
Here, requirements.txt file and setup.py file are interconnected in the project
pip install -r requirements.txt
Step 4: Understand different component of Data pipeline
Refer respective files from github repo for this section:
Backend IAC implemented in “backend_stack.py” file
- Job Scheduling: Implemented through Event Bridge rules. CDK EventBridge API Doc
- Extraction: Implemented using Lambda function (python) to extract data directly from source github repository. (“extract.py”) CDK Lambda API Doc
- Transformation: Pandas module used to perform data processing tasks. Lambda function used to implement this step. (“transform.py”)
- Load: Data has been loaded to DynamoDB table as batch output. For daily incremental load, S3 bucket is used to compare previous day run file. (“s3_operations.py” and “dynamodb_operations.py”) CDK Dynamodb and S3 API Doc
- Notification: SNS notification service has been implemented to notify the user after completion of the ETL job. (“sns_operations.py”) CDK SNS API Doc
Step 5: Launch Backend stack of application
run below cdk commands in sequence to deploy application
pip install -r requirements.txt
cdk synth
cdk bootstrap
cdk deploy backendStack
Step 6: Build and Launch Dashboard for frontend stack
- first go to AWS console and copy API Gateway link that was created by backend stack to “sep20cloudguruchallenge/S3_website/js/covid.js” file.
request.open('GET', 'https://wwtu45hna4.execute-api.us-west-2.amazonaws.com/prod', true)
- after saving “covid.js” file, run below commands in sequence to deploy frontend application
pip install -r requirements.txt
cdk synth
cdk bootstrap
cdk deploy frontendStack
Step 7: Visualize Interactive COVID-19 Dashboard
Go to AWS Console and fetch S3 website hosted bucket URL to view the dashboard. This will connect to “index.html” page of our website and hit API Gateway URL, that will fetch details from dynamoDB table.
Step 8 : clean up environment to avoid further charges
run below command to destroy both backend and frontend stack. Remember, if you have any files in any of your S3 bucket, then that needs to be deleted manually before running below command.
cdk destroy --all
Conclusion
I learned a lot while implementing this project and I hope you can also learn and enjoy implementing this project from the provided resources. Please let me know your experience.