Covid-19 AWS Data Analytics Project

Neeraj Somani
4 min readMar 24, 2021

Covid-19 is a once-in-a-lifetime situation and almost every human being on earth learned something new from this unprecedented event and changed perspective towards life one way or the other. And this has become the most obvious data source for my data enthusiast out there.

Here I also participated in analyzing COVID-19 dataset provided by NYTimes and Johns Hopkins University.

Here is a dashboard snapshot from my project.

Let me begin by saying, how thrilled I am to finish up my project. I am preparing this blog in a way that will help anyone to understand the requirement of the project as well as implement the project. I will explain each step that i followed to build the project and launch it on AWS. Also, not only about the project but also about new trending AWS-CDK paradigm.

Have you heard of AWS CDK, if not you are missing a big item in your list. CDK has completely changed the DevOps world. Now you can use your choice of programming language to build your application infrastructure. ohh, I completely forgot we are here to talk about my project and not CDK. I can write a lot about CDK but not right now, more on CDK over here.

Project Overview and Design approach

Are you interested in learning a simple but most powerful use-case of Data Engineering and DevOps, then this is going to be a great source for you. The outline of the project can be understood from here.

After understanding the outline from the challenge, the first thing that I did is to write the basic design approach that is coming in my mind. First and foremost, what platform are you planning to implement this on. I decided AWS cloud for myself. Below is final design of project. Thanks to the source. This architecture was already implemented using AWS cloudformation. But the source motivated me to implement it using AWS-CDK and thats what i did. Let me help you to do the same.

Implementation approach

Just want to be clear that I am considering reader as novice in AWS CDK, because thats how I was and if you follow below steps you will be able to at-least walk and talk with AWS CDK applications more comfortably. You are not only going to implement this project but also understand how anything can be learned by following a specific procedure. I am gonna share my journey that eventually and hopefully will help in yours.

Project implementation Steps and how i learned it

Step 1: Set up environment using https://cdkworkshop.com instructions. I followed this cdkworkshop project first to familiarize about the application framework and how the overall AWS-CDK application structure works.

you can skip step-1 if you already know the basics of CDK environment.

I hope now you gained some confidence in AWS-CDK after implementing workshop project. Let’s begin our project and understand different components of project that implemented using AWS CDK.

Step 2: Clone my github repository

Once you downloaded the project and inside the base folder of the project you can initiate the virtual environment. This part is very similar to what you might have understood it from workshop.

Step 3: Install needed libraries for CDK app.

for this specific purpose CDK application utilizes requirements.txt and “setup.py” files. You list your required libraries name here in the file and application downloads it in the environment. How easy it is. For this project we used below libraries:

you can install these libraries into your env using below command:

Here, requirements.txt file and setup.py file are interconnected in the project

pip install -r requirements.txt

Step 4: Understand different component of Data pipeline

Refer respective files from github repo for this section:

Backend IAC implemented in “backend_stack.py file

  1. Job Scheduling: Implemented through Event Bridge rules. CDK EventBridge API Doc
  2. Extraction: Implemented using Lambda function (python) to extract data directly from source github repository. (“extract.py”) CDK Lambda API Doc
  3. Transformation: Pandas module used to perform data processing tasks. Lambda function used to implement this step. (“transform.py”)
  4. Load: Data has been loaded to DynamoDB table as batch output. For daily incremental load, S3 bucket is used to compare previous day run file. (“s3_operations.py” and “dynamodb_operations.py”) CDK Dynamodb and S3 API Doc
  5. Notification: SNS notification service has been implemented to notify the user after completion of the ETL job. (“sns_operations.py”) CDK SNS API Doc

Step 5: Launch Backend stack of application

run below cdk commands in sequence to deploy application

pip install -r requirements.txt
cdk synth
cdk bootstrap
cdk deploy backendStack

Step 6: Build and Launch Dashboard for frontend stack

request.open('GET', 'https://wwtu45hna4.execute-api.us-west-2.amazonaws.com/prod', true)
  • after saving “covid.js” file, run below commands in sequence to deploy frontend application
pip install -r requirements.txt
cdk synth
cdk bootstrap
cdk deploy frontendStack

Step 7: Visualize Interactive COVID-19 Dashboard

Go to AWS Console and fetch S3 website hosted bucket URL to view the dashboard. This will connect to “index.html” page of our website and hit API Gateway URL, that will fetch details from dynamoDB table.

Step 8 : clean up environment to avoid further charges

run below command to destroy both backend and frontend stack. Remember, if you have any files in any of your S3 bucket, then that needs to be deleted manually before running below command.

cdk destroy --all

Conclusion

I learned a lot while implementing this project and I hope you can also learn and enjoy implementing this project from the provided resources. Please let me know your experience.

--

--

Neeraj Somani

Data Analytics Engineer, crossing paths in Data Science, Data Engineering and DevOps. Bringing you lots of exciting projects in the form of stories. Enjoy-Love.