As part of honing my Terraform skills, I developed a script to provision a highly available web server, designed to simulate a real-world infrastructure solution.
Architecture Overview
My architecture follows the recommended AWS best practices, including:
- Multi-region deployment - Primary in US West, Secondary in US East
- Auto Scaling Groups in each region to handle traffic fluctuations
- Application Load Balancers to distribute traffic
- Route53 DNS Failover for automatic cross-region recovery
- Health checks to detect and respond to failure
To reduce expenses, I omitted the database from this project. The diagram shows my complete setup minus the database.
How It Works
Here’s how the system handles different scenarios:
Normal Operation
- Users access the website, in this case is web.khoah.net
- Route53 routes traffic to ALB in the primary region (US West)
- The ALB distributes requests to healthy instances in the ASG
- ASG maintains the desired number of instances based on load
Primary Region Failure
- Health checks detect the primary region is unavailable
- CloudWatch Alarm triggers a Lambda function to spin up the secondary region (US East). It also trigger the SNS topic to send notification to me
- Route53 automatically routes traffic to the secondary region
- Users continue to access the application through the same URL
- When the primary region recovers, traffic fails back automatically, then the CloudWatch Alarm will turn off the secondary region to save cost
Traffic Spike
- Increased load causes higher CPU utilization
- ASG scales out additional instances when CPU exceeds 50%
- ALB distributes traffic across all healthy instances
- As load decreases, ASG terminates excess instances
Cost Optimization
The architecture is designed to be cost-effective:
- Auto-scaling ensures you only pay for what you need
- Multi-AZ deployment provides high availability with minimal resources
- Implementing CloudWatch Alarms and Lambda functions to manage secondary server activation based on primary region health
For the full setup, you can check out my code here.
Encountered Issues
Initially, I designed a system with two auto-scaling groups running simultaneously in two regions. However, I realized this was not cost-effective, so I redesigned it and have to write the terraform code again.
Also while setting up auto scaling with Terraform, I spent hours troubleshooting why the auto scaling group wasn’t created—only to realize I had referenced the security group by name instead of id (security_groups = [aws_security_group.my_sg.id]
).
Moreoever, “thanks to” a bug, when I enabled the IP address in the launch template
for the instance in the Terraform code, it helped to remember that the network interface required a security group assignment. If you use the console to create the launch template, it will automatically select the default security group for you.
I also hit an issue configuring a CloudWatch Alarm for a Lambda trigger. The Route 53 Health Check metric is only in us-east-1
, and my aws_cloudwatch_metric_alarm
failed until I explicitly set the correct provider in the CloudWatch configuration.
Improvements
In the real-world, there are a lot of things that need to be added to a web application to make it perform better, but here are the things I should consider:
- Multi region Active-Active Architecture: Same concept but the secondary region always up
- WAF integration for additional security
- CloudFront for global edge caching
- Private subnets for the application tier
- Enhanced monitoring with CloudWatch dashboards