Top Terraform Best Practices for AWS
Top Terraform Best Practices for AWS
AWS infrastructure drifts from your intended configuration the moment someone makes a manual change in the console. A developer adjusts a security group during debugging, an ops engineer scales an instance type during peak traffic, or someone adds a tag to help with cost tracking. None of these changes exist in your infrastructure-as-code, so the next Terraform apply either reverts them (causing confusion) or fails entirely because state doesn't match reality.
This article covers the Terraform patterns that prevent drift, reduce costs, and make AWS infrastructure actually manageable across teams. You'll learn proper state management strategies that prevent state file corruption, module organization that enables reuse without creating tight coupling, and the specific AWS provider configurations that avoid hitting API rate limits during large deployments. These aren't theoretical recommendations—they're patterns that emerge after managing dozens of AWS accounts with hundreds of resources each.
We'll cover remote state configuration, workspace strategies, module versioning, AWS tagging conventions, security group patterns, and the IAM configurations that let Terraform manage permissions safely. Every recommendation includes the specific failure mode it prevents.
State Management: Remote Backend Configuration
Terraform state files contain the complete mapping between your configuration files and actual AWS resources. Losing state means losing Terraform's knowledge of what it created, effectively orphaning resources. Storing state locally works for solo experimentation but breaks immediately when multiple people need to run Terraform or when CI/CD pipelines deploy infrastructure.
Remote state stored in S3 with DynamoDB locking solves both problems: all team members and CI jobs share one authoritative state, and DynamoDB prevents concurrent modifications that would corrupt state. This is the foundational pattern every production Terraform setup needs.
S3 Backend Configuration
# terraform/backend.tf
terraform {
backend "s3" {
bucket = "myorg-terraform-state"
key = "production/vpc/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# Enable versioning for state history
versioning = true
}
}
# Create S3 bucket and DynamoDB table (run once)
# terraform/bootstrap/main.tf
resource "aws_s3_bucket" "terraform_state" {
bucket = "myorg-terraform-state"
lifecycle {
prevent_destroy = true
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
lifecycle {
prevent_destroy = true
}
}
The S3 bucket stores state files with versioning enabled, allowing you to recover from corrupted state by reverting to a previous version. Encryption ensures state contents (which include sensitive values like database passwords) stay encrypted at rest. DynamoDB's PAY_PER_REQUEST billing means you only pay for lock operations during Terraform runs, typically under $1/month even for busy teams.
State File Organization
The S3 key path (production/vpc/terraform.tfstate) determines state file location. Organize by environment and resource grouping to prevent state file bloat:
terraform-state/
├── production/
│ ├── vpc/terraform.tfstate
│ ├── compute/terraform.tfstate
│ ├── database/terraform.tfstate
│ └── monitoring/terraform.tfstate
├── staging/
│ ├── vpc/terraform.tfstate
│ └── compute/terraform.tfstate
└── development/
└── all/terraform.tfstate
Smaller state files mean faster Terraform operations and reduced blast radius when applying changes. Separating VPC state from compute state means deploying application changes doesn't require loading network infrastructure state, cutting plan time from 60 seconds to 10 seconds on large infrastructures.
Module Organization and Versioning
Terraform modules enable reuse, but poorly designed modules create more problems than they solve. A module that tries to handle too many use cases becomes a maze of conditional logic. A module without versioning means every change potentially breaks all consumers. The sweet spot is focused modules with clear interfaces and semantic versioning.
Module Structure Pattern
# modules/vpc/main.tf
variable "vpc_cidr" {
type = string
description = "CIDR block for VPC"
}
variable "availability_zones" {
type = list(string)
description = "AZs to create subnets in"
}
variable "enable_nat_gateway" {
type = bool
default = true
description = "Enable NAT gateway for private subnets"
}
variable "tags" {
type = map(string)
default = {}
description = "Additional tags for all resources"
}
locals {
common_tags = merge(
{
ManagedBy = "terraform"
Module = "vpc"
},
var.tags
)
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(
local.common_tags,
{
Name = "${var.tags["Environment"]}-vpc"
}
)
}
# Create public subnets
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(
local.common_tags,
{
Name = "${var.tags["Environment"]}-public-${var.availability_zones[count.index]}"
Type = "public"
}
)
}
# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.main.id
description = "ID of the created VPC"
}
output "public_subnet_ids" {
value = aws_subnet.public[*].id
description = "List of public subnet IDs"
}
output "vpc_cidr" {
value = aws_vpc.main.cidr_block
description = "CIDR block of the VPC"
}
This module has a single responsibility: creating a VPC with subnets. It exposes only the configuration options that users actually need to vary (CIDR blocks, AZs, NAT gateway toggle) while handling complex details internally (CIDR subnet calculation, tagging patterns, DNS settings). Clear input variables and outputs make the module's contract explicit.
Module Versioning with Git Tags
# modules repository: github.com/yourorg/terraform-aws-modules
# Tag releases with semantic versioning
git tag -a v1.0.0 -m "Initial VPC module release"
git push origin v1.0.0
# Consuming the module with version pinning
module "vpc" {
source = "git::https://github.com/yourorg/terraform-aws-modules.git//vpc?ref=v1.0.0"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
tags = {
Environment = "production"
Team = "platform"
}
}
Version pinning via git tags prevents breaking changes from automatically flowing into production. When you improve the VPC module, tag the change as v1.1.0, but production infrastructure continues using v1.0.0 until you explicitly update the ref. This gives you control over when changes propagate and time to test them in staging first.
| Module Source | Use Case | Version Control |
|---|---|---|
| ./modules/vpc | Local development, rapid iteration | No versioning, changes immediate |
| git::...?ref=main | Testing latest changes | No version lock, always latest |
| git::...?ref=v1.0.0 | Production infrastructure | Explicit version, stable |
| registry.terraform.io | Public modules, official providers | Semantic versioning, constraints |
Resource Tagging Strategy
AWS tags enable cost allocation, resource organization, and automated policy enforcement. But tagging only works if applied consistently across all resources. Terraform's ability to apply tags at the provider level ensures every resource gets baseline tags automatically, eliminating the most common cause of incomplete tagging: forgetting to add them.
Provider-Level Default Tags
# providers.tf
provider "aws" {
region = "us-east-1"
default_tags {
tags = {
ManagedBy = "terraform"
Environment = var.environment
Team = var.team_name
CostCenter = var.cost_center
Project = var.project_name
}
}
}
# All resources automatically receive these tags
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.micro"
# Resource-specific tags merged with default tags
tags = {
Name = "web-server-01"
Role = "webserver"
}
}
# Resulting tags on the instance:
# ManagedBy: terraform
# Environment: production
# Team: platform
# CostCenter: engineering
# Project: main-app
# Name: web-server-01
# Role: webserver
Default tags apply automatically to all resources created by this provider instance. This guarantees every EC2 instance, RDS database, and S3 bucket gets tagged with environment, team, and cost center information without requiring developers to remember tag requirements. Resource-specific tags merge with default tags, allowing you to add context (Name, Role) while maintaining baseline compliance.
Tag-Based Cost Allocation
Enable cost allocation tags in AWS Cost Explorer to break down billing by environment, team, or project. With consistent tagging, you can answer questions like "how much does the staging environment cost?" or "what's our database spend by team?" These insights are impossible without tags or require manual cost categorization that inevitably becomes outdated.
Security Group Management
Security groups accumulate rules over time, eventually becoming complex enough that nobody understands what traffic they actually permit. The pattern that prevents this is treating security groups as narrow-purpose resources rather than catch-all rule collections. Create focused security groups for specific roles (web server, database, internal service) and compose them rather than adding every rule to one massive group.
Role-Based Security Group Pattern
# security-groups.tf
# Base security group for all instances
resource "aws_security_group" "base" {
name_prefix = "${var.environment}-base-"
vpc_id = module.vpc.vpc_id
description = "Base rules for all instances"
# Allow all egress (instances can reach internet)
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
lifecycle {
create_before_destroy = true
}
}
# Web server security group
resource "aws_security_group" "web" {
name_prefix = "${var.environment}-web-"
vpc_id = module.vpc.vpc_id
description = "HTTP/HTTPS access for web servers"
ingress {
description = "HTTPS from ALB"
from_port = 443
to_port = 443
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
ingress {
description = "HTTP from ALB"
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
}
# Database security group
resource "aws_security_group" "database" {
name_prefix = "${var.environment}-database-"
vpc_id = module.vpc.vpc_id
description = "PostgreSQL access from app servers"
ingress {
description = "PostgreSQL from app servers"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
}
# Application instance with multiple security groups
resource "aws_instance" "app" {
ami = "ami-12345678"
instance_type = "t3.small"
vpc_security_group_ids = [
aws_security_group.base.id,
aws_security_group.web.id,
]
}
This pattern creates small, single-purpose security groups that compose together. The base group handles egress (same for all instances), the web group handles HTTP/HTTPS ingress, and instances receive both. When you need to add SSH access for debugging, create an ssh security group and attach it to specific instances rather than adding SSH rules to the web group (where they'd apply to all web servers).
Avoiding CIDR Hardcoding
Hardcoded CIDR blocks in security group rules create brittle infrastructure. When network topology changes, you're searching through Terraform files for every reference to the old CIDR. Reference security groups and VPC CIDR outputs instead:
# Bad: Hardcoded CIDR
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_blocks = ["10.0.1.0/24"] # What if subnet CIDR changes?
}
# Good: Reference security group
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id] # Follows instance
}
# Good: Reference VPC CIDR for VPN access
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [module.vpc.vpc_cidr] # All VPC traffic
}
IAM Role and Policy Management
IAM policies control what AWS resources can do. The common mistake is granting overly broad permissions ("it wasn't working so I just added * to everything") or creating deeply nested policy documents that become impossible to audit. The solution is separating policy documents from role assignments and using managed policies for common patterns.
IAM Role with Attached Policies
# iam.tf
# Trust policy (who can assume this role)
data "aws_iam_policy_document" "ec2_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]
}
}
}
# Permission policy (what the role can do)
data "aws_iam_policy_document" "app_permissions" {
statement {
sid = "S3BucketAccess"
effect = "Allow"
actions = [
"s3:GetObject",
"s3:PutObject",
]
resources = [
"${aws_s3_bucket.app_data.arn}/*",
]
}
statement {
sid = "SecretsManagerAccess"
effect = "Allow"
actions = [
"secretsmanager:GetSecretValue",
]
resources = [
aws_secretsmanager_secret.database_password.arn,
]
}
}
# Create the role
resource "aws_iam_role" "app" {
name = "${var.environment}-app-role"
assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
}
# Create custom policy
resource "aws_iam_policy" "app" {
name = "${var.environment}-app-policy"
policy = data.aws_iam_policy_document.app_permissions.json
}
# Attach custom policy to role
resource "aws_iam_role_policy_attachment" "app_custom" {
role = aws_iam_role.app.name
policy_arn = aws_iam_policy.app.arn
}
# Attach AWS managed policy
resource "aws_iam_role_policy_attachment" "app_ssm" {
role = aws_iam_role.app.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
# Create instance profile
resource "aws_iam_instance_profile" "app" {
name = "${var.environment}-app-profile"
role = aws_iam_role.app.name
}
This structure separates trust policy (who can assume) from permission policy (what they can do), making both easier to understand and modify. Using aws_iam_policy_document data sources generates valid JSON policy documents from HCL, eliminating JSON syntax errors and making policies more readable. Attaching AWS managed policies (like AmazonSSMManagedInstanceCore) leverages AWS's pre-built, maintained policies for common use cases.
Least Privilege Principle
Grant only the specific permissions resources need. If your application reads from one S3 bucket, grant GetObject on that bucket—not all S3 buckets, and not s3:* actions. Start with minimal permissions and expand when you encounter permission errors, rather than starting with broad permissions and trying to narrow them later.
Variable and Secret Management
Terraform variables parameterize configurations, but where those values come from determines security and usability. Hardcoding database passwords in .tfvars files and committing them to git is the most common security mistake. The solution is separating sensitive and non-sensitive variables with appropriate storage for each.
Variable Organization
# variables.tf (defines available variables)
variable "environment" {
type = string
description = "Environment name (production, staging, development)"
validation {
condition = contains(["production", "staging", "development"], var.environment)
error_message = "Environment must be production, staging, or development."
}
}
variable "instance_type" {
type = string
description = "EC2 instance type for application servers"
default = "t3.small"
}
variable "database_password" {
type = string
description = "PostgreSQL master password"
sensitive = true
}
# terraform.tfvars (committed to git)
environment = "production"
instance_type = "t3.medium"
# secrets.auto.tfvars (NOT committed to git, in .gitignore)
database_password = "actual-secure-password"
The sensitive = true flag prevents Terraform from printing the variable value in plan and apply output. Storing the actual password in secrets.auto.tfvars and adding it to .gitignore keeps it out of version control. The .auto.tfvars suffix means Terraform automatically loads it without explicit -var-file flags.
AWS Secrets Manager Integration
For production environments, store secrets in AWS Secrets Manager and reference them in Terraform rather than storing them in .tfvars files at all:
# Create secret in Secrets Manager (one-time setup)
resource "aws_secretsmanager_secret" "db_password" {
name = "${var.environment}/database/master-password"
recovery_window_in_days = 7
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = var.database_password
}
# Reference secret in RDS instance
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
}
resource "aws_db_instance" "main" {
identifier = "${var.environment}-postgres"
engine = "postgres"
instance_class = "db.t3.small"
allocated_storage = 20
username = "dbadmin"
password = data.aws_secretsmanager_secret_version.db_password.secret_string
skip_final_snapshot = true
}
This pattern centralizes secret storage in Secrets Manager, where you get automatic rotation, audit logging, and fine-grained access control. Applications retrieve secrets at runtime via the AWS SDK, never requiring plaintext secrets in environment variables or configuration files.
Environment Parity with Workspaces
Managing multiple environments (dev, staging, production) requires balancing code reuse with environment-specific configuration. Terraform workspaces provide one solution: same code, different state files, with variables determining environment-specific values.
Workspace-Based Environment Pattern
# main.tf uses workspace name for environment selection
locals {
environment = terraform.workspace
# Environment-specific configuration
config = {
production = {
instance_type = "t3.large"
instance_count = 3
db_instance_class = "db.t3.large"
enable_deletion_protection = true
}
staging = {
instance_type = "t3.small"
instance_count = 1
db_instance_class = "db.t3.small"
enable_deletion_protection = false
}
development = {
instance_type = "t3.micro"
instance_count = 1
db_instance_class = "db.t3.micro"
enable_deletion_protection = false
}
}
current_config = local.config[local.environment]
}
resource "aws_instance" "app" {
count = local.current_config.instance_count
ami = data.aws_ami.amazon_linux_2.id
instance_type = local.current_config.instance_type
tags = {
Name = "${local.environment}-app-${count.index + 1}"
}
}
# Using workspaces
# terraform workspace new production
# terraform workspace select production
# terraform apply # Uses production config
# terraform workspace select staging
# terraform apply # Uses staging config
Workspaces share the same Terraform code but maintain separate state files. The workspace name determines which configuration block to use, automatically adjusting instance sizes, counts, and protection settings per environment. This eliminates code duplication while maintaining environment isolation.
When Not to Use Workspaces
Workspaces work well when environments differ only in scale and configuration, not in topology. If production has a multi-region setup with disaster recovery but staging is single-region, workspaces become complex with too many conditionals. In that case, separate directory structures for each environment work better:
terraform/
├── modules/
│ ├── vpc/
│ ├── compute/
│ └── database/
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── staging/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf
This structure duplicates some code but gives complete independence—production and staging can evolve separately without conditional logic cluttering modules.
Handling AWS API Rate Limits
Large Terraform applies that create or modify hundreds of resources can hit AWS API rate limits, causing Terraform operations to fail with throttling errors. The AWS provider includes retry logic, but you can optimize apply speed by parallelizing carefully and ordering resource creation to avoid unnecessary API calls.
Provider Concurrency Configuration
provider "aws" {
region = "us-east-1"
# Limit concurrent API calls
max_retries = 10
# Custom retry configuration
retry_mode = "adaptive"
}
# Reduce parallelism for large applies
# terraform apply -parallelism=5
# Default is 10, which can overwhelm APIs
# 5 reduces concurrent requests while maintaining reasonable speed
The adaptive retry mode uses exponential backoff with jitter, reducing the likelihood of thundering herd problems when Terraform retries failed API calls. Lowering parallelism from the default 10 to 5 cuts concurrent API requests in half, which helps when managing large environments (100+ resources) where hitting rate limits is common.
Resource Dependencies
Terraform automatically infers dependencies from resource references, but sometimes you need explicit depends_on to ensure proper ordering:
# EC2 instance needs IAM instance profile, but Terraform might not detect it
resource "aws_instance" "app" {
ami = "ami-12345678"
instance_type = "t3.small"
iam_instance_profile = aws_iam_instance_profile.app.name
depends_on = [
aws_iam_role_policy_attachment.app_custom,
]
}
# Ensure IAM policy is attached before launching instance
# Without depends_on, instance might launch before policy attachment completes
Cost Optimization Patterns
Terraform manages infrastructure, but the choices you make in Terraform directly impact AWS costs. Small decisions—instance types, EBS volume configurations, NAT gateway strategies—compound into significant monthly charges. These patterns reduce costs without sacrificing reliability.
Right-Sizing Resources
# Use lifecycle configuration for stateful resources
resource "aws_db_instance" "main" {
identifier = "production-db"
engine = "postgres"
instance_class = "db.t3.small"
# Prevent accidental deletion
deletion_protection = true
lifecycle {
prevent_destroy = true
ignore_changes = [
# Ignore manual parameter changes
parameter_group_name,
]
}
}
# Use spot instances for non-critical workloads
resource "aws_spot_instance_request" "worker" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = "t3.medium"
spot_price = "0.02" # Max willing to pay
spot_type = "persistent"
wait_for_fulfillment = true
tags = {
Name = "spot-worker"
}
}
# Cost-optimized NAT gateway strategy
# Single NAT gateway for dev/staging (saves ~$100/month)
resource "aws_nat_gateway" "main" {
count = var.environment == "production" ? length(var.availability_zones) : 1
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
}
The NAT gateway pattern uses one NAT gateway per AZ in production (for high availability) but only one NAT gateway total in staging and development (saving $96/month per environment). Spot instances cost 60-90% less than on-demand for workloads that tolerate interruption (batch processing, CI runners, development instances).
| Resource | Cost Optimization | Monthly Savings |
|---|---|---|
| NAT Gateway | Single NAT for non-prod environments | $96 per AZ removed |
| EC2 Instances | Use spot instances for batch workloads | 60-90% of on-demand cost |
| EBS Volumes | Use gp3 instead of gp2 | ~20% for same performance |
| RDS Instances | Stop non-prod databases off-hours | ~70% for 16h/day shutdown |
Testing Terraform Changes
Applying Terraform changes directly to production without testing is how infrastructure incidents happen. The safe pattern is terraform plan followed by review, but plan output can be overwhelming for large changes. Tools like terraform validate and terraform fmt catch basic errors before plan even runs.
Pre-Apply Validation
# Validate configuration syntax
terraform validate
# Format code consistently
terraform fmt -recursive
# Generate and review plan
terraform plan -out=tfplan
# Show plan in human-readable format
terraform show tfplan
# If plan looks correct, apply it
terraform apply tfplan
# For sensitive changes, use -target to limit scope
terraform plan -target=aws_security_group.database
terraform apply -target=aws_security_group.database
Saving the plan with -out=tfplan and applying that exact plan ensures what you reviewed is what executes. Without this, someone might commit changes between plan and apply, causing the apply to diverge from what you reviewed. The -target flag limits operations to specific resources, useful when debugging or making surgical changes to production.
FAQ
Should I use Terraform workspaces or separate directories for environments?
Use workspaces when environments differ only in scale and configuration (same topology, different instance sizes). Use separate directories when environments have different architectures (production is multi-region with DR, staging is single-region). Workspaces reduce code duplication but add conditional complexity; directories duplicate code but maintain independence.
How do I handle secrets in Terraform?
Never commit secrets to version control. For local development, use secrets.auto.tfvars files that are gitignored. For production, store secrets in AWS Secrets Manager or Parameter Store and reference them via data sources. Mark variables as sensitive = true to prevent Terraform from printing them in output.
What's the best way to organize Terraform files?
Start with logical separation: main.tf (primary resources), variables.tf (input variables), outputs.tf (output values), providers.tf (provider configuration), and backend.tf (state configuration). As files grow beyond 200-300 lines, split by resource type (compute.tf, networking.tf, database.tf) or domain (frontend.tf, backend.tf).
How do I prevent accidental resource deletion?
Use lifecycle { prevent_destroy = true } on critical resources like databases and state buckets. Enable deletion protection on RDS instances and enable termination protection on critical EC2 instances. For production environments, require manual approval before terraform apply runs in CI/CD.
Should I use count or for_each for creating multiple resources?
Use for_each when resources are identified by a meaningful key (names, IDs). Use count when resources are identical and order doesn't matter. for_each is safer because removing an item from the middle doesn't cause Terraform to recreate all subsequent resources, while count would reindex everything.
How do I import existing AWS resources into Terraform?
Use terraform import to add existing resources to state. First, write the resource configuration in Terraform matching the existing resource, then run terraform import
What's the difference between terraform taint and terraform apply -replace?
Both force resource recreation. terraform taint marks a resource for recreation on the next apply (deprecated in newer versions). terraform apply -replace=
How do I handle Terraform version drift across team members?
Pin the required Terraform version in your configuration using required_version in the terraform block. Use a version manager like tfenv or asdf to automatically switch to the correct version per project. Include terraform version in CI checks to catch version mismatches before they cause issues.
Should I commit terraform.tfstate to git?
Never commit state files to git. State contains sensitive data in plaintext and isn't designed for concurrent access. Use remote state with S3 backend and DynamoDB locking instead. Add terraform.tfstate* to .gitignore immediately in new projects.
How do I roll back a bad Terraform apply?
If state is corrupted, restore from S3 versioning. If configuration is wrong, revert the git commit and run terraform apply with the previous configuration. For partial failures, fix the error and re-run apply—Terraform resumes from where it failed. Keep state backups for disaster recovery.
Conclusion
Terraform's power comes from treating infrastructure as code, but that power requires discipline around state management, module organization, and security practices. Remote state with locking prevents the most common state corruption issues, semantic versioning for modules prevents breaking changes from propagating uncontrolled, and consistent tagging enables cost visibility that would be impossible to achieve manually.
Start with these foundational patterns—remote state, provider-level tags, role-based security groups—before optimizing for advanced use cases. A simple, correct Terraform setup beats a sophisticated but fragile one. Every pattern in this article solves a specific failure mode that teams encounter as their infrastructure scales. Implement them proactively rather than reactively after an incident.
The goal isn't perfect infrastructure-as-code on day one. It's infrastructure that's versionable, reviewable, and reproducible—properties that compound in value as your team and infrastructure grow.