Top Terraform Best Practices for AWS

AWS infrastructure drifts from your intended configuration the moment someone makes a manual change in the console. A developer adjusts a security group during debugging, an ops engineer scales an instance type during peak traffic, or someone adds a tag to help with cost tracking. None of these changes exist in your infrastructure-as-code, so the next Terraform apply either reverts them (causing confusion) or fails entirely because state doesn't match reality.

This article covers the Terraform patterns that prevent drift, reduce costs, and make AWS infrastructure actually manageable across teams. You'll learn proper state management strategies that prevent state file corruption, module organization that enables reuse without creating tight coupling, and the specific AWS provider configurations that avoid hitting API rate limits during large deployments. These aren't theoretical recommendations—they're patterns that emerge after managing dozens of AWS accounts with hundreds of resources each.

We'll cover remote state configuration, workspace strategies, module versioning, AWS tagging conventions, security group patterns, and the IAM configurations that let Terraform manage permissions safely. Every recommendation includes the specific failure mode it prevents.

State Management: Remote Backend Configuration

Terraform state files contain the complete mapping between your configuration files and actual AWS resources. Losing state means losing Terraform's knowledge of what it created, effectively orphaning resources. Storing state locally works for solo experimentation but breaks immediately when multiple people need to run Terraform or when CI/CD pipelines deploy infrastructure.

Remote state stored in S3 with DynamoDB locking solves both problems: all team members and CI jobs share one authoritative state, and DynamoDB prevents concurrent modifications that would corrupt state. This is the foundational pattern every production Terraform setup needs.

S3 Backend Configuration

# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "production/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"

    # Enable versioning for state history
    versioning = true
  }
}

# Create S3 bucket and DynamoDB table (run once)
# terraform/bootstrap/main.tf
resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_lock" {
  name           = "terraform-state-lock"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  lifecycle {
    prevent_destroy = true
  }
}

The S3 bucket stores state files with versioning enabled, allowing you to recover from corrupted state by reverting to a previous version. Encryption ensures state contents (which include sensitive values like database passwords) stay encrypted at rest. DynamoDB's PAY_PER_REQUEST billing means you only pay for lock operations during Terraform runs, typically under $1/month even for busy teams.

State File Organization

The S3 key path (production/vpc/terraform.tfstate) determines state file location. Organize by environment and resource grouping to prevent state file bloat:

terraform-state/
├── production/
│   ├── vpc/terraform.tfstate
│   ├── compute/terraform.tfstate
│   ├── database/terraform.tfstate
│   └── monitoring/terraform.tfstate
├── staging/
│   ├── vpc/terraform.tfstate
│   └── compute/terraform.tfstate
└── development/
    └── all/terraform.tfstate

Smaller state files mean faster Terraform operations and reduced blast radius when applying changes. Separating VPC state from compute state means deploying application changes doesn't require loading network infrastructure state, cutting plan time from 60 seconds to 10 seconds on large infrastructures.

Critical: Never commit state files to git, even encrypted. State files contain plaintext secrets (database passwords, API keys). Add terraform.tfstate* to .gitignore immediately. If you accidentally commit state, assume all secrets in that state are compromised and rotate them.

Module Organization and Versioning

Terraform modules enable reuse, but poorly designed modules create more problems than they solve. A module that tries to handle too many use cases becomes a maze of conditional logic. A module without versioning means every change potentially breaks all consumers. The sweet spot is focused modules with clear interfaces and semantic versioning.

Module Structure Pattern

# modules/vpc/main.tf
variable "vpc_cidr" {
  type        = string
  description = "CIDR block for VPC"
}

variable "availability_zones" {
  type        = list(string)
  description = "AZs to create subnets in"
}

variable "enable_nat_gateway" {
  type        = bool
  default     = true
  description = "Enable NAT gateway for private subnets"
}

variable "tags" {
  type        = map(string)
  default     = {}
  description = "Additional tags for all resources"
}

locals {
  common_tags = merge(
    {
      ManagedBy = "terraform"
      Module    = "vpc"
    },
    var.tags
  )
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(
    local.common_tags,
    {
      Name = "${var.tags["Environment"]}-vpc"
    }
  )
}

# Create public subnets
resource "aws_subnet" "public" {
  count             = length(var.availability_zones)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone = var.availability_zones[count.index]

  map_public_ip_on_launch = true

  tags = merge(
    local.common_tags,
    {
      Name = "${var.tags["Environment"]}-public-${var.availability_zones[count.index]}"
      Type = "public"
    }
  )
}

# modules/vpc/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "ID of the created VPC"
}

output "public_subnet_ids" {
  value       = aws_subnet.public[*].id
  description = "List of public subnet IDs"
}

output "vpc_cidr" {
  value       = aws_vpc.main.cidr_block
  description = "CIDR block of the VPC"
}

This module has a single responsibility: creating a VPC with subnets. It exposes only the configuration options that users actually need to vary (CIDR blocks, AZs, NAT gateway toggle) while handling complex details internally (CIDR subnet calculation, tagging patterns, DNS settings). Clear input variables and outputs make the module's contract explicit.

Module Versioning with Git Tags

# modules repository: github.com/yourorg/terraform-aws-modules
# Tag releases with semantic versioning
git tag -a v1.0.0 -m "Initial VPC module release"
git push origin v1.0.0

# Consuming the module with version pinning
module "vpc" {
  source = "git::https://github.com/yourorg/terraform-aws-modules.git//vpc?ref=v1.0.0"

  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b"]

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

Version pinning via git tags prevents breaking changes from automatically flowing into production. When you improve the VPC module, tag the change as v1.1.0, but production infrastructure continues using v1.0.0 until you explicitly update the ref. This gives you control over when changes propagate and time to test them in staging first.

Module Source	Use Case	Version Control
./modules/vpc	Local development, rapid iteration	No versioning, changes immediate
git::...?ref=main	Testing latest changes	No version lock, always latest
git::...?ref=v1.0.0	Production infrastructure	Explicit version, stable
registry.terraform.io	Public modules, official providers	Semantic versioning, constraints

Resource Tagging Strategy

AWS tags enable cost allocation, resource organization, and automated policy enforcement. But tagging only works if applied consistently across all resources. Terraform's ability to apply tags at the provider level ensures every resource gets baseline tags automatically, eliminating the most common cause of incomplete tagging: forgetting to add them.

Provider-Level Default Tags

# providers.tf
provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = var.environment
      Team        = var.team_name
      CostCenter  = var.cost_center
      Project     = var.project_name
    }
  }
}

# All resources automatically receive these tags
resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"

  # Resource-specific tags merged with default tags
  tags = {
    Name = "web-server-01"
    Role = "webserver"
  }
}

# Resulting tags on the instance:
# ManagedBy: terraform
# Environment: production
# Team: platform
# CostCenter: engineering
# Project: main-app
# Name: web-server-01
# Role: webserver

Default tags apply automatically to all resources created by this provider instance. This guarantees every EC2 instance, RDS database, and S3 bucket gets tagged with environment, team, and cost center information without requiring developers to remember tag requirements. Resource-specific tags merge with default tags, allowing you to add context (Name, Role) while maintaining baseline compliance.

Tag-Based Cost Allocation

Enable cost allocation tags in AWS Cost Explorer to break down billing by environment, team, or project. With consistent tagging, you can answer questions like "how much does the staging environment cost?" or "what's our database spend by team?" These insights are impossible without tags or require manual cost categorization that inevitably becomes outdated.

Pro Tip: Create a Service Control Policy (SCP) that requires specific tags on resource creation. This enforces tagging at the AWS API level, catching resources created outside Terraform. Combined with Terraform's default tags, you achieve complete tagging coverage.

Security Group Management

Security groups accumulate rules over time, eventually becoming complex enough that nobody understands what traffic they actually permit. The pattern that prevents this is treating security groups as narrow-purpose resources rather than catch-all rule collections. Create focused security groups for specific roles (web server, database, internal service) and compose them rather than adding every rule to one massive group.

Role-Based Security Group Pattern

# security-groups.tf
# Base security group for all instances
resource "aws_security_group" "base" {
  name_prefix = "${var.environment}-base-"
  vpc_id      = module.vpc.vpc_id
  description = "Base rules for all instances"

  # Allow all egress (instances can reach internet)
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  lifecycle {
    create_before_destroy = true
  }
}

# Web server security group
resource "aws_security_group" "web" {
  name_prefix = "${var.environment}-web-"
  vpc_id      = module.vpc.vpc_id
  description = "HTTP/HTTPS access for web servers"

  ingress {
    description = "HTTPS from ALB"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  ingress {
    description = "HTTP from ALB"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    security_groups = [aws_security_group.alb.id]
  }
}

# Database security group
resource "aws_security_group" "database" {
  name_prefix = "${var.environment}-database-"
  vpc_id      = module.vpc.vpc_id
  description = "PostgreSQL access from app servers"

  ingress {
    description     = "PostgreSQL from app servers"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }
}

# Application instance with multiple security groups
resource "aws_instance" "app" {
  ami           = "ami-12345678"
  instance_type = "t3.small"

  vpc_security_group_ids = [
    aws_security_group.base.id,
    aws_security_group.web.id,
  ]
}

This pattern creates small, single-purpose security groups that compose together. The base group handles egress (same for all instances), the web group handles HTTP/HTTPS ingress, and instances receive both. When you need to add SSH access for debugging, create an ssh security group and attach it to specific instances rather than adding SSH rules to the web group (where they'd apply to all web servers).

Avoiding CIDR Hardcoding

Hardcoded CIDR blocks in security group rules create brittle infrastructure. When network topology changes, you're searching through Terraform files for every reference to the old CIDR. Reference security groups and VPC CIDR outputs instead:

# Bad: Hardcoded CIDR
ingress {
  from_port   = 5432
  to_port     = 5432
  protocol    = "tcp"
  cidr_blocks = ["10.0.1.0/24"]  # What if subnet CIDR changes?
}

# Good: Reference security group
ingress {
  from_port       = 5432
  to_port         = 5432
  protocol        = "tcp"
  security_groups = [aws_security_group.app.id]  # Follows instance
}

# Good: Reference VPC CIDR for VPN access
ingress {
  from_port   = 22
  to_port     = 22
  protocol    = "tcp"
  cidr_blocks = [module.vpc.vpc_cidr]  # All VPC traffic
}

IAM Role and Policy Management

IAM policies control what AWS resources can do. The common mistake is granting overly broad permissions ("it wasn't working so I just added * to everything") or creating deeply nested policy documents that become impossible to audit. The solution is separating policy documents from role assignments and using managed policies for common patterns.

IAM Role with Attached Policies

# iam.tf
# Trust policy (who can assume this role)
data "aws_iam_policy_document" "ec2_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

# Permission policy (what the role can do)
data "aws_iam_policy_document" "app_permissions" {
  statement {
    sid    = "S3BucketAccess"
    effect = "Allow"

    actions = [
      "s3:GetObject",
      "s3:PutObject",
    ]

    resources = [
      "${aws_s3_bucket.app_data.arn}/*",
    ]
  }

  statement {
    sid    = "SecretsManagerAccess"
    effect = "Allow"

    actions = [
      "secretsmanager:GetSecretValue",
    ]

    resources = [
      aws_secretsmanager_secret.database_password.arn,
    ]
  }
}

# Create the role
resource "aws_iam_role" "app" {
  name               = "${var.environment}-app-role"
  assume_role_policy = data.aws_iam_policy_document.ec2_assume_role.json
}

# Create custom policy
resource "aws_iam_policy" "app" {
  name   = "${var.environment}-app-policy"
  policy = data.aws_iam_policy_document.app_permissions.json
}

# Attach custom policy to role
resource "aws_iam_role_policy_attachment" "app_custom" {
  role       = aws_iam_role.app.name
  policy_arn = aws_iam_policy.app.arn
}

# Attach AWS managed policy
resource "aws_iam_role_policy_attachment" "app_ssm" {
  role       = aws_iam_role.app.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# Create instance profile
resource "aws_iam_instance_profile" "app" {
  name = "${var.environment}-app-profile"
  role = aws_iam_role.app.name
}

This structure separates trust policy (who can assume) from permission policy (what they can do), making both easier to understand and modify. Using aws_iam_policy_document data sources generates valid JSON policy documents from HCL, eliminating JSON syntax errors and making policies more readable. Attaching AWS managed policies (like AmazonSSMManagedInstanceCore) leverages AWS's pre-built, maintained policies for common use cases.

Least Privilege Principle

Grant only the specific permissions resources need. If your application reads from one S3 bucket, grant GetObject on that bucket—not all S3 buckets, and not s3:* actions. Start with minimal permissions and expand when you encounter permission errors, rather than starting with broad permissions and trying to narrow them later.

Warning: Never use wildcard permissions (*:*) in production IAM policies. This grants unrestricted access to all AWS services and actions, effectively making the role an administrator. Even "I'll tighten it later" rarely happens—broad permissions become permanent.

Variable and Secret Management

Terraform variables parameterize configurations, but where those values come from determines security and usability. Hardcoding database passwords in .tfvars files and committing them to git is the most common security mistake. The solution is separating sensitive and non-sensitive variables with appropriate storage for each.

Variable Organization

# variables.tf (defines available variables)
variable "environment" {
  type        = string
  description = "Environment name (production, staging, development)"

  validation {
    condition     = contains(["production", "staging", "development"], var.environment)
    error_message = "Environment must be production, staging, or development."
  }
}

variable "instance_type" {
  type        = string
  description = "EC2 instance type for application servers"
  default     = "t3.small"
}

variable "database_password" {
  type        = string
  description = "PostgreSQL master password"
  sensitive   = true
}

# terraform.tfvars (committed to git)
environment   = "production"
instance_type = "t3.medium"

# secrets.auto.tfvars (NOT committed to git, in .gitignore)
database_password = "actual-secure-password"

The sensitive = true flag prevents Terraform from printing the variable value in plan and apply output. Storing the actual password in secrets.auto.tfvars and adding it to .gitignore keeps it out of version control. The .auto.tfvars suffix means Terraform automatically loads it without explicit -var-file flags.

AWS Secrets Manager Integration

For production environments, store secrets in AWS Secrets Manager and reference them in Terraform rather than storing them in .tfvars files at all:

# Create secret in Secrets Manager (one-time setup)
resource "aws_secretsmanager_secret" "db_password" {
  name                    = "${var.environment}/database/master-password"
  recovery_window_in_days = 7
}

resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = var.database_password
}

# Reference secret in RDS instance
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = aws_secretsmanager_secret.db_password.id
}

resource "aws_db_instance" "main" {
  identifier        = "${var.environment}-postgres"
  engine            = "postgres"
  instance_class    = "db.t3.small"
  allocated_storage = 20

  username = "dbadmin"
  password = data.aws_secretsmanager_secret_version.db_password.secret_string

  skip_final_snapshot = true
}

This pattern centralizes secret storage in Secrets Manager, where you get automatic rotation, audit logging, and fine-grained access control. Applications retrieve secrets at runtime via the AWS SDK, never requiring plaintext secrets in environment variables or configuration files.

Environment Parity with Workspaces

Managing multiple environments (dev, staging, production) requires balancing code reuse with environment-specific configuration. Terraform workspaces provide one solution: same code, different state files, with variables determining environment-specific values.

Workspace-Based Environment Pattern

# main.tf uses workspace name for environment selection
locals {
  environment = terraform.workspace

  # Environment-specific configuration
  config = {
    production = {
      instance_type = "t3.large"
      instance_count = 3
      db_instance_class = "db.t3.large"
      enable_deletion_protection = true
    }
    staging = {
      instance_type = "t3.small"
      instance_count = 1
      db_instance_class = "db.t3.small"
      enable_deletion_protection = false
    }
    development = {
      instance_type = "t3.micro"
      instance_count = 1
      db_instance_class = "db.t3.micro"
      enable_deletion_protection = false
    }
  }

  current_config = local.config[local.environment]
}

resource "aws_instance" "app" {
  count         = local.current_config.instance_count
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = local.current_config.instance_type

  tags = {
    Name = "${local.environment}-app-${count.index + 1}"
  }
}

# Using workspaces
# terraform workspace new production
# terraform workspace select production
# terraform apply  # Uses production config

# terraform workspace select staging
# terraform apply  # Uses staging config

Workspaces share the same Terraform code but maintain separate state files. The workspace name determines which configuration block to use, automatically adjusting instance sizes, counts, and protection settings per environment. This eliminates code duplication while maintaining environment isolation.

When Not to Use Workspaces

Workspaces work well when environments differ only in scale and configuration, not in topology. If production has a multi-region setup with disaster recovery but staging is single-region, workspaces become complex with too many conditionals. In that case, separate directory structures for each environment work better:

terraform/
├── modules/
│   ├── vpc/
│   ├── compute/
│   └── database/
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   └── staging/
│       ├── main.tf
│       ├── terraform.tfvars
│       └── backend.tf

This structure duplicates some code but gives complete independence—production and staging can evolve separately without conditional logic cluttering modules.

Handling AWS API Rate Limits

Large Terraform applies that create or modify hundreds of resources can hit AWS API rate limits, causing Terraform operations to fail with throttling errors. The AWS provider includes retry logic, but you can optimize apply speed by parallelizing carefully and ordering resource creation to avoid unnecessary API calls.

Provider Concurrency Configuration

provider "aws" {
  region = "us-east-1"

  # Limit concurrent API calls
  max_retries = 10

  # Custom retry configuration
  retry_mode = "adaptive"
}

# Reduce parallelism for large applies
# terraform apply -parallelism=5

# Default is 10, which can overwhelm APIs
# 5 reduces concurrent requests while maintaining reasonable speed

The adaptive retry mode uses exponential backoff with jitter, reducing the likelihood of thundering herd problems when Terraform retries failed API calls. Lowering parallelism from the default 10 to 5 cuts concurrent API requests in half, which helps when managing large environments (100+ resources) where hitting rate limits is common.

Resource Dependencies

Terraform automatically infers dependencies from resource references, but sometimes you need explicit depends_on to ensure proper ordering:

# EC2 instance needs IAM instance profile, but Terraform might not detect it
resource "aws_instance" "app" {
  ami                  = "ami-12345678"
  instance_type        = "t3.small"
  iam_instance_profile = aws_iam_instance_profile.app.name

  depends_on = [
    aws_iam_role_policy_attachment.app_custom,
  ]
}

# Ensure IAM policy is attached before launching instance
# Without depends_on, instance might launch before policy attachment completes

Cost Optimization Patterns

Terraform manages infrastructure, but the choices you make in Terraform directly impact AWS costs. Small decisions—instance types, EBS volume configurations, NAT gateway strategies—compound into significant monthly charges. These patterns reduce costs without sacrificing reliability.

Right-Sizing Resources

# Use lifecycle configuration for stateful resources
resource "aws_db_instance" "main" {
  identifier     = "production-db"
  engine         = "postgres"
  instance_class = "db.t3.small"

  # Prevent accidental deletion
  deletion_protection = true

  lifecycle {
    prevent_destroy = true
    ignore_changes  = [
      # Ignore manual parameter changes
      parameter_group_name,
    ]
  }
}

# Use spot instances for non-critical workloads
resource "aws_spot_instance_request" "worker" {
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"
  spot_price    = "0.02"  # Max willing to pay

  spot_type            = "persistent"
  wait_for_fulfillment = true

  tags = {
    Name = "spot-worker"
  }
}

# Cost-optimized NAT gateway strategy
# Single NAT gateway for dev/staging (saves ~$100/month)
resource "aws_nat_gateway" "main" {
  count         = var.environment == "production" ? length(var.availability_zones) : 1
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

The NAT gateway pattern uses one NAT gateway per AZ in production (for high availability) but only one NAT gateway total in staging and development (saving $96/month per environment). Spot instances cost 60-90% less than on-demand for workloads that tolerate interruption (batch processing, CI runners, development instances).

Resource	Cost Optimization	Monthly Savings
NAT Gateway	Single NAT for non-prod environments	$96 per AZ removed
EC2 Instances	Use spot instances for batch workloads	60-90% of on-demand cost
EBS Volumes	Use gp3 instead of gp2	~20% for same performance
RDS Instances	Stop non-prod databases off-hours	~70% for 16h/day shutdown

Testing Terraform Changes

Applying Terraform changes directly to production without testing is how infrastructure incidents happen. The safe pattern is terraform plan followed by review, but plan output can be overwhelming for large changes. Tools like terraform validate and terraform fmt catch basic errors before plan even runs.

Pre-Apply Validation

# Validate configuration syntax
terraform validate

# Format code consistently
terraform fmt -recursive

# Generate and review plan
terraform plan -out=tfplan

# Show plan in human-readable format
terraform show tfplan

# If plan looks correct, apply it
terraform apply tfplan

# For sensitive changes, use -target to limit scope
terraform plan -target=aws_security_group.database
terraform apply -target=aws_security_group.database

Saving the plan with -out=tfplan and applying that exact plan ensures what you reviewed is what executes. Without this, someone might commit changes between plan and apply, causing the apply to diverge from what you reviewed. The -target flag limits operations to specific resources, useful when debugging or making surgical changes to production.

Pro Tip: Run terraform plan in CI on every pull request and post the plan output as a comment. This gives reviewers visibility into infrastructure changes before they merge, catching issues during code review rather than during deployment.

FAQ

Should I use Terraform workspaces or separate directories for environments?

Use workspaces when environments differ only in scale and configuration (same topology, different instance sizes). Use separate directories when environments have different architectures (production is multi-region with DR, staging is single-region). Workspaces reduce code duplication but add conditional complexity; directories duplicate code but maintain independence.

How do I handle secrets in Terraform?

Never commit secrets to version control. For local development, use secrets.auto.tfvars files that are gitignored. For production, store secrets in AWS Secrets Manager or Parameter Store and reference them via data sources. Mark variables as sensitive = true to prevent Terraform from printing them in output.

What's the best way to organize Terraform files?

Start with logical separation: main.tf (primary resources), variables.tf (input variables), outputs.tf (output values), providers.tf (provider configuration), and backend.tf (state configuration). As files grow beyond 200-300 lines, split by resource type (compute.tf, networking.tf, database.tf) or domain (frontend.tf, backend.tf).

How do I prevent accidental resource deletion?

Use lifecycle { prevent_destroy = true } on critical resources like databases and state buckets. Enable deletion protection on RDS instances and enable termination protection on critical EC2 instances. For production environments, require manual approval before terraform apply runs in CI/CD.

Should I use count or for_each for creating multiple resources?

Use for_each when resources are identified by a meaningful key (names, IDs). Use count when resources are identical and order doesn't matter. for_each is safer because removing an item from the middle doesn't cause Terraform to recreate all subsequent resources, while count would reindex everything.

How do I import existing AWS resources into Terraform?

Use terraform import to add existing resources to state. First, write the resource configuration in Terraform matching the existing resource, then run terraform import . . Run terraform plan to verify state matches configuration. Importing doesn't generate configuration automatically—you must write it.

What's the difference between terraform taint and terraform apply -replace?

Both force resource recreation. terraform taint marks a resource for recreation on the next apply (deprecated in newer versions). terraform apply -replace= immediately recreates the resource in a single operation. Use -replace for explicit, immediate recreation rather than tainting and applying separately.

How do I handle Terraform version drift across team members?

Pin the required Terraform version in your configuration using required_version in the terraform block. Use a version manager like tfenv or asdf to automatically switch to the correct version per project. Include terraform version in CI checks to catch version mismatches before they cause issues.

Should I commit terraform.tfstate to git?

Never commit state files to git. State contains sensitive data in plaintext and isn't designed for concurrent access. Use remote state with S3 backend and DynamoDB locking instead. Add terraform.tfstate* to .gitignore immediately in new projects.

How do I roll back a bad Terraform apply?

If state is corrupted, restore from S3 versioning. If configuration is wrong, revert the git commit and run terraform apply with the previous configuration. For partial failures, fix the error and re-run apply—Terraform resumes from where it failed. Keep state backups for disaster recovery.

Conclusion

Terraform's power comes from treating infrastructure as code, but that power requires discipline around state management, module organization, and security practices. Remote state with locking prevents the most common state corruption issues, semantic versioning for modules prevents breaking changes from propagating uncontrolled, and consistent tagging enables cost visibility that would be impossible to achieve manually.

Start with these foundational patterns—remote state, provider-level tags, role-based security groups—before optimizing for advanced use cases. A simple, correct Terraform setup beats a sophisticated but fragile one. Every pattern in this article solves a specific failure mode that teams encounter as their infrastructure scales. Implement them proactively rather than reactively after an incident.

The goal isn't perfect infrastructure-as-code on day one. It's infrastructure that's versionable, reviewable, and reproducible—properties that compound in value as your team and infrastructure grow.

Top Terraform Best Practices for AWS

Top Terraform Best Practices for AWS

State Management: Remote Backend Configuration

S3 Backend Configuration

State File Organization

Module Organization and Versioning

Module Structure Pattern

Module Versioning with Git Tags

Resource Tagging Strategy

Provider-Level Default Tags

Tag-Based Cost Allocation

Security Group Management

Role-Based Security Group Pattern

Avoiding CIDR Hardcoding

IAM Role and Policy Management

IAM Role with Attached Policies

Least Privilege Principle

Variable and Secret Management

Variable Organization

AWS Secrets Manager Integration

Environment Parity with Workspaces

Workspace-Based Environment Pattern

When Not to Use Workspaces

Handling AWS API Rate Limits

Provider Concurrency Configuration

Resource Dependencies

Cost Optimization Patterns

Right-Sizing Resources

Testing Terraform Changes

Pre-Apply Validation

FAQ

Should I use Terraform workspaces or separate directories for environments?

How do I handle secrets in Terraform?

What's the best way to organize Terraform files?

How do I prevent accidental resource deletion?

Should I use count or for_each for creating multiple resources?

How do I import existing AWS resources into Terraform?

What's the difference between terraform taint and terraform apply -replace?

How do I handle Terraform version drift across team members?

Should I commit terraform.tfstate to git?

How do I roll back a bad Terraform apply?

Conclusion

Share on Social Media: