Tech Bytes Logo

Tech Bytes Engineering

October 13, 2025

ENGINEERING DEEP DIVE

Reducing AWS Lambda Cold Starts by 70%: A Production Case Study

How we cut Lambda initialization time from 3.2 seconds to 950ms using provisioned concurrency, SnapStart, and ARM64 architecture

70% Faster 30% Cost Savings Production Tested
Dillip Chowdary

Dillip Chowdary

Senior DevOps Engineer • 6+ years AWS/Serverless expertise

Cold starts remain one of the most frustrating challenges in serverless computing. After migrating our image processing API to AWS Lambda, we faced 3.2-second cold start latencies that were killing our user experience. Users expect sub-second responses, not a loading spinner.

Over the past three months, I systematically optimized our Lambda functions using a combination of AWS features and architectural improvements. The result? Cold starts dropped from 3.2s to 950ms—a 70% improvement—while reducing costs by 30%.

This isn't theory. These are production numbers from an API handling 50,000+ requests per day. Here's exactly how we did it.

The Problem: What Causes Cold Starts?

Our Baseline Performance

3.2s
Cold Start Time
~200ms
Warm Execution
15%
Cold Start Rate

Lambda cold starts happen when AWS needs to:

  1. Download your deployment package from S3 (affected by package size)
  2. Start a new execution environment (affected by runtime and memory)
  3. Initialize the runtime (Python, Node.js, Java, etc.)
  4. Run your function's initialization code (imports, connections, etc.)

Our initial Lambda function had these issues:

Optimization #1: Lambda SnapStart (Java Only)

Result: Reduced initialization time by up to 90% for our Java-based authentication service

AWS Lambda SnapStart is a game-changer for Java runtimes. It takes a snapshot of your initialized function and caches it, eliminating most of the initialization overhead.

Enabling SnapStart (Terraform) HCL
resource "aws_lambda_function" "auth_service" {
  function_name = "user-authentication"
  runtime       = "java17"
  handler       = "com.techbytes.AuthHandler"

  # Enable SnapStart
  snap_start {
    apply_on = "PublishedVersions"
  }

  # Required: Publish version for SnapStart
  publish = true

  # Other configuration...
  memory_size = 2048
  timeout     = 30
}

Important considerations:

Performance Impact

Before SnapStart

2.8s

After SnapStart

280ms

Optimization #2: Provisioned Concurrency

Result: Eliminated cold starts during peak hours, but increased costs by $45/month

Provisioned concurrency keeps Lambda execution environments warm and ready to respond immediately. Think of it as paying for "standby capacity."

Provisioned Concurrency with Auto Scaling HCL
resource "aws_lambda_function" "image_processor" {
  function_name = "image-processor"
  runtime       = "python3.11"
  handler       = "app.handler"

  memory_size = 3008  # More on this later
  timeout     = 60

  # Publish version (required for provisioned concurrency)
  publish = true
}

# Create alias pointing to latest version
resource "aws_lambda_alias" "prod" {
  name             = "prod"
  function_name    = aws_lambda_function.image_processor.function_name
  function_version = aws_lambda_function.image_processor.version
}

# Provisioned concurrency on the alias
resource "aws_lambda_provisioned_concurrency_config" "prod" {
  function_name                     = aws_lambda_function.image_processor.function_name
  provisioned_concurrent_executions = 5
  qualifier                         = aws_lambda_alias.prod.name
}

# Auto scaling target
resource "aws_appautoscaling_target" "lambda_target" {
  max_capacity       = 20
  min_capacity       = 5
  resource_id        = "function:${aws_lambda_function.image_processor.function_name}:${aws_lambda_alias.prod.name}"
  scalable_dimension = "lambda:function:ProvisionedConcurrentExecutions"
  service_namespace  = "lambda"
}

# Auto scaling policy (target tracking)
resource "aws_appautoscaling_policy" "lambda_policy" {
  name               = "lambda-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.lambda_target.resource_id
  scalable_dimension = aws_appautoscaling_target.lambda_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.lambda_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "LambdaProvisionedConcurrencyUtilization"
    }
    target_value = 0.7  # Scale when 70% utilized
  }
}

Cost Optimization Tip

Use scheduled scaling to reduce provisioned concurrency during low-traffic hours (nights, weekends). We dropped from 5 to 1 instance during off-peak, saving $30/month.

When to Use Provisioned Concurrency

Use it: For user-facing APIs where latency matters
Use it: Functions with consistent traffic patterns
Skip it: For infrequent batch jobs
Skip it: Event-driven background processing

Optimization #3: Switch to ARM64 (Graviton2)

Result: 20% faster execution and 20% cost reduction compared to x86_64

AWS Graviton2 processors (ARM64 architecture) deliver better performance and cost efficiency for Lambda functions. The migration is usually straightforward.

Migrating to ARM64 HCL
resource "aws_lambda_function" "api_handler" {
  function_name = "api-handler"
  runtime       = "python3.11"

  # Simply change architecture
  architectures = ["arm64"]  # Was ["x86_64"]

  # No other changes needed (for pure Python)
  handler     = "app.handler"
  memory_size = 1024
  timeout     = 30
}

Migration checklist:

Pure Python/Node.js/Java: Usually works immediately
Native dependencies: Rebuild with ARM64 toolchain
Docker images: Use --platform linux/arm64
Test thoroughly: Some libraries behave differently on ARM
Building Lambda Layers for ARM64 Bash
# Use Docker to build ARM64 dependencies
docker run --platform linux/arm64 \
  -v "$PWD":/var/task \
  public.ecr.aws/lambda/python:3.11 \
  pip install -r requirements.txt -t python/

# Package the layer
zip -r lambda-layer-arm64.zip python/

# Upload to Lambda Layer
aws lambda publish-layer-version \
  --layer-name my-dependencies-arm64 \
  --compatible-runtimes python3.11 \
  --compatible-architectures arm64 \
  --zip-file fileb://lambda-layer-arm64.zip

x86_64 Performance

Cold Start: 1.2s
Warm Execution: 180ms
Cost (1M requests): $4.80

ARM64 Performance

Cold Start: 950ms
Warm Execution: 140ms
Cost (1M requests): $3.84

Optimization #4: Memory Allocation (Counter-Intuitive)

Surprising finding: Increasing memory from 1024MB to 3008MB reduced both latency AND cost

Lambda allocates CPU proportionally to memory. More memory = more CPU = faster execution. Sometimes paying for more memory actually costs less due to reduced execution time.

Finding Optimal Memory with Lambda Power Tuning Bash
# Install Lambda Power Tuning (open source tool)
git clone https://github.com/alexcasalboni/aws-lambda-power-tuning
cd aws-lambda-power-tuning
sam deploy --guided

# Run power tuning
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789:function:my-function",
    "powerValues": [128, 256, 512, 1024, 1536, 2048, 3008],
    "num": 50,
    "payload": "{\"test\": \"data\"}"
  }'

# Results show cost vs performance trade-off for each memory size

Our Results: Image Processing Lambda

Memory Execution Time Cost (1M invocations) Verdict
1024 MB 450ms $9.45 ❌ Too slow
1536 MB 320ms $7.89 ⚠️ Better
3008 MB 180ms $7.20 ✅ OPTIMAL

Key insight: 3008MB executes 2.5x faster and costs 24% less than 1024MB due to proportional CPU allocation.

Optimization #5: Code-Level Improvements

Beyond AWS configuration, optimizing your function code dramatically reduces cold start times.

5.1 Lazy Loading Dependencies

❌ Bad: Import Everything Upfront Python
# ALL imports happen on cold start (slow!)
import boto3
import requests
import pandas as pd
import numpy as np
from PIL import Image
import tensorflow as tf

def handler(event, context):
    # Most of these imports aren't used every time
    if event['action'] == 'simple':
        return {'status': 'ok'}
    # Heavy libraries loaded unnecessarily
✅ Good: Lazy Load Heavy Dependencies Python
# Only import what's always needed
import json

# Global variable for caching
_tf_model = None

def handler(event, context):
    action = event.get('action')

    if action == 'simple':
        return {'status': 'ok'}  # Fast path

    elif action == 'ml_inference':
        # Import heavy libraries only when needed
        global _tf_model
        if _tf_model is None:
            import tensorflow as tf
            _tf_model = tf.keras.models.load_model('/tmp/model')

        # Use cached model
        result = _tf_model.predict(event['data'])
        return {'prediction': result}

5.2 Reuse Connections (Critical!)

✅ Initialize Outside Handler (Reused Across Invocations) Python
import boto3
import pymysql

# ✅ Initialize clients OUTSIDE handler (global scope)
# These are reused across warm invocations
s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')

# Database connection pooling
db_connection = None

def get_db_connection():
    global db_connection
    if db_connection is None or not db_connection.open:
        db_connection = pymysql.connect(
            host=os.environ['DB_HOST'],
            user=os.environ['DB_USER'],
            password=os.environ['DB_PASSWORD'],
            database=os.environ['DB_NAME'],
            connect_timeout=5
        )
    return db_connection

def handler(event, context):
    # Reuse existing connections (fast on warm starts)
    conn = get_db_connection()
    cursor = conn.cursor()

    # Your business logic here
    cursor.execute("SELECT * FROM users WHERE id = %s", (event['user_id'],))
    user = cursor.fetchone()

    return {'user': user}

Common Mistake

Initializing clients inside the handler function means you recreate connections on EVERY invocation, negating the warm execution performance benefit.

5.3 Reduce Deployment Package Size

Optimizing Python Packages Bash
# Remove unnecessary files from packages
find . -type d -name "__pycache__" -exec rm -r {} +
find . -type d -name "*.dist-info" -exec rm -r {} +
find . -type d -name "tests" -exec rm -r {} +

# Remove .pyc files
find . -name "*.pyc" -delete

# Strip binaries (for native dependencies)
find . -name "*.so" -exec strip {} +

# Result: Reduced our package from 128MB to 45MB

Final Results: Production Metrics

Before vs After Optimization

❌ Before

Cold Start: 3.2s
Warm Execution: 220ms
P95 Latency: 2.8s
Monthly Cost: $380
Cold Start Rate: 15%

✅ After

Cold Start: 950ms (-70%)
Warm Execution: 140ms (-36%)
P95 Latency: 680ms (-76%)
Monthly Cost: $265 (-30%)
Cold Start Rate: < 1%

Business Impact

  • User satisfaction up 35% (measured via NPS surveys)
  • API timeout errors dropped 90% (from 8% to 0.8%)
  • Cost savings: $115/month ($1,380 annually)
  • Eliminated need for warm-up scripts (saved 200 lines of hacky code)

Key Takeaways

1. ARM64 is a no-brainer for most workloads

20% faster, 20% cheaper, minimal migration effort. Start here.

2. More memory often = lower cost

Use Lambda Power Tuning to find the sweet spot. We found 3008MB was optimal despite seeming expensive.

3. SnapStart is magic (but Java-only)

If you're using Java, enable SnapStart immediately. It's free performance.

4. Provisioned concurrency costs money but eliminates cold starts

Use it selectively for user-facing APIs during peak hours with auto-scaling.

5. Code optimization matters just as much as infrastructure

Lazy loading, connection reuse, and package size reduction are free wins.

Want More DevOps Deep Dives?

Get weekly engineering insights and real-world optimization strategies in your inbox