How boto3 impacts the cold start times of your Lambda functions

AWS Lambda is a really cool service that I use it extensively in the serverless architectures I design and build. My implementation language of choice in these cases is usually Python - not only because I’m most familiar with it, but also because it integrates well in the Lambda ecosystem. The startup times and resource requirements compared to Lambda Functions written in something like Java are really low and using the Lambda code editor to quickly troubleshoot issues in your code makes for quick iterations (don’t do that in production though).

Most of my Lambda functions make some call to at least another AWS service using the AWS SDK for Python: boto3. I knew that boto3 is not the most lightweight library, but I noticed some time ago, that it takes a bit of time to initialize the first boto3 object. After seeing a question on stackoverflow, I decided to take a closer look.

In this post I’ll first show you the architecture I used to measure how boto3 initializations impact the time your Lambda function takes before it can perform any meaningful work and how that behaves in relation to the available performance resources. Then we’ll discuss the measurements and my attempts to explain why we see this behavior. In the end I’m going to tell you some of the conclusions I make based on this.

The setup

How do you measure how long it takes to instantiate a boto3 client or resource? The general idea is pretty simple, here’s a code sample:

from datetime import datetime
import boto3

start_time = datetime.now() # Start the clock
boto3.client("dynamodb")
end_time = datetime.now() # Stop the clock

time_it_took = (end_time - start_time).total_seconds()

print(f"Initialization of a boto3.client took {time_it_took} seconds.")

If you instantiate other boto3.client or boto3.resource objects after that first one, you’ll notice that this is a lot quicker. That’s because the library does some caching internally, which is a good thing. This also means we need to make sure we measure the cold start times, because these are going to be our worst case scenario.

Earlier experiments had led me to believe, that CPU and general I/O performance have a significant impact on this first instantiation. I decided to systematically test this by measuring the execution times with different Lambda functions. Since properties like CPU or network throughput in Lambda scale based on the memory available to the function, I built the following setup.

I created a CDK app, that builds multiple Lambda functions with the same code but different memory configurations for me. For my tests I set 128 MB as the lower boundary and 2560 MB as the upper one and created Lambda functions in increments of 128 MB memory between these two boundaries. That resulted in 20 experiments. I doubled that to 40, because I wanted to test both boto3.client as the lower level implementation and boto3.resource as the higher level abstraction.

To invoke these functions I’ve used an SNS topic with another Lambda in front of it, which sends - in my case - 100 messages to the topic. Each experiment-Lambda records its result in a DynamoDB table. There’s also a 42nd Lambda (for obvious reasons), which aggregates the results in the table into a CSV format.

Architecture Diagram

The CDK app is available on Github, so you can try this for yourself and take a closer look at the code. Setup instructions are part of the repo. To ensure I’d only measure actual cold starts, I had to use some tricks. First I had to detect if the current invocation is a warm-start and if that’s the case, raise an exception. In order to achieve that I used an implementation like this.

IS_COLD_START = True

def handler(event, context):
    global IS_COLD_START

    if not IS_COLD_START:
        raise RuntimeError("Warm Start!")

    # Other code

    IS_COLD_START = False

This stops me from recording the warm-start times, because global variable “survive” through multiple invocations. Now I’ve got the problem, that a lot of my Lambda calls are going to fail, because it’s trying to re-use existing execution contexts. To avoid I implemented an easy way to get rid of these warm execution contexts. I update the Lambda function configuration at the end of a sucessful run. In my case I just add an environment variable called ELEMENT_OF_SURPRISE with a random value. If you’re curious about the implementation, check out the code on Github.

The result aggregation is not very interesting, I basically collect the measurements from the table and calculate an average over all calls, per method, per memory configuration, which is then formatted as CSV for ease of use in Excel. Let’s talk about these results next.

Results

Based on the more than 100 measurements per configuration I described above, I’ve compiled this chart to show the results. The y-axis indicates how long this initial instantiation of either the client or resource took. On the x-axis you can see the amount of memory that has been allocated to the Lambda function. Each measurement is the average of at least 100 individual measurements.

Architecture Diagram

As you can see the amount it takes dramatically decreases as soon as we increase the performance characteristics of the function. In fact we can see by looking at the table below, that the initial instantiation of both the client and resource take more than 10 times as long for a Lambda with 128 MB RAM when compared to one with those that have at least 1408 MB RAM, which seems roughly proportional.

A closer look at the data reveals, that it is in fact not proportional, the initial gains are vastly greater than the later ones. We observe diminishing returns and beyond the threshold of about 1408 MB it doesn’t really matter anymore if you increase the memory. That’s probably around the threshold where a second vCPU core is added to the function. (I have left out the later results, because they’re essentially the same, you can view the whole result set on GitHub.)

Memory Size (MB) client init (ms) resource init (ms)
128 640 716
256 299 356
384 193 232
512 141 174
640 112 138
768 92 108
896 79 97
1024 74 83
1152 69 81
1280 62 76
1408 60 71
1536 59 68
1664 61 71

Let’s now consider what we can learn from that.

Conclusions

  • For “user-facing” functions, such as Lambdas behind an API Gateway or ALB that need to talk to another AWS service through boto3, the minimum memory amount that I’d recommend is between 512 and 768 MB. Consider that after initializing boto3 no actual work in terms of service communication has happened. That’s time that will be added on-top.
  • The - in my opinion - very cool implementation of boto3, which essentially builds the whole API from a couple of JSON files, has a severe performance penalty on first initialization.
  • Doubling the memory in the lower areas of the MB-spectrum results in double the performance, but there are diminishing returns (note: this is not necessarily true for all workloads), this is most likely due to our particular workload not scaling across multiple CPU cores.

I hope you liked this investigation into how boto3 initializations impact the time your Lambda function takes before it can perform meaningful work. If you want to get in touch, I’m available on the social media channels listed in my bio below.

– Maurice

References

Similar Posts You Might Enjoy

Using CloudFormation Modules for Serverless Standard Architecture

Serverless - a Use Case for CloudFormation Modules? Let´s agree to “infrastructure as code” is a good thing. The next question is: What framework do you use? To compare the frameworks, we have the tRick-benchmark repository, where we model infrastructure with different frameworks. Here is a walk through how to use CloudFormation Modules. This should help you to compare the different frameworks. - by Gernot Glawe

Start Guessing Capacity - Benchmark EC2 Instances

Stop guessing capacity! - Start calculating. If you migrate an older server to the AWS Cloud using EC2 instances, the prefered way is to start with a good guess and then rightsize with CloudWatch metric data. But sometimes you’ve got no clue, where to start. And: Did you think all AWS vCPUs are created equal? No, not at all. The compute power of different instance types is - yes - different. - by Gernot Glawe

Speed up Docker Image Building with the CDK

When building docker images with the CDK you might notice increasing build times on subsequent invocations of cdk synth. Depending on your setup, there might be a simple solution to that problem - using a .dockerignore file. In this post I’m going to briefly explain how and why that’s useful and may help you. - by Maurice Borgmeier