Optimizing lambdas - reducing your bills

I’m still on a quest to reduce my lambda bills and I’ve made a new discovery.

My lambda functions are idle 90% of the time. That’s right, out of 300ms or so billed my functions are doing nothing but waiting for about 250ms of that. I figured this out by using AWS x-ray to analyze my function run-times.

To understand why the functions are simply waiting 90% of the time you need to understand how AWS works. When you make API calls into the AWS SDK those calls are almost never executed locally. 99% of the time they turn into RPC calls to another AWS server. Making an RPC call to another server involves a lot of waiting for that server to respond. And that’s how I end up spending 90% of my time waiting.

So the next observation is that 512MB lambdas cost 4x as much as 128MB lambdas. How much CPU and memory do you need to wait? Zero.

Now I have already converted all of my lambdas to Go. In some cases Go is 10x faster than Javascript. I believe this mainly has to do with JIT and not so much the language of Javascript. Go is compiled so it does not JIT. Go is garbage collected so that is equivalent with Javascript. Overall Go is an absolute win for increasing lambda performance.

Next I set all of my lambdas to use 128MB instead of 512MB. Sure they take longer to run, but it is only about 50% or less longer. Changing the memory down to 128MB cuts my bills by 75%.

But how can I do this and not lose response time? That is where x-ray comes in. By looking at my x-ray profiles I can see that my functions process for 5ms, wait for 25ms, process for 5ms, wait for 25ms, and so on. So what can I do during those periods of 25ms waiting? Use GoRoutines! I analyzed all of my lambda functions and turned them into a bunch of smaller subroutines. Then I used GoRoutines to run groups in parallel where possible followed by a synchronization point using channels. Doing this allows me to extract all of the parallelism possible out of my lambda and to run two or three of those 25ms waits simultaneously.

And the biggest win of all – I converted my function initialization to run in parallel. Doing new() on AWS services is especially slow since that involves setting up network connections. My cold start init used to run almost 2 seconds. I converted all of those new() calls to run in parallel and now my cold start init finishes in 150-200ms on 128MB nodes. In some functions I new() connections to eight different services in parallel.

Bottom line - after doing all of this my functions run in approximately the same time or less than they did previously. But now they are all running on 128MB nodes which cut my lambda bills 75%. The CPUs are slower on the 128MB nodes, but the increase in parallelism more than offset the slower CPUs.

Another observation. If I set these optimized functions to run on 512MB nodes, their performance hardly changes. That’s because their run times are dominated by the chain of wait times that can’t be reduced due to dependencies. Running on 512MB increases costs 4x and reduces function run time 3-4%.

Note: Of course some lambdas are CPU bound and bigger instances and more memory makes sense for those. Example - image conversion.

6 Likes

nice and details. Thanks a lot

Recommend to publish it as a formal blog with sample codes, for example, https://medium.com/

Here is an example of how to run init in parallel.

	var dbs *dynamodb.DynamoDB
	var iots *iot.IoT
	var iotdatas *iotdataplane.IoTDataPlane
	var s3s *s3.S3
	var ssms *ssm.SSM
	var privateKey *string

	func initDB(sess *session.Session, c chan bool) {
		// Create DynamoDB client
		dbs = dynamodb.New(sess)
		xray.AWS(dbs.Client)
		c <- true
	}

	func initIOT(sess *session.Session, c chan bool) {
		iots = iot.New(sess)
		xray.AWS(iots.Client)
		c <- true
	}

	func initIOTDataplane(sess *session.Session, c chan bool) {
		iotdatas = iotdataplane.New(sess, &aws.Config{
			Endpoint: aws.String(os.Getenv("IOT_ENDPOINT")),
		})
		xray.AWS(iotdatas.Client)
		c <- true
	}

	func initS3(sess *session.Session, c chan bool) {
		s3s = s3.New(sess)
		c <- true
	}

	func initSSM(sess *session.Session, c chan bool) {
		ssms = ssm.New(sess)
		var parameter *ssm.GetParameterOutput
		parameter, _ = ssms.GetParameter(&ssm.GetParameterInput{
			Name:           aws.String("/VAPID/dev/privateKey"),
			WithDecryption: aws.Bool(true),
		})
		privateKey = parameter.Parameter.Value
		c <- true
	}

	func init() {
		xray.Configure(xray.Config{
			LogLevel:       "info", // default
			ServiceVersion: "1.2.3",
		})

		sess := session.Must(session.NewSession())

		c := make(chan bool)
		go initDB(sess, c)           // 1
		go initS3(sess, c)           // 2
		go initIOT(sess, c)          // 3
		go initIOTDataplane(sess, c) // 4
		go initSSM(sess, c)          // 5

		_, _, _, _, _ = <-c, <-c, <-c, <-c, <-c
	}

	func main() {
		lambda.Start(Handler)
	}

Thanks for the codes.

I read an blog before about the memory performance benchmark about lambda. paste here for reference.

How to make Lambda faster: memory performance benchmark

Note that the Fibonacci test in that article is totally compute bound. When you are compute bound it pays to increase your memory and get a faster processor (up to a point). The last chart in the linked article says the cut over is at 2GB.

My functions are IO bound and are simply waiting for responses from other AWS services. They only need about 30ms of actual compute time. When you are IO bound you should reduce your memory as much as is practically possible. After all, why pay for lambda time when all you are doing is waiting for remote IO to complete.

Until I added in all of the x-ray code I was not aware of how IO bound my functions were. Now I know their run time is totally dominated by IO wait time.

So all observations are consistent - upto 2GB lambda if you are compute bound, 128MB lambda if you are IO bound, biggest lambda offered if you are memory bound.

1 Like

Totally make sense. Thanks.