5 Metrics to track when working with Cloud Functions

Reddit
Linkedin

Let's get started with Stackdrive

This post is a part of a series to get you started with Site Reliability Engineering practices when working with (fully or not) serverless architectures.

The road is long before even start to talk about SRE and even more when the community do not agree yet what does it mean on a serverless paradigm, we can change this, we can make it work but at the end of the day it's all about metrics, therefore, we need to start measuring. It's you need some help about how to get you started with custom log, take a look at this post.

This time we are going to focus on some key metrics that GCP already do for you and take advantage of. For the purpose of this sample, we are going to use Cloud Functions and Firestore.

This is what we are building

production

Getting started with Stackdriver

Making the long story short, it's a tool that allows us to visualize metrics, create dashboards and keep track of them with ease. 

The purpose of the dashboard we are going to build will help you to:

  • Keep track of how much memory are you functions consuming.
  • Keep track of the errors per period of time
  • Keep track of your daily Cloud Functions Execution Count
  • Keep track of your Cloud Functions Execution Time
  • Keep track of your daily reads (firestore)

This will allow taking Data-driven decisions to allocate more memory to your function, track possible bugs, timeouts and your response time. Remember, you cannot control what you don't measure.

Make sure you enable your Monitoring api and set yourself into Stackdriver (this assume you have already some data).

Keep track of Functions Memory

Allocate more memory to functions costs money but also the impact of the performance of your operations, you really want to know when to upgrade those 256Mb you get by default but ensure you pay for what you need. I will explain with details the first one, you can check the video for more details about the others. 

We have to:

  • Create a Dashboard on Stackdrive
  • Add Metrics
  • Select cloud_functions as the source
  • Select Memory as metrics
  • Select None as an aggregator
  • Select 99th percentile Aligner
  • Set a threshold that makes sense for you, in this case, I will set 220Mb

memory

What we did here is to create a tool for us to quickly spot how our functions behave and more importantly, when to make decisions about it. On a future post, I will teach you how to create Incidents when those thresholds are passed and how you can be quickly notified.

Keep track of Errors

In this case, I'm more interested in keeping tracking the count of errors on a period of time,we are going to use a Stacked Bar and separate each our functions with colors.

errors

You could also filter by your custom errors or the severity of the log, I will talk about how to work with custom logs on the next post of this series. 

Probably, in this case, it's a good idea to set a threshold of 5 per minute, this is very useful after new deployments, you know… production is another world 🙄.

Cloud Functions Execution Count

I use this as a reference but at some point, you might be interested to estimate your forecast based on previous data.

errors

On this case, I'm interested in my daily consumption, notice that I set the Alignment Period to 1440 which is equivalent to 24 hours, this also mean you have to set the time periods to 1w in order to visualize it better.

Cloud Functions Execution Time

This one is my favorite, especially when you work with callable or http functions, you can quickly spot which function in delaying your front end and take actions quickly.

exec-time

I personally like to separate my backgrounds functions from the https functions in this case, you can achieve this applying filters by function_name.

Firestore Daily reads

On the firebase console, you have a nice dashboard to tell you this information, however, I rather consolidate all my metrics on a single dashboard instead of looking in different places, let's take a look on this.

Notice I'm using a different source on this case.

errors

Notice I also set the write as a bonus ;-).

If you really have to keep yourself on the free tier, you should probably set a daily threshold according.

My goal with a series of posts about Reliability is to put the curiosity on your mind, as engineers, we need to be in control of our tools, even when bad things happen, you should be able to a post mortem summary and take actions to avoid this to happen again.

Enjoyed this post? Receive the next one in your inbox!

I hand pick all the best resources about Firebase and GCP around the web.


Not bullshit, not spam, just good content, promised 😘.


Reddit
Linkedin