Let's get started with Stackdrive
This post is a part of a series to get you started with Site Reliability Engineering practices when working with (fully or not) serverless architectures.
The road is long before even start to talk about SRE and even more when the community do not agree yet what does it mean on a serverless paradigm, we can change this, we can make it work but at the end of the day it's all about metrics, therefore, we need to start measuring. It's you need some help about how to get you started with custom log, take a look at this post.
This time we are going to focus on some key metrics that GCP already do for you and take advantage of. For the purpose of this sample, we are going to use Cloud Functions and Firestore.
This is what we are building
Making the long story short, it's a tool that allows us to visualize metrics, create dashboards and keep track of them with ease.
The purpose of the dashboard we are going to build will help you to:
This will allow taking Data-driven decisions to allocate more memory to your function, track possible bugs, timeouts and your response time. Remember, you cannot control what you don't measure.
Make sure you enable your Monitoring api and set yourself into Stackdriver (this assume you have already some data).
Allocate more memory to functions costs money but also the impact of the performance of your operations, you really want to know when to upgrade those 256Mb you get by default but ensure you pay for what you need. I will explain with details the first one, you can check the video for more details about the others.
We have to:
What we did here is to create a tool for us to quickly spot how our functions behave and more importantly, when to make decisions about it. On a future post, I will teach you how to create Incidents when those thresholds are passed and how you can be quickly notified.
In this case, I'm more interested in keeping tracking the count of errors on a period of time,we are going to use a Stacked Bar and separate each our functions with colors.
You could also filter by your custom errors or the severity of the log, I will talk about how to work with custom logs on the next post of this series.
Probably, in this case, it's a good idea to set a threshold of 5 per minute, this is very useful after new deployments, you know… production is another world 🙄.
I use this as a reference but at some point, you might be interested to estimate your forecast based on previous data.
On this case, I'm interested in my daily consumption, notice that I set the Alignment Period to 1440 which is equivalent to 24 hours, this also mean you have to set the time periods to 1w in order to visualize it better.
This one is my favorite, especially when you work with callable or http functions, you can quickly spot which function in delaying your front end and take actions quickly.
I personally like to separate my backgrounds functions from the https functions in this case, you can achieve this applying filters by
On the firebase console, you have a nice dashboard to tell you this information, however, I rather consolidate all my metrics on a single dashboard instead of looking in different places, let's take a look on this.
Notice I'm using a different source on this case.
Notice I also set the write as a bonus ;-).
If you really have to keep yourself on the free tier, you should probably set a daily threshold according.
My goal with a series of posts about Reliability is to put the curiosity on your mind, as engineers, we need to be in control of our tools, even when bad things happen, you should be able to a post mortem summary and take actions to avoid this to happen again.