Adding Deployment Health Checks
Applications that you care about should be tested.
While we won't get into testing in this tutorial, Fly does feature health checks that can be thought of as a last line of defense when deploying an application.
TCP Health Check
fly.toml file we can see that Fly already has a TCP heal
Lecturer: 0:00 Applications that you care about should be tested, and we're not going to get into testing specifically in this tutorial. However, we do have a little bit of a last line of defense that is offered to us by Fly in the form of health checks.
0:14 In fact, we already have one health check here. It's called a TCP check, which will make sure that our application is ready to accept network connections, but it doesn't actually connect to our application, make a request, or anything like that.
0:28 What would be really nice is if, right before deploying, we could just make a quick HTTP call to our application and say, "Can you accept traffic yet?" If it responds with a 200 OK, then let's go ahead and deploy the app. If it doesn't, then let's not deployed the app. Let's keep the old one around and then go figure out why it didn't respond.
0:47 This is a last line of defense. It's not intended to replace regular testing, but it can be really helpful in deploying applications where there's a lot of moving parts and you want to have just one last little check. This is called HTTP checks. If you look up fly.toml and go to the configuration documentation page, then you can find the services.http_checks config.
1:12 We're going to copy this and bring it over here, right below our TCP check. This interval is how frequently this HTTP check is going to be made, because it's not just when the application starts. We want to make sure that our application is healthy over time, and so every 10 seconds, it will say, "Hey, are you still healthy?"
1:33 Then, the grace_period is how long after our application has started up does this HTTP check get called. I'm actually fine with it being one second for this simple application, but some applications take a little while to boot up. Depending on how long it takes for your application to boot up, you'll adjust that grace_period.
1:53 The method, we can do a GET. You could also do a HEAD request here. We'll just do a GET. We'll do the path to our home page. That seems like a pretty reasonable thing, that people should be able to get to our home page. That should cover a fair bit of ground as a last line of defense.
2:07 The protocol we're going to leave as HTTP because this is a behind-at-the-network request that's going to be happening from a Fly server to our server, so there's no certificate going on there. We'll leave that as HTTP.
2:18 Restart_limit. This is how many times, after a failed HTTP check, should Fly automatically restart the server for you. The default is zero. I'm going to leave it there. What zero means is it disables that behavior. It will never restart your application, once the HTTP checks start happening.
2:36 Depending on your use case or use situation, you might bump that up a couple of times. After a couple failures, go ahead and try and restart, and magically maybe it'll work. Then, the timeout. It's making requests to your application. How long are you OK with it waiting to get a response before it decides, "Oh, must be failing"? I'll just leave that as 200.
2:57 Something more reasonable, I suppose, might be 500 milliseconds. It depends on how fast your application responds. The tls_skip_verify. This doesn't actually matter all that much because our protocol is HTTP. Then, any headers. If you want to make authenticated requests or something like that, you can add headers down here.
3:17 With all of that, let's do a fly deploy to get a basic health check out there. Then, after this is successfully deployed and it's running, let's go ahead and see what the experience would be like if we actually did deploy something that totally blew up our home page, for example. We'll speed up here a little bit to get it deployed, and then we'll give that a test.
3:46 Here, you'll see that it has two health checks. Before, it only had the one, TCP, but now we've added our HTTP check, and so now there are two of them. After a while, they're both passing, so if we go and take a look at our application, we can see that it's still running. Everything is working just fine and things are happy.
4:04 Of course, you probably ran your test as well, hopefully, and so you are pretty confident that things are good. This just double-checks that everything is going to work out. Now let's make little mistake here. What if I misspell GET right here? That would not be good, and I don't have any TypeScript that's going to double-check that that's right.
4:23 Maybe I should, but let's go ahead and do a fly deploy right now and see what happens as that gets deployed. Let's expand this so we can see all of the output. We'll speed that up for you too. Now we see that we do still have those two health checks.
4:42 One is passing, one is critical. It's not passing, and if we look at our logs over here, we're going to see things are not looking super great for our application. The reason is, if we look at our code, when we had that typo on GET, what happens is it passes this case.
5:01 It continues down in here, and it's going to give us a 404, not found, which is not going to pass our health check. Eventually, Fly is going to decide, "OK, that's not going to fly," literally, and it will roll back to the previous version, so we'll wait for that. Great. Our deploy failed due to unhealthy allocations, so it rolled back to job version 18 and deployed as version 20.
5:31 We tried to deploy the version that had the breakage in our index here, where we misspelled GET, and that deployed, but Fly didn't actually start sending traffic to it until after the health checks passed. They never passed, and so it rolled back and restarted our application with the previous working version, the version that was up before.
5:56 It shut down the version that was not working and wasn't able to accept traffic, and it started up a previous version that was working before, and now our application is working. There are various strategies for deploying with Fly that you can use. There's the rolling strategy, which is the default for a Node application that has a persisted volume, and that's the one that we have.
6:19 There are other strategies you can employ to reduce the amount of time that your deployment is waiting on health checks. In our case, the right thing happened. We did not deploy a app that didn't work. We just made traffic hold off while the deploy was happening and while those health checks were being checked.
6:38 Because they didn't successfully happen, we ended up shutting down the new deploy, starting up the previous version, so that we could accept traffic. Then, hopefully, we get some alerts and say, "Hey, that deploy did not work, and so you're going to want to take a look at it and figure out what you did wrong."
6:55 Another thing that I want to do with this is add another health check to make it really certain that things are working, because you can add multiple of these, and that can be quite useful. For example, on my own personal website, I have a health check endpoint, so we'll do the same thing here, /healthcheck.
7:13 What this does is it not only will check that I can accept traffic on my home page, but also will make some queries to the database to make sure that the database is in a healthy state as well. We'll go ahead and add a health check endpoint here at the top. Case, it'll be a GET request to /healthcheck.
7:35 With this, we'll just do a try-catch. Of course, we need to add a break. It's a switch statement, after all. In the catch, we'll add a console.error. We'll log the error, and then we'll say, res.writeHead(500) and res.end. We can do "Internal Server Error" or just "ERROR," whatever you want to do there.
7:55 The actual check that we want to do, let's just say we want to do this query that we're making on the back end. We'll say, await getCurrentCount, and res.writeHead(200) and res.end("OK"). We're all set there, and so now we have three health checks.
8:13 We've got the TCP health check that came with the initial fly.toml that we started with, the HTTP health check for the home page, and the additional health check endpoint that we can do whatever types of health checks we want. Make sure we have connections with the right services, all of that stuff happening inside of our health check endpoint.
8:34 That's how you make sure that you don't deploy a broken app to production, as a last line of defense. Again, this is not a replacement for tests, but it's a really good last line of defense, right before you switch from it is almost deployed to actually being deployed and accepting traffic. Really helpful feature from Fly.