# Liveness and Readiness Probes
To observe the state of an application/service there are two different states which should be monitored. A monitoring tool needs to be able to poll the current state of each service by calling well-defined endpoints. Usually this polling is done about every 5 to 10 seconds.
# Liveness-Probe / Health-check
The Liveness-probe or health-check tells if the service is up and running.
# Readiness-Probe
The Readiness-Probe tells if the service is ready for processing.
# Difference Example:
A service starts in a Docker Container, when the starting is finished, it's healthy, but not necessary ready. It perhaps needs to establish a database connection or do some migration first before it is ready to fulfill its tasks again. So there is a possible delay between "up and running" healthy and "ready for work" ready.
# Handling for permanent running services
# /info HTTP endpoint
- A service MUST expose a
/infoHTTP endpoint to make basic information about the service easily accessible.- Details:
- The response MUST be in a human readable format and it MUST include the service's version and name.
- The response SHOULD be of content-type
application/json. - The content MUST NOT include any security-sensitive information like versions of dependencies or infrastructural information.
- Details:
# Example payload
{
name: 'payment-gateway'
version: '1.2.14'
}
# /health HTTP endpoint
- A service MUST expose a
/healthHTTP endpoint as to check its health/liveness. - A service MUST send a
200 OKrespond if it is considered healthy.- Details:
- The response MUST be in a human-readable format.
- The response MUST use the http status codes provided in the table below
- The response SHOULD be of content-type
application/json.
- Details:
# Example payload
// healthy
Status code 200
{
state: 'OK'
}
// unhealthy
Status code 503
{
state: 'not available'
}
# /ready HTTP endpoint
- A service MAY expose a
/readyHTTP endpoint as to check its readiness. - A service MUST send a
200 OKrespond if it is considered ready.- Details:
- The response MUST be in a human-readable format.
- The response MUST use the http status codes provided here
- The response SHOULD be of content-type
application/json.
- Details:
# Example payload
// ready
Status code 200
{
state: 'OK'
}
// not ready
Status code 503
{
state: 'not available'
}
# Status Codes
- A service MAY provide additional information about its state by implementing a set of common responses:
| Condition | Readiness | Liveness | State |
|---|---|---|---|
| Not OK - no workload | Not OK - restart | Not OK - restart | |
| starting | 503 - not available | 200 - OK | delay to avoid test |
| active | 200 - OK | 200 - OK | 200 - OK |
| stopping | 503 - not available | 200 - OK | 503 - not available |
| inactive | 503 - not available | 503 - not available | 503 - not available |
| faulty | 500 - server error | 500 - server error | 500 - server error |
# Handling for services on demand / "scale zero"
While permanent running services can be called at any time, it makes no sense to invoke a scale-zero[^scale zero] service, like a Lambda-Function, every x seconds to ask for its live-or readiness. In these cases you MUST NOT implement these endpoints, but use different methods to make sure the service runs as expected.
# Example
A Lambda function is triggered by a http-endpoint. The first thing SHOULD do is log the incoming request and log the response after processing.
// logging process start on info level
logger.info('Start processing request', { request })
// do the processing
// an error occurred
logger.error('Processing of request failed', { request, error })
// logging process finished on info level
logger.info('Finished processing request', { response })
Now you can set up alerting based on the logging, e.g., if the finished message is not logged after a certain time [^certain time] or if an error message was logged.
[^scale zero] Means a service instance is started on demand, e.g., Lambda or Fargate (with this setting) [^certain time] Measure the maximal process time, add a few seconds to your threshold for alerting