# Liveness and Readiness Probes

To observe the state of an application/service there are two different states which should be monitored. A monitoring tool needs to be able to poll the current state of each service by calling well-defined endpoints. Usually this polling is done about every 5 to 10 seconds.

# Liveness-Probe / Health-check

The Liveness-probe or health-check tells if the service is up and running.

# Readiness-Probe

The Readiness-Probe tells if the service is ready for processing.

# Difference Example:

A service starts in a Docker Container, when the starting is finished, it's healthy, but not necessary ready. It perhaps needs to establish a database connection or do some migration first before it is ready to fulfill its tasks again. So there is a possible delay between "up and running" healthy and "ready for work" ready.

# Handling for permanent running services

# /info HTTP endpoint

  • A service MUST expose a /info HTTP endpoint to make basic information about the service easily accessible.
    • Details:
      • The response MUST be in a human readable format and it MUST include the service's version and name.
      • The response SHOULD be of content-type application/json.
      • The content MUST NOT include any security-sensitive information like versions of dependencies or infrastructural information.

# Example payload

{
  name: 'payment-gateway'
  version: '1.2.14'
}

# /health HTTP endpoint

  • A service MUST expose a /health HTTP endpoint as to check its health/liveness.
  • A service MUST send a 200 OK respond if it is considered healthy.
    • Details:
      • The response MUST be in a human-readable format.
      • The response MUST use the http status codes provided in the table below
      • The response SHOULD be of content-type application/json.

# Example payload

// healthy
Status code 200
{
  state: 'OK'
}

// unhealthy
Status code 503
{
  state: 'not available'
}

# /ready HTTP endpoint

  • A service MAY expose a /ready HTTP endpoint as to check its readiness.
  • A service MUST send a 200 OK respond if it is considered ready.
    • Details:
      • The response MUST be in a human-readable format.
      • The response MUST use the http status codes provided here
      • The response SHOULD be of content-type application/json.

# Example payload

// ready
Status code 200
{
  state: 'OK'
}

// not ready
Status code 503
{
  state: 'not available'
}

# Status Codes

  • A service MAY provide additional information about its state by implementing a set of common responses:
Condition Readiness Liveness State
Not OK - no workload Not OK - restart Not OK - restart
starting 503 - not available 200 - OK delay to avoid test
active 200 - OK 200 - OK 200 - OK
stopping 503 - not available 200 - OK 503 - not available
inactive 503 - not available 503 - not available 503 - not available
faulty 500 - server error 500 - server error 500 - server error

# Handling for services on demand / "scale zero"

While permanent running services can be called at any time, it makes no sense to invoke a scale-zero[^scale zero] service, like a Lambda-Function, every x seconds to ask for its live-or readiness. In these cases you MUST NOT implement these endpoints, but use different methods to make sure the service runs as expected.

# Example

A Lambda function is triggered by a http-endpoint. The first thing SHOULD do is log the incoming request and log the response after processing.

// logging process start on info level
logger.info('Start processing request', { request })

// do the processing

// an error occurred
logger.error('Processing of request failed', { request, error })

// logging process finished on info level
logger.info('Finished processing request', { response })

Now you can set up alerting based on the logging, e.g., if the finished message is not logged after a certain time [^certain time] or if an error message was logged.

[^scale zero] Means a service instance is started on demand, e.g., Lambda or Fargate (with this setting) [^certain time] Measure the maximal process time, add a few seconds to your threshold for alerting

Page Info: Created by GitHub on Jun 9, 2023 (last updated a minute ago by GitHub)