# Liveness and Readiness Probes

To observe the state of an application/service there are two different states which should be monitored. A monitoring tool needs to be able to poll the current state of each service by calling well-defined endpoints. Usually this polling is done about every 5 to 10 seconds.

# Liveness-Probe / Health-check

The Liveness-probe or health-check tells if the service is up and running.

# Readiness-Probe

The Readiness-Probe tells if the service is ready for processing.

# Difference Example:

A service starts in a Docker Container, when the starting is finished, it's healthy, but not necessary ready. It perhaps needs to establish a database connection or do some migration first before it is ready to fulfill its tasks again. So there is a possible delay between "up and running" healthy and "ready for work" ready.

# Handling for permanent running services

# `/info` HTTP endpoint

A service MUST expose a /info HTTP endpoint to make basic information about the service easily accessible.
- Details:
  - The response MUST be in a human readable format and it MUST include the service's version and name.
  - The response SHOULD be of content-type application/json.
  - The content MUST NOT include any security-sensitive information like versions of dependencies or infrastructural information.

# Example payload

{
  name: 'payment-gateway'
  version: '1.2.14'
}

# `/health` HTTP endpoint

A service MUST expose a /health HTTP endpoint as to check its health/liveness.
A service MUST send a 200 OK respond if it is considered healthy.
- Details:
  - The response MUST be in a human-readable format.
  - The response MUST use the http status codes provided in the table below
  - The response SHOULD be of content-type application/json.

# Example payload

// healthy
Status code 200
{
  state: 'OK'
}

// unhealthy
Status code 503
{
  state: 'not available'
}

# `/ready` HTTP endpoint

A service MAY expose a /ready HTTP endpoint as to check its readiness.
A service MUST send a 200 OK respond if it is considered ready.
- Details:
  - The response MUST be in a human-readable format.
  - The response MUST use the http status codes provided here
  - The response SHOULD be of content-type application/json.

# Example payload

// ready
Status code 200
{
  state: 'OK'
}

// not ready
Status code 503
{
  state: 'not available'
}

# Status Codes

A service MAY provide additional information about its state by implementing a set of common responses:

Condition	Readiness	Liveness	State
	Not OK - no workload	Not OK - restart	Not OK - restart
starting	503 - not available	200 - OK	delay to avoid test
active	200 - OK	200 - OK	200 - OK
stopping	503 - not available	200 - OK	503 - not available
inactive	503 - not available	503 - not available	503 - not available
faulty	500 - server error	500 - server error	500 - server error

# Handling for services on demand / "scale zero"

While permanent running services can be called at any time, it makes no sense to invoke a scale-zero[^scale zero] service, like a Lambda-Function, every x seconds to ask for its live-or readiness. In these cases you MUST NOT implement these endpoints, but use different methods to make sure the service runs as expected.

# Example

A Lambda function is triggered by a http-endpoint. The first thing SHOULD do is log the incoming request and log the response after processing.

// logging process start on info level
logger.info('Start processing request', { request })

// do the processing

// an error occurred
logger.error('Processing of request failed', { request, error })

// logging process finished on info level
logger.info('Finished processing request', { response })

Now you can set up alerting based on the logging, e.g., if the finished message is not logged after a certain time [^certain time] or if an error message was logged.

[^scale zero] Means a service instance is started on demand, e.g., Lambda or Fargate (with this setting) [^certain time] Measure the maximal process time, add a few seconds to your threshold for alerting

← Jira - Atlassian Cloud Logging →

# Liveness and Readiness Probes

# Liveness-Probe / Health-check

# Readiness-Probe

# Difference Example:

# Handling for permanent running services

# /info HTTP endpoint

# Example payload

# /health HTTP endpoint

# Example payload

# /ready HTTP endpoint

# Example payload

# Status Codes

# Handling for services on demand / "scale zero"

# Example

# `/info` HTTP endpoint

# `/health` HTTP endpoint

# `/ready` HTTP endpoint