commit 4da6f9020a343feccf432db5c3dabfe23c105f93
parent 3d8f0de592a4649ef2b107a5e2c9030637864038
Author: Antoine A <>
Date: Mon, 16 Feb 2026 13:15:36 +0100
dd84: improve proposal
Diffstat:
1 file changed, 34 insertions(+), 19 deletions(-)
diff --git a/design-documents/084-simple-observability.rst b/design-documents/084-simple-observability.rst
@@ -4,16 +4,24 @@ DD 84: Simple observability
Summary
=======
-We want a simple way to check whether our various services are working properly.
+We want a simple way to check if our various services are working properly and to be notified when they are not.
Motivation
==========
-We want to have observability, and the obvious maximalist solution is Prometheus and Grafana.
-The problem is that this system is so complex to configure and so cumbersome that we never have the time to configure properly.
-By trying to have a perfect solution, we end up with none at all.
+We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts.
-I propose a simple solution based on health endpoints that should give us most of what we need more quickly.
+We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now.
+
+Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation.
+
+This is where observability comes in. It provide the answers to ``is it running`` and ``why is it not``. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems:
+- It is heavy and can actually take up more ressources than all the services its supposed to monitor
+- It is complicated to set up, maintain, and configure
+
+Because of these problems, we don't have a solution at the moment, as we are delaying this configuration.
+
+I think we can have a simpler and lighter solution that would answers ``is it running`` well and ``why is it not`` in a minimalistic way.
Requirements
============
@@ -26,15 +34,15 @@ Requirements
Proposed Solution
=================
-Each service should have an health endpoint that give its current health status:
+All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it's use (either ``ok`` if everything is fine, or ``degraded`` if it's running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation.
.. ts:def:: HealthStatus
interface HealthStatus {
// Whether the service is running fine or in a degraded way
status: "ok" | "degraded";
- // Additional information about the service components
- components: [key: string]: string;
+ // Additional context about the service components and processes
+ context: [key: string]: string;
}
For libeufin-bank:
@@ -43,7 +51,7 @@ For libeufin-bank:
{
"status": "degraded",
- "component": {
+ "context": {
"database": "ok",
"tan-sms": "ok",
"tan-email": "failure"
@@ -56,10 +64,14 @@ For libeufin-nexus:
{
"status": "degraded",
- "component": {
+ "context": {
"database": "ok",
"ebics-submit": "ok",
- "ebics-fetch": "failure"
+ "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
+ "ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z"
+ "ebics-fetch": "failure",
+ "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
+ "ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z"
}
}
@@ -69,19 +81,20 @@ For taler-exchange:
{
"status": "degraded",
- "component": {
+ "context": {
"database": "ok",
- "wirewatch": "failure"
+ "wirewatch": "failure",
+ "wirewatch-latest": "2012-04-23T18:25:43.511Z"
}
}
-Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded event if the API is up.
-The JSON body can be shared within the alert, which makes debugging easier because we have a clue as to what is failling.
+Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place.
+The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service.
Test Plan
=========
-- Add health endpoint to libeufin-bank and libeufin-nexus
+- Add health endpoint to libeufin-bank and libeufin-nexus first
- Deploy on demo
- Test status update and alerts
@@ -91,7 +104,7 @@ Alternatives
Prometheus format with Uptime Kuma
----------------------------------
-Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier.
+Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future.
.. code-block:: text
@@ -111,7 +124,7 @@ Instead of using JSON we could use Prometheus metrics textual format. This would
# TYPE component_tan_email_status gauge
tan_email_status 0.0
-We could also use the existing Taler Observability API.
+We could then also use the existing Taler Observability API but only provide simple metrics for now.
Prometheus & Grafana alternative
--------------------------------
@@ -121,8 +134,10 @@ We can try to use other more performant while not simpler alternative like Victo
Drawbacks
=========
-This does not resolve the issue of system resources and services.
+This does not resolve the issue of system resources and systemd services.
It's therefore not sufficient for a complete observability system. However, it is easier to implement for now.
+We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner.
+
Discussion / Q&A
================