dd84: improve proposal - taler-docs - Documentation for GNU Taler components, APIs and protocols

commit 4da6f9020a343feccf432db5c3dabfe23c105f93
parent 3d8f0de592a4649ef2b107a5e2c9030637864038
Author: Antoine A <>
Date:   Mon, 16 Feb 2026 13:15:36 +0100

dd84: improve proposal

Diffstat:
M design-documents/084-simple-observability.rst  | 53 ++++++++++++++++++++++++++++++++++-------------------

1 file changed, 34 insertions(+), 19 deletions(-)
diff --git a/design-documents/084-simple-observability.rst b/design-documents/084-simple-observability.rst
@@ -4,16 +4,24 @@ DD 84: Simple observability
 Summary
 =======
 
-We want a simple way to check whether our various services are working properly.
+We want a simple way to check if our various services are working properly and to be notified when they are not.
 
 Motivation
 ==========
 
-We want to have observability, and the obvious maximalist solution is Prometheus and Grafana.
-The problem is that this system is so complex to configure and so cumbersome that we never have the time to configure properly.
-By trying to have a perfect solution, we end up with none at all.
+We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts.
 
-I propose a simple solution based on health endpoints that should give us most of what we need more quickly.
+We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now.
+
+Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation.
+
+This is where observability comes in. It provide the answers to ``is it running`` and ``why is it not``. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems:
+- It is heavy and can actually take up more ressources than all the services its supposed to monitor
+- It is complicated to set up, maintain, and configure
+
+Because of these problems, we don't have a solution at the moment, as we are delaying this configuration.
+
+I think we can have a simpler and lighter solution that would answers ``is it running`` well and ``why is it not`` in a minimalistic way.
 
 Requirements
 ============
@@ -26,15 +34,15 @@ Requirements
 Proposed Solution
 =================
 
-Each service should have an health endpoint that give its current health status:
+All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it's use (either ``ok`` if everything is fine, or ``degraded`` if it's running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation. 
 
 .. ts:def:: HealthStatus
 
   interface HealthStatus {
     // Whether the service is running fine or in a degraded way
     status: "ok" | "degraded";
-    // Additional information about the service components
-    components: [key: string]: string;
+    // Additional context about the service components and processes
+    context: [key: string]: string;
   }
 
 For libeufin-bank:
@@ -43,7 +51,7 @@ For libeufin-bank:
 
   {
     "status": "degraded",
-      "component": {
+      "context": {
         "database": "ok",
         "tan-sms": "ok",
         "tan-email": "failure"
@@ -56,10 +64,14 @@ For libeufin-nexus:
 
   {
     "status": "degraded",
-    "component": {
+    "context": {
       "database": "ok",
       "ebics-submit": "ok",
-      "ebics-fetch": "failure"
+      "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
+      "ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z"
+      "ebics-fetch": "failure",
+      "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
+      "ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z"
     }
   }
 
@@ -69,19 +81,20 @@ For taler-exchange:
 
   {
     "status": "degraded",
-    "component": {
+    "context": {
       "database": "ok",
-      "wirewatch": "failure"
+      "wirewatch": "failure",
+      "wirewatch-latest": "2012-04-23T18:25:43.511Z"
     }
   }
 
-Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded event if the API is up.
-The JSON body can be shared within the alert, which makes debugging easier because we have a clue as to what is failling.
+Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place.
+The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service.
 
 Test Plan
 =========
 
-- Add health endpoint to libeufin-bank and libeufin-nexus
+- Add health endpoint to libeufin-bank and libeufin-nexus first
 - Deploy on demo
 - Test status update and alerts
 
@@ -91,7 +104,7 @@ Alternatives
 Prometheus format with Uptime Kuma
 ----------------------------------
 
-Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier.
+Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future.
 
 .. code-block:: text
 
@@ -111,7 +124,7 @@ Instead of using JSON we could use Prometheus metrics textual format. This would
     # TYPE component_tan_email_status gauge
     tan_email_status 0.0
 
-We could also use the existing Taler Observability API.
+We could then also use the existing Taler Observability API but only provide simple metrics for now.
 
 Prometheus & Grafana alternative
 --------------------------------
@@ -121,8 +134,10 @@ We can try to use other more performant while not simpler alternative like Victo
 Drawbacks
 =========
 
-This does not resolve the issue of system resources and services.
+This does not resolve the issue of system resources and systemd services.
 It's therefore not sufficient for a complete observability system. However, it is easier to implement for now.
 
+We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner.
+
 Discussion / Q&A
 ================

	taler-docs Documentation for GNU Taler components, APIs and protocols
	Log \| Files \| Refs \| README \| LICENSE