taler-docs

Documentation for GNU Taler components, APIs and protocols
Log | Files | Refs | README | LICENSE

commit 4da6f9020a343feccf432db5c3dabfe23c105f93
parent 3d8f0de592a4649ef2b107a5e2c9030637864038
Author: Antoine A <>
Date:   Mon, 16 Feb 2026 13:15:36 +0100

dd84: improve proposal

Diffstat:
Mdesign-documents/084-simple-observability.rst | 53++++++++++++++++++++++++++++++++++-------------------
1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/design-documents/084-simple-observability.rst b/design-documents/084-simple-observability.rst @@ -4,16 +4,24 @@ DD 84: Simple observability Summary ======= -We want a simple way to check whether our various services are working properly. +We want a simple way to check if our various services are working properly and to be notified when they are not. Motivation ========== -We want to have observability, and the obvious maximalist solution is Prometheus and Grafana. -The problem is that this system is so complex to configure and so cumbersome that we never have the time to configure properly. -By trying to have a perfect solution, we end up with none at all. +We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts. -I propose a simple solution based on health endpoints that should give us most of what we need more quickly. +We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now. + +Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation. + +This is where observability comes in. It provide the answers to ``is it running`` and ``why is it not``. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems: +- It is heavy and can actually take up more ressources than all the services its supposed to monitor +- It is complicated to set up, maintain, and configure + +Because of these problems, we don't have a solution at the moment, as we are delaying this configuration. + +I think we can have a simpler and lighter solution that would answers ``is it running`` well and ``why is it not`` in a minimalistic way. Requirements ============ @@ -26,15 +34,15 @@ Requirements Proposed Solution ================= -Each service should have an health endpoint that give its current health status: +All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it's use (either ``ok`` if everything is fine, or ``degraded`` if it's running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation. .. ts:def:: HealthStatus interface HealthStatus { // Whether the service is running fine or in a degraded way status: "ok" | "degraded"; - // Additional information about the service components - components: [key: string]: string; + // Additional context about the service components and processes + context: [key: string]: string; } For libeufin-bank: @@ -43,7 +51,7 @@ For libeufin-bank: { "status": "degraded", - "component": { + "context": { "database": "ok", "tan-sms": "ok", "tan-email": "failure" @@ -56,10 +64,14 @@ For libeufin-nexus: { "status": "degraded", - "component": { + "context": { "database": "ok", "ebics-submit": "ok", - "ebics-fetch": "failure" + "ebics-fetch-latest": "2012-04-23T18:25:43.511Z", + "ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z" + "ebics-fetch": "failure", + "ebics-fetch-latest": "2012-04-23T18:25:43.511Z", + "ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z" } } @@ -69,19 +81,20 @@ For taler-exchange: { "status": "degraded", - "component": { + "context": { "database": "ok", - "wirewatch": "failure" + "wirewatch": "failure", + "wirewatch-latest": "2012-04-23T18:25:43.511Z" } } -Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded event if the API is up. -The JSON body can be shared within the alert, which makes debugging easier because we have a clue as to what is failling. +Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place. +The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service. Test Plan ========= -- Add health endpoint to libeufin-bank and libeufin-nexus +- Add health endpoint to libeufin-bank and libeufin-nexus first - Deploy on demo - Test status update and alerts @@ -91,7 +104,7 @@ Alternatives Prometheus format with Uptime Kuma ---------------------------------- -Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier. +Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future. .. code-block:: text @@ -111,7 +124,7 @@ Instead of using JSON we could use Prometheus metrics textual format. This would # TYPE component_tan_email_status gauge tan_email_status 0.0 -We could also use the existing Taler Observability API. +We could then also use the existing Taler Observability API but only provide simple metrics for now. Prometheus & Grafana alternative -------------------------------- @@ -121,8 +134,10 @@ We can try to use other more performant while not simpler alternative like Victo Drawbacks ========= -This does not resolve the issue of system resources and services. +This does not resolve the issue of system resources and systemd services. It's therefore not sufficient for a complete observability system. However, it is easier to implement for now. +We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner. + Discussion / Q&A ================