084-simple-observability.rst - taler-docs - Documentation for GNU Taler components, APIs and protocols

084-simple-observability.rst (5475B)
      1 DD 84: Simple observability
      2 ###########################
      3 
      4 Summary
      5 =======
      6 
      7 We want a simple way to check if our various services are working properly and to be notified when they are not.
      8 
      9 Motivation
     10 ==========
     11 
     12 We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts.
     13 
     14 We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now.
     15 
     16 Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation.
     17 
     18 This is where observability comes in. It provide the answers to ``is it running`` and ``why is it not``. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems:
     19 - It is heavy and can actually take up more ressources than all the services its supposed to monitor
     20 - It is complicated to set up, maintain, and configure
     21 
     22 Because of these problems, we don't have a solution at the moment, as we are delaying this configuration.
     23 
     24 I think we can have a simpler and lighter solution that would answers ``is it running`` well and ``why is it not`` in a minimalistic way.
     25 
     26 Requirements
     27 ============
     28 
     29 * Easy to implement by services maintainers
     30 * Easy to configure
     31 * Easy to maintain
     32 * Effective at detecting downtime or degraded states
     33 
     34 Proposed Solution
     35 =================
     36 
     37 Health endpoint
     38 ---------------
     39 
     40 All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it's use (either ``ok`` if everything is fine, or ``degraded`` if it's running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation. 
     41 
     42 .. ts:def:: HealthStatus
     43 
     44   interface HealthStatus {
     45     // Whether the service is running fine or in a degraded way
     46     status: "ok" | "degraded";
     47     // Additional context about the service components and processes
     48     context: [key: string]: string;
     49   }
     50 
     51 For libeufin-bank:
     52 
     53 .. code-block:: json
     54 
     55   {
     56     "status": "degraded",
     57       "context": {
     58         "database": "ok",
     59         "tan-sms": "ok",
     60         "tan-email": "failure"
     61       }
     62   }
     63 
     64 For libeufin-nexus:
     65 
     66 .. code-block:: json
     67 
     68   {
     69     "status": "degraded",
     70     "context": {
     71       "database": "ok",
     72       "ebics-submit": "ok",
     73       "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
     74       "ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z"
     75       "ebics-fetch": "failure",
     76       "ebics-fetch-latest": "2012-04-23T18:25:43.511Z",
     77       "ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z"
     78     }
     79   }
     80 
     81 For taler-exchange:
     82 
     83 .. code-block:: json
     84 
     85   {
     86     "status": "degraded",
     87     "context": {
     88       "database": "ok",
     89       "wirewatch": "failure",
     90       "wirewatch-latest": "2012-04-23T18:25:43.511Z"
     91     }
     92   }
     93 
     94 Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place.
     95 The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service.
     96 
     97 Logs
     98 ----
     99 
    100 Currently, we also rely on logs to detect failures. To do this, we need to ingest the logs into a monitoring system in order to analyze them and generate alerts. We need to decide whether we want to continue doing this or whether we can choose to use the logs only for remediation and therefore leave them where they are.
    101 
    102 Whenever we want to analyze logs to detect a failure condition, we need to see if it is possible for the system to expose it in its health endpoint.
    103 
    104 Test Plan
    105 =========
    106 
    107 - Add health endpoint to libeufin-bank and libeufin-nexus first
    108 - Deploy on demo
    109 - Test status update and alerts
    110 
    111 Alternatives
    112 ============
    113 
    114 Prometheus format with Uptime Kuma
    115 ----------------------------------
    116 
    117 Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future.
    118 
    119 .. code-block:: text
    120 
    121     # HELP app_status 1=ok, 0.5=degraded, 0=down
    122     # TYPE app_status gauge
    123     app_status 0.5
    124 
    125     # HELP component_database_status 1=ok, 0=failure
    126     # TYPE component_database_status gauge
    127     database_status 1.0
    128 
    129     # HELP component_tan_sms_status 1=ok, 0=failure
    130     # TYPE component_tan_sms_status gauge
    131     tan_sms_status 1.0
    132 
    133     # HELP component_tan_email_status 1=ok, 0=failure
    134     # TYPE component_tan_email_status gauge
    135     tan_email_status 0.0
    136 
    137 We could then also use the existing Taler Observability API but only provide simple metrics for now.
    138 
    139 Prometheus & Grafana alternative
    140 --------------------------------
    141 
    142 We can try to use other more performant while not simpler alternative like Victoria Metrics.
    143 
    144 Drawbacks
    145 =========
    146 
    147 This does not resolve the issue of system resources and systemd services.
    148 It's therefore not sufficient for a complete observability system. However, it is easier to implement for now.
    149 
    150 We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner.
    151 
    152 Discussion / Q&A
    153 ================
	taler-docs Documentation for GNU Taler components, APIs and protocols
	Log \| Files \| Refs \| README \| LICENSE