084-simple-observability.rst (5475B)
1 DD 84: Simple observability 2 ########################### 3 4 Summary 5 ======= 6 7 We want a simple way to check if our various services are working properly and to be notified when they are not. 8 9 Motivation 10 ========== 11 12 We had some difficulties managing the degraded operation of libeufin-nexus. The systemd service was working correctly but was failling in a loop. There is currenlty no way to detect this and receive alerts. 13 14 We also have the same problem with other services. For example, when the exchange is not configured correctly, wirewatch may fail to retrieve transaction history from libeufin-nexus, but the API would work correctly. We have to check this manually for now. 15 16 Right now we only have uptime check with Uptime Kana and we need a better observability system that can detect this degraded states and also provides some context on the failure for faster remediation. 17 18 This is where observability comes in. It provide the answers to ``is it running`` and ``why is it not``. To do this, you usually setup Grafana and Prometheus, which is the standard maximalist solution, but it has problems: 19 - It is heavy and can actually take up more ressources than all the services its supposed to monitor 20 - It is complicated to set up, maintain, and configure 21 22 Because of these problems, we don't have a solution at the moment, as we are delaying this configuration. 23 24 I think we can have a simpler and lighter solution that would answers ``is it running`` well and ``why is it not`` in a minimalistic way. 25 26 Requirements 27 ============ 28 29 * Easy to implement by services maintainers 30 * Easy to configure 31 * Easy to maintain 32 * Effective at detecting downtime or degraded states 33 34 Proposed Solution 35 ================= 36 37 Health endpoint 38 --------------- 39 40 All services have a least one REST API. This API should expose a health endpoint that would expose the service global status, this means the httpd process but also all the other components it's use (either ``ok`` if everything is fine, or ``degraded`` if it's running but not functioning properly). We are also adding a way to add more context in a less structured way to help with remediation. 41 42 .. ts:def:: HealthStatus 43 44 interface HealthStatus { 45 // Whether the service is running fine or in a degraded way 46 status: "ok" | "degraded"; 47 // Additional context about the service components and processes 48 context: [key: string]: string; 49 } 50 51 For libeufin-bank: 52 53 .. code-block:: json 54 55 { 56 "status": "degraded", 57 "context": { 58 "database": "ok", 59 "tan-sms": "ok", 60 "tan-email": "failure" 61 } 62 } 63 64 For libeufin-nexus: 65 66 .. code-block:: json 67 68 { 69 "status": "degraded", 70 "context": { 71 "database": "ok", 72 "ebics-submit": "ok", 73 "ebics-fetch-latest": "2012-04-23T18:25:43.511Z", 74 "ebics-fetch-latest-success": "2012-04-23T18:25:43.511Z" 75 "ebics-fetch": "failure", 76 "ebics-fetch-latest": "2012-04-23T18:25:43.511Z", 77 "ebics-fetch-latest-success": "2012-03-23T18:25:43.511Z" 78 } 79 } 80 81 For taler-exchange: 82 83 .. code-block:: json 84 85 { 86 "status": "degraded", 87 "context": { 88 "database": "ok", 89 "wirewatch": "failure", 90 "wirewatch-latest": "2012-04-23T18:25:43.511Z" 91 } 92 } 93 94 Next, Uptime Kuma can be configured to retrieve this endpoint and trigger an alert when the status is degraded if the API is in place. 95 The JSON body can be shared in the alert, which makes remediation easier because we have hints as to what is not working well inside the service. 96 97 Logs 98 ---- 99 100 Currently, we also rely on logs to detect failures. To do this, we need to ingest the logs into a monitoring system in order to analyze them and generate alerts. We need to decide whether we want to continue doing this or whether we can choose to use the logs only for remediation and therefore leave them where they are. 101 102 Whenever we want to analyze logs to detect a failure condition, we need to see if it is possible for the system to expose it in its health endpoint. 103 104 Test Plan 105 ========= 106 107 - Add health endpoint to libeufin-bank and libeufin-nexus first 108 - Deploy on demo 109 - Test status update and alerts 110 111 Alternatives 112 ============ 113 114 Prometheus format with Uptime Kuma 115 ---------------------------------- 116 117 Instead of using JSON we could use Prometheus metrics textual format. This would make upgrading to a better observability system easier in the future. 118 119 .. code-block:: text 120 121 # HELP app_status 1=ok, 0.5=degraded, 0=down 122 # TYPE app_status gauge 123 app_status 0.5 124 125 # HELP component_database_status 1=ok, 0=failure 126 # TYPE component_database_status gauge 127 database_status 1.0 128 129 # HELP component_tan_sms_status 1=ok, 0=failure 130 # TYPE component_tan_sms_status gauge 131 tan_sms_status 1.0 132 133 # HELP component_tan_email_status 1=ok, 0=failure 134 # TYPE component_tan_email_status gauge 135 tan_email_status 0.0 136 137 We could then also use the existing Taler Observability API but only provide simple metrics for now. 138 139 Prometheus & Grafana alternative 140 -------------------------------- 141 142 We can try to use other more performant while not simpler alternative like Victoria Metrics. 143 144 Drawbacks 145 ========= 146 147 This does not resolve the issue of system resources and systemd services. 148 It's therefore not sufficient for a complete observability system. However, it is easier to implement for now. 149 150 We could hack our own health endpoint that expose systemd services, postgres database and system status in a similar manner. 151 152 Discussion / Q&A 153 ================