059-statistics.rst (9502B)
1 DD 59: Statistics 2 ################# 3 4 Summary 5 ======= 6 7 This design document elaborates how we track various statistics 8 in the exchange and merchant, typically for tax reporting or 9 to detect anomalies to be investigated by anti-money laundering 10 officers. The key idea is to use SQL triggers to keep the 11 statistics always up-to-date and a bit of garbage collection 12 to expire ancient statistics. Finally, deployment-specific 13 statistics can easily be added this way by simply injecting 14 the correct SQL code into the backend, without having to modify 15 the core exchange or merchant logic. 16 17 18 Motivation 19 ========== 20 21 Exchange operators are required to monitor for suspicious 22 transactions as part of their AML efforts. Merchants need to 23 collect certain data for their business, especially for tax 24 purposes but also conceivably to analyze sales. The specific 25 data to be tracked varies by operator (and legislation), so 26 we need to be quite flexible in terms of which statistics 27 should be kept, especially to minimize the performance impact. 28 29 30 Requirements 31 ============ 32 33 - statistics should always be up-to-date (in real-time) and 34 not only be updated in batches 35 - some statistics are amounts, others are simple numerical 36 (integer) values 37 - some statistics need to be kept over a sliding interval that 38 moves over time, while others need to be mapped to fixed 39 buckets such as a day, month, quarter or year. 40 - which statistics are being tracked may depend on the 41 operational context, especially for the exchange; it must 42 thus be easy to add (or remove) statistics at any time; 43 - while tracking statistics inherently costs performance the 44 runtime (CPU and storage) overhead should be minimized; 45 in particular for "sliding intervals", the events that 46 "slide out" of the interval may be coarsened and do not 47 necessarily require accuracy down to the second; 48 - when adding new statistics, it may be desirable to compute 49 them retroactivey over historic data (if available); 50 - the SPAs displaying statistics should not have to make an 51 excessive number of REST API calls, so generally multiple 52 values should be returned from a single endpoint; 53 - the merchant is multi-currency capable, and thus amount-valued 54 statistics in the merchant backend should be kept per currency 55 56 57 Proposed Solution 58 ================= 59 60 At a high-level, we use SQL TRIGGERs to update statistics on-the-fly whenever 61 an INSERT or UPDATE changes values in the database that would affect the 62 statistics we are tracking. The statistics themselves are also stored in the 63 database. Additionally, we also store meta-data in the database that 64 determines the granularity of the statistics to keep. This way, the SQL 65 TRIGGERs can be generic. 66 67 68 Tables 69 ------ 70 71 The schema consists of several tables: 72 73 - **statistic_bucket_meta**: 74 meta data for statistics we track in buckets, including 75 a slug used as the main key to identify the statistic, 76 the data type (amount or number), and a pair of 77 range of the bucket (year, quarter, month, etc.) 78 and how many ages (generations) of that bucket to keep. 79 Basically, now minus age times range tells us how far 80 into the past statistics are kept. 81 - **statistic_interval_meta**: 82 meta data used for statistics we keep by time interval, 83 with again a slug and data type, but then pairs 84 consisting of a number of seconds for the range 85 and a precision. A range of 60s implies that we 86 count the events of the last 60s in the interval. 87 Usually multiple ranges are given, like 60s, 120s 88 and 180s to keep events for the last one, two and 89 three minutes. Note that in the database the data 90 kept for the 120s range excludes the data from the 91 first 60s, and the correct amount for any interval 92 is computed on-the-fly by adding up the amounts for 93 the smaller intervals. A precision of 5s means that 94 event timestamps are normalized (and rounded down) 95 to multiples of 5s and thus events at 64s may still 96 be counted for the 60s interval. Note that both 97 intervals and precisions must be monotonically 98 increasing in their respective arrays. 99 - **statistic_bucket_counter**: contains the current 100 numeric (non-amount) value for a particular bucket 101 (given by starting point and range) and object; 102 objects in the exchange are (normalized) payto 103 hashes identifying bank accounts, while objects for 104 the merchant backend are the instances of the merchant. 105 - **statistic_bucket_amount**: same as the counter 106 bucket, except for an amount; in the case of the 107 merchant the *currency* is an additional dimension 108 as amounts are tracked per currency. 109 - **statistic_counter_event**: represents the sum of 110 all (counter) events at a particular time-*slot* in 111 time, again given per object; the *slot* is the 112 time of the event rounded to the *presision* of the 113 interval the event falls into. 114 - **statistic_amount_event**: same as the counter 115 event, except for amounts. 116 - **statistic_interval_counter**: cumulative value 117 of the counter for the given range; additionally 118 includes the unique ID of the oldest event that 119 is included in the counter (as the range is given 120 as a relative time, we need to track which events 121 are included). 122 - **statistic_interval_amount**: same as the 123 interval counter, except for amounts. 124 125 Stored procedures 126 ----------------- 127 128 We additionally provide a few stored procedures to 129 update the statistics. These are: 130 131 * **bump_number_bucket_stat** (slug,object,timestamp,delta): 132 increases the *slug* bucket counter of *object* at *timestamp* 133 by *delta*. Does nothing if *slug* is not in the 134 bucket meta table or if *timestamp* is past the range 135 of buckets we are currently tracking; 136 * **bump_amount_bucket_stat** (slug,object,timestamp,amount): 137 similar to the above, just for statistics of type amount; 138 * **bump_number_interval_stat** (slug,object,timestamp,delta): 139 increases the *slug* interval counter of *object* at 140 *timestamp* by *delta*; 141 * **bump_amount_interval_stat** (slug,object,timestamp,amount): 142 similar to the above, just for statistics of type amount; 143 * **bump_number_stat** (slug, object, timestamp, delta): 144 increases both bucket and interval counters for *slug* 145 of *object* at *timestamp* by *delta*; 146 * **bump_amount_stat** (slug, object, timestamp, delta): 147 similar to the above, just for statistics of type amount; 148 149 We furthermore provide a few stored procedures to 150 access the interval statistics (while making sure 151 they are current). These are: 152 153 * **statistic_interval_number_get** (slug, object): 154 returns all non-zero counters and time intervals that 155 we are tracking for the given *slug* and *object*; 156 the intervals are updated and events are possibly 157 discarded or coarsened (if numeric value of the 158 precision for larger intervals is larger and thus 159 allows for more rounding); 160 * **statistic_interval_amount_get** (slug, object): 161 returns all non-zero amounts and time intervals that 162 we are tracking for the given *slug* and *object*; 163 the intervals are updated and events are possibly 164 discarded or coarsened (if numeric value of the 165 precision for larger intervals is larger and thus 166 allows for more rounding); 167 168 Finally, there are some helpers for cleaning up: 169 170 * **statistic_bucket_gc** (): 171 Removes buckets and events past the range for which 172 we track statistics; also coarsens events to the 173 precision of the range into which they are falling. 174 * **exchange_drop_customization** (schema): 175 special function for the exchange which allows dropping 176 customizations by *schema*. 177 178 Schema 179 ------ 180 181 For the exchange, all triggers (and associated stored 182 procedures) should be stored in one or more payment 183 service provider specific schema. Furthermore, the name 184 of the schema MUST be provided in the *schema* column of 185 the meta data table entries. This way, all triggers, 186 stored procedures and statistics can be removed simply 187 by DROPing the SCHEMA and removing the associated 188 entries from the meta data tables (the foreign key 189 constraints then CASCADE and clean up the statistics 190 themselves). 191 192 The SCHEMA name must also be used as the prefix 193 for the SQL files that inject the triggers. The 194 usual database versioning should also be used, 195 except this time using the SCHEMA name instead of 196 "exchange-". "exchange" is not allowed as a name 197 for customization SCHEMA. 198 199 **taler-exchange-dbinit** is extended with command line arguments to load the 200 latest version of a customization schema or to call 201 **exchange_drop_customization** to remove one. 202 203 204 205 Definition of Done 206 ================== 207 208 - key statistics for merchant and TOPS-deployment implemented 209 - REST API for merchant specified and implemented 210 - REST API for AML officer specified and implemented 211 - SPAs visualize key statistics 212 213 Alternatives 214 ============ 215 216 - batch processing to compute statistics, REST API only 217 returns the value computed by the last batch (computational 218 cost only paid if statistic is desired, but then may be high, 219 also potential for outdated data being shown); 220 - computing of statistics on the C side; may have more data 221 easily available, but has the major disadvantage of making 222 it harder to add/remove statistics and makes the transaction 223 logic more complex; also easier to miss triggering events; 224 225 226 Drawbacks 227 ========= 228 229 - ongoing baseline cost for statistics even if nobody looks 230 at them (but only if the respective statistic is enabled) 231 232 233 Discussion / Q&A 234 ================ 235 236 (This should be filled in with results from discussions on mailing lists / personal communication.)