059-statistics.rst - taler-docs - Documentation for GNU Taler components, APIs and protocols

059-statistics.rst (9502B)
      1 DD 59: Statistics
      2 #################
      3 
      4 Summary
      5 =======
      6 
      7 This design document elaborates how we track various statistics
      8 in the exchange and merchant, typically for tax reporting or
      9 to detect anomalies to be investigated by anti-money laundering
     10 officers. The key idea is to use SQL triggers to keep the
     11 statistics always up-to-date and a bit of garbage collection
     12 to expire ancient statistics. Finally, deployment-specific
     13 statistics can easily be added this way by simply injecting
     14 the correct SQL code into the backend, without having to modify
     15 the core exchange or merchant logic.
     16 
     17 
     18 Motivation
     19 ==========
     20 
     21 Exchange operators are required to monitor for suspicious
     22 transactions as part of their AML efforts. Merchants need to
     23 collect certain data for their business, especially for tax
     24 purposes but also conceivably to analyze sales. The specific
     25 data to be tracked varies by operator (and legislation), so
     26 we need to be quite flexible in terms of which statistics
     27 should be kept, especially to minimize the performance impact.
     28 
     29 
     30 Requirements
     31 ============
     32 
     33 - statistics should always be up-to-date (in real-time) and
     34   not only be updated in batches
     35 - some statistics are amounts, others are simple numerical
     36   (integer) values
     37 - some statistics need to be kept over a sliding interval that
     38   moves over  time, while others need to be mapped to fixed
     39   buckets such as a day, month, quarter or year.
     40 - which statistics are being tracked may depend on the
     41   operational context, especially for the exchange; it must
     42   thus be easy to add (or remove) statistics at any time;
     43 - while tracking statistics inherently costs performance the
     44   runtime (CPU and storage) overhead should be minimized;
     45   in particular for "sliding intervals", the events that
     46   "slide out" of the interval may be coarsened and do not
     47   necessarily require accuracy down to the second;
     48 - when adding new statistics, it may be desirable to compute
     49   them retroactivey over historic data (if available);
     50 - the SPAs displaying statistics should not have to make an
     51   excessive number of REST API calls, so generally multiple
     52   values should be returned from a single endpoint;
     53 - the merchant is multi-currency capable, and thus amount-valued
     54   statistics in the merchant backend should be kept per currency
     55 
     56 
     57 Proposed Solution
     58 =================
     59 
     60 At a high-level, we use SQL TRIGGERs to update statistics on-the-fly whenever
     61 an INSERT or UPDATE changes values in the database that would affect the
     62 statistics we are tracking.  The statistics themselves are also stored in the
     63 database. Additionally, we also store meta-data in the database that
     64 determines the granularity of the statistics to keep. This way, the SQL
     65 TRIGGERs can be generic.
     66 
     67 
     68 Tables
     69 ------
     70 
     71 The schema consists of several tables:
     72 
     73 - **statistic_bucket_meta**:
     74   meta data for statistics we track in buckets, including
     75   a slug used as the main key to identify the statistic,
     76   the data type (amount or number), and a pair of
     77   range of the bucket (year, quarter, month, etc.)
     78   and how many ages (generations) of that bucket to keep.
     79   Basically, now minus age times range tells us how far
     80   into the past statistics are kept.
     81 - **statistic_interval_meta**:
     82   meta data used for statistics we keep by time interval,
     83   with again a slug and data type, but then pairs
     84   consisting of a number of seconds for the range
     85   and a precision.  A range of 60s implies that we
     86   count the events of the last 60s in the interval.
     87   Usually multiple ranges are given, like 60s, 120s
     88   and 180s to keep events for the last one, two and
     89   three minutes. Note that in the database the data
     90   kept for the 120s range excludes the data from the
     91   first 60s, and the correct amount for any interval
     92   is computed on-the-fly by adding up the amounts for
     93   the smaller intervals. A precision of 5s means that
     94   event timestamps are normalized (and rounded down)
     95   to multiples of 5s and thus events at 64s may still
     96   be counted for the 60s interval.  Note that both
     97   intervals and precisions must be monotonically
     98   increasing in their respective arrays.
     99 - **statistic_bucket_counter**: contains the current
    100   numeric (non-amount) value for a particular bucket
    101   (given by starting point and range) and object;
    102   objects in the exchange are (normalized) payto
    103   hashes identifying bank accounts, while objects for
    104   the merchant backend are the instances of the merchant.
    105 - **statistic_bucket_amount**: same as the counter
    106   bucket, except for an amount; in the case of the
    107   merchant the *currency* is an additional dimension
    108   as amounts are tracked per currency.
    109 - **statistic_counter_event**: represents the sum of
    110   all (counter) events at a particular time-*slot* in
    111   time, again given per object; the *slot* is the
    112   time of the event rounded to the *presision* of the
    113   interval the event falls into.
    114 - **statistic_amount_event**: same as the counter
    115   event, except for amounts.
    116 - **statistic_interval_counter**: cumulative value
    117   of the counter for the given range; additionally
    118   includes the unique ID of the oldest event that
    119   is included in the counter (as the range is given
    120   as a relative time, we need to track which events
    121   are included).
    122 - **statistic_interval_amount**: same as the
    123   interval counter, except for amounts.
    124 
    125 Stored procedures
    126 -----------------
    127 
    128 We additionally provide a few stored procedures to
    129 update the statistics.  These are:
    130 
    131 * **bump_number_bucket_stat** (slug,object,timestamp,delta):
    132   increases the *slug* bucket counter of *object* at *timestamp*
    133   by *delta*.  Does nothing if *slug* is not in the
    134   bucket meta table or if *timestamp* is past the range
    135   of buckets we are currently tracking;
    136 * **bump_amount_bucket_stat** (slug,object,timestamp,amount):
    137   similar to the above, just for statistics of type amount;
    138 * **bump_number_interval_stat** (slug,object,timestamp,delta):
    139   increases the *slug* interval counter of *object* at
    140   *timestamp* by *delta*;
    141 * **bump_amount_interval_stat** (slug,object,timestamp,amount):
    142   similar to the above, just for statistics of type amount;
    143 * **bump_number_stat** (slug, object, timestamp, delta):
    144   increases both bucket and interval counters for *slug*
    145   of *object* at *timestamp* by *delta*;
    146 * **bump_amount_stat** (slug, object, timestamp, delta):
    147   similar to the above, just for statistics of type amount;
    148 
    149 We furthermore provide a few stored procedures to
    150 access the interval statistics (while making sure
    151 they are current).  These are:
    152 
    153 * **statistic_interval_number_get** (slug, object):
    154   returns all non-zero counters and time intervals that
    155   we are tracking for the given *slug* and *object*;
    156   the intervals are updated and events are possibly
    157   discarded or coarsened (if numeric value of the
    158   precision for larger intervals is larger and thus
    159   allows for more rounding); 
    160 * **statistic_interval_amount_get** (slug, object):
    161   returns all non-zero amounts and time intervals that
    162   we are tracking for the given *slug* and *object*;
    163   the intervals are updated and events are possibly
    164   discarded or coarsened (if numeric value of the
    165   precision for larger intervals is larger and thus
    166   allows for more rounding); 
    167 
    168 Finally, there are some helpers for cleaning up:
    169 
    170 * **statistic_bucket_gc** ():
    171   Removes buckets and events past the range for which
    172   we track statistics; also coarsens events to the
    173   precision of the range into which they are falling.
    174 * **exchange_drop_customization** (schema):
    175   special function for the exchange which allows dropping
    176   customizations by *schema*.
    177 
    178 Schema
    179 ------
    180 
    181 For the exchange, all triggers (and associated stored
    182 procedures) should be stored in one or more payment
    183 service provider specific schema.  Furthermore, the name
    184 of the schema MUST be provided in the *schema* column of
    185 the meta data table entries.  This way, all triggers,
    186 stored procedures and statistics can be removed simply
    187 by DROPing the SCHEMA and removing the associated
    188 entries from the meta data tables (the foreign key
    189 constraints then CASCADE and clean up the statistics
    190 themselves).
    191 
    192 The SCHEMA name must also be used as the prefix
    193 for the SQL files that inject the triggers. The
    194 usual database versioning should also be used,
    195 except this time using the SCHEMA name instead of
    196 "exchange-". "exchange" is not allowed as a name
    197 for customization SCHEMA.
    198 
    199 **taler-exchange-dbinit** is extended with command line arguments to load the
    200 latest version of a customization schema or to call
    201 **exchange_drop_customization** to remove one.
    202 
    203 
    204 
    205 Definition of Done
    206 ==================
    207 
    208 - key statistics for merchant and TOPS-deployment implemented
    209 - REST API for merchant specified and implemented
    210 - REST API for AML officer specified and implemented
    211 - SPAs visualize key statistics
    212 
    213 Alternatives
    214 ============
    215 
    216 - batch processing to compute statistics, REST API only
    217   returns the value computed by the last batch (computational
    218   cost only paid if statistic is desired, but then may be high,
    219   also potential for outdated data being shown);
    220 - computing of statistics on the C side; may have more data
    221   easily available, but has the major disadvantage of making
    222   it harder to add/remove statistics and makes the transaction
    223   logic more complex; also easier to miss triggering events;
    224 
    225 
    226 Drawbacks
    227 =========
    228 
    229 - ongoing baseline cost for statistics even if nobody looks
    230   at them (but only if the respective statistic is enabled)
    231 
    232 
    233 Discussion / Q&A
    234 ================
    235 
    236 (This should be filled in with results from discussions on mailing lists / personal communication.)
	taler-docs Documentation for GNU Taler components, APIs and protocols
	Log \| Files \| Refs \| README \| LICENSE