quickjs-tart

quickjs-based runtime for wallet-core logic
Log | Files | Refs | README | LICENSE

README.md (24023B)


      1 *This is the original README for the p256-m repository. Please note that as
      2 only a subset of p256-m's files are present in Mbed TLS, this README may refer
      3 to files that are not present/relevant here.*
      4 
      5 p256-m is a minimalistic implementation of ECDH and ECDSA on NIST P-256,
      6 especially suited to constrained 32-bit environments. It's written in standard
      7 C, with optional bits of assembly for Arm Cortex-M and Cortex-A CPUs.
      8 
      9 Its design is guided by the following goals in this order:
     10 
     11 1. correctness & security;
     12 2. low code size & RAM usage;
     13 3. runtime performance.
     14 
     15 Most cryptographic implementations care more about speed than footprint, and
     16 some might even risk weakening security for more speed. p256-m was written
     17 because I wanted to see what happened when reversing the usual emphasis.
     18 
     19 The result is a full implementation of ECDH and ECDSA in **less than 3KiB of
     20 code**, using **less than 768 bytes of RAM**, with comparable performance
     21 to existing implementations (see below) - in less than 700 LOC.
     22 
     23 _Contents of this Readme:_
     24 
     25 - [Correctness](#correctness)
     26 - [Security](#security)
     27 - [Code size](#code-size)
     28 - [RAM usage](#ram-usage)
     29 - [Runtime performance](#runtime-performance)
     30 - [Comparison with other implementations](#comparison-with-other-implementations)
     31 - [Design overview](#design-overview)
     32 - [Notes about other curves](#notes-about-other-curves)
     33 - [Notes about other platforms](#notes-about-other-platforms)
     34 
     35 ## Correctness
     36 
     37 **API design:**
     38 
     39 - The API is minimal: only 4 public functions.
     40 - Each public function fully validates its inputs and returns specific errors.
     41 - The API uses arrays of octets for all input and output.
     42 
     43 **Testing:**
     44 
     45 - p256-m is validated against multiple test vectors from various RFCs and
     46   NIST.
     47 - In addition, crafted inputs are used for negative testing and to reach
     48   corner cases.
     49 - Two test suites are provided: one for closed-box testing (using only the
     50   public API), one for open-box testing (for unit-testing internal functions,
     51 and reaching more error cases by exploiting knowledge of how the RNG is used).
     52 - The resulting branch coverage is maximal: closed-box testing reaches all
     53   branches except four; three of them are reached by open-box testing using a
     54 rigged RNG; the last branch could only be reached by computing a discrete log
     55 on P-256... See `coverage.sh`.
     56 - Testing also uses dynamic analysis: valgrind, ASan, MemSan, UBSan.
     57 
     58 **Code quality:**
     59 
     60 - The code is standard C99; it builds without warnings with `clang
     61   -Weverything` and `gcc -Wall -Wextra -pedantic`.
     62 - The code is small and well documented, including internal APIs: with the
     63   header file, it's less than 700 lines of code, and more lines of comments
     64 than of code.
     65 - However it _has not been reviewed_ independently so far, as this is a
     66   personal project.
     67 
     68 **Short Weierstrass pitfalls:**
     69 
     70 Its has been [pointed out](https://safecurves.cr.yp.to/) that the NIST curves,
     71 and indeed all Short Weierstrass curves, have a number of pitfalls including
     72 risk for the implementation to:
     73 
     74 - "produce incorrect results for some rare curve points" - this is avoided by
     75   carefully checking the validity domain of formulas used throughout the code;
     76 - "leak secret data when the input isn't a curve point" - this is avoided by
     77   validating that points lie on the curve every time a point is deserialized.
     78 
     79 ## Security
     80 
     81 In addition to the above correctness claims, p256-m has the following
     82 properties:
     83 
     84 - it has no branch depending (even indirectly) on secret data;
     85 - it has no memory access depending (even indirectly) on secret data.
     86 
     87 These properties are checked using valgrind and MemSan with the ideas
     88 behind [ctgrind](https://github.com/agl/ctgrind), see `consttime.sh`.
     89 
     90 In addition to avoiding branches and memory accesses depending on secret data,
     91 p256-m also avoid instructions (or library functions) whose execution time
     92 depends on the value of operands on cores of interest. Namely, it never uses
     93 integer division, and for multiplication by default it only uses 16x16->32 bit
     94 unsigned multiplication. On cores which have a constant-time 32x32->64 bit
     95 unsigned multiplication instruction, the symbol `MUL64_IS_CONSTANT_TIME` can
     96 be defined by the user at compile-time to take advantage of it in order to
     97 improve performance and code size. (On Cortex-M and Cortex-A cores wtih GCC or
     98 Clang this is not necessary, since inline assembly is used instead.)
     99 
    100 As a result, p256-m should be secure against the following classes of attackers:
    101 
    102 1. attackers who can only manipulate the input and observe the output;
    103 2. attackers who can also measure the total computation time of the operation;
    104 3. attackers who can also observe and manipulate micro-architectural features
    105    such as the cache or branch predictor with arbitrary precision.
    106 
    107 However, p256-m makes no attempt to protect against:
    108 
    109 4. passive physical attackers who can record traces of physical emissions
    110    (power, EM, sound) of the CPU while it manipulates secrets;
    111 5. active physical attackers who can also inject faults in the computation.
    112 
    113 (Note: p256-m should actually be secure against SPA, by virtue of being fully
    114 constant-flow, but is not expected to resist any other physical attack.)
    115 
    116 **Warning:** p256-m requires an externally-provided RNG function. If that
    117 function is not cryptographically secure, then neither is p256-m's key
    118 generation or ECDSA signature generation.
    119 
    120 _Note:_ p256-m also follows best practices such as securely erasing secret
    121 data on the stack before returning.
    122 
    123 ## Code size
    124 
    125 Compiled with
    126 [ARM-GCC 9](https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads),
    127 with `-mthumb -Os`, here are samples of code sizes reached on selected cores:
    128 
    129 - Cortex-M0: 2988 bytes
    130 - Cortex-M4: 2900 bytes
    131 - Cortex-A7: 2924 bytes
    132 
    133 Clang was also tried but tends to generate larger code (by about 10%). For
    134 details, see `sizes.sh`.
    135 
    136 **What's included:**
    137 
    138 - Full input validation and (de)serialisation of input/outputs to/from bytes.
    139 - Cleaning up secret values from the stack before returning from a function.
    140 - The code has no dependency on libc functions or the toolchain's runtime
    141   library (such as helpers for long multiply); this can be checked for the
    142 Arm-GCC toolchain with the `deps.sh` script.
    143 
    144 **What's excluded:**
    145 
    146 - A secure RNG function needs to be provided externally, see
    147   `p256_generate_random()` in `p256-m.h`.
    148 
    149 ## RAM usage
    150 
    151 p256-m doesn't use any dynamic memory (on the heap), only the stack. Here's
    152 how much stack is used by each of its 4 public functions on selected cores:
    153 
    154 | Function                  | Cortex-M0 | Cortex-M4 | Cortex-A7 |
    155 | ------------------------- | --------: | --------: | --------: |
    156 | `p256_gen_keypair`        |       608 |       564 |       564 |
    157 | `p256_ecdh_shared_secret` |       640 |       596 |       596 |
    158 | `p256_ecdsa_sign`         |       664 |       604 |       604 |
    159 | `p256_ecdsa_verify`       |       752 |       700 |       700 |
    160 
    161 For details, see `stack.sh`, `wcs.py` and `libc.msu` (the above figures assume
    162 that the externally-provided RNG function uses at most 384 bytes of stack).
    163 
    164 ## Runtime performance
    165 
    166 Here are the timings of each public function in milliseconds measured on
    167 platforms based on a selection of cores:
    168 
    169 - Cortex-M0 at  48 MHz: STM32F091 board running Mbed OS 6
    170 - Cortex-M4 at 100 MHz: STM32F411 board running Mbed OS 6
    171 - Cortex-A7 at 900 MHz: Raspberry Pi 2B running Raspbian Buster
    172 
    173 | Function                  | Cortex-M0 | Cortex-M4 | Cortex-A7 |
    174 | ------------------------- | --------: | --------: | --------: |
    175 | `p256_gen_keypair`        |       921 |       145 |        11 |
    176 | `p256_ecdh_shared_secret` |       922 |       144 |        11 |
    177 | `p256_ecdsa_sign`         |       990 |       155 |        12 |
    178 | `p256_ecdsa_verify`       |      1976 |       309 |        24 |
    179 | Sum of the above          |      4809 |       753 |        59 |
    180 
    181 The sum of these operations corresponds to a TLS handshake using ECDHE-ECDSA
    182 with mutual authentication based on raw public keys or directly-trusted
    183 certificates (otherwise, add one 'verify' for each link in the peer's
    184 certificate chain).
    185 
    186 _Note_: the above figures where obtained by compiling with GCC, which is able
    187 to use inline assembly. Without that inline assembly (22 lines for Cortex-M0,
    188 1 line for Cortex-M4), the code would be roughly 2 times slower on those
    189 platforms. (The effect is much less important on the Cortex-A7 core.)
    190 
    191 For details, see `bench.sh`, `benchmark.c` and `on-target-benchmark/`.
    192 
    193 ## Comparison with other implementations
    194 
    195 The most relevant/convenient implementation for comparisons is
    196 [TinyCrypt](https://github.com/intel/tinycrypt), as it's also a standalone
    197 implementation of ECDH and ECDSA on P-256 only, that also targets constrained
    198 devices. Other implementations tend to implement many curves and build on a
    199 shared bignum/MPI module (possibly also supporting RSA), which makes fair
    200 comparisons less convenient.
    201 
    202 The scripts used for TinyCrypt measurements are available in [this
    203 branch](https://github.com/mpg/tinycrypt/tree/measurements), based on version
    204 0.2.8.
    205 
    206 **Code size**
    207 
    208 | Core      | p256-m | TinyCrypt |
    209 | --------- | -----: | --------: |
    210 | Cortex-M0 |   2988 |      6134 |
    211 | Cortex-M4 |   2900 |      5934 |
    212 | Cortex-A7 |   2924 |      5934 |
    213 
    214 **RAM usage**
    215 
    216 TinyCrypto also uses no heap, only the stack. Here's the RAM used by each
    217 operation on a Cortex-M0 core:
    218 
    219 | operation          | p256-m | TinyCrypt |
    220 | ------------------ | -----: | --------: |
    221 | key generation     |    608 |       824 |
    222 | ECDH shared secret |    640 |       728 |
    223 | ECDSA sign         |    664 |       880 |
    224 | ECDSA verify       |    752 |       824 |
    225 
    226 On a Cortex-M4 or Cortex-A7 core (identical numbers):
    227 
    228 | operation          | p256-m | TinyCrypt |
    229 | ------------------ | -----: | --------: |
    230 | key generation     |    564 |       796 |
    231 | ECDH shared secret |    596 |       700 |
    232 | ECDSA sign         |    604 |       844 |
    233 | ECDSA verify       |    700 |       808 |
    234 
    235 **Runtime performance**
    236 
    237 Here are the timings of each operation in milliseconds measured on
    238 platforms based on a selection of cores:
    239 
    240 _Cortex-M0_ at  48 MHz: STM32F091 board running Mbed OS 6
    241 
    242 | Operation          | p256-m | TinyCrypt |
    243 | ------------------ | -----: | --------: |
    244 | Key generation     |    921 |       979 |
    245 | ECDH shared secret |    922 |       975 |
    246 | ECDSA sign         |    990 |      1009 |
    247 | ECDSA verify       |   1976 |      1130 |
    248 | Sum of those 4     |   4809 |      4093 |
    249 
    250 _Cortex-M4_ at 100 MHz: STM32F411 board running Mbed OS 6
    251 
    252 | Operation          | p256-m | TinyCrypt |
    253 | ------------------ | -----: | --------: |
    254 | Key generation     |    145 |       178 |
    255 | ECDH shared secret |    144 |       177 |
    256 | ECDSA sign         |    155 |       188 |
    257 | ECDSA verify       |    309 |       210 |
    258 | Sum of those 4     |    753 |       753 |
    259 
    260 _Cortex-A7_ at 900 MHz: Raspberry Pi 2B running Raspbian Buster
    261 
    262 | Operation          | p256-m | TinyCrypt |
    263 | ------------------ | -----: | --------: |
    264 | Key generation     |     11 |        13 |
    265 | ECDH shared secret |     11 |        13 |
    266 | ECDSA sign         |     12 |        14 |
    267 | ECDSA verify       |     24 |        15 |
    268 | Sum of those 4     |     59 |        55 |
    269 
    270 _64-bit Intel_ (i7-6500U at 2.50GHz) laptop running Ubuntu 20.04
    271 
    272 Note: results in microseconds (previous benchmarks in milliseconds)
    273 
    274 | Operation          | p256-m | TinyCrypt |
    275 | ------------------ | -----: | --------: |
    276 | Key generation     |   1060 |      1627 |
    277 | ECDH shared secret |   1060 |      1611 |
    278 | ECDSA sign         |   1136 |      1712 |
    279 | ECDSA verify       |   2279 |      1888 |
    280 | Sum of those 4     |   5535 |      6838 |
    281 
    282 **Other differences**
    283 
    284 - While p256-m fully validates all inputs, Tinycrypt's ECDH shared secret
    285   function doesn't include validation of the peer's public key, which should be
    286 done separately by the user for static ECDH (there are attacks [when users
    287 forget](https://link.springer.com/chapter/10.1007/978-3-319-24174-6_21)).
    288 - The two implementations have slightly different security characteristics:
    289   p256-m is fully constant-time from the ground up so should be more robust
    290 than TinyCrypt against powerful local attackers (such as an untrusted OS
    291 attacking a secure enclave); on the other hand TinyCrypt includes coordinate
    292 randomisation which protects against some passive physical attacks (such as
    293 DPA, see Table 3, column C9 of [this
    294 paper](https://www.esat.kuleuven.be/cosic/publications/article-2293.pdf#page=12)),
    295 which p256-m completely ignores.
    296 - TinyCrypt's code looks like it could easily be expanded to support other
    297   curves, while p256-m has much more hard-coded to minimize code size (see
    298 "Notes about other curves" below).
    299 - TinyCrypt uses a specialised routine for reduction modulo the curve prime,
    300   exploiting its structure as a Solinas prime, which should be faster than the
    301 generic Montgomery reduction used by p256-m, but other factors appear to
    302 compensate for that.
    303 - TinyCrypt uses Co-Z Jacobian formulas for point operation, which should be
    304   faster (though a bit larger) than the mixed affine-Jacobian formulas
    305 used by p256-m, but again other factors appear to compensate for that.
    306 - p256-m uses bits of inline assembly for 64-bit multiplication on the
    307   platforms used for benchmarking, while TinyCrypt uses only C (and the
    308 compiler's runtime library).
    309 - TinyCrypt uses a specialised routine based on Shamir's trick for
    310   ECDSA verification, which gives much better performance than the generic
    311 code that p256-m uses in order to minimize code size.
    312 
    313 ## Design overview
    314 
    315 The implementation is contained in a single file to keep most functions static
    316 and allow for more optimisations. It is organized in multiple layers:
    317 
    318 - Fixed-width multi-precision arithmetic
    319 - Fixed-width modular arithmetic
    320 - Operations on curve points
    321 - Operations with scalars
    322 - The public API
    323 
    324 **Multi-precision arithmetic.**
    325 
    326 Large integers are represented as arrays of `uint32_t` limbs. When carries may
    327 occur, casts to `uint64_t` are used to nudge the compiler towards using the
    328 CPU's carry flag. When overflow may occur, functions return a carry flag.
    329 
    330 This layer contains optional assembly for Cortex-M and Cortex-A cores, for the
    331 internal `u32_muladd64()` function, as well as two pure C versions of this
    332 function, depending on whether `MUL64_IS_CONSTANT_TIME`.
    333 
    334 This layer's API consists of:
    335 
    336 - addition, subtraction;
    337 - multiply-and-add, shift by one limb (for Montgomery multiplication);
    338 - conditional assignment, assignment of a small value;
    339 - comparison of two values for equality, comparison to 0 for equality;
    340 - (de)serialization as big-endian arrays of bytes.
    341 
    342 **Modular arithmetic.**
    343 
    344 All modular operations are done in the Montgomery domain, that is x is
    345 represented by `x * 2^256 mod m`; integers need to be converted to that domain
    346 before computations, and back from it afterwards. Montgomery constants
    347 associated to the curve's p and n are pre-computed and stored in static
    348 structures.
    349 
    350 Modular inversion is computed using Fermat's little theorem to get
    351 constant-time behaviour with respect to the value being inverted.
    352 
    353 This layer's API consists of:
    354 
    355 - the curve's constants p and n (and associated Montgomery constants);
    356 - modular addition, subtraction, multiplication, and inversion;
    357 - assignment of a small value;
    358 - conversion to/from Montgomery domain;
    359 - (de)serialization to/from bytes with integrated range checking and
    360   Montgomery domain conversion.
    361 
    362 **Operations on curve points.**
    363 
    364 Curve points are represented using either affine or Jacobian coordinates;
    365 affine coordinates are extended to represent 0 as (0,0). Individual
    366 coordinates are always in the Montgomery domain.
    367 
    368 Not all formulas associated with affine or Jacobian coordinates are complete;
    369 great care is taken to document and satisfy each function's pre-conditions.
    370 
    371 This layer's API consists of:
    372 
    373 - curve constants: b from the equation, the base point's coordinates;
    374 - point validity check (on the curve and not 0);
    375 - Jacobian to affine coordinate conversion;
    376 - point doubling in Jacobian coordinates (complete formulas);
    377 - point addition in mixed affine-Jacobian coordinates (P not in {0, Q, -Q});
    378 - point addition-or-doubling in affine coordinates (leaky version, only used
    379   for ECDSA verify where all data is public);
    380 - (de)serialization to/from bytes with integrated validity checking
    381 
    382 **Scalar operations.**
    383 
    384 The crucial function here is scalar multiplication. It uses a signed binary
    385 ladder, which is a variant of the good old double-and-add algorithm where an
    386 addition/subtraction is performed at each step. Again, care is taken to make
    387 sure the pre-conditions for the addition formulas are always satisfied. The
    388 signed binary ladder only works if the scalar is odd; this is ensured by
    389 negating both the scalar (mod n) and the input point if necessary.
    390 
    391 This layer's API consists of:
    392 
    393 - scalar multiplication
    394 - de-serialization from bytes with integrated range checking
    395 - generation of a scalar and its associated public key
    396 
    397 **Public API.**
    398 
    399 This layer builds on the others, but unlike them, all inputs and outputs are
    400 byte arrays. Key generation and ECDH shared secret computation are thin
    401 wrappers around internal functions, just taking care of format conversions and
    402 errors. The ECDSA functions have more non-trivial logic.
    403 
    404 This layer's API consists of:
    405 
    406 - key-pair generation
    407 - ECDH shared secret computation
    408 - ECDSA signature creation
    409 - ECDSA signature verification
    410 
    411 **Testing.**
    412 
    413 A self-contained, straightforward, pure-Python implementation was first
    414 produced as a warm-up and to help check intermediate values. Test vectors from
    415 various sources are embedded and used to validate the implementation.
    416 
    417 This implementation, `p256.py`, is used by a second Python script,
    418 `gen-test-data.py`, to generate additional data for both positive and negative
    419 testing, available from a C header file, that is then used by the closed-box
    420 and open-box test programs.
    421 
    422 p256-m can be compiled with extra instrumentation to mark secret data and
    423 allow either valgrind or MemSan to check that no branch or memory access
    424 depends on it (even indirectly). Macros are defined for this purpose near the
    425 top of the file.
    426 
    427 **Tested platforms.**
    428 
    429 There are 4 versions of the internal function `u32_muladd64`: two assembly
    430 versions, for Cortex-M/A cores with or without the DSP extension, and two
    431 pure-C versions, depending on whether `MUL64_IS_CONSTANT_TIME`.
    432 
    433 Tests are run on the following platforms:
    434 
    435 - `make` on x64 tests the pure-C version without `MUL64_IS_CONSTANT_TIME`
    436   (with Clang).
    437 - `./consttime.sh` on x64 tests both pure-C versions (with Clang).
    438 - `make` on Arm v7-A (Raspberry Pi 2) tests the Arm-DSP assembly version (with
    439   Clang).
    440 - `on-target-*box` on boards based on Cortex-M0 and M4 cores test both
    441   assembly versions (with GCC).
    442 
    443 In addition:
    444 
    445 - `sizes.sh` builds the code for three Arm cores with GCC and Clang.
    446 - `deps.sh` checks for external dependencies with GCC.
    447 
    448 ## Notes about other curves
    449 
    450 It should be clear that minimal code size can only be reached by specializing
    451 the implementation to the curve at hand. Here's a list of things in the
    452 implementation that are specific to the NIST P-256 curve, and how the
    453 implementation could be changed to expand to other curves, layer by layer (see
    454 "Design Overview" above).
    455 
    456 **Fixed-width multi-precision arithmetic:**
    457 
    458 - The number of limbs is hard-coded to 8. For other 256-bit curves, nothing to
    459   change. For a curve of another size, hard-code to another value. For multiple
    460 curves of various sizes, add a parameter to each function specifying the
    461 number of limbs; when declaring arrays, always use the maximum number of
    462 limbs.
    463 
    464 **Fixed-width modular arithmetic:**
    465 
    466 - The values of the curve's constant p and n, and their associated Montgomery
    467   constants, are hard-coded. For another curve, just hard-code the new constants.
    468 For multiple other curves, define all the constants, and from this layer's API
    469 only keep the functions that already accept a `mod` parameter (that is, remove
    470 convenience functions `m256_xxx_p()`).
    471 - The number of limbs is again hard-coded to 8. See above, but it order to
    472   support multiple sizes there is no need to add a new parameter to functions
    473 in this layer: the existing `mod` parameter can include the number of limbs as
    474 well.
    475 
    476 **Operations on curve points:**
    477 
    478 - The values of the curve's constants b (constant term from the equation) and
    479   gx, gy (coordinates of the base point) are hard-coded. For another curve,
    480   hard-code the other values. For multiple curves, define each curve's value and
    481 add a "curve id" parameter to all functions in this layer.
    482 - The value of the curve's constant a is implicitly hard-coded to `-3` by using
    483   a standard optimisation to save one multiplication in the first step of
    484 `point_double()`. For curves that don't have a == -3, replace that with the
    485 normal computation.
    486 - The fact that b != 0 in the curve equation is used indirectly, to ensure
    487   that (0, 0) is not a point on the curve and re-use that value to represent
    488 the point 0. As far as I know, all Short Weierstrass curves standardized so
    489 far have b != 0.
    490 - The shape of the curve is assumed to be Short Weierstrass. For other curve
    491   shapes (Montgomery, (twisted) Edwards), this layer would probably look very
    492 different (both implementation and API).
    493 
    494 **Scalar operations:**
    495 
    496 - If multiple curves are to be supported, all function in this layer need to
    497   gain a new "curve id" parameter.
    498 - This layer assumes that the bit size of the curve's order n is the same as
    499   that of the modulus p. This is true of most curves standardized so far, the
    500 only exception being secp224k1. If that curve were to be supported, the
    501 representation of `n` and scalars would need adapting to allow for an extra
    502 limb.
    503 - The bit size of the curve's order is hard-coded in `scalar_mult()`. For
    504   multiple curves, this should be deduced from the "curve id" parameter.
    505 - The `scalar_mult()` function exploits the fact that the second least
    506   significant bit of the curve's order n is set in order to avoid a special
    507 case. For curve orders that don't meet this criterion, we can just handle that
    508 special case (multiplication by +-2) separately (always compute that and
    509 conditionally assign it to the result).
    510 - The shape of the curve is again assumed to be Short Weierstrass. For other curve
    511   shapes (Montgomery, (twisted) Edwards), this layer would probably have a
    512 very different implementation.
    513 
    514 **Public API:**
    515 
    516 - For multiple curves, all functions in this layer would need to gain a "curve
    517   id" parameter and handle variable-sized input/output.
    518 - The shape of the curve is again assumed to be Short Weierstrass. For other curve
    519   shapes (Montgomery, (twisted) Edwards), the ECDH API would probably look
    520 quite similar (with differences in the size of public keys), but the ECDSA API
    521 wouldn't apply and an EdDSA API would look pretty different.
    522 
    523 ## Notes about other platforms
    524 
    525 While p256-m is standard C99, it is written with constrained 32-bit platforms
    526 in mind and makes a few assumptions about the platform:
    527 
    528 - The types `uint8_t`, `uint16_t`, `uint32_t` and `uint64_t` exist.
    529 - 32-bit unsigned addition and subtraction with carry are constant time.
    530 - 16x16->32-bit unsigned multiplication is available and constant time.
    531 
    532 Also, on platforms on which 64-bit addition and subtraction with carry, or
    533 even 64x64->128-bit multiplication, are available, p256-m makes no use of
    534 them, though they could significantly improve performance.
    535 
    536 This could be improved by replacing uses of arrays of `uint32_t` with a
    537 defined type throughout the internal APIs, and then on 64-bit platforms define
    538 that type to be an array of `uint64_t` instead, and making the obvious
    539 adaptations in the multi-precision arithmetic layer.
    540 
    541 Finally, the optional assembly code (which boosts performance by a factor 2 on
    542 tested Cortex-M CPUs, while slightly reducing code size and stack usage) is
    543 currently only available with compilers that support GCC's extended asm
    544 syntax (which includes GCC and Clang).