NTRU Prime

Current software

ntruprime-20200930.sage: Reference implementation in Sage (Python plus math libraries) for sntrup653, sntrup761 (and sntrup4591761), sntrup857, sntrup953, sntrup1013, sntrup1277, ntrulpr653, ntrulpr761 (and ntrulpr4591761), ntrulpr857, ntrulpr953, ntrulpr1013, and ntrulpr1277.

Reference C software:

Optimized software:

Hardware implementation:

Archives

See also the crypto_kem/sntrup4591761 and crypto_kem/ntrulpr4591761 directories in the SUPERCOP benchmarking toolkit.

ntruprime.sage: 2019 reference implementation and some tests for sntrup653, sntrup761 (and sntrup4591761), sntrup857, ntrulpr653, ntrulpr761 (and ntrulpr4591761), and ntrulpr857.

sntrup4591761.sage: Original reference implementation of sntrup4591761.

ntrulpr4591761.sage: Original reference implementation of ntrulpr4591761.

ntruprime-haswell-20190712.tar.gz: NTRU Prime Haswell assembly-language code package.

ntruprime-20171206.tar.gz: Portable C reference implementations sntrup4591761/ref and ntrulpr4591761/ref; and Haswell-optimized implementations sntrup4591761/avx and ntrulpr4591761/avx. The avx implementations include a very fast general-purpose integer-sorting library, avx/int32_sort.c, of independent interest.

ntruprime-20170815.tar.gz. Changes after this version: Added NTRU LPRime. NIST KAT compatibility (with -DKAT). KAT intermediate-value output (with -DKAT). Namespacing improvements. Removed unused variables.

An older constant-time hardware implementation in VHDL for sntrup653, sntrup761 and sntrup857 is available at https://github.com/AdrianMarotzke/SNTRUP.

C software external interface

The C implementations support the eBACS KEM interface. This means that they provide three functions to callers, as shown below. All occurrences of crypto_kem below are actually crypto_kem_sntrup761 for sntrup761, crypto_kem_ntrulpr761 for ntrulpr761, etc., or a literal crypto_kem in NaCl-style cryptographic libraries.


    #include "crypto_kem.h"

This is not a source file; it is created automatically by SUPERCOP.


    unsigned char pk[crypto_kem_PUBLICKEYBYTES];
    unsigned char sk[crypto_kem_SECRETKEYBYTES];

    crypto_kem_keypair(pk,sk);

Generates a random public key pk[0],...,pk[PUBLICKEYBYTES-1] and corresponding secret key sk[0],...,sk[SECRETKEYBYTES-1]. Always returns 0.


    unsigned char c[crypto_kem_CIPHERTEXTBYTES];
    unsigned char k[crypto_kem_BYTES];
    const unsigned char pk[crypto_kem_PUBLICKEYBYTES];

    crypto_kem_enc(c,k,pk);

Generates a random session key k[0],...,k[BYTES-1] and corresponding ciphertext c[0],...,c[CIPHERTEXTBYTES-1] given a public key pk[0],...,pk[PUBLICKEYBYTES-1]. Always returns 0.


    unsigned char k[crypto_kem_BYTES];
    const unsigned char c[crypto_kem_CIPHERTEXTBYTES];
    const unsigned char sk[crypto_kem_SECRETKEYBYTES];

    crypto_kem_dec(k,c,sk);

Recovers a session key k[0],...,k[BYTES-1] from a ciphertext c[0],...,c[CIPHERTEXTBYTES-1] using a secret key sk[0],...,sk[SECRETKEYBYTES-1]. Returns 0. However, if the ciphertext is invalid, returns -1, possibly after modifying k[0],...,k[BYTES-1].

C software internals for sntrup4591761 and ntrulpr4591761

All of the following details are subject to change. These function names are not exposed to callers. Some functions are specific to ref, and some are specific to avx; some are specific to sntrup4591761, and some are specific to ntrulpr4591761.

Parameters: All of the software takes q to be 4591 and takes p to be 761. For sntrup4591761, the weight parameter w is 286; for ntrulpr4591761, w is 250.

In modular computations there is a distinction between "freezing" integers (reducing them to unique representatives) and "squeezing" integers (reducing them to save time and avoid overflow, but not necessarily freezing). In this software, frozen integers modulo 3 are defined as -1, 0, or 1; frozen integers modulo q are defined as integers between -2295 and 2295; frozen polynomials are defined as linear combinations of 1,x,...,x760 with frozen coefficients. These representations are not expected to change and are not strictly modularized, but sometimes they are abstracted for clarity.

Polynomials are stored as p_padded coefficients, where p_padded is currently 761 for ref and 768 for avx. When p_padded is larger than 761, the remaining coefficients are stored as 0. The current software does not use or provide the p_padded name, but this documentation uses it.

Software clarity, software performance, and specification clarity are obviously different metrics, often producing different function-call structures.


    #include "hide.h"

    unsigned char c[crypto_kem_CIPHERTEXTBYTES];
    unsigned char k[crypto_kem_BYTES];
    const unsigned char pk[crypto_kem_PUBLICKEYBYTES];
    const unsigned char r[32];

    hide(c,k,pk,r);

Same as crypto_kem_enc, except that hide takes 32 explicit bytes of randomness r[0],...,r[31] as input, while crypto_kem_enc takes 32 bytes from randombytes.

The current software provides this function only for ntrulpr4591761. An analogous function for sntrup4591761 would have a small polynomial r as input rather than a 32-byte string.


    #include "int32_sort.h"

    crypto_int32 x[...]
    int xlen;

    int32_sort(x,xlen);

Sorts x[0],...,x[xlen-1] into order.


    crypto_int32 x, y;

    minmax(&x,&y);

Sorts x and y into order.


    #include "mod3.h"

    small x;
    crypto_int32 i;

    x = mod3_freeze(i);

Reads i between -100000 and 100000. Freezes modulo 3. Returns the result.


    #include "mod3.h"

    small x;
    small a, b, c;

    x = mod3_minusproduct(a,b,c);

Reads frozen a, b, c modulo 3. Computes a-bc. Freezes modulo 3. Returns the result.


    #include "mod3.h"

    int c;
    small x;

    c = mod3_nonzero_mask(x);

Reads frozen x modulo 3. Returns -1 if x is nonzero modulo 3, 0 otherwise.


    #include "mod3.h"

    small x;
    small a, b, c;

    x = mod3_plusproduct(a,b,c);

Reads frozen a, b, c modulo 3. Computes a+bc. Freezes modulo 3. Returns the result.


    #include "mod3.h"

    small x;
    small a, b;

    x = mod3_product(a,b);

Reads frozen a, b modulo 3. Computes ab. Freezes modulo 3 (which is a no-op). Returns the result.


    #include "mod3.h"

    small x;
    small a, b;

    x = mod3_quotient(a,b);

Same as x = mod3_product(a,mod3_reciprocal(b)). See mod3_product and mod3_reciprocal.


    #include "mod3.h"

    small x;
    small a;

    x = mod3_reciprocal(a);

Reads frozen a modulo 3. Computes the reciprocal modulo 3, or 0 if the input is 0 modulo 3. Freezes modulo 3. (All of this is a no-op.) Returns the result.


    #include "mod3.h"

    small x;
    small a, b;

    x = mod3_sum(a,b);

Reads frozen a, b modulo 3. Computes a+b. Freezes modulo 3. Returns the result.


    #include "modq.h"

    modq x;
    crypto_int32 i;

    x = modq_freeze(i);

Reads i between -9000000 and 9000000. Freezes modulo q. Returns the result.


    #include "modq.h"

    modq x;
    crypto_uint32 u;

    x = modq_fromuint32(u);

Similar to modq_freeze but has a wider input range. Specifically, reads an integer u between 0 and 4294967295; subtracts 2295; freezes modulo q; returns the result.


    #include "modq.h"

    modq x;
    modq a, b, c;

    x = modq_minusproduct(a,b,c);

Reads frozen a, b, c modulo q. Computes a-bc. Freezes modulo q. Returns the result.


    #include "modq.h"

    int c;
    modq x;

    c = modq_nonzero_mask(x);

Reads frozen x modulo q. Returns -1 if x is nonzero modulo q, 0 otherwise.


    #include "modq.h"

    modq x;
    modq a, b, c;

    x = modq_plusproduct(a,b,c);

Reads frozen a, b, c modulo q. Computes a+bc. Freezes modulo q. Returns the result.


    #include "modq.h"

    modq x;
    modq a, b;

    x = modq_product(a,b);

Reads frozen a, b modulo q. Computes ab. Freezes modulo q. Returns the result.


    #include "modq.h"

    modq x;
    modq a, b;

    x = modq_quotient(a,b);

Same as x = modq_product(a,modq_reciprocal(b)). See modq_product and modq_reciprocal.


    #include "modq.h"

    modq x;
    modq a;

    x = modq_reciprocal(a);

Reads frozen a modulo q. Computes the reciprocal modulo q, or 0 if the input is 0 modulo q. Freezes modulo q. Returns the result.


    #include "modq.h"

    modq x;
    modq a;

    x = modq_square(a);

Same as x = modq_product(a,a). See modq_product.


    #include "modq.h"

    modq x;
    modq a, b;

    x = modq_sum(a,b);

Reads frozen a, b modulo q. Computes a+b. Freezes modulo q. Returns the result.


    #include "r3.h"

    small h[p_padded];
    const small f[p_padded];
    const small g[p_padded];

    r3_mult(h,f,g);

Computes a product in the ring of polynomials modulo 3 and modulo x761−x−1. The inputs are polynomials f[0]+f[1]x+... and g[0]+g[1]x+..., assumed to be frozen modulo 3. The output is a polynomial h[0]+h[1]x+..., frozen modulo 3.


    #include "r3.h"

    small h[p_padded];
    const small g[p_padded];

    r3_recip(h,g);

Computes a reciprocal in the ring of polynomials modulo 3 and modulo x761−x−1; and returns 0. The input is a polynomial g[0]+g[1]x+..., assumed to be frozen modulo 3. The output is a polynomial h[0]+h[1]x+..., frozen modulo 3.

If the input is not invertible, instead returns -1. The contents of h[0] etc. are then undefined.

Speed notes: A separate function to test invertibility would be faster. Namely, check divisibility by the factors of x761−x−1 modulo 3. These factors have degrees 19, 60, and 682, producing divisibility with probabilities below 1/230, 1/295, and 1/21080. Evidently degree 682 can safely be skipped, and skipping degree 60 might also be acceptable.

Testing invertibility before calling the reciprocal function would slightly increase the average time to generate one key, but would significantly reduce the maximum observed time. The following procedure would save much more time in any application bottlenecked by key generation: compute a batch of many invertible g inputs, using an invertibility test for each input; then use Montgomery's trick for batch reciprocals.


    #include "r3.h"

    int c;
    const small g[p_padded];

    c = r3_weightw_mask(g);

Returns 0 if the input has weight exactly w, otherwise -1. The input is a polynomial g[0]+g[1]x+..., assumed to be frozen modulo 3.


    #include "rq.h"

    modq f[p_padded];
    const unsigned char c[rq_encode_len];

    rq_decode(f,c);

See rq_encode.


    #include "rq.h"

    modq f[p_padded];
    const unsigned char c[rq_encoderounded_len];

    rq_decoderounded(f,c);

See rq_encoderounded.


    #include "rq.h"

    unsigned char c[rq_encode_len];
    const modq f[p_padded];

    rq_encode(c,f);

See "Encoding of public keys as strings" in the sntrup4591761 specification. rq_encode_len is 1218.


    #include "rq.h"

    unsigned char c[rq_encoderounded_len];
    const modq f[p_padded];

    rq_encoderounded(c,f);

See "Encoding of rounded ring elements as strings" in the sntrup4591761 specification. rq_encoderounded_len is 1015; the name rq_encoderounded_len is not always provided by the current software.


    #include "rq.h"

    modq h[p_padded];
    const unsigned char K[32];

    rq_fromseed(h,K);

See the definition of "Generator" for ntrulpr4591761.


    #include "rq.h"

    small g[p_padded];
    const modq h[p_padded];

    rq_mod3(g,h);

Handles each coefficient as follows: multiply by 3; freeze modulo q; freeze modulo 3. The input is a polynomial h[0]+h[1]x+..., assumed to be frozen modulo q. The output is a polynomial g[0]+g[1]x+..., frozen modulo 3.

Speed note: The specification of decapsulation multiplies the incoming c by 3f, freezes modulo q, and then freezes modulo 3. The software actually multiplies the incoming c by f, and then uses rq_mod3. The cost of multiplying by 3 in rq_mod3 is outweighed by the benefit of a smaller input to multiplication.


    #include "rq.h"

    modq h[p_padded];
    const modq f[p_padded];
    const small g[p_padded];

    rq_mult(h,f,g);

Computes a product in the ring of polynomials modulo q and modulo x761−x−1. The inputs are polynomials f[0]+f[1]x+... and g[0]+g[1]x+..., assumed to be frozen modulo q. Furthermore, g is assumed to have each coefficient -1, 0, or 1. The output is a polynomial h[0]+h[1]x+..., frozen modulo q.


    #include "rq.h"

    modq h[p_padded];
    const modq g[p_padded];

    rq_recip3(h,g);

Computes a reciprocal, scaled by a factor 3, in the ring of polynomials modulo q and modulo x761−x−1; and returns 0. The input is a polynomial g[0]+g[1]x+..., assumed to be frozen modulo q. The output is a polynomial h[0]+h[1]x+..., frozen modulo q. The reciprocal of the output is 3 times the input.

If the input is not invertible, instead returns -1. The contents of h[0] etc. are then undefined.

See r3_recip for speed notes. An analogous invertibility test for rq_recip3 is even easier: all of the ring elements that appear are nonzero by construction, and each nonzero ring element is invertible.


    #include "rq.h"

    unsigned char r[32];
    const unsigned char c[128];
    const modq ab[256];

    rq_rightsubbit(r,c,ab);

See the definition of "Right" for ntrulpr4591761, and the sign-bit step in NTRU LPRime decapsulation.

rq_rightsubbit first applies Right to c[0],...,c[127], obtaining 256 intermediate coefficients. It then subtracts the corresponding coefficients ab[0],...,ab[255], adds 4w+1 to each coefficient, freezes each coefficient modulo q, extracts the sign bit of each coefficient, and packs these bits in little-endian form into r[0],...,r[31].


    #include "rq.h"

    modq h[p_padded];
    const modq f[p_padded];

    rq_round3(h,f);

Rounds each polynomial coefficient to the nearest multiple of 3. The input is a polynomial f[0]+f[1]x+..., assumed to be frozen modulo q. The output is a polynomial h[0]+h[1]x+..., where each h[i] is f[i] rounded to the nearest multiple of 3. This polynomial is frozen modulo q since q is 1 modulo 6.


    #include "rq.h"

    unsigned char c[rq_encoderounded_len];
    const modq f[p_padded];

    rq_roundencode(c,f);

Composition of rq_round3 and rq_encoderounded. Speed note: These are merged to save time in avx.


    #include "rq.h"

    unsigned char c[128];
    const modq f[256];
    const unsigned char r[32];

    rq_top(c,f,r);

See the definition of "Top" for ntrulpr4591761; and the definition of Cj in NTRU LPRime encapsulation.

This function computes c[0],...,c[127] as Top applied to 256 input coefficients. The input coefficients are the 256 bits of r in little-endian form, times (q-1)/2, plus f[0],...,f[255].


    #include "small.h"

    small f[p_padded];
    const unsigned char c[small_encode_len];

    small_decode(f,c);

See small_encode.


    #include "small.h"

    unsigned char c[small_encode_len];
    const small f[p_padded];

    small_encode(c,f);

Encodes a frozen polynomial modulo 3 as follows. View the polynomial in little-endian form as 764 coefficients, where the last 3 coefficients are always 0. Add 1 to each coefficient, obtaining an element of {0,1,2}. Write each batch of 4 elements in radix 4, obtaining a byte. This produces 191 bytes c[0],...,c[190]; small_encode_len is 191.


    #include "small.h"

    small g[p_padded];

    small_random(g);

Generates a uniform random element of the ring of polynomials modulo 3 and modulo x761−x−1. The output is a polynomial g[0]+g[1]x+..., frozen modulo 3.

Actually, the distribution is not precisely uniform. Each coefficient (starting with the constant coefficient) is generated as follows: call small_random32(), extract the bottom 30 bits, multiply by 3, shift right by 30, and subtract 1.

A separate analysis shows that the divergence from uniformity is below 1.000002.


    #include "small.h"

    small f[p_padded];

    small_random_weightw(f);

Generates a uniform random weight-w element of the ring of polynomials modulo 3 and modulo x761−x−1. Weight w means that there are exactly w nonzero coefficients. The output is a polynomial f[0]+f[1]x+..., frozen modulo 3.

Actually, the distribution is not precisely uniform. This function works as follows. Generate 761 int32 coefficients (starting with the constant coefficient) using small_random32(), For each of the first w coefficients, clear the bottom bit, obtaining 0 or 2 modulo 4. For the remaining 761−w coefficients, set the bottom bit and clear the next bit, obtaining 1 modulo 4. Then sort all 761 coefficients. Finally, for each coefficient, extract the bottom 2 bits and subtract 1, obtaining -1 or 0 or 1.

A separate analysis shows that the divergence from uniformity is below 1.00027.


    #include "small.h"

    crypto_int32 i;

    i = small_random32();

Generates and returns a uniform random int32. More detailed specification for test vectors: obtains 4 bytes from randombytes, interpreted in little-endian form as a uint32, interpreted in twos-complement as an int32. Even more detailed specification for NIST KATs, which do not provide chunk-independence: randombytes is called in chunks of 3044 bytes.


    #include "small.h"

    small f[p_padded];
    const unsigned char k[32];

    small_seeded_weightw(f,k);

Same as small_random_weightw, except for two differences. First, the initial random coefficients are obtained as AES-256-CTR output using key k[0],...,k[31] rather than as randombytes output. Second, these coefficients are sorted as uint32 rather than as int32. Equivalently, the coefficients are sorted as int32 but first the top bit of each coefficient is flipped.

For ntrulpr4591761, small_random_weightw is actually implemented as small_seeded_weightw using k[0],...,k[31] obtained from randombytes.


    int c;
    int x, y;

    c = smaller_mask(x,y);

Returns -1 if x is smaller than y, 0 otherwise. Assumes that x-y does not overflow.


    #include "swap.h"

    unsigned char x[...];
    unsigned char y[...];
    int bytes;
    int c;

    swap(x,y,bytes,c);

Conditionally swaps x[0],...,x[bytes-1] with y[0],...,y[bytes-1]: swaps if c is -1, leaves in place if c is 0. Assumes that c is -1 or 0. The swap prototype uses void pointers so it can also be applied to other data types.


    small z[...];
    int len;
    const small x[...];
    const small y[...];
    const small c;

    vectormod3_minusproduct(z,len,x,y,c);

Sets z[0] to mod3_minusproduct(x[0],y[0],c); sets z[1] to mod3_minusproduct(x[1],y[1],c); ... sets z[len-1] to mod3_minusproduct(x[len-1],y[len-1],c). See mod3_minusproduct.


    small z[...];
    int len;
    const small x[...];
    const small c;

    vectormod3_product(z,len,x,c);

Sets z[0] to mod3_product(x[0],c); sets z[1] to mod3_product(x[1],c); ... sets z[len-1] to mod3_product(x[len-1],c). See mod3_product.


    small z[...];
    int len;

    vectormod3_shift(z,len);

Reads 0,z[0],z[1],...,z[len-2] and stores those in z[0],z[1],...,z[len-1]. Assumes that len is at least 1.


    modq z[...];
    int len;
    const modq x[...];
    const modq y[...];
    const modq c;

    vectormodq_minusproduct(z,len,x,y,c);

Sets z[0] to modq_minusproduct(x[0],y[0],c); sets z[1] to modq_minusproduct(x[1],y[1],c); ... sets z[len-1] to modq_minusproduct(x[len-1],y[len-1],c). See modq_minusproduct.


    modq z[...];
    int len;
    const modq x[...];
    const modq c;

    vectormodq_product(z,len,x,c);

Sets z[0] to modq_product(x[0],c); sets z[1] to modq_product(x[1],c); ... sets z[len-1] to modq_product(x[len-1],c). See modq_product.


    modq z[...];
    int len;

    vectormodq_shift(z,len);

Reads 0,z[0],z[1],...,z[len-2] and stores those in z[0],z[1],...,z[len-1]. Assumes that len is at least 1.


    int c;
    const unsigned char x[crypto_kem_CIPHERTEXTBYTES];
    const unsigned char y[crypto_kem_CIPHERTEXTBYTES];

    c = verify(x,y);

Returns 0 if x[0],...,x[CIPHERTEXTBYTES-1] are the same as y[0],...,y[CIPHERTEXTBYTES-1]. Otherwise returns -1. Security note: The standard memcmp function does not take constant time.


Version: This is version 2023.02.06 of the "Software" web page.