comparison tomsfastmath/tfm.tex @ 643:a362b62d38b2 dropbear-tfm

Add tomsfastmath from git rev bfa4582842bc3bab42e4be4aed5703437049502a with Makefile.in renamed
author Matt Johnston <matt@ucc.asn.au>
date Wed, 23 Nov 2011 18:10:20 +0700
parents
children
comparison
equal deleted inserted replaced
642:33fd2f3499d2 643:a362b62d38b2
1 \documentclass[b5paper]{book}
2 \usepackage{hyperref}
3 \usepackage{makeidx}
4 \usepackage{amssymb}
5 \usepackage{color}
6 \usepackage{alltt}
7 \usepackage{graphicx}
8 \usepackage{layout}
9 \def\union{\cup}
10 \def\intersect{\cap}
11 \def\getsrandom{\stackrel{\rm R}{\gets}}
12 \def\cross{\times}
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}}
14 \def\catn{$\|$}
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}}
16 \def\nequiv{\not\equiv}
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}}
18 \def\lcm{{\rm lcm}}
19 \def\gcd{{\rm gcd}}
20 \def\log{{\rm log}}
21 \def\ord{{\rm ord}}
22 \def\abs{{\mathit abs}}
23 \def\rep{{\mathit rep}}
24 \def\mod{{\mathit\ mod\ }}
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})}
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor}
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil}
28 \def\Or{{\rm\ or\ }}
29 \def\And{{\rm\ and\ }}
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}}
31 \def\implies{\Rightarrow}
32 \def\undefined{{\rm ``undefined"}}
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}}
34 \let\oldphi\phi
35 \def\phi{\varphi}
36 \def\Pr{{\rm Pr}}
37 \newcommand{\str}[1]{{\mathbf{#1}}}
38 \def\F{{\mathbb F}}
39 \def\N{{\mathbb N}}
40 \def\Z{{\mathbb Z}}
41 \def\R{{\mathbb R}}
42 \def\C{{\mathbb C}}
43 \def\Q{{\mathbb Q}}
44 \definecolor{DGray}{gray}{0.5}
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}}
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}}
47 \def\gap{\vspace{0.5ex}}
48 \makeindex
49 \begin{document}
50 \frontmatter
51 \pagestyle{empty}
52 \title{TomsFastMath User Manual \\ v0.12}
53 \author{Tom St Denis \\ [email protected]}
54 \maketitle
55 This text and library are all hereby placed in the public domain. This book has been formatted for B5
56 [176x250] paper using the \LaTeX{} {\em book} macro package.
57
58 \vspace{13cm}
59
60 \begin{flushleft}This project was sponsored in part by
61
62 Secure Science Corporation \url{http://www.securescience.net}.
63 \end{flushleft}
64
65 \tableofcontents
66 \listoffigures
67 \mainmatter
68 \pagestyle{headings}
69 \chapter{Introduction}
70 \section{What is TomsFastMath?}
71
72 TomsFastMath is meant to be a very fast yet still fairly portable and easy to port large
73 integer arithmetic library written in ISO C. The goal specifically is to be able to perform
74 very fast modular exponentiations and other related functions required for ECC, DH and RSA
75 cryptosystems.
76
77 Most of the library is pure ISO C portable source code while a small portion (three files) contain
78 a mixture of ISO C and assembler inline fragments. Compared to LibTomMath this new library is
79 meant to be much faster while sacrificing flexibiltiy. This is accomplished through several means.
80
81 \begin{enumerate}
82 \item The new code is slightly messier and contains asm blocks.
83 \item This uses fixed not multiple precision integers.
84 \item It is designed only for fast modular exponentiations [e.g. less flexibility].
85 \end{enumerate}
86
87 To mitigate some of the problems that arise from using assembler it has been carefully and
88 appropriately used where it would make the most gain in performance. Also we use macro's
89 for assembler code which allows new ports to be inserted easily.
90
91 The new code uses fixed precision arithmetic which means at compile time you choose a maximum
92 precision and all numbers are limited to that. This has the benefit of not requiring any
93 memory heap operations (which are slow) in any of the functions. It has the downside that
94 integers that are too large are truncated.
95
96 The goal of this library is to be able to perform modular exponentiations (with an odd modulus) very
97 fast. This is what takes the most time in systems such as RSA and DH. This also requires
98 fast multiplication and squaring and has the side effect of speeding up ECC operations as well.
99
100 \section{License}
101 TomsFastMath is public domain.
102
103 \section{Building}
104 To build the library simply type ``make''. Or to install in typical *unix like directories use
105 ``make install''. Similarly a shared library can be built with ``make -f makefile.shared install''.
106
107 You can build the test program with ``make test''. To perform simple static testing (useful to
108 test out new assembly ports) use the stest program. Type ``make stest'' and run it on your
109 target. The program will perform three multiplications, squarings and montgomery reductions.
110 Likely if your assembly code is invalid this code will exhibit the bug.
111
112 \subsection{Intel CC}
113 In theory you should be able to build the library with
114
115 \begin{verbatim}
116 CFLAGS="-O3 -ip" CC=icc make IGNORE_SPEED=1
117 \end{verbatim}
118
119 However, Intels inline assembler is way less advanced than GCCs. As a result it doesn't compile.
120 Fortunately it doesn't really matter.
121
122 \subsection{MSVC}
123 The library doesn't build with MSVC. Imagine that.
124
125 \subsection{Build Limitations}
126 TomsFastMath has the following build requirements which are non--portable but under most
127 circumstances not problematic.
128
129 \begin{enumerate}
130 \item ``CHAR\_BIT'' must be eight.
131 \item The ``fp\_digit'' type must be a multiple of eight bits long.
132 \item The ``fp\_word'' must be at least twice the length of fp\_digit.
133 \end{enumerate}
134
135 \subsection{Optimization Configuration}
136 By default TFM is configured for 32--bit digits using ISO C source code. This mode while portable
137 is not very efficient. While building the library (from scratch) you can define one of
138 several ``CFLAGS'' defines.
139
140 For example, to build with with SSE2 optimizations type
141
142 \begin{verbatim}
143 CFLAGS=-DTFM_SSE2 make clean libtfm.a
144 \end{verbatim}
145
146 \subsubsection{x86--32} The ``x86--32'' mode is defined by ``TFM\_X86'' and covers all
147 i386 and beyond processors. It requires GCC to build and only works with 32--bit digits. In this
148 mode fp\_digit is 32--bits and fp\_word is 64--bits. This mode will be autodetected when building
149 with GCC to an ``i386'' target. You can override this behaviour by defining TFM\_NO\_ASM or
150 another optimization mode (such as SSE2).
151
152 \subsubsection{SSE2} The ``SSE2'' mode is defined by ``TFM\_SSE2'' and requires a Pentium 4, Pentium
153 M or Athlon64 processor. It requires GCC to build. Note that you shouldn't define both
154 TFM\_X86 and TFM\_SSE2 at the same time. This mode only works with 32--bit digits. In this
155 mode fp\_digit is 32--bits and fp\_word is 64--bits. While this mode will work on the AMD Athlon64
156 series of processors it is less efficient than the native ``x86--64'' mode and not recommended.
157
158 There is an additional ``TFM\_PRESCOTT'' flag that you can define for P4 Prescott processors. This causes
159 the mul/sqr functions to use x86\_32 and the montgomery reduction to use SSE2 which is (so far) the fastest
160 combination. If you are using an older (e.g. Northwood) generation P4 don't define this.
161
162 \subsubsection{x86--64} The ``x86--64'' mode is defined by ``TFM\_X86\_64'' and requires a
163 ``x86--64'' capable processor (Athlon64 and future Pentium processors). It requires GCC to
164 build and only works with 64--bit digits. Note that by enabling this mode it will automatically
165 enable 64--bit digits. In this mode fp\_digit is 64--bits and fp\_word is 128--bits. This mode will
166 be autodetected when building with GCC to an ``x86--64'' target. You can override this behaviour by defining
167 TFM\_NO\_ASM.
168
169 \subsubsection{ARM} The ``ARM'' mode is defined by ``TFM\_ARM'' and requires a ARMv4 with the M instructions (enhanced
170 multipliers) or higher processor. It requires GCC and works with 32--bit digits. In this mode fp\_digit is 32--bits and
171 fp\_word is 64--bits.
172
173 \subsubsection{PPC32} The ``PPC32'' mode is defined by ``TFM\_PPC32'' and requires a standard PPC processor. It doesn't
174 use altivec or other extensions so it should work on all compliant implementations of PPC. It requires GCC and works
175 with 32--bit digits. In this mode fp\_digit is 32--bits and fp\_word is 64--bits.
176
177 \subsubsection{PPC64} The ``PPC64'' mode is defined by ``TFM\_PPC64'' and requires a 64--bit PPC processor.
178
179 \subsubsection{AVR32} The ``AVR32'' mode is defined by ``TFM\_AVR32'' and requires an Atmel AVR32 processor.
180
181 \subsubsection{Future Releases} Future releases will support additional platform optimizations.
182 Developers of MIPS and SPARC platforms are encouraged to submit GCC asm inline patches
183 (see chapter \ref{chap:asmops} for more information).
184
185 \begin{figure}[here]
186 \begin{small}
187 \begin{center}
188 \begin{tabular}{|l|l|}
189 \hline \textbf{Processor} & \textbf{Recommended Mode} \\
190 \hline All 32--bit x86 platforms & TFM\_X86 \\
191 \hline Pentium 4 & TFM\_SSE2 \\
192 \hline Pentium 4 Prescott & TFM\_SSE2 + TFM\_PRESCOTT \\
193 \hline Athlon64 & TFM\_X86\_64 \\
194 \hline ARMv4 or higher with M & TFM\_ARM \\
195 \hline G3/G4 (32-bit PPC) & TFM\_PPC32 \\
196 \hline G5 (64-bit PPC) & TFM\_PPC64 \\
197 \hline Atmel AVR32 & TFM\_AVR32 \\
198 \hline &\\
199 \hline x86--32 or x86--64 (with GCC) & Leave blank and let autodetect work \\
200 \hline
201 \end{tabular}
202 \caption{Recommended Build Modes}
203 \end{center}
204 \end{small}
205 \end{figure}
206
207 \subsection{Build Configurations}
208 TomsFastMath is configurable in terms of which unrolled code (if any) is included. By default, the majority of the code is included which
209 results in large binaries. The first flag to try out is TFM\_ALREADY\_SET which tells TFM to turn off \textbf{all} unrolled code. This will
210 result in a smaller library but also a much slower library.
211
212 From this clean state, you can start enabling unrolled code for given cryptographic tasks at hand. A series of TFM\_MULXYZ and TFM\_SQRXYZ macros
213 exist to enable specific unrolled code. For instance, TFM\_MUL32 will enable a 32 digit unrolled multiplier. For a complete list see the tfm.h header
214 file. Keep in mind this is for digits not bits. For example, you should enable TFM\_MUL16 if you are doing 1024-bit exptmods on a 64--bit platform, enable
215 TFM\_MUL32 on 32--bit platforms.
216
217 To help developers use ECC there are a set of defines for the five NIST curve sizes. They are named TFM\_ECCXYZ where XYZ is one of 192, 224, 256, 384, or 521. These
218 enable the multipliers and squaring code for a given curve, autodetecting 64--bit platforms as well.
219
220 \subsection{Precision Configuration}
221 The precision of all integers in this library are fixed to a limited precision. Essentially
222 the rule of setting the precision is if you plan on doing modular exponentiation with $k$--bit
223 numbers than the precision must be fixed to $2k$--bits plus four digits.
224
225 This is changed by altering the value of ``FP\_MAX\_SIZE'' in tfm.h to your desired size. By default,
226 the library is configured to handle upto 2048--bit inputs to the modular exponentiator.
227
228 \chapter{Getting Started}
229 \section{Data Types}
230 TomsFastMath is a large fixed precision integer library. It provides the functionality to
231 manipulate large signed integers through a relatively trivial api and a single data type.
232
233 The ``fp\_int'' or fixed precision integer is the data type that the functions operate with.
234
235 \begin{verbatim}
236 typedef struct {
237 fp_digit dp[FP_SIZE];
238 int used,
239 sign;
240 } fp_int;
241 \end{verbatim}
242
243 The \textbf{dp} member is the array of digits that forms the number. It must always be zero
244 padded. The \textbf{used} member is the count of digits used in the array. Although the
245 precision is fixed the algorithms are still tuned to not process the entire array if it
246 does not have to. The \textbf{sign} indicates the sign of the integer. It is \textbf{FP\_ZPOS} (0)
247 if the integer is zero or positive and \textbf{FP\_NEG} (1) otherwise.
248
249 \section{Initialization}
250 \subsection{Simple Initialization}
251 To initialize an integer to the default state of zero use the fp\_init() function.
252
253 \index{fp\_init}
254 \begin{verbatim}
255 void fp_init(fp_int *a);
256 \end{verbatim}
257
258 This will initialize the fp\_int $a$ to zero. Note that the function fp\_zero() is an alias
259 for fp\_init().
260
261 \subsection{Initialize Small Constants}
262 To initialize an integer with a small single digit value use the fp\_set() function.
263
264 \index{fp\_set}
265 \begin{verbatim}
266 void fp_set(fp_int *a, fp_digit b);
267 \end{verbatim}
268
269 This will initialize $a$ and set it equal to the digit $b$.
270
271 \subsection{Initialize Copy}
272 To initialize an integer with a copy of another integer use the fp\_init\_copy() function.
273
274 \index{fp\_init\_copy}
275 \begin{verbatim}
276 void fp_init_copy(fp_int *a, fp_int *b)
277 \end{verbatim}
278
279 This will initialize $a$ as a copy of $b$. Note that for compatibility with LibTomMath the function
280 fp\_copy() is also provided.
281
282 \chapter{Arithmetic Operations}
283 \section{Odds and Evens}
284 To quickly and easily tell if an integer is zero, odd or even use the following functions.
285
286 \index{fp\_iszero} \index{fp\_iseven} \index{fp\_isodd}
287 \begin{verbatim}
288 int fp_iszero(fp_int *a);
289 int fp_iseven(fp_int *a);
290 int fp_isodd(fp_int *a);
291 \end{verbatim}
292
293 These will return \textbf{FP\_YES} if the answer to their respective questions is yes. Otherwise they
294 return \textbf{FP\_NO}. Note that these are implemented as macros and as such you should avoid using
295 ++ or --~-- operators on the input operand.
296
297 \section{Sign Manipulation}
298 To negate or compute the absolute of an integer use the following functions.
299
300 \index{fp\_neg} \index{fp\_abs}
301 \begin{verbatim}
302 void fp_neg(fp_int *a, fp_int *b);
303 void fp_abs(fp_int *a, fp_int *b);
304 \end{verbatim}
305 This will compute the negation (or absolute) of $a$ and store the result in $b$. Note that these
306 are implemented as macros and as such you should avoid using ++ or --~-- operators on the input
307 operand.
308
309 \section{Comparisons}
310 To perform signed or unsigned comparisons use following functions.
311
312 \index{fp\_cmp} \index{fp\_cmp\_mag}
313 \begin{verbatim}
314 int fp_cmp(fp_int *a, fp_int *b);
315 int fp_cmp_mag(fp_int *a, fp_int *b);
316 \end{verbatim}
317 These will compare $a$ to $b$. They will return \textbf{FP\_GT} if $a$ is larger than $b$,
318 \textbf{FP\_EQ} if they are equal and \textbf{FP\_LT} if $a$ is less than $b$.
319
320 The function fp\_cmp performs signed comparisons while the other performs unsigned comparisons.
321
322 \section{Shifting}
323 To shift the digits of an fp\_int left or right use the following functions.
324
325 \index{fp\_lshd} \index{fp\_rshd}
326 \begin{verbatim}
327 void fp_lshd(fp_int *a, int x);
328 void fp_rshd(fp_int *a, int x);
329 \end{verbatim}
330
331 These will shift the digits of $a$ left (or right respectively) $x$ digits.
332
333 To shift individual bits of an fp\_int use the following functions.
334
335 \index{fp\_div\_2d} \index{fp\_mod\_2d} \index{fp\_mul\_2d} \index{fp\_div\_2} \index{fp\_mul\_2}
336 \begin{verbatim}
337 void fp_div_2d(fp_int *a, int b, fp_int *c, fp_int *d);
338 void fp_mod_2d(fp_int *a, int b, fp_int *c);
339 void fp_mul_2d(fp_int *a, int b, fp_int *c);
340 void fp_mul_2(fp_int *a, fp_int *c);
341 void fp_div_2(fp_int *a, fp_int *c);
342 void fp_2expt(fp_int *a, int b);
343 \end{verbatim}
344 fp\_div\_2d() will divide $a$ by $2^b$ and store the quotient in $c$ and remainder in $d$. Either of
345 $c$ or $d$ can be \textbf{NULL} if their value is not required. fp\_mod\_2d() is a shortcut to
346 compute the remainder directly. fp\_mul\_2d() will multiply $a$ by $2^b$ and store the result in $c$.
347
348 The fp\_mul\_2() and fp\_div\_2() functions are optimized multiplication and divisions by two. The
349 function fp\_2expt() will compute $a = 2^b$ quickly.
350
351 To quickly count the number of least significant bits that are zero use the following function.
352
353 \index{fp\_cnt\_lsb}
354 \begin{verbatim}
355 int fp_cnt_lsb(fp_int *a);
356 \end{verbatim}
357 This will return the number of adjacent least significant bits that are zero. This is equivalent
358 to the number of times two evenly divides $a$.
359
360 \section{Basic Algebra}
361
362 The following functions round out the basic algebraic functionality of the library.
363
364 \index{fp\_add} \index{fp\_sub} \index{fp\_mul} \index{fp\_sqr} \index{fp\_div} \index{fp\_mod}
365 \begin{verbatim}
366 void fp_add(fp_int *a, fp_int *b, fp_int *c);
367 void fp_sub(fp_int *a, fp_int *b, fp_int *c);
368 void fp_mul(fp_int *a, fp_int *b, fp_int *c);
369 void fp_sqr(fp_int *a, fp_int *b);
370 int fp_div(fp_int *a, fp_int *b, fp_int *c, fp_int *d);
371 int fp_mod(fp_int *a, fp_int *b, fp_int *c);
372 \end{verbatim}
373
374 The functions fp\_add(), fp\_sub() and fp\_mul() perform their respective operations on $a$ and
375 $b$ and store the result in $c$. The function fp\_sqr() computes $b = a^2$ and is faster than
376 using fp\_mul() to perform the same operation.
377
378 The function fp\_div() divides $a$ by $b$ and stores the quotient in $c$ and remainder in $d$. Either
379 of $c$ and $d$ can be \textbf{NULL} if the result is not required. The function fp\_mod() is a simple
380 shortcut to find the remainder.
381
382 \section{Modular Exponentiation}
383 To compute a modular exponentiation use the following function.
384
385 \index{fp\_exptmod}
386 \begin{verbatim}
387 int fp_exptmod(fp_int *a, fp_int *b, fp_int *c, fp_int *d);
388 \end{verbatim}
389 This computes $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for any odd $c$ and $b$. $b$ may be negative so long as
390 $a^{-1} \mbox{ (mod }c\mbox{)}$ exists. The initial value of $a$ may be larger than $c$. The size of $c$ must be
391 half of the maximum precision used during the build of the library. For example, by default $c$ must be less
392 than $2^{2048}$.
393
394 \section{Number Theoretic}
395
396 To perform modular inverses, greatest common divisor or least common multiples use the following
397 functions.
398
399 \index{fp\_invmod} \index{fp\_gcd} \index{fp\_lcm}
400 \begin{verbatim}
401 int fp_invmod(fp_int *a, fp_int *b, fp_int *c);
402 void fp_gcd(fp_int *a, fp_int *b, fp_int *c);
403 void fp_lcm(fp_int *a, fp_int *b, fp_int *c);
404 \end{verbatim}
405
406 The fp\_invmod() function will find the modular inverse of $a$ modulo an odd modulus $b$ and store
407 it in $c$ (provided it exists). The function fp\_gcd() will compute the greatest common
408 divisor of $a$ and $b$ and store it in $c$. Similarly the fp\_lcm() function will compute
409 the least common multiple of $a$ and $b$ and store it in $c$.
410
411 \section{Prime Numbers}
412 To quickly test a number for primality call this function.
413
414 \index{fp\_isprime}
415 \begin{verbatim}
416 int fp_isprime(fp_int *a);
417 \end{verbatim}
418 This will return \textbf{FP\_YES} if $a$ is probably prime. It uses 256 trial divisions and
419 eight rounds of Rabin-Miller testing. Note that this routine performs modular exponentiations
420 which means that $a$ must be in a valid range of precision.
421
422 \chapter{Porting TomsFastMath}
423 \label{chap:asmops}
424 \section{Getting Started}
425 Porting TomsFastMath to a given processor target is usually a simple procedure. For the most part
426 assembly is used to get around the lack of a ``add with carry'' operation in the C language. To
427 make matters simpler the use of assembler is through macro blocks.
428
429 Each ``port'' is defined by a block of code that re-defines the portable ISO C macros with assembler
430 inline blocks. To add a new port you must designate a TFM\_XXX define that will enable your
431 port when built.
432
433 \section{Multiply with Comba}
434 The file ``fp\_mul\_comba.c'' is responsible for providing the fast multiplication within the
435 library. This comba multiplication is fairly simple. It uses a sliding three digit carry
436 system with the variables $c0$, $c1$, $c2$. For every digit of output $c0$ is the what will
437 be that digit, $c1$ will carry into the next digit and $c2$ will be the ``c1'' carry for
438 the next digit. For every ``next'' digit effectively $c0$ is stored as output, $c1$ moves into
439 $c0$, $c2$ into $c1$ and zero into $c2$.
440
441 The following macros define the assmebler interface to the code.
442
443 \begin{verbatim}
444 #define COMBA_START
445 \end{verbatim}
446
447 This is issued at the beginning of the multiplication function. This is in place to allow you to
448 initialize any registers or machine words required. You can leave it blank if you do not need
449 it.
450
451 \begin{verbatim}
452 #define COMBA_CLEAR \
453 c0 = c1 = c2 = 0;
454 \end{verbatim}
455
456 This clears the three comba carries. If you are going to place carries in registers then
457 zero the appropriate registers. Note that the functions do not use $c0$, $c1$ or $c2$ directly
458 so you are free to ignore these varibles and use registers directly.
459
460 \begin{verbatim}
461 #define COMBA_FORWARD \
462 c0 = c1; c1 = c2; c2 = 0;
463 \end{verbatim}
464
465 This propagates the carries after a digit has been produced.
466
467 \begin{verbatim}
468 #define COMBA_STORE(x) \
469 x = c0;
470 \end{verbatim}
471
472 This stores the $c0$ digit in the memory location specified by $x$. Note that if you manually
473 aliased $c0$ with a register than just store that register in $x$.
474
475 \begin{verbatim}
476 #define COMBA_STORE2(x) \
477 x = c1;
478 \end{verbatim}
479
480 This stores the $c1$ digit in the memory location specified by $x$. Note that if you manually
481 aliased $c1$ with a register than just store that register in $x$.
482
483 \begin{verbatim}
484 #define COMBA_FINI
485 \end{verbatim}
486
487 If at the end of the function you need to perform some action fill this macro in.
488
489 \begin{verbatim}
490 #define MULADD(i, j) \
491 t = ((fp_word)i) * ((fp_word)j); \
492 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \
493 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
494 \end{verbatim}
495
496 This macro performs the ``multiply and add'' step that is central to the comba
497 multiplier. It multiplies the fp\_digits $i$ and $j$ to produce a fp\_word result. Effectively
498 the double--digit value is added to the three-digit carry formed by $c0$, $c1$, $c2$ where $c0$
499 is the least significant digit.
500
501 \section{Squaring with Comba}
502 Squaring is similar to multiplication except that it uses a special ``multiply and add twice'' macro
503 that replaces multiplications that are not required.
504
505 \begin{verbatim}
506 #define COMBA_START
507 \end{verbatim}
508
509 This allows for any initialization code you might have.
510
511 \begin{verbatim}
512 #define CLEAR_CARRY \
513 c0 = c1 = c2 = 0;
514 \end{verbatim}
515
516 This will clear the carries. Like multiplication you can safely alias the three carry variables
517 to registers if you can/want to.
518
519 \begin{verbatim}
520 #define COMBA_STORE(x) \
521 x = c0;
522 \end{verbatim}
523
524 Store the $c0$ carry to a given memory location.
525
526 \begin{verbatim}
527 #define COMBA_STORE2(x) \
528 x = c1;
529 \end{verbatim}
530
531 Store the $c1$ carry to a given memory location.
532
533 \begin{verbatim}
534 #define CARRY_FORWARD \
535 c0 = c1; c1 = c2; c2 = 0;
536 \end{verbatim}
537
538 Forward propagate all three carry variables.
539
540 \begin{verbatim}
541 #define COMBA_FINI
542 \end{verbatim}
543
544 If you need to clean up at the end of the function.
545
546 \begin{verbatim}
547 /* multiplies point i and j, updates carry "c1" and digit c2 */
548 #define SQRADD(i, j) \
549 t = ((fp_word)i) * ((fp_word)j); \
550 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \
551 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
552 \end{verbatim}
553
554 This is essentially the MULADD macro from the multiplication code.
555
556 \begin{verbatim}
557 /* for squaring some of the terms are doubled... */
558 #define SQRADD2(i, j) \
559 t = ((fp_word)i) * ((fp_word)j); \
560 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \
561 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; \
562 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \
563 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2;
564 \end{verbatim}
565
566 This is like SQRADD except it adds the produce twice. It's similar to
567 computing SQRADD(i, j*2).
568
569 To further make things interesting the squaring code also has ``doubles'' (see my LTM book chapter five...) which are
570 handled with these macros.
571
572 \begin{verbatim}
573 #define SQRADDSC(i, j) \
574 do { fp_word t; \
575 t = ((fp_word)i) * ((fp_word)j); \
576 sc0 = (fp_digit)t; sc1 = (t >> DIGIT_BIT); sc2 = 0; \
577 } while (0);
578 \end{verbatim}
579 This computes a product and stores it in the ``secondary'' carry registers $\left < sc0, sc1, sc2 \right >$.
580
581 \begin{verbatim}
582 #define SQRADDAC(i, j) \
583 do { fp_word t; \
584 t = sc0 + ((fp_word)i) * ((fp_word)j); sc0 = t; \
585 t = sc1 + (t >> DIGIT_BIT); sc1 = t; sc2 += t >> DIGIT_BIT; \
586 } while (0);
587 \end{verbatim}
588 This computes a product and adds it to the ``secondary'' carry registers.
589
590 \begin{verbatim}
591 #define SQRADDDB \
592 do { fp_word t; \
593 t = ((fp_word)sc0) + ((fp_word)sc0) + c0; c0 = t; \
594 t = ((fp_word)sc1) + ((fp_word)sc1) + c1 + (t >> DIGIT_BIT); c1 = t; \
595 c2 = c2 + ((fp_word)sc2) + ((fp_word)sc2) + (t >> DIGIT_BIT); \
596 } while (0);
597 \end{verbatim}
598 This doubles the ``secondary'' carry registers and adds the sum to the main carry registers. Really complicated.
599
600 \section{Montgomery with Comba}
601 Montgomery reduction is used in modular exponentiation and is most called function during
602 that operation. It's important to make sure this routine is very fast or all is lost.
603
604 Unlike the two other comba routines this one does not use a single three--digit carry
605 system. It does have three--digit carries except that the routine steps through them
606 in the inner loop. This means you cannot alias them to registers (at all).
607
608 To make matters simple though the three arrays of carries are stored in one array. The
609 ``c0'' array resides in $c[0 \ldots OFF1-1]$, ``c1'' in $c[OFF1 \ldots OFF2-1]$ and ``c2'' in
610 $c[OFF2 \ldots OFF2+FP\_SIZE-1]$.
611
612 \begin{verbatim}
613 #define MONT_START
614 \end{verbatim}
615
616 This allows you to insert anything at the start that you need.
617
618 \begin{verbatim}
619 #define MONT_FINI
620 \end{verbatim}
621
622 This allows you to insert anything at the end that you need.
623
624 \begin{verbatim}
625 #define LOOP_START \
626 mu = c[x] * mp;
627 \end{verbatim}
628
629 This computes the $\mu$ value for the inner loop. You can safely alias $mu$ and $mp$ to
630 a register if you want.
631
632 \begin{verbatim}
633 #define INNERMUL \
634 do { fp_word t; \
635 _c[0] = t = ((fp_word)_c[0] + (fp_word)cy) + \
636 (((fp_word)mu) * ((fp_word)*tmpm++)); \
637 cy = (t >> DIGIT_BIT); \
638 } while (0)
639 \end{verbatim}
640
641 This computes the inner product and adds it to the destination and carry variable $cy$.
642 This uses the $mu$ value computed above (can be in a register already) and the
643 $cy$ which is a chaining carry. Inside the INNERMUL loop the $cy$ value can be kept
644 inside a register (hint: it always starts as $cy = 0$ in the first iteration).
645
646 Upon completion of the inner loop the macro LOOP\_END is called which is used to fetch
647 $cy$ into the variable the C program can see. This is where, if you cached $cy$ in a
648 register you would copy it to the locally accessible C variable.
649
650 \begin{verbatim}
651 #define PROPCARRY \
652 do { fp_digit t = _c[0] += cy; cy = (t < cy); } while (0)
653 \end{verbatim}
654
655 This propagates the carry upwards by one digit.
656
657 \input{tfm.ind}
658
659 \end{document}