Mercurial > dropbear
comparison tomsfastmath/tfm.tex @ 643:a362b62d38b2 dropbear-tfm
Add tomsfastmath from git rev bfa4582842bc3bab42e4be4aed5703437049502a
with Makefile.in renamed
author | Matt Johnston <matt@ucc.asn.au> |
---|---|
date | Wed, 23 Nov 2011 18:10:20 +0700 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
642:33fd2f3499d2 | 643:a362b62d38b2 |
---|---|
1 \documentclass[b5paper]{book} | |
2 \usepackage{hyperref} | |
3 \usepackage{makeidx} | |
4 \usepackage{amssymb} | |
5 \usepackage{color} | |
6 \usepackage{alltt} | |
7 \usepackage{graphicx} | |
8 \usepackage{layout} | |
9 \def\union{\cup} | |
10 \def\intersect{\cap} | |
11 \def\getsrandom{\stackrel{\rm R}{\gets}} | |
12 \def\cross{\times} | |
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}} | |
14 \def\catn{$\|$} | |
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}} | |
16 \def\nequiv{\not\equiv} | |
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}} | |
18 \def\lcm{{\rm lcm}} | |
19 \def\gcd{{\rm gcd}} | |
20 \def\log{{\rm log}} | |
21 \def\ord{{\rm ord}} | |
22 \def\abs{{\mathit abs}} | |
23 \def\rep{{\mathit rep}} | |
24 \def\mod{{\mathit\ mod\ }} | |
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})} | |
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} | |
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} | |
28 \def\Or{{\rm\ or\ }} | |
29 \def\And{{\rm\ and\ }} | |
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}} | |
31 \def\implies{\Rightarrow} | |
32 \def\undefined{{\rm ``undefined"}} | |
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}} | |
34 \let\oldphi\phi | |
35 \def\phi{\varphi} | |
36 \def\Pr{{\rm Pr}} | |
37 \newcommand{\str}[1]{{\mathbf{#1}}} | |
38 \def\F{{\mathbb F}} | |
39 \def\N{{\mathbb N}} | |
40 \def\Z{{\mathbb Z}} | |
41 \def\R{{\mathbb R}} | |
42 \def\C{{\mathbb C}} | |
43 \def\Q{{\mathbb Q}} | |
44 \definecolor{DGray}{gray}{0.5} | |
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}} | |
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}} | |
47 \def\gap{\vspace{0.5ex}} | |
48 \makeindex | |
49 \begin{document} | |
50 \frontmatter | |
51 \pagestyle{empty} | |
52 \title{TomsFastMath User Manual \\ v0.12} | |
53 \author{Tom St Denis \\ [email protected]} | |
54 \maketitle | |
55 This text and library are all hereby placed in the public domain. This book has been formatted for B5 | |
56 [176x250] paper using the \LaTeX{} {\em book} macro package. | |
57 | |
58 \vspace{13cm} | |
59 | |
60 \begin{flushleft}This project was sponsored in part by | |
61 | |
62 Secure Science Corporation \url{http://www.securescience.net}. | |
63 \end{flushleft} | |
64 | |
65 \tableofcontents | |
66 \listoffigures | |
67 \mainmatter | |
68 \pagestyle{headings} | |
69 \chapter{Introduction} | |
70 \section{What is TomsFastMath?} | |
71 | |
72 TomsFastMath is meant to be a very fast yet still fairly portable and easy to port large | |
73 integer arithmetic library written in ISO C. The goal specifically is to be able to perform | |
74 very fast modular exponentiations and other related functions required for ECC, DH and RSA | |
75 cryptosystems. | |
76 | |
77 Most of the library is pure ISO C portable source code while a small portion (three files) contain | |
78 a mixture of ISO C and assembler inline fragments. Compared to LibTomMath this new library is | |
79 meant to be much faster while sacrificing flexibiltiy. This is accomplished through several means. | |
80 | |
81 \begin{enumerate} | |
82 \item The new code is slightly messier and contains asm blocks. | |
83 \item This uses fixed not multiple precision integers. | |
84 \item It is designed only for fast modular exponentiations [e.g. less flexibility]. | |
85 \end{enumerate} | |
86 | |
87 To mitigate some of the problems that arise from using assembler it has been carefully and | |
88 appropriately used where it would make the most gain in performance. Also we use macro's | |
89 for assembler code which allows new ports to be inserted easily. | |
90 | |
91 The new code uses fixed precision arithmetic which means at compile time you choose a maximum | |
92 precision and all numbers are limited to that. This has the benefit of not requiring any | |
93 memory heap operations (which are slow) in any of the functions. It has the downside that | |
94 integers that are too large are truncated. | |
95 | |
96 The goal of this library is to be able to perform modular exponentiations (with an odd modulus) very | |
97 fast. This is what takes the most time in systems such as RSA and DH. This also requires | |
98 fast multiplication and squaring and has the side effect of speeding up ECC operations as well. | |
99 | |
100 \section{License} | |
101 TomsFastMath is public domain. | |
102 | |
103 \section{Building} | |
104 To build the library simply type ``make''. Or to install in typical *unix like directories use | |
105 ``make install''. Similarly a shared library can be built with ``make -f makefile.shared install''. | |
106 | |
107 You can build the test program with ``make test''. To perform simple static testing (useful to | |
108 test out new assembly ports) use the stest program. Type ``make stest'' and run it on your | |
109 target. The program will perform three multiplications, squarings and montgomery reductions. | |
110 Likely if your assembly code is invalid this code will exhibit the bug. | |
111 | |
112 \subsection{Intel CC} | |
113 In theory you should be able to build the library with | |
114 | |
115 \begin{verbatim} | |
116 CFLAGS="-O3 -ip" CC=icc make IGNORE_SPEED=1 | |
117 \end{verbatim} | |
118 | |
119 However, Intels inline assembler is way less advanced than GCCs. As a result it doesn't compile. | |
120 Fortunately it doesn't really matter. | |
121 | |
122 \subsection{MSVC} | |
123 The library doesn't build with MSVC. Imagine that. | |
124 | |
125 \subsection{Build Limitations} | |
126 TomsFastMath has the following build requirements which are non--portable but under most | |
127 circumstances not problematic. | |
128 | |
129 \begin{enumerate} | |
130 \item ``CHAR\_BIT'' must be eight. | |
131 \item The ``fp\_digit'' type must be a multiple of eight bits long. | |
132 \item The ``fp\_word'' must be at least twice the length of fp\_digit. | |
133 \end{enumerate} | |
134 | |
135 \subsection{Optimization Configuration} | |
136 By default TFM is configured for 32--bit digits using ISO C source code. This mode while portable | |
137 is not very efficient. While building the library (from scratch) you can define one of | |
138 several ``CFLAGS'' defines. | |
139 | |
140 For example, to build with with SSE2 optimizations type | |
141 | |
142 \begin{verbatim} | |
143 CFLAGS=-DTFM_SSE2 make clean libtfm.a | |
144 \end{verbatim} | |
145 | |
146 \subsubsection{x86--32} The ``x86--32'' mode is defined by ``TFM\_X86'' and covers all | |
147 i386 and beyond processors. It requires GCC to build and only works with 32--bit digits. In this | |
148 mode fp\_digit is 32--bits and fp\_word is 64--bits. This mode will be autodetected when building | |
149 with GCC to an ``i386'' target. You can override this behaviour by defining TFM\_NO\_ASM or | |
150 another optimization mode (such as SSE2). | |
151 | |
152 \subsubsection{SSE2} The ``SSE2'' mode is defined by ``TFM\_SSE2'' and requires a Pentium 4, Pentium | |
153 M or Athlon64 processor. It requires GCC to build. Note that you shouldn't define both | |
154 TFM\_X86 and TFM\_SSE2 at the same time. This mode only works with 32--bit digits. In this | |
155 mode fp\_digit is 32--bits and fp\_word is 64--bits. While this mode will work on the AMD Athlon64 | |
156 series of processors it is less efficient than the native ``x86--64'' mode and not recommended. | |
157 | |
158 There is an additional ``TFM\_PRESCOTT'' flag that you can define for P4 Prescott processors. This causes | |
159 the mul/sqr functions to use x86\_32 and the montgomery reduction to use SSE2 which is (so far) the fastest | |
160 combination. If you are using an older (e.g. Northwood) generation P4 don't define this. | |
161 | |
162 \subsubsection{x86--64} The ``x86--64'' mode is defined by ``TFM\_X86\_64'' and requires a | |
163 ``x86--64'' capable processor (Athlon64 and future Pentium processors). It requires GCC to | |
164 build and only works with 64--bit digits. Note that by enabling this mode it will automatically | |
165 enable 64--bit digits. In this mode fp\_digit is 64--bits and fp\_word is 128--bits. This mode will | |
166 be autodetected when building with GCC to an ``x86--64'' target. You can override this behaviour by defining | |
167 TFM\_NO\_ASM. | |
168 | |
169 \subsubsection{ARM} The ``ARM'' mode is defined by ``TFM\_ARM'' and requires a ARMv4 with the M instructions (enhanced | |
170 multipliers) or higher processor. It requires GCC and works with 32--bit digits. In this mode fp\_digit is 32--bits and | |
171 fp\_word is 64--bits. | |
172 | |
173 \subsubsection{PPC32} The ``PPC32'' mode is defined by ``TFM\_PPC32'' and requires a standard PPC processor. It doesn't | |
174 use altivec or other extensions so it should work on all compliant implementations of PPC. It requires GCC and works | |
175 with 32--bit digits. In this mode fp\_digit is 32--bits and fp\_word is 64--bits. | |
176 | |
177 \subsubsection{PPC64} The ``PPC64'' mode is defined by ``TFM\_PPC64'' and requires a 64--bit PPC processor. | |
178 | |
179 \subsubsection{AVR32} The ``AVR32'' mode is defined by ``TFM\_AVR32'' and requires an Atmel AVR32 processor. | |
180 | |
181 \subsubsection{Future Releases} Future releases will support additional platform optimizations. | |
182 Developers of MIPS and SPARC platforms are encouraged to submit GCC asm inline patches | |
183 (see chapter \ref{chap:asmops} for more information). | |
184 | |
185 \begin{figure}[here] | |
186 \begin{small} | |
187 \begin{center} | |
188 \begin{tabular}{|l|l|} | |
189 \hline \textbf{Processor} & \textbf{Recommended Mode} \\ | |
190 \hline All 32--bit x86 platforms & TFM\_X86 \\ | |
191 \hline Pentium 4 & TFM\_SSE2 \\ | |
192 \hline Pentium 4 Prescott & TFM\_SSE2 + TFM\_PRESCOTT \\ | |
193 \hline Athlon64 & TFM\_X86\_64 \\ | |
194 \hline ARMv4 or higher with M & TFM\_ARM \\ | |
195 \hline G3/G4 (32-bit PPC) & TFM\_PPC32 \\ | |
196 \hline G5 (64-bit PPC) & TFM\_PPC64 \\ | |
197 \hline Atmel AVR32 & TFM\_AVR32 \\ | |
198 \hline &\\ | |
199 \hline x86--32 or x86--64 (with GCC) & Leave blank and let autodetect work \\ | |
200 \hline | |
201 \end{tabular} | |
202 \caption{Recommended Build Modes} | |
203 \end{center} | |
204 \end{small} | |
205 \end{figure} | |
206 | |
207 \subsection{Build Configurations} | |
208 TomsFastMath is configurable in terms of which unrolled code (if any) is included. By default, the majority of the code is included which | |
209 results in large binaries. The first flag to try out is TFM\_ALREADY\_SET which tells TFM to turn off \textbf{all} unrolled code. This will | |
210 result in a smaller library but also a much slower library. | |
211 | |
212 From this clean state, you can start enabling unrolled code for given cryptographic tasks at hand. A series of TFM\_MULXYZ and TFM\_SQRXYZ macros | |
213 exist to enable specific unrolled code. For instance, TFM\_MUL32 will enable a 32 digit unrolled multiplier. For a complete list see the tfm.h header | |
214 file. Keep in mind this is for digits not bits. For example, you should enable TFM\_MUL16 if you are doing 1024-bit exptmods on a 64--bit platform, enable | |
215 TFM\_MUL32 on 32--bit platforms. | |
216 | |
217 To help developers use ECC there are a set of defines for the five NIST curve sizes. They are named TFM\_ECCXYZ where XYZ is one of 192, 224, 256, 384, or 521. These | |
218 enable the multipliers and squaring code for a given curve, autodetecting 64--bit platforms as well. | |
219 | |
220 \subsection{Precision Configuration} | |
221 The precision of all integers in this library are fixed to a limited precision. Essentially | |
222 the rule of setting the precision is if you plan on doing modular exponentiation with $k$--bit | |
223 numbers than the precision must be fixed to $2k$--bits plus four digits. | |
224 | |
225 This is changed by altering the value of ``FP\_MAX\_SIZE'' in tfm.h to your desired size. By default, | |
226 the library is configured to handle upto 2048--bit inputs to the modular exponentiator. | |
227 | |
228 \chapter{Getting Started} | |
229 \section{Data Types} | |
230 TomsFastMath is a large fixed precision integer library. It provides the functionality to | |
231 manipulate large signed integers through a relatively trivial api and a single data type. | |
232 | |
233 The ``fp\_int'' or fixed precision integer is the data type that the functions operate with. | |
234 | |
235 \begin{verbatim} | |
236 typedef struct { | |
237 fp_digit dp[FP_SIZE]; | |
238 int used, | |
239 sign; | |
240 } fp_int; | |
241 \end{verbatim} | |
242 | |
243 The \textbf{dp} member is the array of digits that forms the number. It must always be zero | |
244 padded. The \textbf{used} member is the count of digits used in the array. Although the | |
245 precision is fixed the algorithms are still tuned to not process the entire array if it | |
246 does not have to. The \textbf{sign} indicates the sign of the integer. It is \textbf{FP\_ZPOS} (0) | |
247 if the integer is zero or positive and \textbf{FP\_NEG} (1) otherwise. | |
248 | |
249 \section{Initialization} | |
250 \subsection{Simple Initialization} | |
251 To initialize an integer to the default state of zero use the fp\_init() function. | |
252 | |
253 \index{fp\_init} | |
254 \begin{verbatim} | |
255 void fp_init(fp_int *a); | |
256 \end{verbatim} | |
257 | |
258 This will initialize the fp\_int $a$ to zero. Note that the function fp\_zero() is an alias | |
259 for fp\_init(). | |
260 | |
261 \subsection{Initialize Small Constants} | |
262 To initialize an integer with a small single digit value use the fp\_set() function. | |
263 | |
264 \index{fp\_set} | |
265 \begin{verbatim} | |
266 void fp_set(fp_int *a, fp_digit b); | |
267 \end{verbatim} | |
268 | |
269 This will initialize $a$ and set it equal to the digit $b$. | |
270 | |
271 \subsection{Initialize Copy} | |
272 To initialize an integer with a copy of another integer use the fp\_init\_copy() function. | |
273 | |
274 \index{fp\_init\_copy} | |
275 \begin{verbatim} | |
276 void fp_init_copy(fp_int *a, fp_int *b) | |
277 \end{verbatim} | |
278 | |
279 This will initialize $a$ as a copy of $b$. Note that for compatibility with LibTomMath the function | |
280 fp\_copy() is also provided. | |
281 | |
282 \chapter{Arithmetic Operations} | |
283 \section{Odds and Evens} | |
284 To quickly and easily tell if an integer is zero, odd or even use the following functions. | |
285 | |
286 \index{fp\_iszero} \index{fp\_iseven} \index{fp\_isodd} | |
287 \begin{verbatim} | |
288 int fp_iszero(fp_int *a); | |
289 int fp_iseven(fp_int *a); | |
290 int fp_isodd(fp_int *a); | |
291 \end{verbatim} | |
292 | |
293 These will return \textbf{FP\_YES} if the answer to their respective questions is yes. Otherwise they | |
294 return \textbf{FP\_NO}. Note that these are implemented as macros and as such you should avoid using | |
295 ++ or --~-- operators on the input operand. | |
296 | |
297 \section{Sign Manipulation} | |
298 To negate or compute the absolute of an integer use the following functions. | |
299 | |
300 \index{fp\_neg} \index{fp\_abs} | |
301 \begin{verbatim} | |
302 void fp_neg(fp_int *a, fp_int *b); | |
303 void fp_abs(fp_int *a, fp_int *b); | |
304 \end{verbatim} | |
305 This will compute the negation (or absolute) of $a$ and store the result in $b$. Note that these | |
306 are implemented as macros and as such you should avoid using ++ or --~-- operators on the input | |
307 operand. | |
308 | |
309 \section{Comparisons} | |
310 To perform signed or unsigned comparisons use following functions. | |
311 | |
312 \index{fp\_cmp} \index{fp\_cmp\_mag} | |
313 \begin{verbatim} | |
314 int fp_cmp(fp_int *a, fp_int *b); | |
315 int fp_cmp_mag(fp_int *a, fp_int *b); | |
316 \end{verbatim} | |
317 These will compare $a$ to $b$. They will return \textbf{FP\_GT} if $a$ is larger than $b$, | |
318 \textbf{FP\_EQ} if they are equal and \textbf{FP\_LT} if $a$ is less than $b$. | |
319 | |
320 The function fp\_cmp performs signed comparisons while the other performs unsigned comparisons. | |
321 | |
322 \section{Shifting} | |
323 To shift the digits of an fp\_int left or right use the following functions. | |
324 | |
325 \index{fp\_lshd} \index{fp\_rshd} | |
326 \begin{verbatim} | |
327 void fp_lshd(fp_int *a, int x); | |
328 void fp_rshd(fp_int *a, int x); | |
329 \end{verbatim} | |
330 | |
331 These will shift the digits of $a$ left (or right respectively) $x$ digits. | |
332 | |
333 To shift individual bits of an fp\_int use the following functions. | |
334 | |
335 \index{fp\_div\_2d} \index{fp\_mod\_2d} \index{fp\_mul\_2d} \index{fp\_div\_2} \index{fp\_mul\_2} | |
336 \begin{verbatim} | |
337 void fp_div_2d(fp_int *a, int b, fp_int *c, fp_int *d); | |
338 void fp_mod_2d(fp_int *a, int b, fp_int *c); | |
339 void fp_mul_2d(fp_int *a, int b, fp_int *c); | |
340 void fp_mul_2(fp_int *a, fp_int *c); | |
341 void fp_div_2(fp_int *a, fp_int *c); | |
342 void fp_2expt(fp_int *a, int b); | |
343 \end{verbatim} | |
344 fp\_div\_2d() will divide $a$ by $2^b$ and store the quotient in $c$ and remainder in $d$. Either of | |
345 $c$ or $d$ can be \textbf{NULL} if their value is not required. fp\_mod\_2d() is a shortcut to | |
346 compute the remainder directly. fp\_mul\_2d() will multiply $a$ by $2^b$ and store the result in $c$. | |
347 | |
348 The fp\_mul\_2() and fp\_div\_2() functions are optimized multiplication and divisions by two. The | |
349 function fp\_2expt() will compute $a = 2^b$ quickly. | |
350 | |
351 To quickly count the number of least significant bits that are zero use the following function. | |
352 | |
353 \index{fp\_cnt\_lsb} | |
354 \begin{verbatim} | |
355 int fp_cnt_lsb(fp_int *a); | |
356 \end{verbatim} | |
357 This will return the number of adjacent least significant bits that are zero. This is equivalent | |
358 to the number of times two evenly divides $a$. | |
359 | |
360 \section{Basic Algebra} | |
361 | |
362 The following functions round out the basic algebraic functionality of the library. | |
363 | |
364 \index{fp\_add} \index{fp\_sub} \index{fp\_mul} \index{fp\_sqr} \index{fp\_div} \index{fp\_mod} | |
365 \begin{verbatim} | |
366 void fp_add(fp_int *a, fp_int *b, fp_int *c); | |
367 void fp_sub(fp_int *a, fp_int *b, fp_int *c); | |
368 void fp_mul(fp_int *a, fp_int *b, fp_int *c); | |
369 void fp_sqr(fp_int *a, fp_int *b); | |
370 int fp_div(fp_int *a, fp_int *b, fp_int *c, fp_int *d); | |
371 int fp_mod(fp_int *a, fp_int *b, fp_int *c); | |
372 \end{verbatim} | |
373 | |
374 The functions fp\_add(), fp\_sub() and fp\_mul() perform their respective operations on $a$ and | |
375 $b$ and store the result in $c$. The function fp\_sqr() computes $b = a^2$ and is faster than | |
376 using fp\_mul() to perform the same operation. | |
377 | |
378 The function fp\_div() divides $a$ by $b$ and stores the quotient in $c$ and remainder in $d$. Either | |
379 of $c$ and $d$ can be \textbf{NULL} if the result is not required. The function fp\_mod() is a simple | |
380 shortcut to find the remainder. | |
381 | |
382 \section{Modular Exponentiation} | |
383 To compute a modular exponentiation use the following function. | |
384 | |
385 \index{fp\_exptmod} | |
386 \begin{verbatim} | |
387 int fp_exptmod(fp_int *a, fp_int *b, fp_int *c, fp_int *d); | |
388 \end{verbatim} | |
389 This computes $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for any odd $c$ and $b$. $b$ may be negative so long as | |
390 $a^{-1} \mbox{ (mod }c\mbox{)}$ exists. The initial value of $a$ may be larger than $c$. The size of $c$ must be | |
391 half of the maximum precision used during the build of the library. For example, by default $c$ must be less | |
392 than $2^{2048}$. | |
393 | |
394 \section{Number Theoretic} | |
395 | |
396 To perform modular inverses, greatest common divisor or least common multiples use the following | |
397 functions. | |
398 | |
399 \index{fp\_invmod} \index{fp\_gcd} \index{fp\_lcm} | |
400 \begin{verbatim} | |
401 int fp_invmod(fp_int *a, fp_int *b, fp_int *c); | |
402 void fp_gcd(fp_int *a, fp_int *b, fp_int *c); | |
403 void fp_lcm(fp_int *a, fp_int *b, fp_int *c); | |
404 \end{verbatim} | |
405 | |
406 The fp\_invmod() function will find the modular inverse of $a$ modulo an odd modulus $b$ and store | |
407 it in $c$ (provided it exists). The function fp\_gcd() will compute the greatest common | |
408 divisor of $a$ and $b$ and store it in $c$. Similarly the fp\_lcm() function will compute | |
409 the least common multiple of $a$ and $b$ and store it in $c$. | |
410 | |
411 \section{Prime Numbers} | |
412 To quickly test a number for primality call this function. | |
413 | |
414 \index{fp\_isprime} | |
415 \begin{verbatim} | |
416 int fp_isprime(fp_int *a); | |
417 \end{verbatim} | |
418 This will return \textbf{FP\_YES} if $a$ is probably prime. It uses 256 trial divisions and | |
419 eight rounds of Rabin-Miller testing. Note that this routine performs modular exponentiations | |
420 which means that $a$ must be in a valid range of precision. | |
421 | |
422 \chapter{Porting TomsFastMath} | |
423 \label{chap:asmops} | |
424 \section{Getting Started} | |
425 Porting TomsFastMath to a given processor target is usually a simple procedure. For the most part | |
426 assembly is used to get around the lack of a ``add with carry'' operation in the C language. To | |
427 make matters simpler the use of assembler is through macro blocks. | |
428 | |
429 Each ``port'' is defined by a block of code that re-defines the portable ISO C macros with assembler | |
430 inline blocks. To add a new port you must designate a TFM\_XXX define that will enable your | |
431 port when built. | |
432 | |
433 \section{Multiply with Comba} | |
434 The file ``fp\_mul\_comba.c'' is responsible for providing the fast multiplication within the | |
435 library. This comba multiplication is fairly simple. It uses a sliding three digit carry | |
436 system with the variables $c0$, $c1$, $c2$. For every digit of output $c0$ is the what will | |
437 be that digit, $c1$ will carry into the next digit and $c2$ will be the ``c1'' carry for | |
438 the next digit. For every ``next'' digit effectively $c0$ is stored as output, $c1$ moves into | |
439 $c0$, $c2$ into $c1$ and zero into $c2$. | |
440 | |
441 The following macros define the assmebler interface to the code. | |
442 | |
443 \begin{verbatim} | |
444 #define COMBA_START | |
445 \end{verbatim} | |
446 | |
447 This is issued at the beginning of the multiplication function. This is in place to allow you to | |
448 initialize any registers or machine words required. You can leave it blank if you do not need | |
449 it. | |
450 | |
451 \begin{verbatim} | |
452 #define COMBA_CLEAR \ | |
453 c0 = c1 = c2 = 0; | |
454 \end{verbatim} | |
455 | |
456 This clears the three comba carries. If you are going to place carries in registers then | |
457 zero the appropriate registers. Note that the functions do not use $c0$, $c1$ or $c2$ directly | |
458 so you are free to ignore these varibles and use registers directly. | |
459 | |
460 \begin{verbatim} | |
461 #define COMBA_FORWARD \ | |
462 c0 = c1; c1 = c2; c2 = 0; | |
463 \end{verbatim} | |
464 | |
465 This propagates the carries after a digit has been produced. | |
466 | |
467 \begin{verbatim} | |
468 #define COMBA_STORE(x) \ | |
469 x = c0; | |
470 \end{verbatim} | |
471 | |
472 This stores the $c0$ digit in the memory location specified by $x$. Note that if you manually | |
473 aliased $c0$ with a register than just store that register in $x$. | |
474 | |
475 \begin{verbatim} | |
476 #define COMBA_STORE2(x) \ | |
477 x = c1; | |
478 \end{verbatim} | |
479 | |
480 This stores the $c1$ digit in the memory location specified by $x$. Note that if you manually | |
481 aliased $c1$ with a register than just store that register in $x$. | |
482 | |
483 \begin{verbatim} | |
484 #define COMBA_FINI | |
485 \end{verbatim} | |
486 | |
487 If at the end of the function you need to perform some action fill this macro in. | |
488 | |
489 \begin{verbatim} | |
490 #define MULADD(i, j) \ | |
491 t = ((fp_word)i) * ((fp_word)j); \ | |
492 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ | |
493 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; | |
494 \end{verbatim} | |
495 | |
496 This macro performs the ``multiply and add'' step that is central to the comba | |
497 multiplier. It multiplies the fp\_digits $i$ and $j$ to produce a fp\_word result. Effectively | |
498 the double--digit value is added to the three-digit carry formed by $c0$, $c1$, $c2$ where $c0$ | |
499 is the least significant digit. | |
500 | |
501 \section{Squaring with Comba} | |
502 Squaring is similar to multiplication except that it uses a special ``multiply and add twice'' macro | |
503 that replaces multiplications that are not required. | |
504 | |
505 \begin{verbatim} | |
506 #define COMBA_START | |
507 \end{verbatim} | |
508 | |
509 This allows for any initialization code you might have. | |
510 | |
511 \begin{verbatim} | |
512 #define CLEAR_CARRY \ | |
513 c0 = c1 = c2 = 0; | |
514 \end{verbatim} | |
515 | |
516 This will clear the carries. Like multiplication you can safely alias the three carry variables | |
517 to registers if you can/want to. | |
518 | |
519 \begin{verbatim} | |
520 #define COMBA_STORE(x) \ | |
521 x = c0; | |
522 \end{verbatim} | |
523 | |
524 Store the $c0$ carry to a given memory location. | |
525 | |
526 \begin{verbatim} | |
527 #define COMBA_STORE2(x) \ | |
528 x = c1; | |
529 \end{verbatim} | |
530 | |
531 Store the $c1$ carry to a given memory location. | |
532 | |
533 \begin{verbatim} | |
534 #define CARRY_FORWARD \ | |
535 c0 = c1; c1 = c2; c2 = 0; | |
536 \end{verbatim} | |
537 | |
538 Forward propagate all three carry variables. | |
539 | |
540 \begin{verbatim} | |
541 #define COMBA_FINI | |
542 \end{verbatim} | |
543 | |
544 If you need to clean up at the end of the function. | |
545 | |
546 \begin{verbatim} | |
547 /* multiplies point i and j, updates carry "c1" and digit c2 */ | |
548 #define SQRADD(i, j) \ | |
549 t = ((fp_word)i) * ((fp_word)j); \ | |
550 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ | |
551 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; | |
552 \end{verbatim} | |
553 | |
554 This is essentially the MULADD macro from the multiplication code. | |
555 | |
556 \begin{verbatim} | |
557 /* for squaring some of the terms are doubled... */ | |
558 #define SQRADD2(i, j) \ | |
559 t = ((fp_word)i) * ((fp_word)j); \ | |
560 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ | |
561 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; \ | |
562 c0 = (c0 + t); if (c0 < ((fp_digit)t)) ++c1; \ | |
563 c1 = (c1 + (t>>DIGIT_BIT)); if (c1 < (t>>DIGIT_BIT)) ++c2; | |
564 \end{verbatim} | |
565 | |
566 This is like SQRADD except it adds the produce twice. It's similar to | |
567 computing SQRADD(i, j*2). | |
568 | |
569 To further make things interesting the squaring code also has ``doubles'' (see my LTM book chapter five...) which are | |
570 handled with these macros. | |
571 | |
572 \begin{verbatim} | |
573 #define SQRADDSC(i, j) \ | |
574 do { fp_word t; \ | |
575 t = ((fp_word)i) * ((fp_word)j); \ | |
576 sc0 = (fp_digit)t; sc1 = (t >> DIGIT_BIT); sc2 = 0; \ | |
577 } while (0); | |
578 \end{verbatim} | |
579 This computes a product and stores it in the ``secondary'' carry registers $\left < sc0, sc1, sc2 \right >$. | |
580 | |
581 \begin{verbatim} | |
582 #define SQRADDAC(i, j) \ | |
583 do { fp_word t; \ | |
584 t = sc0 + ((fp_word)i) * ((fp_word)j); sc0 = t; \ | |
585 t = sc1 + (t >> DIGIT_BIT); sc1 = t; sc2 += t >> DIGIT_BIT; \ | |
586 } while (0); | |
587 \end{verbatim} | |
588 This computes a product and adds it to the ``secondary'' carry registers. | |
589 | |
590 \begin{verbatim} | |
591 #define SQRADDDB \ | |
592 do { fp_word t; \ | |
593 t = ((fp_word)sc0) + ((fp_word)sc0) + c0; c0 = t; \ | |
594 t = ((fp_word)sc1) + ((fp_word)sc1) + c1 + (t >> DIGIT_BIT); c1 = t; \ | |
595 c2 = c2 + ((fp_word)sc2) + ((fp_word)sc2) + (t >> DIGIT_BIT); \ | |
596 } while (0); | |
597 \end{verbatim} | |
598 This doubles the ``secondary'' carry registers and adds the sum to the main carry registers. Really complicated. | |
599 | |
600 \section{Montgomery with Comba} | |
601 Montgomery reduction is used in modular exponentiation and is most called function during | |
602 that operation. It's important to make sure this routine is very fast or all is lost. | |
603 | |
604 Unlike the two other comba routines this one does not use a single three--digit carry | |
605 system. It does have three--digit carries except that the routine steps through them | |
606 in the inner loop. This means you cannot alias them to registers (at all). | |
607 | |
608 To make matters simple though the three arrays of carries are stored in one array. The | |
609 ``c0'' array resides in $c[0 \ldots OFF1-1]$, ``c1'' in $c[OFF1 \ldots OFF2-1]$ and ``c2'' in | |
610 $c[OFF2 \ldots OFF2+FP\_SIZE-1]$. | |
611 | |
612 \begin{verbatim} | |
613 #define MONT_START | |
614 \end{verbatim} | |
615 | |
616 This allows you to insert anything at the start that you need. | |
617 | |
618 \begin{verbatim} | |
619 #define MONT_FINI | |
620 \end{verbatim} | |
621 | |
622 This allows you to insert anything at the end that you need. | |
623 | |
624 \begin{verbatim} | |
625 #define LOOP_START \ | |
626 mu = c[x] * mp; | |
627 \end{verbatim} | |
628 | |
629 This computes the $\mu$ value for the inner loop. You can safely alias $mu$ and $mp$ to | |
630 a register if you want. | |
631 | |
632 \begin{verbatim} | |
633 #define INNERMUL \ | |
634 do { fp_word t; \ | |
635 _c[0] = t = ((fp_word)_c[0] + (fp_word)cy) + \ | |
636 (((fp_word)mu) * ((fp_word)*tmpm++)); \ | |
637 cy = (t >> DIGIT_BIT); \ | |
638 } while (0) | |
639 \end{verbatim} | |
640 | |
641 This computes the inner product and adds it to the destination and carry variable $cy$. | |
642 This uses the $mu$ value computed above (can be in a register already) and the | |
643 $cy$ which is a chaining carry. Inside the INNERMUL loop the $cy$ value can be kept | |
644 inside a register (hint: it always starts as $cy = 0$ in the first iteration). | |
645 | |
646 Upon completion of the inner loop the macro LOOP\_END is called which is used to fetch | |
647 $cy$ into the variable the C program can see. This is where, if you cached $cy$ in a | |
648 register you would copy it to the locally accessible C variable. | |
649 | |
650 \begin{verbatim} | |
651 #define PROPCARRY \ | |
652 do { fp_digit t = _c[0] += cy; cy = (t < cy); } while (0) | |
653 \end{verbatim} | |
654 | |
655 This propagates the carry upwards by one digit. | |
656 | |
657 \input{tfm.ind} | |
658 | |
659 \end{document} |