comparison tommath.tex @ 19:e1037a1e12e7 libtommath-orig

0.30 release of LibTomMath
author Matt Johnston <matt@ucc.asn.au>
date Tue, 15 Jun 2004 14:42:57 +0000
parents
children d29b64170cf0
comparison
equal deleted inserted replaced
2:86e0b50a9b58 19:e1037a1e12e7
1 \documentclass[b5paper]{book}
2 \usepackage{hyperref}
3 \usepackage{makeidx}
4 \usepackage{amssymb}
5 \usepackage{color}
6 \usepackage{alltt}
7 \usepackage{graphicx}
8 \usepackage{layout}
9 \def\union{\cup}
10 \def\intersect{\cap}
11 \def\getsrandom{\stackrel{\rm R}{\gets}}
12 \def\cross{\times}
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}}
14 \def\catn{$\|$}
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}}
16 \def\nequiv{\not\equiv}
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}}
18 \def\lcm{{\rm lcm}}
19 \def\gcd{{\rm gcd}}
20 \def\log{{\rm log}}
21 \def\ord{{\rm ord}}
22 \def\abs{{\mathit abs}}
23 \def\rep{{\mathit rep}}
24 \def\mod{{\mathit\ mod\ }}
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})}
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor}
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil}
28 \def\Or{{\rm\ or\ }}
29 \def\And{{\rm\ and\ }}
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}}
31 \def\implies{\Rightarrow}
32 \def\undefined{{\rm ``undefined"}}
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}}
34 \let\oldphi\phi
35 \def\phi{\varphi}
36 \def\Pr{{\rm Pr}}
37 \newcommand{\str}[1]{{\mathbf{#1}}}
38 \def\F{{\mathbb F}}
39 \def\N{{\mathbb N}}
40 \def\Z{{\mathbb Z}}
41 \def\R{{\mathbb R}}
42 \def\C{{\mathbb C}}
43 \def\Q{{\mathbb Q}}
44 \definecolor{DGray}{gray}{0.5}
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}}
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}}
47 \def\gap{\vspace{0.5ex}}
48 \makeindex
49 \begin{document}
50 \frontmatter
51 \pagestyle{empty}
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition }
53 \author{\mbox{
54 %\begin{small}
55 \begin{tabular}{c}
56 Tom St Denis \\
57 Algonquin College \\
58 \\
59 Mads Rasmussen \\
60 Open Communications Security \\
61 \\
62 Greg Rose \\
63 QUALCOMM Australia \\
64 \end{tabular}
65 %\end{small}
66 }
67 }
68 \maketitle
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the
70 LibTomMath project.
71
72 \begin{alltt}
73 Tom St Denis
74 111 Banning Rd
75 Ottawa, Ontario
76 K2L 1C3
77 Canada
78
79 Phone: 1-613-836-3160
80 Email: [email protected]
81 \end{alltt}
82
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
84 {\em book} macro package and the Perl {\em booker} package.
85
86 \tableofcontents
87 \listoffigures
88 \chapter*{Prefaces to the Draft Edition}
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the
94 text.
95
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain
101 rewarding.
102
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text.
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy
107 onto finishing the book not securing a contract.
108
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook.
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is,
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely
114 without hinderance.
115
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C.
122
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic.
125
126 At this time I feel I should share a little information about myself. The most common question I was asked at
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training.
130
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am
133 still far off from that goal.
134
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors
141 in my written English have saved me on several occasions to say the least.
142
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many
147 people who will take it.
148
149 \begin{flushright} Tom St Denis \end{flushright}
150
151 \newpage
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also
153 contribute to educate others facing the problem of having to handle big number mathematical calculations.
154
155 This book is Tom's child and he has been caring and fostering the project ever since the beginning with a clear mind of
156 how he wanted the project to turn out. I have helped by proofreading the text and we have had several discussions about
157 the layout and language used.
158
159 I hold a masters degree in cryptography from the University of Southern Denmark and have always been interested in the
160 practical aspects of cryptography.
161
162 Having worked in the security consultancy business for several years in S\~{a}o Paulo, Brazil, I have been in touch with a
163 great deal of work in which multiple precision mathematics was needed. Understanding the possibilities for speeding up
164 multiple precision calculations is often very important since we deal with outdated machine architecture where modular
165 reductions, for example, become painfully slow.
166
167 This text is for people who stop and wonder when first examining algorithms such as RSA for the first time and asks
168 themselves, ``You tell me this is only secure for large numbers, fine; but how do you implement these numbers?''
169
170 \begin{flushright}
171 Mads Rasmussen
172
173 S\~{a}o Paulo - SP
174
175 Brazil
176 \end{flushright}
177
178 \newpage
179 It's all because I broke my leg. That just happened to be at about the same time that Tom asked for someone to review the section of the book about
180 Karatsuba multiplication. I was laid up, alone and immobile, and thought ``Why not?'' I vaguely knew what Karatsuba multiplication was, but not
181 really, so I thought I could help, learn, and stop myself from watching daytime cable TV, all at once.
182
183 At the time of writing this, I've still not met Tom or Mads in meatspace. I've been following Tom's progress since his first splash on the
184 sci.crypt Usenet news group. I watched him go from a clueless newbie, to the cryptographic equivalent of a reformed smoker, to a real
185 contributor to the field, over a period of about two years. I've been impressed with his obvious intelligence, and astounded by his productivity.
186 Of course, he's young enough to be my own child, so he doesn't have my problems with staying awake.
187
188 When I reviewed that single section of the book, in its very earliest form, I was very pleasantly surprised. So I decided to collaborate more fully,
189 and at least review all of it, and perhaps write some bits too. There's still a long way to go with it, and I have watched a number of close
190 friends go through the mill of publication, so I think that the way to go is longer than Tom thinks it is. Nevertheless, it's a good effort,
191 and I'm pleased to be involved with it.
192
193 \begin{flushright}
194 Greg Rose, Sydney, Australia, June 2003.
195 \end{flushright}
196
197 \mainmatter
198 \pagestyle{headings}
199 \chapter{Introduction}
200 \section{Multiple Precision Arithmetic}
201
202 \subsection{What is Multiple Precision Arithmetic?}
203 When we think of long-hand arithmetic such as addition or multiplication we rarely consider the fact that we instinctively
204 raise or lower the precision of the numbers we are dealing with. For example, in decimal we almost immediate can
205 reason that $7$ times $6$ is $42$. However, $42$ has two digits of precision as opposed to one digit we started with.
206 Further multiplications of say $3$ result in a larger precision result $126$. In these few examples we have multiple
207 precisions for the numbers we are working with. Despite the various levels of precision a single subset\footnote{With the occasional optimization.}
208 of algorithms can be designed to accomodate them.
209
210 By way of comparison a fixed or single precision operation would lose precision on various operations. For example, in
211 the decimal system with fixed precision $6 \cdot 7 = 2$.
212
213 Essentially at the heart of computer based multiple precision arithmetic are the same long-hand algorithms taught in
214 schools to manually add, subtract, multiply and divide.
215
216 \subsection{The Need for Multiple Precision Arithmetic}
217 The most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation
218 of public-key cryptography algorithms. Algorithms such as RSA \cite{RSAREF} and Diffie-Hellman \cite{DHREF} require
219 integers of significant magnitude to resist known cryptanalytic attacks. For example, at the time of this writing a
220 typical RSA modulus would be at least greater than $10^{309}$. However, modern programming languages such as ISO C \cite{ISOC} and
221 Java \cite{JAVA} only provide instrinsic support for integers which are relatively small and single precision.
222
223 \begin{figure}[!here]
224 \begin{center}
225 \begin{tabular}{|r|c|}
226 \hline \textbf{Data Type} & \textbf{Range} \\
227 \hline char & $-128 \ldots 127$ \\
228 \hline short & $-32768 \ldots 32767$ \\
229 \hline long & $-2147483648 \ldots 2147483647$ \\
230 \hline long long & $-9223372036854775808 \ldots 9223372036854775807$ \\
231 \hline
232 \end{tabular}
233 \end{center}
234 \caption{Typical Data Types for the C Programming Language}
235 \label{fig:ISOC}
236 \end{figure}
237
238 The largest data type guaranteed to be provided by the ISO C programming
239 language\footnote{As per the ISO C standard. However, each compiler vendor is allowed to augment the precision as they
240 see fit.} can only represent values up to $10^{19}$ as shown in figure \ref{fig:ISOC}. On its own the C language is
241 insufficient to accomodate the magnitude required for the problem at hand. An RSA modulus of magnitude $10^{19}$ could be
242 trivially factored\footnote{A Pollard-Rho factoring would take only $2^{16}$ time.} on the average desktop computer,
243 rendering any protocol based on the algorithm insecure. Multiple precision algorithms solve this very problem by
244 extending the range of representable integers while using single precision data types.
245
246 Most advancements in fast multiple precision arithmetic stem from the need for faster and more efficient cryptographic
247 primitives. Faster modular reduction and exponentiation algorithms such as Barrett's algorithm, which have appeared in
248 various cryptographic journals, can render algorithms such as RSA and Diffie-Hellman more efficient. In fact, several
249 major companies such as RSA Security, Certicom and Entrust have built entire product lines on the implementation and
250 deployment of efficient algorithms.
251
252 However, cryptography is not the only field of study that can benefit from fast multiple precision integer routines.
253 Another auxiliary use of multiple precision integers is high precision floating point data types.
254 The basic IEEE \cite{IEEE} standard floating point type is made up of an integer mantissa $q$, an exponent $e$ and a sign bit $s$.
255 Numbers are given in the form $n = q \cdot b^e \cdot -1^s$ where $b = 2$ is the most common base for IEEE. Since IEEE
256 floating point is meant to be implemented in hardware the precision of the mantissa is often fairly small
257 (\textit{23, 48 and 64 bits}). The mantissa is merely an integer and a multiple precision integer could be used to create
258 a mantissa of much larger precision than hardware alone can efficiently support. This approach could be useful where
259 scientific applications must minimize the total output error over long calculations.
260
261 Another use for large integers is within arithmetic on polynomials of large characteristic (i.e. $GF(p)[x]$ for large $p$).
262 In fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}.
263
264 \subsection{Benefits of Multiple Precision Arithmetic}
265 \index{precision}
266 The benefit of multiple precision representations over single or fixed precision representations is that
267 no precision is lost while representing the result of an operation which requires excess precision. For example,
268 the product of two $n$-bit integers requires at least $2n$ bits of precision to be represented faithfully. A multiple
269 precision algorithm would augment the precision of the destination to accomodate the result while a single precision system
270 would truncate excess bits to maintain a fixed level of precision.
271
272 It is possible to implement algorithms which require large integers with fixed precision algorithms. For example, elliptic
273 curve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum
274 size the system will ever need. Such an approach can lead to vastly simpler algorithms which can accomodate the
275 integers required even if the host platform cannot natively accomodate them\footnote{For example, the average smartcard
276 processor has an 8 bit accumulator.}. However, as efficient as such an approach may be, the resulting source code is not
277 normally very flexible. It cannot, at runtime, accomodate inputs of higher magnitude than the designer anticipated.
278
279 Multiple precision algorithms have the most overhead of any style of arithmetic. For the the most part the
280 overhead can be kept to a minimum with careful planning, but overall, it is not well suited for most memory starved
281 platforms. However, multiple precision algorithms do offer the most flexibility in terms of the magnitude of the
282 inputs. That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input
283 without the designer's explicit forethought. This leads to lower cost of ownership for the code as it only has to
284 be written and tested once.
285
286 \section{Purpose of This Text}
287 The purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms.
288 That is to not only explain a limited subset of the core theory behind the algorithms but also the various ``house keeping''
289 elements that are neglected by authors of other texts on the subject. Several well reknowned texts \cite{TAOCPV2,HAC}
290 give considerably detailed explanations of the theoretical aspects of algorithms and often very little information
291 regarding the practical implementation aspects.
292
293 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts. For
294 example, the Handbook of Applied Cryptography (\textit{HAC}), algorithm 14.7 on page 594, gives a relatively simple
295 algorithm for performing multiple precision integer addition. However, the description lacks any discussion concerning
296 the fact that the two integer inputs may be of differing magnitudes. As a result the implementation is not as simple
297 as the text would lead people to believe. Similarly the division routine (\textit{algorithm 14.20, pp. 598}) does not
298 discuss how to handle sign or handle the dividend's decreasing magnitude in the main loop (\textit{step \#3}).
299
300 Both texts also do not discuss several key optimal algorithms required such as ``Comba'' and Karatsuba multipliers
301 and fast modular inversion, which we consider practical oversights. These optimal algorithms are vital to achieve
302 any form of useful performance in non-trivial applications.
303
304 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer
305 package. As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.org}} package is used
306 to demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field
307 tested and work very well. The LibTomMath library is freely available on the Internet for all uses and this text
308 discusses a very large portion of the inner workings of the library.
309
310 The algorithms that are presented will always include at least one ``pseudo-code'' description followed
311 by the actual C source code that implements the algorithm. The pseudo-code can be used to implement the same
312 algorithm in other programming languages as the reader sees fit.
313
314 This text shall also serve as a walkthrough of the creation of multiple precision algorithms from scratch. Showing
315 the reader how the algorithms fit together as well as where to start on various taskings.
316
317 \section{Discussion and Notation}
318 \subsection{Notation}
319 A multiple precision integer of $n$-digits shall be denoted as $x = (x_{n-1} ... x_1 x_0)_{ \beta }$ and represent
320 the integer $x \equiv \sum_{i=0}^{n-1} x_i\beta^i$. The elements of the array $x$ are said to be the radix $\beta$ digits
321 of the integer. For example, $x = (1,2,3)_{10}$ would represent the integer
322 $1\cdot 10^2 + 2\cdot10^1 + 3\cdot10^0 = 123$.
323
324 \index{mp\_int}
325 The term ``mp\_int'' shall refer to a composite structure which contains the digits of the integer it represents, as well
326 as auxilary data required to manipulate the data. These additional members are discussed further in section
327 \ref{sec:MPINT}. For the purposes of this text a ``multiple precision integer'' and an ``mp\_int'' are assumed to be
328 synonymous. When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members
329 are present as well. An expression of the type \textit{variablename.item} implies that it should evaluate to the
330 member named ``item'' of the variable. For example, a string of characters may have a member ``length'' which would
331 evaluate to the number of characters in the string. If the string $a$ equals ``hello'' then it follows that
332 $a.length = 5$.
333
334 For certain discussions more generic algorithms are presented to help the reader understand the final algorithm used
335 to solve a given problem. When an algorithm is described as accepting an integer input it is assumed the input is
336 a plain integer with no additional multiple-precision members. That is, algorithms that use integers as opposed to
337 mp\_ints as inputs do not concern themselves with the housekeeping operations required such as memory management. These
338 algorithms will be used to establish the relevant theory which will subsequently be used to describe a multiple
339 precision algorithm to solve the same problem.
340
341 \subsection{Precision Notation}
342 For the purposes of this text a single precision variable must be able to represent integers in the range
343 $0 \le x < q \beta$ while a double precision variable must be able to represent integers in the range
344 $0 \le x < q \beta^2$. The variable $\beta$ represents the radix of a single digit of a multiple precision integer and
345 must be of the form $q^p$ for $q, p \in \Z^+$. The extra radix-$q$ factor allows additions and subtractions to proceed
346 without truncation of the carry. Since all modern computers are binary, it is assumed that $q$ is two, for all intents
347 and purposes.
348
349 \index{mp\_digit} \index{mp\_word}
350 Within the source code that will be presented for each algorithm, the data type \textbf{mp\_digit} will represent
351 a single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type. In
352 several algorithms (notably the Comba routines) temporary results will be stored in arrays of double precision mp\_words.
353 For the purposes of this text $x_j$ will refer to the $j$'th digit of a single precision array and $\hat x_j$ will refer to
354 the $j$'th digit of a double precision array. Whenever an expression is to be assigned to a double precision
355 variable it is assumed that all single precision variables are promoted to double precision during the evaluation.
356 Expressions that are assigned to a single precision variable are truncated to fit within the precision of a single
357 precision data type.
358
359 For example, if $\beta = 10^2$ a single precision data type may represent a value in the
360 range $0 \le x < 10^3$, while a double precision data type may represent a value in the range $0 \le x < 10^5$. Let
361 $a = 23$ and $b = 49$ represent two single precision variables. The single precision product shall be written
362 as $c \leftarrow a \cdot b$ while the double precision product shall be written as $\hat c \leftarrow a \cdot b$.
363 In this particular case, $\hat c = 1127$ and $c = 127$. The most significant digit of the product would not fit
364 in a single precision data type and as a result $c \ne \hat c$.
365
366 \subsection{Algorithm Inputs and Outputs}
367 Within the algorithm descriptions all variables are assumed to be scalars of either single or double precision
368 as indicated. The only exception to this rule is when variables have been indicated to be of type mp\_int. This
369 distinction is important as scalars are often used as array indicies and various other counters.
370
371 \subsection{Mathematical Expressions}
372 The $\lfloor \mbox{ } \rfloor$ brackets imply an expression truncated to an integer not greater than the expression
373 itself. For example, $\lfloor 5.7 \rfloor = 5$. Similarly the $\lceil \mbox{ } \rceil$ brackets imply an expression
374 rounded to an integer not less than the expression itself. For example, $\lceil 5.1 \rceil = 6$. Typically when
375 the $/$ division symbol is used the intention is to perform an integer division with truncation. For example,
376 $5/2 = 2$ which will often be written as $\lfloor 5/2 \rfloor = 2$ for clarity. When an expression is written as a
377 fraction a real value division is implied, for example ${5 \over 2} = 2.5$.
378
379 The norm of a multiple precision integer, for example, $\vert \vert x \vert \vert$ will be used to represent the number of digits in the representation
380 of the integer. For example, $\vert \vert 123 \vert \vert = 3$ and $\vert \vert 79452 \vert \vert = 5$.
381
382 \subsection{Work Effort}
383 \index{big-Oh}
384 To measure the efficiency of the specified algorithms, a modified big-Oh notation is used. In this system all
385 single precision operations are considered to have the same cost\footnote{Except where explicitly noted.}.
386 That is a single precision addition, multiplication and division are assumed to take the same time to
387 complete. While this is generally not true in practice, it will simplify the discussions considerably.
388
389 Some algorithms have slight advantages over others which is why some constants will not be removed in
390 the notation. For example, a normal baseline multiplication (section \ref{sec:basemult}) requires $O(n^2)$ work while a
391 baseline squaring (section \ref{sec:basesquare}) requires $O({{n^2 + n}\over 2})$ work. In standard big-Oh notation these
392 would both be said to be equivalent to $O(n^2)$. However,
393 in the context of the this text this is not the case as the magnitude of the inputs will typically be rather small. As a
394 result small constant factors in the work effort will make an observable difference in algorithm efficiency.
395
396 All of the algorithms presented in this text have a polynomial time work level. That is, of the form
397 $O(n^k)$ for $n, k \in \Z^{+}$. This will help make useful comparisons in terms of the speed of the algorithms and how
398 various optimizations will help pay off in the long run.
399
400 \section{Exercises}
401 Within the more advanced chapters a section will be set aside to give the reader some challenging exercises related to
402 the discussion at hand. These exercises are not designed to be prize winning problems, but instead to be thought
403 provoking. Wherever possible the problems are forward minded, stating problems that will be answered in subsequent
404 chapters. The reader is encouraged to finish the exercises as they appear to get a better understanding of the
405 subject material.
406
407 That being said, the problems are designed to affirm knowledge of a particular subject matter. Students in particular
408 are encouraged to verify they can answer the problems correctly before moving on.
409
410 Similar to the exercises of \cite[pp. ix]{TAOCPV2} these exercises are given a scoring system based on the difficulty of
411 the problem. However, unlike \cite{TAOCPV2} the problems do not get nearly as hard. The scoring of these
412 exercises ranges from one (the easiest) to five (the hardest). The following table sumarizes the
413 scoring system used.
414
415 \begin{figure}[here]
416 \begin{center}
417 \begin{small}
418 \begin{tabular}{|c|l|}
419 \hline $\left [ 1 \right ]$ & An easy problem that should only take the reader a manner of \\
420 & minutes to solve. Usually does not involve much computer time \\
421 & to solve. \\
422 \hline $\left [ 2 \right ]$ & An easy problem that involves a marginal amount of computer \\
423 & time usage. Usually requires a program to be written to \\
424 & solve the problem. \\
425 \hline $\left [ 3 \right ]$ & A moderately hard problem that requires a non-trivial amount \\
426 & of work. Usually involves trivial research and development of \\
427 & new theory from the perspective of a student. \\
428 \hline $\left [ 4 \right ]$ & A moderately hard problem that involves a non-trivial amount \\
429 & of work and research, the solution to which will demonstrate \\
430 & a higher mastery of the subject matter. \\
431 \hline $\left [ 5 \right ]$ & A hard problem that involves concepts that are difficult for a \\
432 & novice to solve. Solutions to these problems will demonstrate a \\
433 & complete mastery of the given subject. \\
434 \hline
435 \end{tabular}
436 \end{small}
437 \end{center}
438 \caption{Exercise Scoring System}
439 \end{figure}
440
441 Problems at the first level are meant to be simple questions that the reader can answer quickly without programming a solution or
442 devising new theory. These problems are quick tests to see if the material is understood. Problems at the second level
443 are also designed to be easy but will require a program or algorithm to be implemented to arrive at the answer. These
444 two levels are essentially entry level questions.
445
446 Problems at the third level are meant to be a bit more difficult than the first two levels. The answer is often
447 fairly obvious but arriving at an exacting solution requires some thought and skill. These problems will almost always
448 involve devising a new algorithm or implementing a variation of another algorithm previously presented. Readers who can
449 answer these questions will feel comfortable with the concepts behind the topic at hand.
450
451 Problems at the fourth level are meant to be similar to those of the level three questions except they will require
452 additional research to be completed. The reader will most likely not know the answer right away, nor will the text provide
453 the exact details of the answer until a subsequent chapter.
454
455 Problems at the fifth level are meant to be the hardest
456 problems relative to all the other problems in the chapter. People who can correctly answer fifth level problems have a
457 mastery of the subject matter at hand.
458
459 Often problems will be tied together. The purpose of this is to start a chain of thought that will be discussed in future chapters. The reader
460 is encouraged to answer the follow-up problems and try to draw the relevance of problems.
461
462 \section{Introduction to LibTomMath}
463
464 \subsection{What is LibTomMath?}
465 LibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C. By portable it
466 is meant that the library does not contain any code that is computer platform dependent or otherwise problematic to use on
467 any given platform.
468
469 The library has been successfully tested under numerous operating systems including Unix\footnote{All of these
470 trademarks belong to their respective rightful owners.}, MacOS, Windows, Linux, PalmOS and on standalone hardware such
471 as the Gameboy Advance. The library is designed to contain enough functionality to be able to develop applications such
472 as public key cryptosystems and still maintain a relatively small footprint.
473
474 \subsection{Goals of LibTomMath}
475
476 Libraries which obtain the most efficiency are rarely written in a high level programming language such as C. However,
477 even though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the
478 library. Specifically the code has been written to work well with the GNU C Compiler (\textit{GCC}) on both x86 and ARM
479 processors. Wherever possible, highly efficient algorithms, such as Karatsuba multiplication, sliding window
480 exponentiation and Montgomery reduction have been provided to make the library more efficient.
481
482 Even with the nearly optimal and specialized algorithms that have been included the Application Programing Interface
483 (\textit{API}) has been kept as simple as possible. Often generic place holder routines will make use of specialized
484 algorithms automatically without the developer's specific attention. One such example is the generic multiplication
485 algorithm \textbf{mp\_mul()} which will automatically use Toom--Cook, Karatsuba, Comba or baseline multiplication
486 based on the magnitude of the inputs and the configuration of the library.
487
488 Making LibTomMath as efficient as possible is not the only goal of the LibTomMath project. Ideally the library should
489 be source compatible with another popular library which makes it more attractive for developers to use. In this case the
490 MPI library was used as a API template for all the basic functions. MPI was chosen because it is another library that fits
491 in the same niche as LibTomMath. Even though LibTomMath uses MPI as the template for the function names and argument
492 passing conventions, it has been written from scratch by Tom St Denis.
493
494 The project is also meant to act as a learning tool for students, the logic being that no easy-to-follow ``bignum''
495 library exists which can be used to teach computer science students how to perform fast and reliable multiple precision
496 integer arithmetic. To this end the source code has been given quite a few comments and algorithm discussion points.
497
498 \section{Choice of LibTomMath}
499 LibTomMath was chosen as the case study of this text not only because the author of both projects is one and the same but
500 for more worthy reasons. Other libraries such as GMP \cite{GMP}, MPI \cite{MPI}, LIP \cite{LIP} and OpenSSL
501 \cite{OPENSSL} have multiple precision integer arithmetic routines but would not be ideal for this text for
502 reasons that will be explained in the following sub-sections.
503
504 \subsection{Code Base}
505 The LibTomMath code base is all portable ISO C source code. This means that there are no platform dependent conditional
506 segments of code littered throughout the source. This clean and uncluttered approach to the library means that a
507 developer can more readily discern the true intent of a given section of source code without trying to keep track of
508 what conditional code will be used.
509
510 The code base of LibTomMath is well organized. Each function is in its own separate source code file
511 which allows the reader to find a given function very quickly. On average there are $76$ lines of code per source
512 file which makes the source very easily to follow. By comparison MPI and LIP are single file projects making code tracing
513 very hard. GMP has many conditional code segments which also hinder tracing.
514
515 When compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.}
516 which is fairly small compared to GMP (over $250$KiB). LibTomMath is slightly larger than MPI (which compiles to about
517 $50$KiB) but LibTomMath is also much faster and more complete than MPI.
518
519 \subsection{API Simplicity}
520 LibTomMath is designed after the MPI library and shares the API design. Quite often programs that use MPI will build
521 with LibTomMath without change. The function names correlate directly to the action they perform. Almost all of the
522 functions share the same parameter passing convention. The learning curve is fairly shallow with the API provided
523 which is an extremely valuable benefit for the student and developer alike.
524
525 The LIP library is an example of a library with an API that is awkward to work with. LIP uses function names that are often ``compressed'' to
526 illegible short hand. LibTomMath does not share this characteristic.
527
528 The GMP library also does not return error codes. Instead it uses a POSIX.1 \cite{POSIX1} signal system where errors
529 are signaled to the host application. This happens to be the fastest approach but definitely not the most versatile. In
530 effect a math error (i.e. invalid input, heap error, etc) can cause a program to stop functioning which is definitely
531 undersireable in many situations.
532
533 \subsection{Optimizations}
534 While LibTomMath is certainly not the fastest library (GMP often beats LibTomMath by a factor of two) it does
535 feature a set of optimal algorithms for tasks such as modular reduction, exponentiation, multiplication and squaring. GMP
536 and LIP also feature such optimizations while MPI only uses baseline algorithms with no optimizations. GMP lacks a few
537 of the additional modular reduction optimizations that LibTomMath features\footnote{At the time of this writing GMP
538 only had Barrett and Montgomery modular reduction algorithms.}.
539
540 LibTomMath is almost always an order of magnitude faster than the MPI library at computationally expensive tasks such as modular
541 exponentiation. In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually
542 slower than the best libraries such as GMP and OpenSSL by only a small factor.
543
544 \subsection{Portability and Stability}
545 LibTomMath will build ``out of the box'' on any platform equipped with a modern version of the GNU C Compiler
546 (\textit{GCC}). This means that without changes the library will build without configuration or setting up any
547 variables. LIP and MPI will build ``out of the box'' as well but have numerous known bugs. Most notably the author of
548 MPI has recently stopped working on his library and LIP has long since been discontinued.
549
550 GMP requires a configuration script to run and will not build out of the box. GMP and LibTomMath are still in active
551 development and are very stable across a variety of platforms.
552
553 \subsection{Choice}
554 LibTomMath is a relatively compact, well documented, highly optimized and portable library which seems only natural for
555 the case study of this text. Various source files from the LibTomMath project will be included within the text. However,
556 the reader is encouraged to download their own copy of the library to actually be able to work with the library.
557
558 \chapter{Getting Started}
559 \section{Library Basics}
560 The trick to writing any useful library of source code is to build a solid foundation and work outwards from it. First,
561 a problem along with allowable solution parameters should be identified and analyzed. In this particular case the
562 inability to accomodate multiple precision integers is the problem. Futhermore, the solution must be written
563 as portable source code that is reasonably efficient across several different computer platforms.
564
565 After a foundation is formed the remainder of the library can be designed and implemented in a hierarchical fashion.
566 That is, to implement the lowest level dependencies first and work towards the most abstract functions last. For example,
567 before implementing a modular exponentiation algorithm one would implement a modular reduction algorithm.
568 By building outwards from a base foundation instead of using a parallel design methodology the resulting project is
569 highly modular. Being highly modular is a desirable property of any project as it often means the resulting product
570 has a small footprint and updates are easy to perform.
571
572 Usually when I start a project I will begin with the header file. I define the data types I think I will need and
573 prototype the initial functions that are not dependent on other functions (within the library). After I
574 implement these base functions I prototype more dependent functions and implement them. The process repeats until
575 I implement all of the functions I require. For example, in the case of LibTomMath I implemented functions such as
576 mp\_init() well before I implemented mp\_mul() and even further before I implemented mp\_exptmod(). As an example as to
577 why this design works note that the Karatsuba and Toom-Cook multipliers were written \textit{after} the
578 dependent function mp\_exptmod() was written. Adding the new multiplication algorithms did not require changes to the
579 mp\_exptmod() function itself and lowered the total cost of ownership (\textit{so to speak}) and of development
580 for new algorithms. This methodology allows new algorithms to be tested in a complete framework with relative ease.
581
582 \begin{center}
583 \begin{figure}[here]
584 \includegraphics{pics/design_process.ps}
585 \caption{Design Flow of the First Few Original LibTomMath Functions.}
586 \label{pic:design_process}
587 \end{figure}
588 \end{center}
589
590 Only after the majority of the functions were in place did I pursue a less hierarchical approach to auditing and optimizing
591 the source code. For example, one day I may audit the multipliers and the next day the polynomial basis functions.
592
593 It only makes sense to begin the text with the preliminary data types and support algorithms required as well.
594 This chapter discusses the core algorithms of the library which are the dependents for every other algorithm.
595
596 \section{What is a Multiple Precision Integer?}
597 Recall that most programming languages, in particular ISO C \cite{ISOC}, only have fixed precision data types that on their own cannot
598 be used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is
599 to use fixed precision data types to create and manipulate multiple precision integers which may represent values
600 that are very large.
601
602 As a well known analogy, school children are taught how to form numbers larger than nine by prepending more radix ten digits. In the decimal system
603 the largest single digit value is $9$. However, by concatenating digits together larger numbers may be represented. Newly prepended digits
604 (\textit{to the left}) are said to be in a different power of ten column. That is, the number $123$ can be described as having a $1$ in the hundreds
605 column, $2$ in the tens column and $3$ in the ones column. Or more formally $123 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3 \cdot 10^0$. Computer based
606 multiple precision arithmetic is essentially the same concept. Larger integers are represented by adjoining fixed
607 precision computer words with the exception that a different radix is used.
608
609 What most people probably do not think about explicitly are the various other attributes that describe a multiple precision
610 integer. For example, the integer $154_{10}$ has two immediately obvious properties. First, the integer is positive,
611 that is the sign of this particular integer is positive as opposed to negative. Second, the integer has three digits in
612 its representation. There is an additional property that the integer posesses that does not concern pencil-and-paper
613 arithmetic. The third property is how many digits placeholders are available to hold the integer.
614
615 The human analogy of this third property is ensuring there is enough space on the paper to write the integer. For example,
616 if one starts writing a large number too far to the right on a piece of paper they will have to erase it and move left.
617 Similarly, computer algorithms must maintain strict control over memory usage to ensure that the digits of an integer
618 will not exceed the allowed boundaries. These three properties make up what is known as a multiple precision
619 integer or mp\_int for short.
620
621 \subsection{The mp\_int Structure}
622 \label{sec:MPINT}
623 The mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer. The ISO C standard does not provide for
624 any such data type but it does provide for making composite data types known as structures. The following is the structure definition
625 used within LibTomMath.
626
627 \index{mp\_int}
628 \begin{verbatim}
629 typedef struct {
630 int used, alloc, sign;
631 mp_digit *dp;
632 } mp_int;
633 \end{verbatim}
634
635 The mp\_int structure can be broken down as follows.
636
637 \begin{enumerate}
638 \item The \textbf{used} parameter denotes how many digits of the array \textbf{dp} contain the digits used to represent
639 a given integer. The \textbf{used} count must be positive (or zero) and may not exceed the \textbf{alloc} count.
640
641 \item The \textbf{alloc} parameter denotes how
642 many digits are available in the array to use by functions before it has to increase in size. When the \textbf{used} count
643 of a result would exceed the \textbf{alloc} count all of the algorithms will automatically increase the size of the
644 array to accommodate the precision of the result.
645
646 \item The pointer \textbf{dp} points to a dynamically allocated array of digits that represent the given multiple
647 precision integer. It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits. The array is maintained in a least
648 significant digit order. As a pencil and paper analogy the array is organized such that the right most digits are stored
649 first starting at the location indexed by zero\footnote{In C all arrays begin at zero.} in the array. For example,
650 if \textbf{dp} contains $\lbrace a, b, c, \ldots \rbrace$ where \textbf{dp}$_0 = a$, \textbf{dp}$_1 = b$, \textbf{dp}$_2 = c$, $\ldots$ then
651 it would represent the integer $a + b\beta + c\beta^2 + \ldots$
652
653 \index{MP\_ZPOS} \index{MP\_NEG}
654 \item The \textbf{sign} parameter denotes the sign as either zero/positive (\textbf{MP\_ZPOS}) or negative (\textbf{MP\_NEG}).
655 \end{enumerate}
656
657 \subsubsection{Valid mp\_int Structures}
658 Several rules are placed on the state of an mp\_int structure and are assumed to be followed for reasons of efficiency.
659 The only exceptions are when the structure is passed to initialization functions such as mp\_init() and mp\_init\_copy().
660
661 \begin{enumerate}
662 \item The value of \textbf{alloc} may not be less than one. That is \textbf{dp} always points to a previously allocated
663 array of digits.
664 \item The value of \textbf{used} may not exceed \textbf{alloc} and must be greater than or equal to zero.
665 \item The value of \textbf{used} implies the digit at index $(used - 1)$ of the \textbf{dp} array is non-zero. That is,
666 leading zero digits in the most significant positions must be trimmed.
667 \begin{enumerate}
668 \item Digits in the \textbf{dp} array at and above the \textbf{used} location must be zero.
669 \end{enumerate}
670 \item The value of \textbf{sign} must be \textbf{MP\_ZPOS} if \textbf{used} is zero;
671 this represents the mp\_int value of zero.
672 \end{enumerate}
673
674 \section{Argument Passing}
675 A convention of argument passing must be adopted early on in the development of any library. Making the function
676 prototypes consistent will help eliminate many headaches in the future as the library grows to significant complexity.
677 In LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int
678 structures. That means that the source (input) operands are placed on the left and the destination (output) on the right.
679 Consider the following examples.
680
681 \begin{verbatim}
682 mp_mul(&a, &b, &c); /* c = a * b */
683 mp_add(&a, &b, &a); /* a = a + b */
684 mp_sqr(&a, &b); /* b = a * a */
685 \end{verbatim}
686
687 The left to right order is a fairly natural way to implement the functions since it lets the developer read aloud the
688 functions and make sense of them. For example, the first function would read ``multiply a and b and store in c''.
689
690 Certain libraries (\textit{LIP by Lenstra for instance}) accept parameters the other way around, to mimic the order
691 of assignment expressions. That is, the destination (output) is on the left and arguments (inputs) are on the right. In
692 truth, it is entirely a matter of preference. In the case of LibTomMath the convention from the MPI library has been
693 adopted.
694
695 Another very useful design consideration, provided for in LibTomMath, is whether to allow argument sources to also be a
696 destination. For example, the second example (\textit{mp\_add}) adds $a$ to $b$ and stores in $a$. This is an important
697 feature to implement since it allows the calling functions to cut down on the number of variables it must maintain.
698 However, to implement this feature specific care has to be given to ensure the destination is not modified before the
699 source is fully read.
700
701 \section{Return Values}
702 A well implemented application, no matter what its purpose, should trap as many runtime errors as possible and return them
703 to the caller. By catching runtime errors a library can be guaranteed to prevent undefined behaviour. However, the end
704 developer can still manage to cause a library to crash. For example, by passing an invalid pointer an application may
705 fault by dereferencing memory not owned by the application.
706
707 In the case of LibTomMath the only errors that are checked for are related to inappropriate inputs (division by zero for
708 instance) and memory allocation errors. It will not check that the mp\_int passed to any function is valid nor
709 will it check pointers for validity. Any function that can cause a runtime error will return an error code as an
710 \textbf{int} data type with one of the following values.
711
712 \index{MP\_OKAY} \index{MP\_VAL} \index{MP\_MEM}
713 \begin{center}
714 \begin{tabular}{|l|l|}
715 \hline \textbf{Value} & \textbf{Meaning} \\
716 \hline \textbf{MP\_OKAY} & The function was successful \\
717 \hline \textbf{MP\_VAL} & One of the input value(s) was invalid \\
718 \hline \textbf{MP\_MEM} & The function ran out of heap memory \\
719 \hline
720 \end{tabular}
721 \end{center}
722
723 When an error is detected within a function it should free any memory it allocated, often during the initialization of
724 temporary mp\_ints, and return as soon as possible. The goal is to leave the system in the same state it was when the
725 function was called. Error checking with this style of API is fairly simple.
726
727 \begin{verbatim}
728 int err;
729 if ((err = mp_add(&a, &b, &c)) != MP_OKAY) {
730 printf("Error: %s\n", mp_error_to_string(err));
731 exit(EXIT_FAILURE);
732 }
733 \end{verbatim}
734
735 The GMP \cite{GMP} library uses C style \textit{signals} to flag errors which is of questionable use. Not all errors are fatal
736 and it was not deemed ideal by the author of LibTomMath to force developers to have signal handlers for such cases.
737
738 \section{Initialization and Clearing}
739 The logical starting point when actually writing multiple precision integer functions is the initialization and
740 clearing of the mp\_int structures. These two algorithms will be used by the majority of the higher level algorithms.
741
742 Given the basic mp\_int structure an initialization routine must first allocate memory to hold the digits of
743 the integer. Often it is optimal to allocate a sufficiently large pre-set number of digits even though
744 the initial integer will represent zero. If only a single digit were allocated quite a few subsequent re-allocations
745 would occur when operations are performed on the integers. There is a tradeoff between how many default digits to allocate
746 and how many re-allocations are tolerable. Obviously allocating an excessive amount of digits initially will waste
747 memory and become unmanageable.
748
749 If the memory for the digits has been successfully allocated then the rest of the members of the structure must
750 be initialized. Since the initial state of an mp\_int is to represent the zero integer, the allocated digits must be set
751 to zero. The \textbf{used} count set to zero and \textbf{sign} set to \textbf{MP\_ZPOS}.
752
753 \subsection{Initializing an mp\_int}
754 An mp\_int is said to be initialized if it is set to a valid, preferably default, state such that all of the members of the
755 structure are set to valid values. The mp\_init algorithm will perform such an action.
756
757 \begin{figure}[here]
758 \begin{center}
759 \begin{tabular}{l}
760 \hline Algorithm \textbf{mp\_init}. \\
761 \textbf{Input}. An mp\_int $a$ \\
762 \textbf{Output}. Allocate memory and initialize $a$ to a known valid mp\_int state. \\
763 \hline \\
764 1. Allocate memory for \textbf{MP\_PREC} digits. \\
765 2. If the allocation failed return(\textit{MP\_MEM}) \\
766 3. for $n$ from $0$ to $MP\_PREC - 1$ do \\
767 \hspace{3mm}3.1 $a_n \leftarrow 0$\\
768 4. $a.sign \leftarrow MP\_ZPOS$\\
769 5. $a.used \leftarrow 0$\\
770 6. $a.alloc \leftarrow MP\_PREC$\\
771 7. Return(\textit{MP\_OKAY})\\
772 \hline
773 \end{tabular}
774 \end{center}
775 \caption{Algorithm mp\_init}
776 \end{figure}
777
778 \textbf{Algorithm mp\_init.}
779 The \textbf{MP\_PREC} name represents a constant\footnote{Defined in the ``tommath.h'' header file within LibTomMath.}
780 used to dictate the minimum precision of allocated mp\_int integers. Ideally, it is at least equal to $32$ since for most
781 purposes that will be more than enough.
782
783 Memory for the default number of digits is allocated first. If the allocation fails the algorithm returns immediately
784 with the \textbf{MP\_MEM} error code. If the allocation succeeds the remaining members of the mp\_int structure
785 must be initialized to reflect the default initial state.
786
787 The allocated digits are all set to zero (step three) to ensure they are in a known state. The \textbf{sign}, \textbf{used}
788 and \textbf{alloc} are subsequently initialized to represent the zero integer. By step seven the algorithm returns a success
789 code and the mp\_int $a$ has been successfully initialized to a valid state representing the integer zero.
790
791 \textbf{Remark.}
792 This function introduces the idiosyncrasy that all iterative loops, commonly initiated with the ``for'' keyword, iterate incrementally
793 when the ``to'' keyword is placed between two expressions. For example, ``for $a$ from $b$ to $c$ do'' means that
794 a subsequent expression (or body of expressions) are to be evaluated upto $c - b$ times so long as $b \le c$. In each
795 iteration the variable $a$ is substituted for a new integer that lies inclusively between $b$ and $c$. If $b > c$ occured
796 the loop would not iterate. By contrast if the ``downto'' keyword were used in place of ``to'' the loop would iterate
797 decrementally.
798
799 \vspace{+3mm}\begin{small}
800 \hspace{-5.1mm}{\bf File}: bn\_mp\_init.c
801 \vspace{-3mm}
802 \begin{alltt}
803 016
804 017 /* init a new bigint */
805 018 int mp_init (mp_int * a)
806 019 \{
807 020 /* allocate memory required and clear it */
808 021 a->dp = OPT_CAST(mp_digit) XCALLOC (sizeof (mp_digit), MP_PREC);
809 022 if (a->dp == NULL) \{
810 023 return MP_MEM;
811 024 \}
812 025
813 026 /* set the used to zero, allocated digits to the default precision
814 027 * and sign to positive */
815 028 a->used = 0;
816 029 a->alloc = MP_PREC;
817 030 a->sign = MP_ZPOS;
818 031
819 032 return MP_OKAY;
820 033 \}
821 \end{alltt}
822 \end{small}
823
824 One immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure. It
825 is assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack. The
826 call to mp\_init() is used only to initialize the members of the structure to a known default state.
827
828 Before any of the other members of the structure are initialized memory from the application heap is allocated with
829 the calloc() function (line @22,calloc@). The size of the allocated memory is large enough to hold \textbf{MP\_PREC}
830 mp\_digit variables. The calloc() function is used instead\footnote{calloc() will allocate memory in the same
831 manner as malloc() except that it also sets the contents to zero upon successfully allocating the memory.} of malloc()
832 since digits have to be set to zero for the function to finish correctly. The \textbf{OPT\_CAST} token is a macro
833 definition which will turn into a cast from void * to mp\_digit * for C++ compilers. It is not required for C compilers.
834
835 After the memory has been successfully allocated the remainder of the members are initialized
836 (lines 28 through 30) to their respective default states. At this point the algorithm has succeeded and
837 a success code is returned to the calling function.
838
839 If this function returns \textbf{MP\_OKAY} it is safe to assume the mp\_int structure has been properly initialized and
840 is safe to use with other functions within the library.
841
842 \subsection{Clearing an mp\_int}
843 When an mp\_int is no longer required by the application, the memory that has been allocated for its digits must be
844 returned to the application's memory pool with the mp\_clear algorithm.
845
846 \begin{figure}[here]
847 \begin{center}
848 \begin{tabular}{l}
849 \hline Algorithm \textbf{mp\_clear}. \\
850 \textbf{Input}. An mp\_int $a$ \\
851 \textbf{Output}. The memory for $a$ is freed for reuse. \\
852 \hline \\
853 1. If $a$ has been previously freed then return(\textit{MP\_OKAY}). \\
854 2. for $n$ from 0 to $a.used - 1$ do \\
855 \hspace{3mm}2.1 $a_n \leftarrow 0$ \\
856 3. Free the memory allocated for the digits of $a$. \\
857 4. $a.used \leftarrow 0$ \\
858 5. $a.alloc \leftarrow 0$ \\
859 6. $a.sign \leftarrow MP\_ZPOS$ \\
860 7. Return(\textit{MP\_OKAY}). \\
861 \hline
862 \end{tabular}
863 \end{center}
864 \caption{Algorithm mp\_clear}
865 \end{figure}
866
867 \textbf{Algorithm mp\_clear.}
868 This algorithm releases the memory allocated for an mp\_int back into the memory pool for reuse. It is designed
869 such that a given mp\_int structure can be cleared multiple times between initializations without attempting to
870 free the memory twice\footnote{In ISO C for example, calling free() twice on the same memory block causes undefinied
871 behaviour.}.
872
873 The first step determines if the mp\_int structure has been marked as free already. If it has, the algorithm returns
874 success immediately as no further actions are required. Otherwise, the algorithm will proceed to put the structure
875 in a known empty and otherwise invalid state. First the digits of the mp\_int are set to zero. The memory that has been allocated for the
876 digits is then freed. The \textbf{used} and \textbf{alloc} counts are both set to zero and the \textbf{sign} set to
877 \textbf{MP\_ZPOS}. This known fixed state for cleared mp\_int structures will make debuging easier for the end
878 developer. That is, if they spot (via their debugger) an mp\_int they are using that is in this state it will be
879 obvious that they erroneously and prematurely cleared the mp\_int structure.
880
881 Note that once an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm
882 with the exception of algorithms mp\_init, mp\_init\_copy, mp\_init\_size and mp\_clear.
883
884 \vspace{+3mm}\begin{small}
885 \hspace{-5.1mm}{\bf File}: bn\_mp\_clear.c
886 \vspace{-3mm}
887 \begin{alltt}
888 016
889 017 /* clear one (frees) */
890 018 void
891 019 mp_clear (mp_int * a)
892 020 \{
893 021 /* only do anything if a hasn't been freed previously */
894 022 if (a->dp != NULL) \{
895 023 /* first zero the digits */
896 024 memset (a->dp, 0, sizeof (mp_digit) * a->used);
897 025
898 026 /* free ram */
899 027 XFREE(a->dp);
900 028
901 029 /* reset members to make debugging easier */
902 030 a->dp = NULL;
903 031 a->alloc = a->used = 0;
904 032 a->sign = MP_ZPOS;
905 033 \}
906 034 \}
907 \end{alltt}
908 \end{small}
909
910 The ``if'' statement (line 22) prevents the heap from being corrupted if a user double-frees an
911 mp\_int. This is because once the memory is freed the pointer is set to \textbf{NULL} (line 30).
912
913 Without the check, code that accidentally calls mp\_clear twice for a given mp\_int structure would try to free the memory
914 allocated for the digits twice. This may cause some C libraries to signal a fault. By setting the pointer to
915 \textbf{NULL} it helps debug code that may inadvertently free the mp\_int before it is truly not needed, because attempts
916 to reference digits should fail immediately. The allocated digits are set to zero before being freed (line 24).
917 This is ideal for cryptographic situations where the integer that the mp\_int represents might need to be kept a secret.
918
919 \section{Maintenance Algorithms}
920
921 The previous sections describes how to initialize and clear an mp\_int structure. To further support operations
922 that are to be performed on mp\_int structures (such as addition and multiplication) the dependent algorithms must be
923 able to augment the precision of an mp\_int and
924 initialize mp\_ints with differing initial conditions.
925
926 These algorithms complete the set of low level algorithms required to work with mp\_int structures in the higher level
927 algorithms such as addition, multiplication and modular exponentiation.
928
929 \subsection{Augmenting an mp\_int's Precision}
930 When storing a value in an mp\_int structure, a sufficient number of digits must be available to accomodate the entire
931 result of an operation without loss of precision. Quite often the size of the array given by the \textbf{alloc} member
932 is large enough to simply increase the \textbf{used} digit count. However, when the size of the array is too small it
933 must be re-sized appropriately to accomodate the result. The mp\_grow algorithm will provide this functionality.
934
935 \newpage\begin{figure}[here]
936 \begin{center}
937 \begin{tabular}{l}
938 \hline Algorithm \textbf{mp\_grow}. \\
939 \textbf{Input}. An mp\_int $a$ and an integer $b$. \\
940 \textbf{Output}. $a$ is expanded to accomodate $b$ digits. \\
941 \hline \\
942 1. if $a.alloc \ge b$ then return(\textit{MP\_OKAY}) \\
943 2. $u \leftarrow b\mbox{ (mod }MP\_PREC\mbox{)}$ \\
944 3. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\
945 4. Re-Allocate the array of digits $a$ to size $v$ \\
946 5. If the allocation failed then return(\textit{MP\_MEM}). \\
947 6. for n from a.alloc to $v - 1$ do \\
948 \hspace{+3mm}6.1 $a_n \leftarrow 0$ \\
949 7. $a.alloc \leftarrow v$ \\
950 8. Return(\textit{MP\_OKAY}) \\
951 \hline
952 \end{tabular}
953 \end{center}
954 \caption{Algorithm mp\_grow}
955 \end{figure}
956
957 \textbf{Algorithm mp\_grow.}
958 It is ideal to prevent re-allocations from being performed if they are not required (step one). This is useful to
959 prevent mp\_ints from growing excessively in code that erroneously calls mp\_grow.
960
961 The requested digit count is padded up to next multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} (steps two and three).
962 This helps prevent many trivial reallocations that would grow an mp\_int by trivially small values.
963
964 It is assumed that the reallocation (step four) leaves the lower $a.alloc$ digits of the mp\_int intact. This is much
965 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are
966 assumed to contain undefined values they are initially set to zero.
967
968 \vspace{+3mm}\begin{small}
969 \hspace{-5.1mm}{\bf File}: bn\_mp\_grow.c
970 \vspace{-3mm}
971 \begin{alltt}
972 016
973 017 /* grow as required */
974 018 int mp_grow (mp_int * a, int size)
975 019 \{
976 020 int i;
977 021 mp_digit *tmp;
978 022
979 023 /* if the alloc size is smaller alloc more ram */
980 024 if (a->alloc < size) \{
981 025 /* ensure there are always at least MP_PREC digits extra on top */
982 026 size += (MP_PREC * 2) - (size % MP_PREC);
983 027
984 028 /* reallocate the array a->dp
985 029 *
986 030 * We store the return in a temporary variable
987 031 * in case the operation failed we don't want
988 032 * to overwrite the dp member of a.
989 033 */
990 034 tmp = OPT_CAST(mp_digit) XREALLOC (a->dp, sizeof (mp_digit) * size);
991 035 if (tmp == NULL) \{
992 036 /* reallocation failed but "a" is still valid [can be freed] */
993 037 return MP_MEM;
994 038 \}
995 039
996 040 /* reallocation succeeded so set a->dp */
997 041 a->dp = tmp;
998 042
999 043 /* zero excess digits */
1000 044 i = a->alloc;
1001 045 a->alloc = size;
1002 046 for (; i < a->alloc; i++) \{
1003 047 a->dp[i] = 0;
1004 048 \}
1005 049 \}
1006 050 return MP_OKAY;
1007 051 \}
1008 \end{alltt}
1009 \end{small}
1010
1011 The first step is to see if we actually need to perform a re-allocation at all (line 24). If a reallocation
1012 must occur the digit count is padded upwards to help prevent many trivial reallocations (line 26). Next the reallocation is performed
1013 and the return of realloc() is stored in a temporary pointer named $tmp$ (line 36). The return is stored in a temporary
1014 instead of $a.dp$ to prevent the code from losing the original pointer in case the reallocation fails. Had the return been stored
1015 in $a.dp$ instead there would be no way to reclaim the heap originally used.
1016
1017 If the reallocation fails the function will return \textbf{MP\_MEM} (line 37), otherwise, the value of $tmp$ is assigned
1018 to the pointer $a.dp$ and the function continues. A simple for loop from line 46 to line 51 will zero all digits
1019 that were above the old \textbf{alloc} limit to make sure the integer is in a known state.
1020
1021 \subsection{Initializing Variable Precision mp\_ints}
1022 Occasionally the number of digits required will be known in advance of an initialization, based on, for example, the size
1023 of input mp\_ints to a given algorithm. The purpose of algorithm mp\_init\_size is similar to mp\_init except that it
1024 will allocate \textit{at least} a specified number of digits.
1025
1026 \begin{figure}[here]
1027 \begin{small}
1028 \begin{center}
1029 \begin{tabular}{l}
1030 \hline Algorithm \textbf{mp\_init\_size}. \\
1031 \textbf{Input}. An mp\_int $a$ and the requested number of digits $b$. \\
1032 \textbf{Output}. $a$ is initialized to hold at least $b$ digits. \\
1033 \hline \\
1034 1. $u \leftarrow b \mbox{ (mod }MP\_PREC\mbox{)}$ \\
1035 2. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\
1036 3. Allocate $v$ digits. \\
1037 4. for $n$ from $0$ to $v - 1$ do \\
1038 \hspace{3mm}4.1 $a_n \leftarrow 0$ \\
1039 5. $a.sign \leftarrow MP\_ZPOS$\\
1040 6. $a.used \leftarrow 0$\\
1041 7. $a.alloc \leftarrow v$\\
1042 8. Return(\textit{MP\_OKAY})\\
1043 \hline
1044 \end{tabular}
1045 \end{center}
1046 \end{small}
1047 \caption{Algorithm mp\_init\_size}
1048 \end{figure}
1049
1050 \textbf{Algorithm mp\_init\_size.}
1051 This algorithm will initialize an mp\_int structure $a$ like algorithm mp\_init with the exception that the number of
1052 digits allocated can be controlled by the second input argument $b$. The input size is padded upwards so it is a
1053 multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} digits. This padding is used to prevent trivial
1054 allocations from becoming a bottleneck in the rest of the algorithms.
1055
1056 Like algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero. This
1057 particular algorithm is useful if it is known ahead of time the approximate size of the input. If the approximation is
1058 correct no further memory re-allocations are required to work with the mp\_int.
1059
1060 \vspace{+3mm}\begin{small}
1061 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_size.c
1062 \vspace{-3mm}
1063 \begin{alltt}
1064 016
1065 017 /* init an mp_init for a given size */
1066 018 int mp_init_size (mp_int * a, int size)
1067 019 \{
1068 020 /* pad size so there are always extra digits */
1069 021 size += (MP_PREC * 2) - (size % MP_PREC);
1070 022
1071 023 /* alloc mem */
1072 024 a->dp = OPT_CAST(mp_digit) XCALLOC (sizeof (mp_digit), size);
1073 025 if (a->dp == NULL) \{
1074 026 return MP_MEM;
1075 027 \}
1076 028 a->used = 0;
1077 029 a->alloc = size;
1078 030 a->sign = MP_ZPOS;
1079 031
1080 032 return MP_OKAY;
1081 033 \}
1082 \end{alltt}
1083 \end{small}
1084
1085 The number of digits $b$ requested is padded (line 21) by first augmenting it to the next multiple of
1086 \textbf{MP\_PREC} and then adding \textbf{MP\_PREC} to the result. If the memory can be successfully allocated the
1087 mp\_int is placed in a default state representing the integer zero. Otherwise, the error code \textbf{MP\_MEM} will be
1088 returned (line 26).
1089
1090 The digits are allocated and set to zero at the same time with the calloc() function (line @25,calloc@). The
1091 \textbf{used} count is set to zero, the \textbf{alloc} count set to the padded digit count and the \textbf{sign} flag set
1092 to \textbf{MP\_ZPOS} to achieve a default valid mp\_int state (lines 28, 29 and 30). If the function
1093 returns succesfully then it is correct to assume that the mp\_int structure is in a valid state for the remainder of the
1094 functions to work with.
1095
1096 \subsection{Multiple Integer Initializations and Clearings}
1097 Occasionally a function will require a series of mp\_int data types to be made available simultaneously.
1098 The purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single
1099 statement. It is essentially a shortcut to multiple initializations.
1100
1101 \newpage\begin{figure}[here]
1102 \begin{center}
1103 \begin{tabular}{l}
1104 \hline Algorithm \textbf{mp\_init\_multi}. \\
1105 \textbf{Input}. Variable length array $V_k$ of mp\_int variables of length $k$. \\
1106 \textbf{Output}. The array is initialized such that each mp\_int of $V_k$ is ready to use. \\
1107 \hline \\
1108 1. for $n$ from 0 to $k - 1$ do \\
1109 \hspace{+3mm}1.1. Initialize the mp\_int $V_n$ (\textit{mp\_init}) \\
1110 \hspace{+3mm}1.2. If initialization failed then do \\
1111 \hspace{+6mm}1.2.1. for $j$ from $0$ to $n$ do \\
1112 \hspace{+9mm}1.2.1.1. Free the mp\_int $V_j$ (\textit{mp\_clear}) \\
1113 \hspace{+6mm}1.2.2. Return(\textit{MP\_MEM}) \\
1114 2. Return(\textit{MP\_OKAY}) \\
1115 \hline
1116 \end{tabular}
1117 \end{center}
1118 \caption{Algorithm mp\_init\_multi}
1119 \end{figure}
1120
1121 \textbf{Algorithm mp\_init\_multi.}
1122 The algorithm will initialize the array of mp\_int variables one at a time. If a runtime error has been detected
1123 (\textit{step 1.2}) all of the previously initialized variables are cleared. The goal is an ``all or nothing''
1124 initialization which allows for quick recovery from runtime errors.
1125
1126 \vspace{+3mm}\begin{small}
1127 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_multi.c
1128 \vspace{-3mm}
1129 \begin{alltt}
1130 016 #include <stdarg.h>
1131 017
1132 018 int mp_init_multi(mp_int *mp, ...)
1133 019 \{
1134 020 mp_err res = MP_OKAY; /* Assume ok until proven otherwise */
1135 021 int n = 0; /* Number of ok inits */
1136 022 mp_int* cur_arg = mp;
1137 023 va_list args;
1138 024
1139 025 va_start(args, mp); /* init args to next argument from caller */
1140 026 while (cur_arg != NULL) \{
1141 027 if (mp_init(cur_arg) != MP_OKAY) \{
1142 028 /* Oops - error! Back-track and mp_clear what we already
1143 029 succeeded in init-ing, then return error.
1144 030 */
1145 031 va_list clean_args;
1146 032
1147 033 /* end the current list */
1148 034 va_end(args);
1149 035
1150 036 /* now start cleaning up */
1151 037 cur_arg = mp;
1152 038 va_start(clean_args, mp);
1153 039 while (n--) \{
1154 040 mp_clear(cur_arg);
1155 041 cur_arg = va_arg(clean_args, mp_int*);
1156 042 \}
1157 043 va_end(clean_args);
1158 044 res = MP_MEM;
1159 045 break;
1160 046 \}
1161 047 n++;
1162 048 cur_arg = va_arg(args, mp_int*);
1163 049 \}
1164 050 va_end(args);
1165 051 return res; /* Assumed ok, if error flagged above. */
1166 052 \}
1167 053
1168 \end{alltt}
1169 \end{small}
1170
1171 This function intializes a variable length list of mp\_int structure pointers. However, instead of having the mp\_int
1172 structures in an actual C array they are simply passed as arguments to the function. This function makes use of the
1173 ``...'' argument syntax of the C programming language. The list is terminated with a final \textbf{NULL} argument
1174 appended on the right.
1175
1176 The function uses the ``stdarg.h'' \textit{va} functions to step portably through the arguments to the function. A count
1177 $n$ of succesfully initialized mp\_int structures is maintained (line 47) such that if a failure does occur,
1178 the algorithm can backtrack and free the previously initialized structures (lines 27 to 46).
1179
1180
1181 \subsection{Clamping Excess Digits}
1182 When a function anticipates a result will be $n$ digits it is simpler to assume this is true within the body of
1183 the function instead of checking during the computation. For example, a multiplication of a $i$ digit number by a
1184 $j$ digit produces a result of at most $i + j$ digits. It is entirely possible that the result is $i + j - 1$
1185 though, with no final carry into the last position. However, suppose the destination had to be first expanded
1186 (\textit{via mp\_grow}) to accomodate $i + j - 1$ digits than further expanded to accomodate the final carry.
1187 That would be a considerable waste of time since heap operations are relatively slow.
1188
1189 The ideal solution is to always assume the result is $i + j$ and fix up the \textbf{used} count after the function
1190 terminates. This way a single heap operation (\textit{at most}) is required. However, if the result was not checked
1191 there would be an excess high order zero digit.
1192
1193 For example, suppose the product of two integers was $x_n = (0x_{n-1}x_{n-2}...x_0)_{\beta}$. The leading zero digit
1194 will not contribute to the precision of the result. In fact, through subsequent operations more leading zero digits would
1195 accumulate to the point the size of the integer would be prohibitive. As a result even though the precision is very
1196 low the representation is excessively large.
1197
1198 The mp\_clamp algorithm is designed to solve this very problem. It will trim high-order zeros by decrementing the
1199 \textbf{used} count until a non-zero most significant digit is found. Also in this system, zero is considered to be a
1200 positive number which means that if the \textbf{used} count is decremented to zero, the sign must be set to
1201 \textbf{MP\_ZPOS}.
1202
1203 \begin{figure}[here]
1204 \begin{center}
1205 \begin{tabular}{l}
1206 \hline Algorithm \textbf{mp\_clamp}. \\
1207 \textbf{Input}. An mp\_int $a$ \\
1208 \textbf{Output}. Any excess leading zero digits of $a$ are removed \\
1209 \hline \\
1210 1. while $a.used > 0$ and $a_{a.used - 1} = 0$ do \\
1211 \hspace{+3mm}1.1 $a.used \leftarrow a.used - 1$ \\
1212 2. if $a.used = 0$ then do \\
1213 \hspace{+3mm}2.1 $a.sign \leftarrow MP\_ZPOS$ \\
1214 \hline \\
1215 \end{tabular}
1216 \end{center}
1217 \caption{Algorithm mp\_clamp}
1218 \end{figure}
1219
1220 \textbf{Algorithm mp\_clamp.}
1221 As can be expected this algorithm is very simple. The loop on step one is expected to iterate only once or twice at
1222 the most. For example, this will happen in cases where there is not a carry to fill the last position. Step two fixes the sign for
1223 when all of the digits are zero to ensure that the mp\_int is valid at all times.
1224
1225 \vspace{+3mm}\begin{small}
1226 \hspace{-5.1mm}{\bf File}: bn\_mp\_clamp.c
1227 \vspace{-3mm}
1228 \begin{alltt}
1229 016
1230 017 /* trim unused digits
1231 018 *
1232 019 * This is used to ensure that leading zero digits are
1233 020 * trimed and the leading "used" digit will be non-zero
1234 021 * Typically very fast. Also fixes the sign if there
1235 022 * are no more leading digits
1236 023 */
1237 024 void
1238 025 mp_clamp (mp_int * a)
1239 026 \{
1240 027 /* decrease used while the most significant digit is
1241 028 * zero.
1242 029 */
1243 030 while (a->used > 0 && a->dp[a->used - 1] == 0) \{
1244 031 --(a->used);
1245 032 \}
1246 033
1247 034 /* reset the sign flag if used == 0 */
1248 035 if (a->used == 0) \{
1249 036 a->sign = MP_ZPOS;
1250 037 \}
1251 038 \}
1252 \end{alltt}
1253 \end{small}
1254
1255 Note on line 27 how to test for the \textbf{used} count is made on the left of the \&\& operator. In the C programming
1256 language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails. This is
1257 important since if the \textbf{used} is zero the test on the right would fetch below the array. That is obviously
1258 undesirable. The parenthesis on line 30 is used to make sure the \textbf{used} count is decremented and not
1259 the pointer ``a''.
1260
1261 \section*{Exercises}
1262 \begin{tabular}{cl}
1263 $\left [ 1 \right ]$ & Discuss the relevance of the \textbf{used} member of the mp\_int structure. \\
1264 & \\
1265 $\left [ 1 \right ]$ & Discuss the consequences of not using padding when performing allocations. \\
1266 & \\
1267 $\left [ 2 \right ]$ & Estimate an ideal value for \textbf{MP\_PREC} when performing 1024-bit RSA \\
1268 & encryption when $\beta = 2^{28}$. \\
1269 & \\
1270 $\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp. What does it prevent? \\
1271 & \\
1272 $\left [ 1 \right ]$ & Give an example of when the algorithm mp\_init\_copy might be useful. \\
1273 & \\
1274 \end{tabular}
1275
1276
1277 %%%
1278 % CHAPTER FOUR
1279 %%%
1280
1281 \chapter{Basic Operations}
1282
1283 \section{Introduction}
1284 In the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining
1285 mp\_int structures. This chapter will discuss another set of seemingly non-algebraic algorithms which will form the low
1286 level basis of the entire library. While these algorithm are relatively trivial it is important to understand how they
1287 work before proceeding since these algorithms will be used almost intrinsically in the following chapters.
1288
1289 The algorithms in this chapter deal primarily with more ``programmer'' related tasks such as creating copies of
1290 mp\_int structures, assigning small values to mp\_int structures and comparisons of the values mp\_int structures
1291 represent.
1292
1293 \section{Assigning Values to mp\_int Structures}
1294 \subsection{Copying an mp\_int}
1295 Assigning the value that a given mp\_int structure represents to another mp\_int structure shall be known as making
1296 a copy for the purposes of this text. The copy of the mp\_int will be a separate entity that represents the same
1297 value as the mp\_int it was copied from. The mp\_copy algorithm provides this functionality.
1298
1299 \newpage\begin{figure}[here]
1300 \begin{center}
1301 \begin{tabular}{l}
1302 \hline Algorithm \textbf{mp\_copy}. \\
1303 \textbf{Input}. An mp\_int $a$ and $b$. \\
1304 \textbf{Output}. Store a copy of $a$ in $b$. \\
1305 \hline \\
1306 1. If $b.alloc < a.used$ then grow $b$ to $a.used$ digits. (\textit{mp\_grow}) \\
1307 2. for $n$ from 0 to $a.used - 1$ do \\
1308 \hspace{3mm}2.1 $b_{n} \leftarrow a_{n}$ \\
1309 3. for $n$ from $a.used$ to $b.used - 1$ do \\
1310 \hspace{3mm}3.1 $b_{n} \leftarrow 0$ \\
1311 4. $b.used \leftarrow a.used$ \\
1312 5. $b.sign \leftarrow a.sign$ \\
1313 6. return(\textit{MP\_OKAY}) \\
1314 \hline
1315 \end{tabular}
1316 \end{center}
1317 \caption{Algorithm mp\_copy}
1318 \end{figure}
1319
1320 \textbf{Algorithm mp\_copy.}
1321 This algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will
1322 represent the same integer as the mp\_int $a$. The mp\_int $b$ shall be a complete and distinct copy of the
1323 mp\_int $a$ meaing that the mp\_int $a$ can be modified and it shall not affect the value of the mp\_int $b$.
1324
1325 If $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow
1326 algorithm. The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two
1327 and three). The \textbf{used} and \textbf{sign} members of $a$ are finally copied over the respective members of
1328 $b$.
1329
1330 \textbf{Remark.} This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the
1331 text. The error return codes of other algorithms are not explicitly checked in the pseudo-code presented. For example, in
1332 step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded. Text space is
1333 limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return
1334 the error code itself. However, the C code presented will demonstrate all of the error handling logic required to
1335 implement the pseudo-code.
1336
1337 \vspace{+3mm}\begin{small}
1338 \hspace{-5.1mm}{\bf File}: bn\_mp\_copy.c
1339 \vspace{-3mm}
1340 \begin{alltt}
1341 016
1342 017 /* copy, b = a */
1343 018 int
1344 019 mp_copy (mp_int * a, mp_int * b)
1345 020 \{
1346 021 int res, n;
1347 022
1348 023 /* if dst == src do nothing */
1349 024 if (a == b) \{
1350 025 return MP_OKAY;
1351 026 \}
1352 027
1353 028 /* grow dest */
1354 029 if (b->alloc < a->used) \{
1355 030 if ((res = mp_grow (b, a->used)) != MP_OKAY) \{
1356 031 return res;
1357 032 \}
1358 033 \}
1359 034
1360 035 /* zero b and copy the parameters over */
1361 036 \{
1362 037 register mp_digit *tmpa, *tmpb;
1363 038
1364 039 /* pointer aliases */
1365 040
1366 041 /* source */
1367 042 tmpa = a->dp;
1368 043
1369 044 /* destination */
1370 045 tmpb = b->dp;
1371 046
1372 047 /* copy all the digits */
1373 048 for (n = 0; n < a->used; n++) \{
1374 049 *tmpb++ = *tmpa++;
1375 050 \}
1376 051
1377 052 /* clear high digits */
1378 053 for (; n < b->used; n++) \{
1379 054 *tmpb++ = 0;
1380 055 \}
1381 056 \}
1382 057
1383 058 /* copy used count and sign */
1384 059 b->used = a->used;
1385 060 b->sign = a->sign;
1386 061 return MP_OKAY;
1387 062 \}
1388 \end{alltt}
1389 \end{small}
1390
1391 Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output
1392 mp\_int structures passed to a function are one and the same. For this case it is optimal to return immediately without
1393 copying digits (line 24).
1394
1395 The mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$. If $b.alloc$ is less than
1396 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 29 to 33). In order to
1397 simplify the inner loop that copies the digits from $a$ to $b$, two aliases $tmpa$ and $tmpb$ point directly at the digits
1398 of the mp\_ints $a$ and $b$ respectively. These aliases (lines 42 and 45) allow the compiler to access the digits without first dereferencing the
1399 mp\_int pointers and then subsequently the pointer to the digits.
1400
1401 After the aliases are established the digits from $a$ are copied into $b$ (lines 48 to 50) and then the excess
1402 digits of $b$ are set to zero (lines 53 to 55). Both ``for'' loops make use of the pointer aliases and in
1403 fact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits. This optimization
1404 allows the alias to stay in a machine register fairly easy between the two loops.
1405
1406 \textbf{Remarks.} The use of pointer aliases is an implementation methodology first introduced in this function that will
1407 be used considerably in other functions. Technically, a pointer alias is simply a short hand alias used to lower the
1408 number of pointer dereferencing operations required to access data. For example, a for loop may resemble
1409
1410 \begin{alltt}
1411 for (x = 0; x < 100; x++) \{
1412 a->num[4]->dp[x] = 0;
1413 \}
1414 \end{alltt}
1415
1416 This could be re-written using aliases as
1417
1418 \begin{alltt}
1419 mp_digit *tmpa;
1420 a = a->num[4]->dp;
1421 for (x = 0; x < 100; x++) \{
1422 *a++ = 0;
1423 \}
1424 \end{alltt}
1425
1426 In this case an alias is used to access the
1427 array of digits within an mp\_int structure directly. It may seem that a pointer alias is strictly not required
1428 as a compiler may optimize out the redundant pointer operations. However, there are two dominant reasons to use aliases.
1429
1430 The first reason is that most compilers will not effectively optimize pointer arithmetic. For example, some optimizations
1431 may work for the Microsoft Visual C++ compiler (MSVC) and not for the GNU C Compiler (GCC). Also some optimizations may
1432 work for GCC and not MSVC. As such it is ideal to find a common ground for as many compilers as possible. Pointer
1433 aliases optimize the code considerably before the compiler even reads the source code which means the end compiled code
1434 stands a better chance of being faster.
1435
1436 The second reason is that pointer aliases often can make an algorithm simpler to read. Consider the first ``for''
1437 loop of the function mp\_copy() re-written to not use pointer aliases.
1438
1439 \begin{alltt}
1440 /* copy all the digits */
1441 for (n = 0; n < a->used; n++) \{
1442 b->dp[n] = a->dp[n];
1443 \}
1444 \end{alltt}
1445
1446 Whether this code is harder to read depends strongly on the individual. However, it is quantifiably slightly more
1447 complicated as there are four variables within the statement instead of just two.
1448
1449 \subsubsection{Nested Statements}
1450 Another commonly used technique in the source routines is that certain sections of code are nested. This is used in
1451 particular with the pointer aliases to highlight code phases. For example, a Comba multiplier (discussed in chapter six)
1452 will typically have three different phases. First the temporaries are initialized, then the columns calculated and
1453 finally the carries are propagated. In this example the middle column production phase will typically be nested as it
1454 uses temporary variables and aliases the most.
1455
1456 The nesting also simplies the source code as variables that are nested are only valid for their scope. As a result
1457 the various temporary variables required do not propagate into other sections of code.
1458
1459
1460 \subsection{Creating a Clone}
1461 Another common operation is to make a local temporary copy of an mp\_int argument. To initialize an mp\_int
1462 and then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone. This is
1463 useful within functions that need to modify an argument but do not wish to actually modify the original copy. The
1464 mp\_init\_copy algorithm has been designed to help perform this task.
1465
1466 \begin{figure}[here]
1467 \begin{center}
1468 \begin{tabular}{l}
1469 \hline Algorithm \textbf{mp\_init\_copy}. \\
1470 \textbf{Input}. An mp\_int $a$ and $b$\\
1471 \textbf{Output}. $a$ is initialized to be a copy of $b$. \\
1472 \hline \\
1473 1. Init $a$. (\textit{mp\_init}) \\
1474 2. Copy $b$ to $a$. (\textit{mp\_copy}) \\
1475 3. Return the status of the copy operation. \\
1476 \hline
1477 \end{tabular}
1478 \end{center}
1479 \caption{Algorithm mp\_init\_copy}
1480 \end{figure}
1481
1482 \textbf{Algorithm mp\_init\_copy.}
1483 This algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it. As
1484 such this algorithm will perform two operations in one step.
1485
1486 \vspace{+3mm}\begin{small}
1487 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_copy.c
1488 \vspace{-3mm}
1489 \begin{alltt}
1490 016
1491 017 /* creates "a" then copies b into it */
1492 018 int mp_init_copy (mp_int * a, mp_int * b)
1493 019 \{
1494 020 int res;
1495 021
1496 022 if ((res = mp_init (a)) != MP_OKAY) \{
1497 023 return res;
1498 024 \}
1499 025 return mp_copy (b, a);
1500 026 \}
1501 \end{alltt}
1502 \end{small}
1503
1504 This will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}. Note that
1505 \textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call
1506 and \textbf{a} will be left intact.
1507
1508 \section{Zeroing an Integer}
1509 Reseting an mp\_int to the default state is a common step in many algorithms. The mp\_zero algorithm will be the algorithm used to
1510 perform this task.
1511
1512 \begin{figure}[here]
1513 \begin{center}
1514 \begin{tabular}{l}
1515 \hline Algorithm \textbf{mp\_zero}. \\
1516 \textbf{Input}. An mp\_int $a$ \\
1517 \textbf{Output}. Zero the contents of $a$ \\
1518 \hline \\
1519 1. $a.used \leftarrow 0$ \\
1520 2. $a.sign \leftarrow$ MP\_ZPOS \\
1521 3. for $n$ from 0 to $a.alloc - 1$ do \\
1522 \hspace{3mm}3.1 $a_n \leftarrow 0$ \\
1523 \hline
1524 \end{tabular}
1525 \end{center}
1526 \caption{Algorithm mp\_zero}
1527 \end{figure}
1528
1529 \textbf{Algorithm mp\_zero.}
1530 This algorithm simply resets a mp\_int to the default state.
1531
1532 \vspace{+3mm}\begin{small}
1533 \hspace{-5.1mm}{\bf File}: bn\_mp\_zero.c
1534 \vspace{-3mm}
1535 \begin{alltt}
1536 016
1537 017 /* set to zero */
1538 018 void
1539 019 mp_zero (mp_int * a)
1540 020 \{
1541 021 a->sign = MP_ZPOS;
1542 022 a->used = 0;
1543 023 memset (a->dp, 0, sizeof (mp_digit) * a->alloc);
1544 024 \}
1545 \end{alltt}
1546 \end{small}
1547
1548 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
1549 \textbf{sign} variable is set to \textbf{MP\_ZPOS}.
1550
1551 \section{Sign Manipulation}
1552 \subsection{Absolute Value}
1553 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute
1554 the absolute value of an mp\_int.
1555
1556 \newpage\begin{figure}[here]
1557 \begin{center}
1558 \begin{tabular}{l}
1559 \hline Algorithm \textbf{mp\_abs}. \\
1560 \textbf{Input}. An mp\_int $a$ \\
1561 \textbf{Output}. Computes $b = \vert a \vert$ \\
1562 \hline \\
1563 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\
1564 2. If the copy failed return(\textit{MP\_MEM}). \\
1565 3. $b.sign \leftarrow MP\_ZPOS$ \\
1566 4. Return(\textit{MP\_OKAY}) \\
1567 \hline
1568 \end{tabular}
1569 \end{center}
1570 \caption{Algorithm mp\_abs}
1571 \end{figure}
1572
1573 \textbf{Algorithm mp\_abs.}
1574 This algorithm computes the absolute of an mp\_int input. First it copies $a$ over $b$. This is an example of an
1575 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows,
1576 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition
1577 logic to handle it.
1578
1579 \vspace{+3mm}\begin{small}
1580 \hspace{-5.1mm}{\bf File}: bn\_mp\_abs.c
1581 \vspace{-3mm}
1582 \begin{alltt}
1583 016
1584 017 /* b = |a|
1585 018 *
1586 019 * Simple function copies the input and fixes the sign to positive
1587 020 */
1588 021 int
1589 022 mp_abs (mp_int * a, mp_int * b)
1590 023 \{
1591 024 int res;
1592 025
1593 026 /* copy a to b */
1594 027 if (a != b) \{
1595 028 if ((res = mp_copy (a, b)) != MP_OKAY) \{
1596 029 return res;
1597 030 \}
1598 031 \}
1599 032
1600 033 /* force the sign of b to positive */
1601 034 b->sign = MP_ZPOS;
1602 035
1603 036 return MP_OKAY;
1604 037 \}
1605 \end{alltt}
1606 \end{small}
1607
1608 \subsection{Integer Negation}
1609 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute
1610 the negative of an mp\_int input.
1611
1612 \begin{figure}[here]
1613 \begin{center}
1614 \begin{tabular}{l}
1615 \hline Algorithm \textbf{mp\_neg}. \\
1616 \textbf{Input}. An mp\_int $a$ \\
1617 \textbf{Output}. Computes $b = -a$ \\
1618 \hline \\
1619 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\
1620 2. If the copy failed return(\textit{MP\_MEM}). \\
1621 3. If $a.used = 0$ then return(\textit{MP\_OKAY}). \\
1622 4. If $a.sign = MP\_ZPOS$ then do \\
1623 \hspace{3mm}4.1 $b.sign = MP\_NEG$. \\
1624 5. else do \\
1625 \hspace{3mm}5.1 $b.sign = MP\_ZPOS$. \\
1626 6. Return(\textit{MP\_OKAY}) \\
1627 \hline
1628 \end{tabular}
1629 \end{center}
1630 \caption{Algorithm mp\_neg}
1631 \end{figure}
1632
1633 \textbf{Algorithm mp\_neg.}
1634 This algorithm computes the negation of an input. First it copies $a$ over $b$. If $a$ has no used digits then
1635 the algorithm returns immediately. Otherwise it flips the sign flag and stores the result in $b$. Note that if
1636 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return
1637 zero as negative.
1638
1639 \vspace{+3mm}\begin{small}
1640 \hspace{-5.1mm}{\bf File}: bn\_mp\_neg.c
1641 \vspace{-3mm}
1642 \begin{alltt}
1643 016
1644 017 /* b = -a */
1645 018 int mp_neg (mp_int * a, mp_int * b)
1646 019 \{
1647 020 int res;
1648 021 if ((res = mp_copy (a, b)) != MP_OKAY) \{
1649 022 return res;
1650 023 \}
1651 024 if (mp_iszero(b) != MP_YES) \{
1652 025 b->sign = (a->sign == MP_ZPOS) ? MP_NEG : MP_ZPOS;
1653 026 \}
1654 027 return MP_OKAY;
1655 028 \}
1656 \end{alltt}
1657 \end{small}
1658
1659 \section{Small Constants}
1660 \subsection{Setting Small Constants}
1661 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful.
1662
1663 \begin{figure}[here]
1664 \begin{center}
1665 \begin{tabular}{l}
1666 \hline Algorithm \textbf{mp\_set}. \\
1667 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\
1668 \textbf{Output}. Make $a$ equivalent to $b$ \\
1669 \hline \\
1670 1. Zero $a$ (\textit{mp\_zero}). \\
1671 2. $a_0 \leftarrow b \mbox{ (mod }\beta\mbox{)}$ \\
1672 3. $a.used \leftarrow \left \lbrace \begin{array}{ll}
1673 1 & \mbox{if }a_0 > 0 \\
1674 0 & \mbox{if }a_0 = 0
1675 \end{array} \right .$ \\
1676 \hline
1677 \end{tabular}
1678 \end{center}
1679 \caption{Algorithm mp\_set}
1680 \end{figure}
1681
1682 \textbf{Algorithm mp\_set.}
1683 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The
1684 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly.
1685
1686 \vspace{+3mm}\begin{small}
1687 \hspace{-5.1mm}{\bf File}: bn\_mp\_set.c
1688 \vspace{-3mm}
1689 \begin{alltt}
1690 016
1691 017 /* set to a digit */
1692 018 void mp_set (mp_int * a, mp_digit b)
1693 019 \{
1694 020 mp_zero (a);
1695 021 a->dp[0] = b & MP_MASK;
1696 022 a->used = (a->dp[0] != 0) ? 1 : 0;
1697 023 \}
1698 \end{alltt}
1699 \end{small}
1700
1701 Line 20 calls mp\_zero() to clear the mp\_int and reset the sign. Line 21 copies the digit
1702 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly
1703 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with
1704 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line 22 will set the \textbf{used} member with respect to the
1705 digit actually set. This function will always make the integer positive.
1706
1707 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses
1708 this function should take that into account. Only trivially small constants can be set using this function.
1709
1710 \subsection{Setting Large Constants}
1711 To overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal. It accepts a ``long''
1712 data type as input and will always treat it as a 32-bit integer.
1713
1714 \begin{figure}[here]
1715 \begin{center}
1716 \begin{tabular}{l}
1717 \hline Algorithm \textbf{mp\_set\_int}. \\
1718 \textbf{Input}. An mp\_int $a$ and a ``long'' integer $b$ \\
1719 \textbf{Output}. Make $a$ equivalent to $b$ \\
1720 \hline \\
1721 1. Zero $a$ (\textit{mp\_zero}) \\
1722 2. for $n$ from 0 to 7 do \\
1723 \hspace{3mm}2.1 $a \leftarrow a \cdot 16$ (\textit{mp\_mul2d}) \\
1724 \hspace{3mm}2.2 $u \leftarrow \lfloor b / 2^{4(7 - n)} \rfloor \mbox{ (mod }16\mbox{)}$\\
1725 \hspace{3mm}2.3 $a_0 \leftarrow a_0 + u$ \\
1726 \hspace{3mm}2.4 $a.used \leftarrow a.used + 1$ \\
1727 3. Clamp excess used digits (\textit{mp\_clamp}) \\
1728 \hline
1729 \end{tabular}
1730 \end{center}
1731 \caption{Algorithm mp\_set\_int}
1732 \end{figure}
1733
1734 \textbf{Algorithm mp\_set\_int.}
1735 The algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the
1736 mp\_int. Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions. In step 2.2 the
1737 next four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is
1738 incremented to reflect the addition. The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have
1739 zero digits used and the newly added four bits would be ignored.
1740
1741 Excess zero digits are trimmed in steps 2.1 and 3 by using higher level algorithms mp\_mul2d and mp\_clamp.
1742
1743 \vspace{+3mm}\begin{small}
1744 \hspace{-5.1mm}{\bf File}: bn\_mp\_set\_int.c
1745 \vspace{-3mm}
1746 \begin{alltt}
1747 016
1748 017 /* set a 32-bit const */
1749 018 int mp_set_int (mp_int * a, unsigned long b)
1750 019 \{
1751 020 int x, res;
1752 021
1753 022 mp_zero (a);
1754 023
1755 024 /* set four bits at a time */
1756 025 for (x = 0; x < 8; x++) \{
1757 026 /* shift the number up four bits */
1758 027 if ((res = mp_mul_2d (a, 4, a)) != MP_OKAY) \{
1759 028 return res;
1760 029 \}
1761 030
1762 031 /* OR in the top four bits of the source */
1763 032 a->dp[0] |= (b >> 28) & 15;
1764 033
1765 034 /* shift the source up to the next four bits */
1766 035 b <<= 4;
1767 036
1768 037 /* ensure that digits are not clamped off */
1769 038 a->used += 1;
1770 039 \}
1771 040 mp_clamp (a);
1772 041 return MP_OKAY;
1773 042 \}
1774 \end{alltt}
1775 \end{small}
1776
1777 This function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes. The weird
1778 addition on line 38 ensures that the newly added in bits are added to the number of digits. While it may not
1779 seem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line 27
1780 as well as the call to mp\_clamp() on line 40. Both functions will clamp excess leading digits which keeps
1781 the number of used digits low.
1782
1783 \section{Comparisons}
1784 \subsection{Unsigned Comparisions}
1785 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers. For example,
1786 to compare $1,234$ to $1,264$ the digits are extracted by their positions. That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$
1787 to $1 \cdot 10^3 + 2 \cdot 10^2 + 6 \cdot 10^1 + 4 \cdot 10^0$ by comparing single digits at a time starting with the highest magnitude
1788 positions. If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater.
1789
1790 The first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two
1791 mp\_int variables alone. It will ignore the sign of the two inputs. Such a function is useful when an absolute comparison is required or if the
1792 signs are known to agree in advance.
1793
1794 To facilitate working with the results of the comparison functions three constants are required.
1795
1796 \begin{figure}[here]
1797 \begin{center}
1798 \begin{tabular}{|r|l|}
1799 \hline \textbf{Constant} & \textbf{Meaning} \\
1800 \hline \textbf{MP\_GT} & Greater Than \\
1801 \hline \textbf{MP\_EQ} & Equal To \\
1802 \hline \textbf{MP\_LT} & Less Than \\
1803 \hline
1804 \end{tabular}
1805 \end{center}
1806 \caption{Comparison Return Codes}
1807 \end{figure}
1808
1809 \begin{figure}[here]
1810 \begin{center}
1811 \begin{tabular}{l}
1812 \hline Algorithm \textbf{mp\_cmp\_mag}. \\
1813 \textbf{Input}. Two mp\_ints $a$ and $b$. \\
1814 \textbf{Output}. Unsigned comparison results ($a$ to the left of $b$). \\
1815 \hline \\
1816 1. If $a.used > b.used$ then return(\textit{MP\_GT}) \\
1817 2. If $a.used < b.used$ then return(\textit{MP\_LT}) \\
1818 3. for n from $a.used - 1$ to 0 do \\
1819 \hspace{+3mm}3.1 if $a_n > b_n$ then return(\textit{MP\_GT}) \\
1820 \hspace{+3mm}3.2 if $a_n < b_n$ then return(\textit{MP\_LT}) \\
1821 4. Return(\textit{MP\_EQ}) \\
1822 \hline
1823 \end{tabular}
1824 \end{center}
1825 \caption{Algorithm mp\_cmp\_mag}
1826 \end{figure}
1827
1828 \textbf{Algorithm mp\_cmp\_mag.}
1829 By saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return
1830 \textbf{MP\_GT} and similar with respect to when $a = b$ and $a < b$. The first two steps compare the number of digits used in both $a$ and $b$.
1831 Obviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is.
1832 If both have the same number of digits than the actual digits themselves must be compared starting at the leading digit.
1833
1834 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to
1835 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
1836
1837 \vspace{+3mm}\begin{small}
1838 \hspace{-5.1mm}{\bf File}: bn\_mp\_cmp\_mag.c
1839 \vspace{-3mm}
1840 \begin{alltt}
1841 016
1842 017 /* compare maginitude of two ints (unsigned) */
1843 018 int mp_cmp_mag (mp_int * a, mp_int * b)
1844 019 \{
1845 020 int n;
1846 021 mp_digit *tmpa, *tmpb;
1847 022
1848 023 /* compare based on # of non-zero digits */
1849 024 if (a->used > b->used) \{
1850 025 return MP_GT;
1851 026 \}
1852 027
1853 028 if (a->used < b->used) \{
1854 029 return MP_LT;
1855 030 \}
1856 031
1857 032 /* alias for a */
1858 033 tmpa = a->dp + (a->used - 1);
1859 034
1860 035 /* alias for b */
1861 036 tmpb = b->dp + (a->used - 1);
1862 037
1863 038 /* compare based on digits */
1864 039 for (n = 0; n < a->used; ++n, --tmpa, --tmpb) \{
1865 040 if (*tmpa > *tmpb) \{
1866 041 return MP_GT;
1867 042 \}
1868 043
1869 044 if (*tmpa < *tmpb) \{
1870 045 return MP_LT;
1871 046 \}
1872 047 \}
1873 048 return MP_EQ;
1874 049 \}
1875 \end{alltt}
1876 \end{small}
1877
1878 The two if statements on lines 24 and 28 compare the number of digits in the two inputs. These two are performed before all of the digits
1879 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid
1880 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the
1881 array of digits.
1882
1883 \subsection{Signed Comparisons}
1884 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude
1885 comparison a trivial signed comparison algorithm can be written.
1886
1887 \begin{figure}[here]
1888 \begin{center}
1889 \begin{tabular}{l}
1890 \hline Algorithm \textbf{mp\_cmp}. \\
1891 \textbf{Input}. Two mp\_ints $a$ and $b$ \\
1892 \textbf{Output}. Signed Comparison Results ($a$ to the left of $b$) \\
1893 \hline \\
1894 1. if $a.sign = MP\_NEG$ and $b.sign = MP\_ZPOS$ then return(\textit{MP\_LT}) \\
1895 2. if $a.sign = MP\_ZPOS$ and $b.sign = MP\_NEG$ then return(\textit{MP\_GT}) \\
1896 3. if $a.sign = MP\_NEG$ then \\
1897 \hspace{+3mm}3.1 Return the unsigned comparison of $b$ and $a$ (\textit{mp\_cmp\_mag}) \\
1898 4 Otherwise \\
1899 \hspace{+3mm}4.1 Return the unsigned comparison of $a$ and $b$ \\
1900 \hline
1901 \end{tabular}
1902 \end{center}
1903 \caption{Algorithm mp\_cmp}
1904 \end{figure}
1905
1906 \textbf{Algorithm mp\_cmp.}
1907 The first two steps compare the signs of the two inputs. If the signs do not agree then it can return right away with the appropriate
1908 comparison code. When the signs are equal the digits of the inputs must be compared to determine the correct result. In step
1909 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then
1910 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive.
1911
1912 \vspace{+3mm}\begin{small}
1913 \hspace{-5.1mm}{\bf File}: bn\_mp\_cmp.c
1914 \vspace{-3mm}
1915 \begin{alltt}
1916 016
1917 017 /* compare two ints (signed)*/
1918 018 int
1919 019 mp_cmp (mp_int * a, mp_int * b)
1920 020 \{
1921 021 /* compare based on sign */
1922 022 if (a->sign != b->sign) \{
1923 023 if (a->sign == MP_NEG) \{
1924 024 return MP_LT;
1925 025 \} else \{
1926 026 return MP_GT;
1927 027 \}
1928 028 \}
1929 029
1930 030 /* compare digits */
1931 031 if (a->sign == MP_NEG) \{
1932 032 /* if negative compare opposite direction */
1933 033 return mp_cmp_mag(b, a);
1934 034 \} else \{
1935 035 return mp_cmp_mag(a, b);
1936 036 \}
1937 037 \}
1938 \end{alltt}
1939 \end{small}
1940
1941 The two if statements on lines 22 and 23 perform the initial sign comparison. If the signs are not the equal then which ever
1942 has the positive sign is larger. At line 31, the inputs are compared based on magnitudes. If the signs were both negative then
1943 the unsigned comparison is performed in the opposite direction (\textit{line 33}). Otherwise, the signs are assumed to
1944 be both positive and a forward direction unsigned comparison is performed.
1945
1946 \section*{Exercises}
1947 \begin{tabular}{cl}
1948 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
1949 & \\
1950 $\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits \\
1951 & of two random digits (of equal magnitude) before a difference is found. \\
1952 & \\
1953 $\left [ 1 \right ]$ & Suggest a simple method to speed up the implementation of mp\_cmp\_mag based \\
1954 & on the observations made in the previous problem. \\
1955 &
1956 \end{tabular}
1957
1958 \chapter{Basic Arithmetic}
1959 \section{Introduction}
1960 At this point algorithms for initialization, clearing, zeroing, copying, comparing and setting small constants have been
1961 established. The next logical set of algorithms to develop are addition, subtraction and digit shifting algorithms. These
1962 algorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms. It is very important
1963 that these algorithms are highly optimized. On their own they are simple $O(n)$ algorithms but they can be called from higher level algorithms
1964 which easily places them at $O(n^2)$ or even $O(n^3)$ work levels.
1965
1966 All of the algorithms within this chapter make use of the logical bit shift operations denoted by $<<$ and $>>$ for left and right
1967 logical shifts respectively. A logical shift is analogous to sliding the decimal point of radix-10 representations. For example, the real
1968 number $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}).
1969 Algebraically a binary logical shift is equivalent to a division or multiplication by a power of two.
1970 For example, $a << k = a \cdot 2^k$ while $a >> k = \lfloor a/2^k \rfloor$.
1971
1972 One significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed
1973 from the number. For example, consider $1101_2 >> 1$ using decimal notation this would produce $110.1_2$. However, with a logical shift the
1974 result is $110_2$.
1975
1976 \section{Addition and Subtraction}
1977 In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus. For example, with 32-bit integers
1978 $a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$ since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$.
1979 As a result subtraction can be performed with a trivial series of logical operations and an addition.
1980
1981 However, in multiple precision arithmetic negative numbers are not represented in the same way. Instead a sign flag is used to keep track of the
1982 sign of the integer. As a result signed addition and subtraction are actually implemented as conditional usage of lower level addition or
1983 subtraction algorithms with the sign fixed up appropriately.
1984
1985 The lower level algorithms will add or subtract integers without regard to the sign flag. That is they will add or subtract the magnitude of
1986 the integers respectively.
1987
1988 \subsection{Low Level Addition}
1989 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers. That is to add the
1990 trailing digits first and propagate the resulting carry upwards. Since this is a lower level algorithm the name will have a ``s\_'' prefix.
1991 Historically that convention stems from the MPI library where ``s\_'' stood for static functions that were hidden from the developer entirely.
1992
1993 \newpage
1994 \begin{figure}[!here]
1995 \begin{center}
1996 \begin{small}
1997 \begin{tabular}{l}
1998 \hline Algorithm \textbf{s\_mp\_add}. \\
1999 \textbf{Input}. Two mp\_ints $a$ and $b$ \\
2000 \textbf{Output}. The unsigned addition $c = \vert a \vert + \vert b \vert$. \\
2001 \hline \\
2002 1. if $a.used > b.used$ then \\
2003 \hspace{+3mm}1.1 $min \leftarrow b.used$ \\
2004 \hspace{+3mm}1.2 $max \leftarrow a.used$ \\
2005 \hspace{+3mm}1.3 $x \leftarrow a$ \\
2006 2. else \\
2007 \hspace{+3mm}2.1 $min \leftarrow a.used$ \\
2008 \hspace{+3mm}2.2 $max \leftarrow b.used$ \\
2009 \hspace{+3mm}2.3 $x \leftarrow b$ \\
2010 3. If $c.alloc < max + 1$ then grow $c$ to hold at least $max + 1$ digits (\textit{mp\_grow}) \\
2011 4. $oldused \leftarrow c.used$ \\
2012 5. $c.used \leftarrow max + 1$ \\
2013 6. $u \leftarrow 0$ \\
2014 7. for $n$ from $0$ to $min - 1$ do \\
2015 \hspace{+3mm}7.1 $c_n \leftarrow a_n + b_n + u$ \\
2016 \hspace{+3mm}7.2 $u \leftarrow c_n >> lg(\beta)$ \\
2017 \hspace{+3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
2018 8. if $min \ne max$ then do \\
2019 \hspace{+3mm}8.1 for $n$ from $min$ to $max - 1$ do \\
2020 \hspace{+6mm}8.1.1 $c_n \leftarrow x_n + u$ \\
2021 \hspace{+6mm}8.1.2 $u \leftarrow c_n >> lg(\beta)$ \\
2022 \hspace{+6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
2023 9. $c_{max} \leftarrow u$ \\
2024 10. if $olduse > max$ then \\
2025 \hspace{+3mm}10.1 for $n$ from $max + 1$ to $oldused - 1$ do \\
2026 \hspace{+6mm}10.1.1 $c_n \leftarrow 0$ \\
2027 11. Clamp excess digits in $c$. (\textit{mp\_clamp}) \\
2028 12. Return(\textit{MP\_OKAY}) \\
2029 \hline
2030 \end{tabular}
2031 \end{small}
2032 \end{center}
2033 \caption{Algorithm s\_mp\_add}
2034 \end{figure}
2035
2036 \textbf{Algorithm s\_mp\_add.}
2037 This algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes.
2038 Coincidentally the description of algorithm A in Knuth \cite[pp. 266]{TAOCPV2} shares the same deficiency as the algorithm from \cite{HAC}. Even the
2039 MIX pseudo machine code presented by Knuth \cite[pp. 266-267]{TAOCPV2} is incapable of handling inputs which are of different magnitudes.
2040
2041 The first thing that has to be accomplished is to sort out which of the two inputs is the largest. The addition logic
2042 will simply add all of the smallest input to the largest input and store that first part of the result in the
2043 destination. Then it will apply a simpler addition loop to excess digits of the larger input.
2044
2045 The first two steps will handle sorting the inputs such that $min$ and $max$ hold the digit counts of the two
2046 inputs. The variable $x$ will be an mp\_int alias for the largest input or the second input $b$ if they have the
2047 same number of digits. After the inputs are sorted the destination $c$ is grown as required to accomodate the sum
2048 of the two inputs. The original \textbf{used} count of $c$ is copied and set to the new used count.
2049
2050 At this point the first addition loop will go through as many digit positions that both inputs have. The carry
2051 variable $\mu$ is set to zero outside the loop. Inside the loop an ``addition'' step requires three statements to produce
2052 one digit of the summand. First
2053 two digits from $a$ and $b$ are added together along with the carry $\mu$. The carry of this step is extracted and stored
2054 in $\mu$ and finally the digit of the result $c_n$ is truncated within the range $0 \le c_n < \beta$.
2055
2056 Now all of the digit positions that both inputs have in common have been exhausted. If $min \ne max$ then $x$ is an alias
2057 for one of the inputs that has more digits. A simplified addition loop is then used to essentially copy the remaining digits
2058 and the carry to the destination.
2059
2060 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition.
2061
2062
2063 \vspace{+3mm}\begin{small}
2064 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_add.c
2065 \vspace{-3mm}
2066 \begin{alltt}
2067 016
2068 017 /* low level addition, based on HAC pp.594, Algorithm 14.7 */
2069 018 int
2070 019 s_mp_add (mp_int * a, mp_int * b, mp_int * c)
2071 020 \{
2072 021 mp_int *x;
2073 022 int olduse, res, min, max;
2074 023
2075 024 /* find sizes, we let |a| <= |b| which means we have to sort
2076 025 * them. "x" will point to the input with the most digits
2077 026 */
2078 027 if (a->used > b->used) \{
2079 028 min = b->used;
2080 029 max = a->used;
2081 030 x = a;
2082 031 \} else \{
2083 032 min = a->used;
2084 033 max = b->used;
2085 034 x = b;
2086 035 \}
2087 036
2088 037 /* init result */
2089 038 if (c->alloc < max + 1) \{
2090 039 if ((res = mp_grow (c, max + 1)) != MP_OKAY) \{
2091 040 return res;
2092 041 \}
2093 042 \}
2094 043
2095 044 /* get old used digit count and set new one */
2096 045 olduse = c->used;
2097 046 c->used = max + 1;
2098 047
2099 048 \{
2100 049 register mp_digit u, *tmpa, *tmpb, *tmpc;
2101 050 register int i;
2102 051
2103 052 /* alias for digit pointers */
2104 053
2105 054 /* first input */
2106 055 tmpa = a->dp;
2107 056
2108 057 /* second input */
2109 058 tmpb = b->dp;
2110 059
2111 060 /* destination */
2112 061 tmpc = c->dp;
2113 062
2114 063 /* zero the carry */
2115 064 u = 0;
2116 065 for (i = 0; i < min; i++) \{
2117 066 /* Compute the sum at one digit, T[i] = A[i] + B[i] + U */
2118 067 *tmpc = *tmpa++ + *tmpb++ + u;
2119 068
2120 069 /* U = carry bit of T[i] */
2121 070 u = *tmpc >> ((mp_digit)DIGIT_BIT);
2122 071
2123 072 /* take away carry bit from T[i] */
2124 073 *tmpc++ &= MP_MASK;
2125 074 \}
2126 075
2127 076 /* now copy higher words if any, that is in A+B
2128 077 * if A or B has more digits add those in
2129 078 */
2130 079 if (min != max) \{
2131 080 for (; i < max; i++) \{
2132 081 /* T[i] = X[i] + U */
2133 082 *tmpc = x->dp[i] + u;
2134 083
2135 084 /* U = carry bit of T[i] */
2136 085 u = *tmpc >> ((mp_digit)DIGIT_BIT);
2137 086
2138 087 /* take away carry bit from T[i] */
2139 088 *tmpc++ &= MP_MASK;
2140 089 \}
2141 090 \}
2142 091
2143 092 /* add carry */
2144 093 *tmpc++ = u;
2145 094
2146 095 /* clear digits above oldused */
2147 096 for (i = c->used; i < olduse; i++) \{
2148 097 *tmpc++ = 0;
2149 098 \}
2150 099 \}
2151 100
2152 101 mp_clamp (c);
2153 102 return MP_OKAY;
2154 103 \}
2155 \end{alltt}
2156 \end{small}
2157
2158 Lines 27 to 35 perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a
2159 mp\_int assigned to the largest input, in effect it is a local alias. Lines 37 to 42 ensure that the destination is grown to
2160 accomodate the result of the addition.
2161
2162 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on
2163 lines 55, 58 and 61 represent the two inputs and destination variables respectively. These aliases are used to ensure the
2164 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int.
2165
2166 The initial carry $u$ is cleared on line 64, note that $u$ is of type mp\_digit which ensures type compatibility within the
2167 implementation. The initial addition loop begins on line 65 and ends on line 74. Similarly the conditional addition loop
2168 begins on line 80 and ends on line 90. The addition is finished with the final carry being stored in $tmpc$ on line 93.
2169 Note the ``++'' operator on the same line. After line 93 $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful
2170 for the next loop on lines 96 to 99 which set any old upper digits to zero.
2171
2172 \subsection{Low Level Subtraction}
2173 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the
2174 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
2175 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
2176 This algorithm as will be shown can be used to create functional signed addition and subtraction algorithms.
2177
2178
2179 For this algorithm a new variable is required to make the description simpler. Recall from section 1.3.1 that a mp\_digit must be able to represent
2180 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For
2181 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a
2182 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}).
2183
2184 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long''
2185 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$.
2186
2187 \newpage\begin{figure}[!here]
2188 \begin{center}
2189 \begin{small}
2190 \begin{tabular}{l}
2191 \hline Algorithm \textbf{s\_mp\_sub}. \\
2192 \textbf{Input}. Two mp\_ints $a$ and $b$ ($\vert a \vert \ge \vert b \vert$) \\
2193 \textbf{Output}. The unsigned subtraction $c = \vert a \vert - \vert b \vert$. \\
2194 \hline \\
2195 1. $min \leftarrow b.used$ \\
2196 2. $max \leftarrow a.used$ \\
2197 3. If $c.alloc < max$ then grow $c$ to hold at least $max$ digits. (\textit{mp\_grow}) \\
2198 4. $oldused \leftarrow c.used$ \\
2199 5. $c.used \leftarrow max$ \\
2200 6. $u \leftarrow 0$ \\
2201 7. for $n$ from $0$ to $min - 1$ do \\
2202 \hspace{3mm}7.1 $c_n \leftarrow a_n - b_n - u$ \\
2203 \hspace{3mm}7.2 $u \leftarrow c_n >> (\gamma - 1)$ \\
2204 \hspace{3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
2205 8. if $min < max$ then do \\
2206 \hspace{3mm}8.1 for $n$ from $min$ to $max - 1$ do \\
2207 \hspace{6mm}8.1.1 $c_n \leftarrow a_n - u$ \\
2208 \hspace{6mm}8.1.2 $u \leftarrow c_n >> (\gamma - 1)$ \\
2209 \hspace{6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
2210 9. if $oldused > max$ then do \\
2211 \hspace{3mm}9.1 for $n$ from $max$ to $oldused - 1$ do \\
2212 \hspace{6mm}9.1.1 $c_n \leftarrow 0$ \\
2213 10. Clamp excess digits of $c$. (\textit{mp\_clamp}). \\
2214 11. Return(\textit{MP\_OKAY}). \\
2215 \hline
2216 \end{tabular}
2217 \end{small}
2218 \end{center}
2219 \caption{Algorithm s\_mp\_sub}
2220 \end{figure}
2221
2222 \textbf{Algorithm s\_mp\_sub.}
2223 This algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive. That is when
2224 passing variables $a$ and $b$ the condition that $\vert a \vert \ge \vert b \vert$ must be met for the algorithm to function correctly. This
2225 algorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well. As was the case
2226 of the algorithm s\_mp\_add both other references lack discussion concerning various practical details such as when the inputs differ in magnitude.
2227
2228 The initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$. Steps 1 and 2
2229 set the $min$ and $max$ variables. Unlike the addition routine there is guaranteed to be no carry which means that the final result can be at
2230 most $max$ digits in length as opposed to $max + 1$. Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and
2231 set to the maximal count for the operation.
2232
2233 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision
2234 subtraction is used instead. Note the use of the $\gamma$ variable to extract the carry (\textit{also known as the borrow}) within the subtraction
2235 loops. Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry.
2236
2237 For example, consider subtracting $0101_2$ from $0100_2$ where $\gamma = 4$ and $\beta = 2$. The least significant bit will force a carry upwards to
2238 the third bit which will be set to zero after the borrow. After the very first bit has been subtracted $4 - 1 \equiv 0011_2$ will remain, When the
2239 third bit of $0101_2$ is subtracted from the result it will cause another carry. In this case though the carry will be forced to propagate all the
2240 way to the most significant bit.
2241
2242 Recall that $\beta < 2^{\gamma}$. This means that if a carry does occur just before the $lg(\beta)$'th bit it will propagate all the way to the most
2243 significant bit. Thus, the high order bits of the mp\_digit that are not part of the actual digit will either be all zero, or all one. All that
2244 is needed is a single zero or one bit for the carry. Therefore a single logical shift right by $\gamma - 1$ positions is sufficient to extract the
2245 carry. This method of carry extraction may seem awkward but the reason for it becomes apparent when the implementation is discussed.
2246
2247 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step
2248 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed.
2249
2250 \vspace{+3mm}\begin{small}
2251 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sub.c
2252 \vspace{-3mm}
2253 \begin{alltt}
2254 016
2255 017 /* low level subtraction (assumes |a| > |b|), HAC pp.595 Algorithm 14.9 */
2256 018 int
2257 019 s_mp_sub (mp_int * a, mp_int * b, mp_int * c)
2258 020 \{
2259 021 int olduse, res, min, max;
2260 022
2261 023 /* find sizes */
2262 024 min = b->used;
2263 025 max = a->used;
2264 026
2265 027 /* init result */
2266 028 if (c->alloc < max) \{
2267 029 if ((res = mp_grow (c, max)) != MP_OKAY) \{
2268 030 return res;
2269 031 \}
2270 032 \}
2271 033 olduse = c->used;
2272 034 c->used = max;
2273 035
2274 036 \{
2275 037 register mp_digit u, *tmpa, *tmpb, *tmpc;
2276 038 register int i;
2277 039
2278 040 /* alias for digit pointers */
2279 041 tmpa = a->dp;
2280 042 tmpb = b->dp;
2281 043 tmpc = c->dp;
2282 044
2283 045 /* set carry to zero */
2284 046 u = 0;
2285 047 for (i = 0; i < min; i++) \{
2286 048 /* T[i] = A[i] - B[i] - U */
2287 049 *tmpc = *tmpa++ - *tmpb++ - u;
2288 050
2289 051 /* U = carry bit of T[i]
2290 052 * Note this saves performing an AND operation since
2291 053 * if a carry does occur it will propagate all the way to the
2292 054 * MSB. As a result a single shift is enough to get the carry
2293 055 */
2294 056 u = *tmpc >> ((mp_digit)(CHAR_BIT * sizeof (mp_digit) - 1));
2295 057
2296 058 /* Clear carry from T[i] */
2297 059 *tmpc++ &= MP_MASK;
2298 060 \}
2299 061
2300 062 /* now copy higher words if any, e.g. if A has more digits than B */
2301 063 for (; i < max; i++) \{
2302 064 /* T[i] = A[i] - U */
2303 065 *tmpc = *tmpa++ - u;
2304 066
2305 067 /* U = carry bit of T[i] */
2306 068 u = *tmpc >> ((mp_digit)(CHAR_BIT * sizeof (mp_digit) - 1));
2307 069
2308 070 /* Clear carry from T[i] */
2309 071 *tmpc++ &= MP_MASK;
2310 072 \}
2311 073
2312 074 /* clear digits above used (since we may not have grown result above) */
2313
2314 075 for (i = c->used; i < olduse; i++) \{
2315 076 *tmpc++ = 0;
2316 077 \}
2317 078 \}
2318 079
2319 080 mp_clamp (c);
2320 081 return MP_OKAY;
2321 082 \}
2322 083
2323 \end{alltt}
2324 \end{small}
2325
2326 Line 24 and 25 perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only
2327 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines 41, 42 and 43 initialize the aliases for
2328 $a$, $b$ and $c$ respectively.
2329
2330 The first subtraction loop occurs on lines 46 through 60. The theory behind the subtraction loop is exactly the same as that for
2331 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry
2332 (\textit{see line 56}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND
2333 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry
2334 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply
2335 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on
2336 twos compliment machines which is a safe assumption to make.
2337
2338 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines 63 through 72}) is required to propagate the carry through
2339 $a$ and copy the result to $c$.
2340
2341 \subsection{High Level Addition}
2342 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be
2343 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data
2344 types.
2345
2346 Recall from section 5.2 that an mp\_int represents an integer with an unsigned mantissa (\textit{the array of digits}) and a \textbf{sign}
2347 flag. A high level addition is actually performed as a series of eight separate cases which can be optimized down to three unique cases.
2348
2349 \begin{figure}[!here]
2350 \begin{center}
2351 \begin{tabular}{l}
2352 \hline Algorithm \textbf{mp\_add}. \\
2353 \textbf{Input}. Two mp\_ints $a$ and $b$ \\
2354 \textbf{Output}. The signed addition $c = a + b$. \\
2355 \hline \\
2356 1. if $a.sign = b.sign$ then do \\
2357 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\
2358 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add})\\
2359 2. else do \\
2360 \hspace{3mm}2.1 if $\vert a \vert < \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\
2361 \hspace{6mm}2.1.1 $c.sign \leftarrow b.sign$ \\
2362 \hspace{6mm}2.1.2 $c \leftarrow \vert b \vert - \vert a \vert$ (\textit{s\_mp\_sub}) \\
2363 \hspace{3mm}2.2 else do \\
2364 \hspace{6mm}2.2.1 $c.sign \leftarrow a.sign$ \\
2365 \hspace{6mm}2.2.2 $c \leftarrow \vert a \vert - \vert b \vert$ \\
2366 3. Return(\textit{MP\_OKAY}). \\
2367 \hline
2368 \end{tabular}
2369 \end{center}
2370 \caption{Algorithm mp\_add}
2371 \end{figure}
2372
2373 \textbf{Algorithm mp\_add.}
2374 This algorithm performs the signed addition of two mp\_int variables. There is no reference algorithm to draw upon from
2375 either \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations. The algorithm is fairly
2376 straightforward but restricted since subtraction can only produce positive results.
2377
2378 \begin{figure}[here]
2379 \begin{small}
2380 \begin{center}
2381 \begin{tabular}{|c|c|c|c|c|}
2382 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert > \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\
2383 \hline $+$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\
2384 \hline $+$ & $+$ & No & $c = a + b$ & $a.sign$ \\
2385 \hline $-$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\
2386 \hline $-$ & $-$ & No & $c = a + b$ & $a.sign$ \\
2387 \hline &&&&\\
2388
2389 \hline $+$ & $-$ & No & $c = b - a$ & $b.sign$ \\
2390 \hline $-$ & $+$ & No & $c = b - a$ & $b.sign$ \\
2391
2392 \hline &&&&\\
2393
2394 \hline $+$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\
2395 \hline $-$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\
2396
2397 \hline
2398 \end{tabular}
2399 \end{center}
2400 \end{small}
2401 \caption{Addition Guide Chart}
2402 \label{fig:AddChart}
2403 \end{figure}
2404
2405 Figure~\ref{fig:AddChart} lists all of the eight possible input combinations and is sorted to show that only three
2406 specific cases need to be handled. The return code of the unsigned operations at step 1.2, 2.1.2 and 2.2.2 are
2407 forwarded to step three to check for errors. This simplifies the description of the algorithm considerably and best
2408 follows how the implementation actually was achieved.
2409
2410 Also note how the \textbf{sign} is set before the unsigned addition or subtraction is performed. Recall from the descriptions of algorithms
2411 s\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits. The mp\_clamp algorithm will set the \textbf{sign}
2412 to \textbf{MP\_ZPOS} when the \textbf{used} digit count reaches zero.
2413
2414 For example, consider performing $-a + a$ with algorithm mp\_add. By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would
2415 produce a result of $-0$. However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp
2416 within algorithm s\_mp\_add will force $-0$ to become $0$.
2417
2418 \vspace{+3mm}\begin{small}
2419 \hspace{-5.1mm}{\bf File}: bn\_mp\_add.c
2420 \vspace{-3mm}
2421 \begin{alltt}
2422 016
2423 017 /* high level addition (handles signs) */
2424 018 int mp_add (mp_int * a, mp_int * b, mp_int * c)
2425 019 \{
2426 020 int sa, sb, res;
2427 021
2428 022 /* get sign of both inputs */
2429 023 sa = a->sign;
2430 024 sb = b->sign;
2431 025
2432 026 /* handle two cases, not four */
2433 027 if (sa == sb) \{
2434 028 /* both positive or both negative */
2435 029 /* add their magnitudes, copy the sign */
2436 030 c->sign = sa;
2437 031 res = s_mp_add (a, b, c);
2438 032 \} else \{
2439 033 /* one positive, the other negative */
2440 034 /* subtract the one with the greater magnitude from */
2441 035 /* the one of the lesser magnitude. The result gets */
2442 036 /* the sign of the one with the greater magnitude. */
2443 037 if (mp_cmp_mag (a, b) == MP_LT) \{
2444 038 c->sign = sb;
2445 039 res = s_mp_sub (b, a, c);
2446 040 \} else \{
2447 041 c->sign = sa;
2448 042 res = s_mp_sub (a, b, c);
2449 043 \}
2450 044 \}
2451 045 return res;
2452 046 \}
2453 047
2454 \end{alltt}
2455 \end{small}
2456
2457 The source code follows the algorithm fairly closely. The most notable new source code addition is the usage of the $res$ integer variable which
2458 is used to pass result of the unsigned operations forward. Unlike in the algorithm, the variable $res$ is merely returned as is without
2459 explicitly checking it and returning the constant \textbf{MP\_OKAY}. The observation is this algorithm will succeed or fail only if the lower
2460 level functions do so. Returning their return code is sufficient.
2461
2462 \subsection{High Level Subtraction}
2463 The high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm.
2464
2465 \newpage\begin{figure}[!here]
2466 \begin{center}
2467 \begin{tabular}{l}
2468 \hline Algorithm \textbf{mp\_sub}. \\
2469 \textbf{Input}. Two mp\_ints $a$ and $b$ \\
2470 \textbf{Output}. The signed subtraction $c = a - b$. \\
2471 \hline \\
2472 1. if $a.sign \ne b.sign$ then do \\
2473 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\
2474 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add}) \\
2475 2. else do \\
2476 \hspace{3mm}2.1 if $\vert a \vert \ge \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\
2477 \hspace{6mm}2.1.1 $c.sign \leftarrow a.sign$ \\
2478 \hspace{6mm}2.1.2 $c \leftarrow \vert a \vert - \vert b \vert$ (\textit{s\_mp\_sub}) \\
2479 \hspace{3mm}2.2 else do \\
2480 \hspace{6mm}2.2.1 $c.sign \leftarrow \left \lbrace \begin{array}{ll}
2481 MP\_ZPOS & \mbox{if }a.sign = MP\_NEG \\
2482 MP\_NEG & \mbox{otherwise} \\
2483 \end{array} \right .$ \\
2484 \hspace{6mm}2.2.2 $c \leftarrow \vert b \vert - \vert a \vert$ \\
2485 3. Return(\textit{MP\_OKAY}). \\
2486 \hline
2487 \end{tabular}
2488 \end{center}
2489 \caption{Algorithm mp\_sub}
2490 \end{figure}
2491
2492 \textbf{Algorithm mp\_sub.}
2493 This algorithm performs the signed subtraction of two inputs. Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or
2494 \cite{HAC}. Also this algorithm is restricted by algorithm s\_mp\_sub. Chart \ref{fig:SubChart} lists the eight possible inputs and
2495 the operations required.
2496
2497 \begin{figure}[!here]
2498 \begin{small}
2499 \begin{center}
2500 \begin{tabular}{|c|c|c|c|c|}
2501 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert \ge \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\
2502 \hline $+$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\
2503 \hline $+$ & $-$ & No & $c = a + b$ & $a.sign$ \\
2504 \hline $-$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\
2505 \hline $-$ & $+$ & No & $c = a + b$ & $a.sign$ \\
2506 \hline &&&& \\
2507 \hline $+$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\
2508 \hline $-$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\
2509 \hline &&&& \\
2510 \hline $+$ & $+$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
2511 \hline $-$ & $-$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
2512 \hline
2513 \end{tabular}
2514 \end{center}
2515 \end{small}
2516 \caption{Subtraction Guide Chart}
2517 \label{fig:SubChart}
2518 \end{figure}
2519
2520 Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction. That is to prevent the
2521 algorithm from producing $-a - -a = -0$ as a result.
2522
2523 \vspace{+3mm}\begin{small}
2524 \hspace{-5.1mm}{\bf File}: bn\_mp\_sub.c
2525 \vspace{-3mm}
2526 \begin{alltt}
2527 016
2528 017 /* high level subtraction (handles signs) */
2529 018 int
2530 019 mp_sub (mp_int * a, mp_int * b, mp_int * c)
2531 020 \{
2532 021 int sa, sb, res;
2533 022
2534 023 sa = a->sign;
2535 024 sb = b->sign;
2536 025
2537 026 if (sa != sb) \{
2538 027 /* subtract a negative from a positive, OR */
2539 028 /* subtract a positive from a negative. */
2540 029 /* In either case, ADD their magnitudes, */
2541 030 /* and use the sign of the first number. */
2542 031 c->sign = sa;
2543 032 res = s_mp_add (a, b, c);
2544 033 \} else \{
2545 034 /* subtract a positive from a positive, OR */
2546 035 /* subtract a negative from a negative. */
2547 036 /* First, take the difference between their */
2548 037 /* magnitudes, then... */
2549 038 if (mp_cmp_mag (a, b) != MP_LT) \{
2550 039 /* Copy the sign from the first */
2551 040 c->sign = sa;
2552 041 /* The first has a larger or equal magnitude */
2553 042 res = s_mp_sub (a, b, c);
2554 043 \} else \{
2555 044 /* The result has the *opposite* sign from */
2556 045 /* the first number. */
2557 046 c->sign = (sa == MP_ZPOS) ? MP_NEG : MP_ZPOS;
2558 047 /* The second has a larger magnitude */
2559 048 res = s_mp_sub (b, a, c);
2560 049 \}
2561 050 \}
2562 051 return res;
2563 052 \}
2564 053
2565 \end{alltt}
2566 \end{small}
2567
2568 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
2569 and forward it to the end of the function. On line 38 the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a
2570 ``greater than or equal to'' comparison.
2571
2572 \section{Bit and Digit Shifting}
2573 It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$.
2574 This notation arises within discussion of Montgomery and Diminished Radix Reduction as well as Karatsuba multiplication and squaring.
2575
2576 In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established. That is to shift
2577 the digits left or right as well to shift individual bits of the digits left and right. It is important to note that not all ``shift'' operations
2578 are on radix-$\beta$ digits.
2579
2580 \subsection{Multiplication by Two}
2581
2582 In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient
2583 operation to perform. A single precision logical shift left is sufficient to multiply a single digit by two.
2584
2585 \newpage\begin{figure}[!here]
2586 \begin{small}
2587 \begin{center}
2588 \begin{tabular}{l}
2589 \hline Algorithm \textbf{mp\_mul\_2}. \\
2590 \textbf{Input}. One mp\_int $a$ \\
2591 \textbf{Output}. $b = 2a$. \\
2592 \hline \\
2593 1. If $b.alloc < a.used + 1$ then grow $b$ to hold $a.used + 1$ digits. (\textit{mp\_grow}) \\
2594 2. $oldused \leftarrow b.used$ \\
2595 3. $b.used \leftarrow a.used$ \\
2596 4. $r \leftarrow 0$ \\
2597 5. for $n$ from 0 to $a.used - 1$ do \\
2598 \hspace{3mm}5.1 $rr \leftarrow a_n >> (lg(\beta) - 1)$ \\
2599 \hspace{3mm}5.2 $b_n \leftarrow (a_n << 1) + r \mbox{ (mod }\beta\mbox{)}$ \\
2600 \hspace{3mm}5.3 $r \leftarrow rr$ \\
2601 6. If $r \ne 0$ then do \\
2602 \hspace{3mm}6.1 $b_{n + 1} \leftarrow r$ \\
2603 \hspace{3mm}6.2 $b.used \leftarrow b.used + 1$ \\
2604 7. If $b.used < oldused - 1$ then do \\
2605 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\
2606 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\
2607 8. $b.sign \leftarrow a.sign$ \\
2608 9. Return(\textit{MP\_OKAY}).\\
2609 \hline
2610 \end{tabular}
2611 \end{center}
2612 \end{small}
2613 \caption{Algorithm mp\_mul\_2}
2614 \end{figure}
2615
2616 \textbf{Algorithm mp\_mul\_2.}
2617 This algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two. Neither \cite{TAOCPV2} nor \cite{HAC} describe such
2618 an algorithm despite the fact it arises often in other algorithms. The algorithm is setup much like the lower level algorithm s\_mp\_add since
2619 it is for all intents and purposes equivalent to the operation $b = \vert a \vert + \vert a \vert$.
2620
2621 Step 1 and 2 grow the input as required to accomodate the maximum number of \textbf{used} digits in the result. The initial \textbf{used} count
2622 is set to $a.used$ at step 4. Only if there is a final carry will the \textbf{used} count require adjustment.
2623
2624 Step 6 is an optimization implementation of the addition loop for this specific case. That is since the two values being added together
2625 are the same there is no need to perform two reads from the digits of $a$. Step 6.1 performs a single precision shift on the current digit $a_n$ to
2626 obtain what will be the carry for the next iteration. Step 6.2 calculates the $n$'th digit of the result as single precision shift of $a_n$ plus
2627 the previous carry. Recall from section 4.1 that $a_n << 1$ is equivalent to $a_n \cdot 2$. An iteration of the addition loop is finished with
2628 forwarding the carry to the next iteration.
2629
2630 Step 7 takes care of any final carry by setting the $a.used$'th digit of the result to the carry and augmenting the \textbf{used} count of $b$.
2631 Step 8 clears any leading digits of $b$ in case it originally had a larger magnitude than $a$.
2632
2633 \vspace{+3mm}\begin{small}
2634 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_2.c
2635 \vspace{-3mm}
2636 \begin{alltt}
2637 016
2638 017 /* b = a*2 */
2639 018 int mp_mul_2(mp_int * a, mp_int * b)
2640 019 \{
2641 020 int x, res, oldused;
2642 021
2643 022 /* grow to accomodate result */
2644 023 if (b->alloc < a->used + 1) \{
2645 024 if ((res = mp_grow (b, a->used + 1)) != MP_OKAY) \{
2646 025 return res;
2647 026 \}
2648 027 \}
2649 028
2650 029 oldused = b->used;
2651 030 b->used = a->used;
2652 031
2653 032 \{
2654 033 register mp_digit r, rr, *tmpa, *tmpb;
2655 034
2656 035 /* alias for source */
2657 036 tmpa = a->dp;
2658 037
2659 038 /* alias for dest */
2660 039 tmpb = b->dp;
2661 040
2662 041 /* carry */
2663 042 r = 0;
2664 043 for (x = 0; x < a->used; x++) \{
2665 044
2666 045 /* get what will be the *next* carry bit from the
2667 046 * MSB of the current digit
2668 047 */
2669 048 rr = *tmpa >> ((mp_digit)(DIGIT_BIT - 1));
2670 049
2671 050 /* now shift up this digit, add in the carry [from the previous] */
2672 051 *tmpb++ = ((*tmpa++ << ((mp_digit)1)) | r) & MP_MASK;
2673 052
2674 053 /* copy the carry that would be from the source
2675 054 * digit into the next iteration
2676 055 */
2677 056 r = rr;
2678 057 \}
2679 058
2680 059 /* new leading digit? */
2681 060 if (r != 0) \{
2682 061 /* add a MSB which is always 1 at this point */
2683 062 *tmpb = 1;
2684 063 ++(b->used);
2685 064 \}
2686 065
2687 066 /* now zero any excess digits on the destination
2688 067 * that we didn't write to
2689 068 */
2690 069 tmpb = b->dp + b->used;
2691 070 for (x = b->used; x < oldused; x++) \{
2692 071 *tmpb++ = 0;
2693 072 \}
2694 073 \}
2695 074 b->sign = a->sign;
2696 075 return MP_OKAY;
2697 076 \}
2698 \end{alltt}
2699 \end{small}
2700
2701 This implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input. The only noteworthy difference
2702 is the use of the logical shift operator on line 51 to perform a single precision doubling.
2703
2704 \subsection{Division by Two}
2705 A division by two can just as easily be accomplished with a logical shift right as multiplication by two can be with a logical shift left.
2706
2707 \newpage\begin{figure}[!here]
2708 \begin{small}
2709 \begin{center}
2710 \begin{tabular}{l}
2711 \hline Algorithm \textbf{mp\_div\_2}. \\
2712 \textbf{Input}. One mp\_int $a$ \\
2713 \textbf{Output}. $b = a/2$. \\
2714 \hline \\
2715 1. If $b.alloc < a.used$ then grow $b$ to hold $a.used$ digits. (\textit{mp\_grow}) \\
2716 2. If the reallocation failed return(\textit{MP\_MEM}). \\
2717 3. $oldused \leftarrow b.used$ \\
2718 4. $b.used \leftarrow a.used$ \\
2719 5. $r \leftarrow 0$ \\
2720 6. for $n$ from $b.used - 1$ to $0$ do \\
2721 \hspace{3mm}6.1 $rr \leftarrow a_n \mbox{ (mod }2\mbox{)}$\\
2722 \hspace{3mm}6.2 $b_n \leftarrow (a_n >> 1) + (r << (lg(\beta) - 1)) \mbox{ (mod }\beta\mbox{)}$ \\
2723 \hspace{3mm}6.3 $r \leftarrow rr$ \\
2724 7. If $b.used < oldused - 1$ then do \\
2725 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\
2726 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\
2727 8. $b.sign \leftarrow a.sign$ \\
2728 9. Clamp excess digits of $b$. (\textit{mp\_clamp}) \\
2729 10. Return(\textit{MP\_OKAY}).\\
2730 \hline
2731 \end{tabular}
2732 \end{center}
2733 \end{small}
2734 \caption{Algorithm mp\_div\_2}
2735 \end{figure}
2736
2737 \textbf{Algorithm mp\_div\_2.}
2738 This algorithm will divide an mp\_int by two using logical shifts to the right. Like mp\_mul\_2 it uses a modified low level addition
2739 core as the basis of the algorithm. Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit. The algorithm
2740 could be written to work from the trailing digit to the leading digit however, it would have to stop one short of $a.used - 1$ digits to prevent
2741 reading past the end of the array of digits.
2742
2743 Essentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the
2744 least significant bit not the most significant bit.
2745
2746 \vspace{+3mm}\begin{small}
2747 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_2.c
2748 \vspace{-3mm}
2749 \begin{alltt}
2750 016
2751 017 /* b = a/2 */
2752 018 int mp_div_2(mp_int * a, mp_int * b)
2753 019 \{
2754 020 int x, res, oldused;
2755 021
2756 022 /* copy */
2757 023 if (b->alloc < a->used) \{
2758 024 if ((res = mp_grow (b, a->used)) != MP_OKAY) \{
2759 025 return res;
2760 026 \}
2761 027 \}
2762 028
2763 029 oldused = b->used;
2764 030 b->used = a->used;
2765 031 \{
2766 032 register mp_digit r, rr, *tmpa, *tmpb;
2767 033
2768 034 /* source alias */
2769 035 tmpa = a->dp + b->used - 1;
2770 036
2771 037 /* dest alias */
2772 038 tmpb = b->dp + b->used - 1;
2773 039
2774 040 /* carry */
2775 041 r = 0;
2776 042 for (x = b->used - 1; x >= 0; x--) \{
2777 043 /* get the carry for the next iteration */
2778 044 rr = *tmpa & 1;
2779 045
2780 046 /* shift the current digit, add in carry and store */
2781 047 *tmpb-- = (*tmpa-- >> 1) | (r << (DIGIT_BIT - 1));
2782 048
2783 049 /* forward carry to next iteration */
2784 050 r = rr;
2785 051 \}
2786 052
2787 053 /* zero excess digits */
2788 054 tmpb = b->dp + b->used;
2789 055 for (x = b->used; x < oldused; x++) \{
2790 056 *tmpb++ = 0;
2791 057 \}
2792 058 \}
2793 059 b->sign = a->sign;
2794 060 mp_clamp (b);
2795 061 return MP_OKAY;
2796 062 \}
2797 \end{alltt}
2798 \end{small}
2799
2800 \section{Polynomial Basis Operations}
2801 Recall from section 4.3 that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$. Such a representation is also known as
2802 the polynomial basis \cite[pp. 48]{ROSE}. Given such a notation a multiplication or division by $x$ amounts to shifting whole digits a single
2803 place. The need for such operations arises in several other higher level algorithms such as Barrett and Montgomery reduction, integer
2804 division and Karatsuba multiplication.
2805
2806 Converting from an array of digits to polynomial basis is very simple. Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that
2807 $y = \sum_{i=0}^{2} a_i \beta^i$. Simply replace $\beta$ with $x$ and the expression is in polynomial basis. For example, $f(x) = 8x + 9$ is the
2808 polynomial basis representation for $89$ using radix ten. That is, $f(10) = 8(10) + 9 = 89$.
2809
2810 \subsection{Multiplication by $x$}
2811
2812 Given a polynomial in $x$ such as $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_0$ multiplying by $x$ amounts to shifting the coefficients up one
2813 degree. In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$. From a scalar basis point of view multiplying by $x$ is equivalent to
2814 multiplying by the integer $\beta$.
2815
2816 \newpage\begin{figure}[!here]
2817 \begin{small}
2818 \begin{center}
2819 \begin{tabular}{l}
2820 \hline Algorithm \textbf{mp\_lshd}. \\
2821 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\
2822 \textbf{Output}. $a \leftarrow a \cdot \beta^b$ (equivalent to multiplication by $x^b$). \\
2823 \hline \\
2824 1. If $b \le 0$ then return(\textit{MP\_OKAY}). \\
2825 2. If $a.alloc < a.used + b$ then grow $a$ to at least $a.used + b$ digits. (\textit{mp\_grow}). \\
2826 3. If the reallocation failed return(\textit{MP\_MEM}). \\
2827 4. $a.used \leftarrow a.used + b$ \\
2828 5. $i \leftarrow a.used - 1$ \\
2829 6. $j \leftarrow a.used - 1 - b$ \\
2830 7. for $n$ from $a.used - 1$ to $b$ do \\
2831 \hspace{3mm}7.1 $a_{i} \leftarrow a_{j}$ \\
2832 \hspace{3mm}7.2 $i \leftarrow i - 1$ \\
2833 \hspace{3mm}7.3 $j \leftarrow j - 1$ \\
2834 8. for $n$ from 0 to $b - 1$ do \\
2835 \hspace{3mm}8.1 $a_n \leftarrow 0$ \\
2836 9. Return(\textit{MP\_OKAY}). \\
2837 \hline
2838 \end{tabular}
2839 \end{center}
2840 \end{small}
2841 \caption{Algorithm mp\_lshd}
2842 \end{figure}
2843
2844 \textbf{Algorithm mp\_lshd.}
2845 This algorithm multiplies an mp\_int by the $b$'th power of $x$. This is equivalent to multiplying by $\beta^b$. The algorithm differs
2846 from the other algorithms presented so far as it performs the operation in place instead storing the result in a separate location. The
2847 motivation behind this change is due to the way this function is typically used. Algorithms such as mp\_add store the result in an optionally
2848 different third mp\_int because the original inputs are often still required. Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is
2849 typically used on values where the original value is no longer required. The algorithm will return success immediately if
2850 $b \le 0$ since the rest of algorithm is only valid when $b > 0$.
2851
2852 First the destination $a$ is grown as required to accomodate the result. The counters $i$ and $j$ are used to form a \textit{sliding window} over
2853 the digits of $a$ of length $b$. The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}).
2854 The loop on step 7 copies the digit from the tail to the head. In each iteration the window is moved down one digit. The last loop on
2855 step 8 sets the lower $b$ digits to zero.
2856
2857 \newpage
2858 \begin{center}
2859 \begin{figure}[here]
2860 \includegraphics{pics/sliding_window.ps}
2861 \caption{Sliding Window Movement}
2862 \label{pic:sliding_window}
2863 \end{figure}
2864 \end{center}
2865
2866 \vspace{+3mm}\begin{small}
2867 \hspace{-5.1mm}{\bf File}: bn\_mp\_lshd.c
2868 \vspace{-3mm}
2869 \begin{alltt}
2870 016
2871 017 /* shift left a certain amount of digits */
2872 018 int mp_lshd (mp_int * a, int b)
2873 019 \{
2874 020 int x, res;
2875 021
2876 022 /* if its less than zero return */
2877 023 if (b <= 0) \{
2878 024 return MP_OKAY;
2879 025 \}
2880 026
2881 027 /* grow to fit the new digits */
2882 028 if (a->alloc < a->used + b) \{
2883 029 if ((res = mp_grow (a, a->used + b)) != MP_OKAY) \{
2884 030 return res;
2885 031 \}
2886 032 \}
2887 033
2888 034 \{
2889 035 register mp_digit *top, *bottom;
2890 036
2891 037 /* increment the used by the shift amount then copy upwards */
2892 038 a->used += b;
2893 039
2894 040 /* top */
2895 041 top = a->dp + a->used - 1;
2896 042
2897 043 /* base */
2898 044 bottom = a->dp + a->used - 1 - b;
2899 045
2900 046 /* much like mp_rshd this is implemented using a sliding window
2901 047 * except the window goes the otherway around. Copying from
2902 048 * the bottom to the top. see bn_mp_rshd.c for more info.
2903 049 */
2904 050 for (x = a->used - 1; x >= b; x--) \{
2905 051 *top-- = *bottom--;
2906 052 \}
2907 053
2908 054 /* zero the lower digits */
2909 055 top = a->dp;
2910 056 for (x = 0; x < b; x++) \{
2911 057 *top++ = 0;
2912 058 \}
2913 059 \}
2914 060 return MP_OKAY;
2915 061 \}
2916 \end{alltt}
2917 \end{small}
2918
2919 The if statement on line 23 ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before
2920 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line 41 is an alias
2921 for the leading digit while $bottom$ on line 44 is an alias for the trailing edge. The aliases form a window of exactly $b$ digits
2922 over the input.
2923
2924 \subsection{Division by $x$}
2925
2926 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit.
2927
2928 \newpage\begin{figure}[!here]
2929 \begin{small}
2930 \begin{center}
2931 \begin{tabular}{l}
2932 \hline Algorithm \textbf{mp\_rshd}. \\
2933 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\
2934 \textbf{Output}. $a \leftarrow a / \beta^b$ (Divide by $x^b$). \\
2935 \hline \\
2936 1. If $b \le 0$ then return. \\
2937 2. If $a.used \le b$ then do \\
2938 \hspace{3mm}2.1 Zero $a$. (\textit{mp\_zero}). \\
2939 \hspace{3mm}2.2 Return. \\
2940 3. $i \leftarrow 0$ \\
2941 4. $j \leftarrow b$ \\
2942 5. for $n$ from 0 to $a.used - b - 1$ do \\
2943 \hspace{3mm}5.1 $a_i \leftarrow a_j$ \\
2944 \hspace{3mm}5.2 $i \leftarrow i + 1$ \\
2945 \hspace{3mm}5.3 $j \leftarrow j + 1$ \\
2946 6. for $n$ from $a.used - b$ to $a.used - 1$ do \\
2947 \hspace{3mm}6.1 $a_n \leftarrow 0$ \\
2948 7. $a.used \leftarrow a.used - b$ \\
2949 8. Return. \\
2950 \hline
2951 \end{tabular}
2952 \end{center}
2953 \end{small}
2954 \caption{Algorithm mp\_rshd}
2955 \end{figure}
2956
2957 \textbf{Algorithm mp\_rshd.}
2958 This algorithm divides the input in place by the $b$'th power of $x$. It is analogous to dividing by a $\beta^b$ but much quicker since
2959 it does not require single precision division. This algorithm does not actually return an error code as it cannot fail.
2960
2961 If the input $b$ is less than one the algorithm quickly returns without performing any work. If the \textbf{used} count is less than or equal
2962 to the shift count $b$ then it will simply zero the input and return.
2963
2964 After the trivial cases of inputs have been handled the sliding window is setup. Much like the case of algorithm mp\_lshd a sliding window that
2965 is $b$ digits wide is used to copy the digits. Unlike mp\_lshd the window slides in the opposite direction from the trailing to the leading digit.
2966 Also the digits are copied from the leading to the trailing edge.
2967
2968 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented.
2969
2970 \vspace{+3mm}\begin{small}
2971 \hspace{-5.1mm}{\bf File}: bn\_mp\_rshd.c
2972 \vspace{-3mm}
2973 \begin{alltt}
2974 016
2975 017 /* shift right a certain amount of digits */
2976 018 void mp_rshd (mp_int * a, int b)
2977 019 \{
2978 020 int x;
2979 021
2980 022 /* if b <= 0 then ignore it */
2981 023 if (b <= 0) \{
2982 024 return;
2983 025 \}
2984 026
2985 027 /* if b > used then simply zero it and return */
2986 028 if (a->used <= b) \{
2987 029 mp_zero (a);
2988 030 return;
2989 031 \}
2990 032
2991 033 \{
2992 034 register mp_digit *bottom, *top;
2993 035
2994 036 /* shift the digits down */
2995 037
2996 038 /* bottom */
2997 039 bottom = a->dp;
2998 040
2999 041 /* top [offset into digits] */
3000 042 top = a->dp + b;
3001 043
3002 044 /* this is implemented as a sliding window where
3003 045 * the window is b-digits long and digits from
3004 046 * the top of the window are copied to the bottom
3005 047 *
3006 048 * e.g.
3007 049
3008 050 b-2 | b-1 | b0 | b1 | b2 | ... | bb | ---->
3009 051 /\symbol{92} | ---->
3010 052 \symbol{92}-------------------/ ---->
3011 053 */
3012 054 for (x = 0; x < (a->used - b); x++) \{
3013 055 *bottom++ = *top++;
3014 056 \}
3015 057
3016 058 /* zero the top digits */
3017 059 for (; x < a->used; x++) \{
3018 060 *bottom++ = 0;
3019 061 \}
3020 062 \}
3021 063
3022 064 /* remove excess digits */
3023 065 a->used -= b;
3024 066 \}
3025 \end{alltt}
3026 \end{small}
3027
3028 The only noteworthy element of this routine is the lack of a return type.
3029
3030 -- Will update later to give it a return type...Tom
3031
3032 \section{Powers of Two}
3033
3034 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For
3035 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single
3036 shifts $k$ times to achieve a multiplication by $2^{\pm k}$ a mixture of whole digit shifting and partial digit shifting is employed.
3037
3038 \subsection{Multiplication by Power of Two}
3039
3040 \newpage\begin{figure}[!here]
3041 \begin{small}
3042 \begin{center}
3043 \begin{tabular}{l}
3044 \hline Algorithm \textbf{mp\_mul\_2d}. \\
3045 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\
3046 \textbf{Output}. $c \leftarrow a \cdot 2^b$. \\
3047 \hline \\
3048 1. $c \leftarrow a$. (\textit{mp\_copy}) \\
3049 2. If $c.alloc < c.used + \lfloor b / lg(\beta) \rfloor + 2$ then grow $c$ accordingly. \\
3050 3. If the reallocation failed return(\textit{MP\_MEM}). \\
3051 4. If $b \ge lg(\beta)$ then \\
3052 \hspace{3mm}4.1 $c \leftarrow c \cdot \beta^{\lfloor b / lg(\beta) \rfloor}$ (\textit{mp\_lshd}). \\
3053 \hspace{3mm}4.2 If step 4.1 failed return(\textit{MP\_MEM}). \\
3054 5. $d \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
3055 6. If $d \ne 0$ then do \\
3056 \hspace{3mm}6.1 $mask \leftarrow 2^d$ \\
3057 \hspace{3mm}6.2 $r \leftarrow 0$ \\
3058 \hspace{3mm}6.3 for $n$ from $0$ to $c.used - 1$ do \\
3059 \hspace{6mm}6.3.1 $rr \leftarrow c_n >> (lg(\beta) - d) \mbox{ (mod }mask\mbox{)}$ \\
3060 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n << d) + r \mbox{ (mod }\beta\mbox{)}$ \\
3061 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\
3062 \hspace{3mm}6.4 If $r > 0$ then do \\
3063 \hspace{6mm}6.4.1 $c_{c.used} \leftarrow r$ \\
3064 \hspace{6mm}6.4.2 $c.used \leftarrow c.used + 1$ \\
3065 7. Return(\textit{MP\_OKAY}). \\
3066 \hline
3067 \end{tabular}
3068 \end{center}
3069 \end{small}
3070 \caption{Algorithm mp\_mul\_2d}
3071 \end{figure}
3072
3073 \textbf{Algorithm mp\_mul\_2d.}
3074 This algorithm multiplies $a$ by $2^b$ and stores the result in $c$. The algorithm uses algorithm mp\_lshd and a derivative of algorithm mp\_mul\_2 to
3075 quickly compute the product.
3076
3077 First the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than
3078 $\beta$. For example, if $b = 37$ and $\beta = 2^{28}$ then this step will multiply by $x$ leaving a multiplication by $2^{37 - 28} = 2^{9}$
3079 left.
3080
3081 After the digits have been shifted appropriately at most $lg(\beta) - 1$ shifts are left to perform. Step 5 calculates the number of remaining shifts
3082 required. If it is non-zero a modified shift loop is used to calculate the remaining product.
3083 Essentially the loop is a generic version of algorith mp\_mul2 designed to handle any shift count in the range $1 \le x < lg(\beta)$. The $mask$
3084 variable is used to extract the upper $d$ bits to form the carry for the next iteration.
3085
3086 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
3087 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
3088
3089 \vspace{+3mm}\begin{small}
3090 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_2d.c
3091 \vspace{-3mm}
3092 \begin{alltt}
3093 016
3094 017 /* shift left by a certain bit count */
3095 018 int mp_mul_2d (mp_int * a, int b, mp_int * c)
3096 019 \{
3097 020 mp_digit d;
3098 021 int res;
3099 022
3100 023 /* copy */
3101 024 if (a != c) \{
3102 025 if ((res = mp_copy (a, c)) != MP_OKAY) \{
3103 026 return res;
3104 027 \}
3105 028 \}
3106 029
3107 030 if (c->alloc < (int)(c->used + b/DIGIT_BIT + 1)) \{
3108 031 if ((res = mp_grow (c, c->used + b / DIGIT_BIT + 1)) != MP_OKAY) \{
3109 032 return res;
3110 033 \}
3111 034 \}
3112 035
3113 036 /* shift by as many digits in the bit count */
3114 037 if (b >= (int)DIGIT_BIT) \{
3115 038 if ((res = mp_lshd (c, b / DIGIT_BIT)) != MP_OKAY) \{
3116 039 return res;
3117 040 \}
3118 041 \}
3119 042
3120 043 /* shift any bit count < DIGIT_BIT */
3121 044 d = (mp_digit) (b % DIGIT_BIT);
3122 045 if (d != 0) \{
3123 046 register mp_digit *tmpc, shift, mask, r, rr;
3124 047 register int x;
3125 048
3126 049 /* bitmask for carries */
3127 050 mask = (((mp_digit)1) << d) - 1;
3128 051
3129 052 /* shift for msbs */
3130 053 shift = DIGIT_BIT - d;
3131 054
3132 055 /* alias */
3133 056 tmpc = c->dp;
3134 057
3135 058 /* carry */
3136 059 r = 0;
3137 060 for (x = 0; x < c->used; x++) \{
3138 061 /* get the higher bits of the current word */
3139 062 rr = (*tmpc >> shift) & mask;
3140 063
3141 064 /* shift the current word and OR in the carry */
3142 065 *tmpc = ((*tmpc << d) | r) & MP_MASK;
3143 066 ++tmpc;
3144 067
3145 068 /* set the carry to the carry bits of the current word */
3146 069 r = rr;
3147 070 \}
3148 071
3149 072 /* set final carry */
3150 073 if (r != 0) \{
3151 074 c->dp[(c->used)++] = r;
3152 075 \}
3153 076 \}
3154 077 mp_clamp (c);
3155 078 return MP_OKAY;
3156 079 \}
3157 \end{alltt}
3158 \end{small}
3159
3160 Notes to be revised when code is updated. -- Tom
3161
3162 \subsection{Division by Power of Two}
3163
3164 \newpage\begin{figure}[!here]
3165 \begin{small}
3166 \begin{center}
3167 \begin{tabular}{l}
3168 \hline Algorithm \textbf{mp\_div\_2d}. \\
3169 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\
3170 \textbf{Output}. $c \leftarrow \lfloor a / 2^b \rfloor, d \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\
3171 \hline \\
3172 1. If $b \le 0$ then do \\
3173 \hspace{3mm}1.1 $c \leftarrow a$ (\textit{mp\_copy}) \\
3174 \hspace{3mm}1.2 $d \leftarrow 0$ (\textit{mp\_zero}) \\
3175 \hspace{3mm}1.3 Return(\textit{MP\_OKAY}). \\
3176 2. $c \leftarrow a$ \\
3177 3. $d \leftarrow a \mbox{ (mod }2^b\mbox{)}$ (\textit{mp\_mod\_2d}) \\
3178 4. If $b \ge lg(\beta)$ then do \\
3179 \hspace{3mm}4.1 $c \leftarrow \lfloor c/\beta^{\lfloor b/lg(\beta) \rfloor} \rfloor$ (\textit{mp\_rshd}). \\
3180 5. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
3181 6. If $k \ne 0$ then do \\
3182 \hspace{3mm}6.1 $mask \leftarrow 2^k$ \\
3183 \hspace{3mm}6.2 $r \leftarrow 0$ \\
3184 \hspace{3mm}6.3 for $n$ from $c.used - 1$ to $0$ do \\
3185 \hspace{6mm}6.3.1 $rr \leftarrow c_n \mbox{ (mod }mask\mbox{)}$ \\
3186 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n >> k) + (r << (lg(\beta) - k))$ \\
3187 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\
3188 7. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\
3189 8. Return(\textit{MP\_OKAY}). \\
3190 \hline
3191 \end{tabular}
3192 \end{center}
3193 \end{small}
3194 \caption{Algorithm mp\_div\_2d}
3195 \end{figure}
3196
3197 \textbf{Algorithm mp\_div\_2d.}
3198 This algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder. The algorithm is designed much like algorithm
3199 mp\_mul\_2d by first using whole digit shifts then single precision shifts. This algorithm will also produce the remainder of the division
3200 by using algorithm mp\_mod\_2d.
3201
3202 \vspace{+3mm}\begin{small}
3203 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_2d.c
3204 \vspace{-3mm}
3205 \begin{alltt}
3206 016
3207 017 /* shift right by a certain bit count (store quotient in c, optional remaind
3208 er in d) */
3209 018 int mp_div_2d (mp_int * a, int b, mp_int * c, mp_int * d)
3210 019 \{
3211 020 mp_digit D, r, rr;
3212 021 int x, res;
3213 022 mp_int t;
3214 023
3215 024
3216 025 /* if the shift count is <= 0 then we do no work */
3217 026 if (b <= 0) \{
3218 027 res = mp_copy (a, c);
3219 028 if (d != NULL) \{
3220 029 mp_zero (d);
3221 030 \}
3222 031 return res;
3223 032 \}
3224 033
3225 034 if ((res = mp_init (&t)) != MP_OKAY) \{
3226 035 return res;
3227 036 \}
3228 037
3229 038 /* get the remainder */
3230 039 if (d != NULL) \{
3231 040 if ((res = mp_mod_2d (a, b, &t)) != MP_OKAY) \{
3232 041 mp_clear (&t);
3233 042 return res;
3234 043 \}
3235 044 \}
3236 045
3237 046 /* copy */
3238 047 if ((res = mp_copy (a, c)) != MP_OKAY) \{
3239 048 mp_clear (&t);
3240 049 return res;
3241 050 \}
3242 051
3243 052 /* shift by as many digits in the bit count */
3244 053 if (b >= (int)DIGIT_BIT) \{
3245 054 mp_rshd (c, b / DIGIT_BIT);
3246 055 \}
3247 056
3248 057 /* shift any bit count < DIGIT_BIT */
3249 058 D = (mp_digit) (b % DIGIT_BIT);
3250 059 if (D != 0) \{
3251 060 register mp_digit *tmpc, mask, shift;
3252 061
3253 062 /* mask */
3254 063 mask = (((mp_digit)1) << D) - 1;
3255 064
3256 065 /* shift for lsb */
3257 066 shift = DIGIT_BIT - D;
3258 067
3259 068 /* alias */
3260 069 tmpc = c->dp + (c->used - 1);
3261 070
3262 071 /* carry */
3263 072 r = 0;
3264 073 for (x = c->used - 1; x >= 0; x--) \{
3265 074 /* get the lower bits of this word in a temp */
3266 075 rr = *tmpc & mask;
3267 076
3268 077 /* shift the current word and mix in the carry bits from the previous
3269 word */
3270 078 *tmpc = (*tmpc >> D) | (r << shift);
3271 079 --tmpc;
3272 080
3273 081 /* set the carry to the carry bits of the current word found above */
3274 082 r = rr;
3275 083 \}
3276 084 \}
3277 085 mp_clamp (c);
3278 086 if (d != NULL) \{
3279 087 mp_exch (&t, d);
3280 088 \}
3281 089 mp_clear (&t);
3282 090 return MP_OKAY;
3283 091 \}
3284 \end{alltt}
3285 \end{small}
3286
3287 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally
3288 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the
3289 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before
3290 the quotient is obtained.
3291
3292 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom).
3293
3294 \subsection{Remainder of Division by Power of Two}
3295
3296 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This
3297 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
3298
3299 \begin{figure}[!here]
3300 \begin{small}
3301 \begin{center}
3302 \begin{tabular}{l}
3303 \hline Algorithm \textbf{mp\_mod\_2d}. \\
3304 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\
3305 \textbf{Output}. $c \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\
3306 \hline \\
3307 1. If $b \le 0$ then do \\
3308 \hspace{3mm}1.1 $c \leftarrow 0$ (\textit{mp\_zero}) \\
3309 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\
3310 2. If $b > a.used \cdot lg(\beta)$ then do \\
3311 \hspace{3mm}2.1 $c \leftarrow a$ (\textit{mp\_copy}) \\
3312 \hspace{3mm}2.2 Return the result of step 2.1. \\
3313 3. $c \leftarrow a$ \\
3314 4. If step 3 failed return(\textit{MP\_MEM}). \\
3315 5. for $n$ from $\lceil b / lg(\beta) \rceil$ to $c.used$ do \\
3316 \hspace{3mm}5.1 $c_n \leftarrow 0$ \\
3317 6. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
3318 7. $c_{\lfloor b / lg(\beta) \rfloor} \leftarrow c_{\lfloor b / lg(\beta) \rfloor} \mbox{ (mod }2^{k}\mbox{)}$. \\
3319 8. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\
3320 9. Return(\textit{MP\_OKAY}). \\
3321 \hline
3322 \end{tabular}
3323 \end{center}
3324 \end{small}
3325 \caption{Algorithm mp\_mod\_2d}
3326 \end{figure}
3327
3328 \textbf{Algorithm mp\_mod\_2d.}
3329 This algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$. First if $b$ is less than or equal to zero the
3330 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$
3331 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count.
3332
3333 \vspace{+3mm}\begin{small}
3334 \hspace{-5.1mm}{\bf File}: bn\_mp\_mod\_2d.c
3335 \vspace{-3mm}
3336 \begin{alltt}
3337 016
3338 017 /* calc a value mod 2**b */
3339 018 int
3340 019 mp_mod_2d (mp_int * a, int b, mp_int * c)
3341 020 \{
3342 021 int x, res;
3343 022
3344 023 /* if b is <= 0 then zero the int */
3345 024 if (b <= 0) \{
3346 025 mp_zero (c);
3347 026 return MP_OKAY;
3348 027 \}
3349 028
3350 029 /* if the modulus is larger than the value than return */
3351 030 if (b > (int) (a->used * DIGIT_BIT)) \{
3352 031 res = mp_copy (a, c);
3353 032 return res;
3354 033 \}
3355 034
3356 035 /* copy */
3357 036 if ((res = mp_copy (a, c)) != MP_OKAY) \{
3358 037 return res;
3359 038 \}
3360 039
3361 040 /* zero digits above the last digit of the modulus */
3362 041 for (x = (b / DIGIT_BIT) + ((b % DIGIT_BIT) == 0 ? 0 : 1); x < c->used; x+
3363 +) \{
3364 042 c->dp[x] = 0;
3365 043 \}
3366 044 /* clear the digit that is not completely outside/inside the modulus */
3367 045 c->dp[b / DIGIT_BIT] &=
3368 046 (mp_digit) ((((mp_digit) 1) << (((mp_digit) b) % DIGIT_BIT)) - ((mp_digi
3369 t) 1));
3370 047 mp_clamp (c);
3371 048 return MP_OKAY;
3372 049 \}
3373 \end{alltt}
3374 \end{small}
3375
3376 -- Add comments later, Tom.
3377
3378 \section*{Exercises}
3379 \begin{tabular}{cl}
3380 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
3381 & in $O(n)$ time. \\
3382 &\\
3383 $\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming \\
3384 & weight values such as $3$, $5$ and $9$. Extend it to handle all values \\
3385 & upto $64$ with a hamming weight less than three. \\
3386 &\\
3387 $\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\
3388 & $2^k - 1$ as well. \\
3389 &\\
3390 $\left [ 3 \right ] $ & Using only algorithms mp\_mul\_2, mp\_div\_2 and mp\_add create an \\
3391 & algorithm to multiply two integers in roughly $O(2n^2)$ time for \\
3392 & any $n$-bit input. Note that the time of addition is ignored in the \\
3393 & calculation. \\
3394 & \\
3395 $\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\
3396 & $O \left (2^{(k-1)}n + \left ({2n^2 \over k} \right ) \right )$ for an appropriate choice of $k$. Again ignore \\
3397 & the cost of addition. \\
3398 & \\
3399 $\left [ 2 \right ] $ & Devise a chart to find optimal values of $k$ for the previous problem \\
3400 & for $n = 64 \ldots 1024$ in steps of $64$. \\
3401 & \\
3402 $\left [ 2 \right ] $ & Using only algorithms mp\_abs and mp\_sub devise another method for \\
3403 & calculating the result of a signed comparison. \\
3404 &
3405 \end{tabular}
3406
3407 \chapter{Multiplication and Squaring}
3408 \section{The Multipliers}
3409 For most number theoretic problems including certain public key cryptographic algorithms, the ``multipliers'' form the most important subset of
3410 algorithms of any multiple precision integer package. The set of multiplier algorithms include integer multiplication, squaring and modular reduction
3411 where in each of the algorithms single precision multiplication is the dominant operation performed. This chapter will discuss integer multiplication
3412 and squaring, leaving modular reductions for the subsequent chapter.
3413
3414 The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular
3415 exponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$. During a modular
3416 exponentiation the majority\footnote{Roughly speaking a modular exponentiation will spend about 40\% of the time performing modular reductions,
3417 35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision
3418 multiplications.
3419
3420 For centuries general purpose multiplication has required a lengthly $O(n^2)$ process, whereby each digit of one multiplicand has to be multiplied
3421 against every digit of the other multiplicand. Traditional long-hand multiplication is based on this process; while the techniques can differ the
3422 overall algorithm used is essentially the same. Only ``recently'' have faster algorithms been studied. First Karatsuba multiplication was discovered in
3423 1962. This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach.
3424 This technique led to the discovery of polynomial basis algorithms (\textit{good reference?}) and subquently Fourier Transform based solutions.
3425
3426 \section{Multiplication}
3427 \subsection{The Baseline Multiplication}
3428 \label{sec:basemult}
3429 \index{baseline multiplication}
3430 Computing the product of two integers in software can be achieved using a trivial adaptation of the standard $O(n^2)$ long-hand multiplication
3431 algorithm that school children are taught. The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision
3432 multiplications are required. More specifically for a $m$ and $n$ digit input $m \cdot n$ single precision multiplications are required. To
3433 simplify most discussions, it will be assumed that the inputs have comparable number of digits.
3434
3435 The ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be
3436 used. This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible. One important
3437 facet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution. The importance of this
3438 modification will become evident during the discussion of Barrett modular reduction. Recall that for a $n$ and $m$ digit input the product
3439 will be at most $n + m$ digits. Therefore, this algorithm can be reduced to a full multiplier by having it produce $n + m$ digits of the product.
3440
3441 Recall from sub-section 4.2.2 the definition of $\gamma$ as the number of bits in the type \textbf{mp\_digit}. We shall now extend the variable set to
3442 include $\alpha$ which shall represent the number of bits in the type \textbf{mp\_word}. This implies that $2^{\alpha} > 2 \cdot \beta^2$. The
3443 constant $\delta = 2^{\alpha - 2lg(\beta)}$ will represent the maximal weight of any column in a product (\textit{see sub-section 5.2.2 for more information}).
3444
3445 \newpage\begin{figure}[!here]
3446 \begin{small}
3447 \begin{center}
3448 \begin{tabular}{l}
3449 \hline Algorithm \textbf{s\_mp\_mul\_digs}. \\
3450 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
3451 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
3452 \hline \\
3453 1. If min$(a.used, b.used) < \delta$ then do \\
3454 \hspace{3mm}1.1 Calculate $c = \vert a \vert \cdot \vert b \vert$ by the Comba method (\textit{see algorithm~\ref{fig:COMBAMULT}}). \\
3455 \hspace{3mm}1.2 Return the result of step 1.1 \\
3456 \\
3457 Allocate and initialize a temporary mp\_int. \\
3458 2. Init $t$ to be of size $digs$ \\
3459 3. If step 2 failed return(\textit{MP\_MEM}). \\
3460 4. $t.used \leftarrow digs$ \\
3461 \\
3462 Compute the product. \\
3463 5. for $ix$ from $0$ to $a.used - 1$ do \\
3464 \hspace{3mm}5.1 $u \leftarrow 0$ \\
3465 \hspace{3mm}5.2 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\
3466 \hspace{3mm}5.3 If $pb < 1$ then goto step 6. \\
3467 \hspace{3mm}5.4 for $iy$ from $0$ to $pb - 1$ do \\
3468 \hspace{6mm}5.4.1 $\hat r \leftarrow t_{iy + ix} + a_{ix} \cdot b_{iy} + u$ \\
3469 \hspace{6mm}5.4.2 $t_{iy + ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
3470 \hspace{6mm}5.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
3471 \hspace{3mm}5.5 if $ix + pb < digs$ then do \\
3472 \hspace{6mm}5.5.1 $t_{ix + pb} \leftarrow u$ \\
3473 6. Clamp excess digits of $t$. \\
3474 7. Swap $c$ with $t$ \\
3475 8. Clear $t$ \\
3476 9. Return(\textit{MP\_OKAY}). \\
3477 \hline
3478 \end{tabular}
3479 \end{center}
3480 \end{small}
3481 \caption{Algorithm s\_mp\_mul\_digs}
3482 \end{figure}
3483
3484 \textbf{Algorithm s\_mp\_mul\_digs.}
3485 This algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits. While it may seem
3486 a bit awkward to modify the function from its simple $O(n^2)$ description, the usefulness of partial multipliers will arise in a subsequent
3487 algorithm. The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}.
3488 Algorithm s\_mp\_mul\_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the
3489 inputs.
3490
3491 The first thing this algorithm checks for is whether a Comba multiplier can be used instead. If the minimum digit count of either
3492 input is less than $\delta$, then the Comba method may be used instead. After the Comba method is ruled out, the baseline algorithm begins. A
3493 temporary mp\_int variable $t$ is used to hold the intermediate result of the product. This allows the algorithm to be used to
3494 compute products when either $a = c$ or $b = c$ without overwriting the inputs.
3495
3496 All of step 5 is the infamous $O(n^2)$ multiplication loop slightly modified to only produce upto $digs$ digits of output. The $pb$ variable
3497 is given the count of digits to read from $b$ inside the nested loop. If $pb \le 1$ then no more output digits can be produced and the algorithm
3498 will exit the loop. The best way to think of the loops are as a series of $pb \times 1$ multiplications. That is, in each pass of the
3499 innermost loop $a_{ix}$ is multiplied against $b$ and the result is added (\textit{with an appropriate shift}) to $t$.
3500
3501 For example, consider multiplying $576$ by $241$. That is equivalent to computing $10^0(1)(576) + 10^1(4)(576) + 10^2(2)(576)$ which is best
3502 visualized in the following table.
3503
3504 \begin{figure}[here]
3505 \begin{center}
3506 \begin{tabular}{|c|c|c|c|c|c|l|}
3507 \hline && & 5 & 7 & 6 & \\
3508 \hline $\times$&& & 2 & 4 & 1 & \\
3509 \hline &&&&&&\\
3510 && & 5 & 7 & 6 & $10^0(1)(576)$ \\
3511 &2 & 3 & 6 & 1 & 6 & $10^1(4)(576) + 10^0(1)(576)$ \\
3512 1 & 3 & 8 & 8 & 1 & 6 & $10^2(2)(576) + 10^1(4)(576) + 10^0(1)(576)$ \\
3513 \hline
3514 \end{tabular}
3515 \end{center}
3516 \caption{Long-Hand Multiplication Diagram}
3517 \end{figure}
3518
3519 Each row of the product is added to the result after being shifted to the left (\textit{multiplied by a power of the radix}) by the appropriate
3520 count. That is in pass $ix$ of the inner loop the product is added starting at the $ix$'th digit of the reult.
3521
3522 Step 5.4.1 introduces the hat symbol (\textit{e.g. $\hat r$}) which represents a double precision variable. The multiplication on that step
3523 is assumed to be a double wide output single precision multiplication. That is, two single precision variables are multiplied to produce a
3524 double precision result. The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step
3525 5.4.1 is propagated through the nested loop. If the carry was not propagated immediately it would overflow the single precision digit
3526 $t_{ix+iy}$ and the result would be lost.
3527
3528 At step 5.5 the nested loop is finished and any carry that was left over should be forwarded. The carry does not have to be added to the $ix+pb$'th
3529 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result
3530 exceed the precision requested.
3531
3532 \vspace{+3mm}\begin{small}
3533 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_mul\_digs.c
3534 \vspace{-3mm}
3535 \begin{alltt}
3536 016
3537 017 /* multiplies |a| * |b| and only computes upto digs digits of result
3538 018 * HAC pp. 595, Algorithm 14.12 Modified so you can control how
3539 019 * many digits of output are created.
3540 020 */
3541 021 int
3542 022 s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs)
3543 023 \{
3544 024 mp_int t;
3545 025 int res, pa, pb, ix, iy;
3546 026 mp_digit u;
3547 027 mp_word r;
3548 028 mp_digit tmpx, *tmpt, *tmpy;
3549 029
3550 030 /* can we use the fast multiplier? */
3551 031 if (((digs) < MP_WARRAY) &&
3552 032 MIN (a->used, b->used) <
3553 033 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{
3554 034 return fast_s_mp_mul_digs (a, b, c, digs);
3555 035 \}
3556 036
3557 037 if ((res = mp_init_size (&t, digs)) != MP_OKAY) \{
3558 038 return res;
3559 039 \}
3560 040 t.used = digs;
3561 041
3562 042 /* compute the digits of the product directly */
3563 043 pa = a->used;
3564 044 for (ix = 0; ix < pa; ix++) \{
3565 045 /* set the carry to zero */
3566 046 u = 0;
3567 047
3568 048 /* limit ourselves to making digs digits of output */
3569 049 pb = MIN (b->used, digs - ix);
3570 050
3571 051 /* setup some aliases */
3572 052 /* copy of the digit from a used within the nested loop */
3573 053 tmpx = a->dp[ix];
3574 054
3575 055 /* an alias for the destination shifted ix places */
3576 056 tmpt = t.dp + ix;
3577 057
3578 058 /* an alias for the digits of b */
3579 059 tmpy = b->dp;
3580 060
3581 061 /* compute the columns of the output and propagate the carry */
3582 062 for (iy = 0; iy < pb; iy++) \{
3583 063 /* compute the column as a mp_word */
3584 064 r = ((mp_word)*tmpt) +
3585 065 ((mp_word)tmpx) * ((mp_word)*tmpy++) +
3586 066 ((mp_word) u);
3587 067
3588 068 /* the new column is the lower part of the result */
3589 069 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
3590 070
3591 071 /* get the carry word from the result */
3592 072 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT));
3593 073 \}
3594 074 /* set carry if it is placed below digs */
3595 075 if (ix + iy < digs) \{
3596 076 *tmpt = u;
3597 077 \}
3598 078 \}
3599 079
3600 080 mp_clamp (&t);
3601 081 mp_exch (&t, c);
3602 082
3603 083 mp_clear (&t);
3604 084 return MP_OKAY;
3605 085 \}
3606 \end{alltt}
3607 \end{small}
3608
3609 Lines 31 to 35 determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and
3610 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control
3611 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium.
3612
3613 Of particular importance is the calculation of the $ix+iy$'th column on lines 64, 65 and 66. Note how all of the
3614 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations
3615 are used instead of single precision. The multiplication on line 65 makes use of a specific GCC optimizer behaviour. On the outset it looks like
3616 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most
3617 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For
3618 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result.
3619
3620 \subsection{Faster Multiplication by the ``Comba'' Method}
3621
3622 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This
3623 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known
3624 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested
3625 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in
3626 his 1986 paper \cite{BARRETT} written five years before.
3627
3628 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how
3629 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the
3630 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously.
3631
3632 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a
3633 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount
3634 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows.
3635
3636 \begin{equation}
3637 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace
3638 \end{equation}
3639
3640 Where $\vec x_n$ is the $n'th$ column of the output vector. Consider the following example which computes the vector $\vec x$ for the multiplication
3641 of $576$ and $241$.
3642
3643 \newpage\begin{figure}[here]
3644 \begin{small}
3645 \begin{center}
3646 \begin{tabular}{|c|c|c|c|c|c|}
3647 \hline & & 5 & 7 & 6 & First Input\\
3648 \hline $\times$ & & 2 & 4 & 1 & Second Input\\
3649 \hline & & $1 \cdot 5 = 5$ & $1 \cdot 7 = 7$ & $1 \cdot 6 = 6$ & First pass \\
3650 & $4 \cdot 5 = 20$ & $4 \cdot 7+5=33$ & $4 \cdot 6+7=31$ & 6 & Second pass \\
3651 $2 \cdot 5 = 10$ & $2 \cdot 7 + 20 = 34$ & $2 \cdot 6+33=45$ & 31 & 6 & Third pass \\
3652 \hline 10 & 34 & 45 & 31 & 6 & Final Result \\
3653 \hline
3654 \end{tabular}
3655 \end{center}
3656 \end{small}
3657 \caption{Comba Multiplication Diagram}
3658 \end{figure}
3659
3660 At this point the vector $x = \left < 10, 34, 45, 31, 6 \right >$ is the result of the first step of the Comba multipler.
3661 Now the columns must be fixed by propagating the carry upwards. The resultant vector will have one extra dimension over the input vector which is
3662 congruent to adding a leading zero digit.
3663
3664 \begin{figure}[!here]
3665 \begin{small}
3666 \begin{center}
3667 \begin{tabular}{l}
3668 \hline Algorithm \textbf{Comba Fixup}. \\
3669 \textbf{Input}. Vector $\vec x$ of dimension $k$ \\
3670 \textbf{Output}. Vector $\vec x$ such that the carries have been propagated. \\
3671 \hline \\
3672 1. for $n$ from $0$ to $k - 1$ do \\
3673 \hspace{3mm}1.1 $\vec x_{n+1} \leftarrow \vec x_{n+1} + \lfloor \vec x_{n}/\beta \rfloor$ \\
3674 \hspace{3mm}1.2 $\vec x_{n} \leftarrow \vec x_{n} \mbox{ (mod }\beta\mbox{)}$ \\
3675 2. Return($\vec x$). \\
3676 \hline
3677 \end{tabular}
3678 \end{center}
3679 \end{small}
3680 \caption{Algorithm Comba Fixup}
3681 \end{figure}
3682
3683 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$. In this case
3684 $241 \cdot 576$ is in fact $138816$ and the procedure succeeded. If the algorithm is correct and as will be demonstrated shortly more
3685 efficient than the baseline algorithm why not simply always use this algorithm?
3686
3687 \subsubsection{Column Weight.}
3688 At the nested $O(n^2)$ level the Comba method adds the product of two single precision variables to each column of the output
3689 independently. A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix
3690 the carries. For example, in the multiplication of two three-digit numbers the third column of output will be the sum of
3691 three single precision multiplications. If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then
3692 an overflow can occur and the carry information will be lost. For any $m$ and $n$ digit inputs the maximum weight of any column is
3693 min$(m, n)$ which is fairly obvious.
3694
3695 The maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used. Recall
3696 from earlier that a double precision type has $\alpha$ bits of resolution and a single precision digit has $lg(\beta)$ bits of precision. Given these
3697 two quantities we must not violate the following
3698
3699 \begin{equation}
3700 k \cdot \left (\beta - 1 \right )^2 < 2^{\alpha}
3701 \end{equation}
3702
3703 Which reduces to
3704
3705 \begin{equation}
3706 k \cdot \left ( \beta^2 - 2\beta + 1 \right ) < 2^{\alpha}
3707 \end{equation}
3708
3709 Let $\rho = lg(\beta)$ represent the number of bits in a single precision digit. By further re-arrangement of the equation the final solution is
3710 found.
3711
3712 \begin{equation}
3713 k < {{2^{\alpha}} \over {\left (2^{2\rho} - 2^{\rho + 1} + 1 \right )}}
3714 \end{equation}
3715
3716 The defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$. In this configuration
3717 the smaller input may not have more than $256$ digits if the Comba method is to be used. This is quite satisfactory for most applications since
3718 $256$ digits would allow for numbers in the range of $0 \le x < 2^{7168}$ which, is much larger than most public key cryptographic algorithms require.
3719
3720 \newpage\begin{figure}[!here]
3721 \begin{small}
3722 \begin{center}
3723 \begin{tabular}{l}
3724 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\
3725 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
3726 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
3727 \hline \\
3728 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\
3729 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\
3730 2. If step 1 failed return(\textit{MP\_MEM}).\\
3731 \\
3732 Zero the temporary array $\hat W$. \\
3733 3. for $n$ from $0$ to $digs - 1$ do \\
3734 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\
3735 \\
3736 Compute the columns. \\
3737 4. for $ix$ from $0$ to $a.used - 1$ do \\
3738 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\
3739 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\
3740 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\
3741 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\
3742 \\
3743 Propagate the carries upwards. \\
3744 5. $oldused \leftarrow c.used$ \\
3745 6. $c.used \leftarrow digs$ \\
3746 7. If $digs > 1$ then do \\
3747 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\
3748 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\
3749 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\
3750 8. else do \\
3751 \hspace{3mm}8.1 $ix \leftarrow 0$ \\
3752 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\
3753 \\
3754 Zero excess digits. \\
3755 10. If $digs < oldused$ then do \\
3756 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\
3757 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\
3758 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\
3759 12. Return(\textit{MP\_OKAY}). \\
3760 \hline
3761 \end{tabular}
3762 \end{center}
3763 \end{small}
3764 \caption{Algorithm fast\_s\_mp\_mul\_digs}
3765 \label{fig:COMBAMULT}
3766 \end{figure}
3767
3768 \textbf{Algorithm fast\_s\_mp\_mul\_digs.}
3769 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm
3770 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster.
3771
3772 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that
3773 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$.
3774
3775 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of
3776 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each
3777 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism.
3778
3779 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the
3780 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require
3781 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice,
3782 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
3783 and addition operations in the nested loop in parallel.
3784
3785 \vspace{+3mm}\begin{small}
3786 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_mul\_digs.c
3787 \vspace{-3mm}
3788 \begin{alltt}
3789 016
3790 017 /* Fast (comba) multiplier
3791 018 *
3792 019 * This is the fast column-array [comba] multiplier. It is
3793 020 * designed to compute the columns of the product first
3794 021 * then handle the carries afterwards. This has the effect
3795 022 * of making the nested loops that compute the columns very
3796 023 * simple and schedulable on super-scalar processors.
3797 024 *
3798 025 * This has been modified to produce a variable number of
3799 026 * digits of output so if say only a half-product is required
3800 027 * you don't have to compute the upper half (a feature
3801 028 * required for fast Barrett reduction).
3802 029 *
3803 030 * Based on Algorithm 14.12 on pp.595 of HAC.
3804 031 *
3805 032 */
3806 033 int
3807 034 fast_s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs)
3808 035 \{
3809 036 int olduse, res, pa, ix;
3810 037 mp_word W[MP_WARRAY];
3811 038
3812 039 /* grow the destination as required */
3813 040 if (c->alloc < digs) \{
3814 041 if ((res = mp_grow (c, digs)) != MP_OKAY) \{
3815 042 return res;
3816 043 \}
3817 044 \}
3818 045
3819 046 /* clear temp buf (the columns) */
3820 047 memset (W, 0, sizeof (mp_word) * digs);
3821 048
3822 049 /* calculate the columns */
3823 050 pa = a->used;
3824 051 for (ix = 0; ix < pa; ix++) \{
3825 052 /* this multiplier has been modified to allow you to
3826 053 * control how many digits of output are produced.
3827 054 * So at most we want to make upto "digs" digits of output.
3828 055 *
3829 056 * this adds products to distinct columns (at ix+iy) of W
3830 057 * note that each step through the loop is not dependent on
3831 058 * the previous which means the compiler can easily unroll
3832 059 * the loop without scheduling problems
3833 060 */
3834 061 \{
3835 062 register mp_digit tmpx, *tmpy;
3836 063 register mp_word *_W;
3837 064 register int iy, pb;
3838 065
3839 066 /* alias for the the word on the left e.g. A[ix] * A[iy] */
3840 067 tmpx = a->dp[ix];
3841 068
3842 069 /* alias for the right side */
3843 070 tmpy = b->dp;
3844 071
3845 072 /* alias for the columns, each step through the loop adds a new
3846 073 term to each column
3847 074 */
3848 075 _W = W + ix;
3849 076
3850 077 /* the number of digits is limited by their placement. E.g.
3851 078 we avoid multiplying digits that will end up above the # of
3852 079 digits of precision requested
3853 080 */
3854 081 pb = MIN (b->used, digs - ix);
3855 082
3856 083 for (iy = 0; iy < pb; iy++) \{
3857 084 *_W++ += ((mp_word)tmpx) * ((mp_word)*tmpy++);
3858 085 \}
3859 086 \}
3860 087
3861 088 \}
3862 089
3863 090 /* setup dest */
3864 091 olduse = c->used;
3865 092 c->used = digs;
3866 093
3867 094 \{
3868 095 register mp_digit *tmpc;
3869 096
3870 097 /* At this point W[] contains the sums of each column. To get the
3871 098 * correct result we must take the extra bits from each column and
3872 099 * carry them down
3873 100 *
3874 101 * Note that while this adds extra code to the multiplier it
3875 102 * saves time since the carry propagation is removed from the
3876 103 * above nested loop.This has the effect of reducing the work
3877 104 * from N*(N+N*c)==N**2 + c*N**2 to N**2 + N*c where c is the
3878 105 * cost of the shifting. On very small numbers this is slower
3879 106 * but on most cryptographic size numbers it is faster.
3880 107 *
3881 108 * In this particular implementation we feed the carries from
3882 109 * behind which means when the loop terminates we still have one
3883 110 * last digit to copy
3884 111 */
3885 112 tmpc = c->dp;
3886 113 for (ix = 1; ix < digs; ix++) \{
3887 114 /* forward the carry from the previous temp */
3888 115 W[ix] += (W[ix - 1] >> ((mp_word) DIGIT_BIT));
3889 116
3890 117 /* now extract the previous digit [below the carry] */
3891 118 *tmpc++ = (mp_digit) (W[ix - 1] & ((mp_word) MP_MASK));
3892 119 \}
3893 120 /* fetch the last digit */
3894 121 *tmpc++ = (mp_digit) (W[digs - 1] & ((mp_word) MP_MASK));
3895 122
3896 123 /* clear unused digits [that existed in the old copy of c] */
3897 124 for (; ix < olduse; ix++) \{
3898 125 *tmpc++ = 0;
3899 126 \}
3900 127 \}
3901 128 mp_clamp (c);
3902 129 return MP_OKAY;
3903 130 \}
3904 \end{alltt}
3905 \end{small}
3906
3907 The memset on line 47 clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication
3908 implementation a series of aliases (\textit{lines 67, 70 and 75}) are used to simplify the inner $O(n^2)$ loop.
3909 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass.
3910
3911 The inner loop on lines 83, 84 and 85 is where the algorithm will spend the majority of the time, which is why it has been
3912 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the
3913 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three
3914 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop
3915 and scheduling the instructions so there are very few dependency stalls.
3916
3917 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the
3918 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next
3919 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can
3920 be simultaneously used.
3921
3922 \subsection{Polynomial Basis Multiplication}
3923 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms
3924 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and
3925 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
3926
3927 The product $a \cdot b \equiv f(x)g(x)$ is the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$. The coefficients $w_i$ will
3928 directly yield the desired product when $\beta$ is substituted for $x$. The direct solution to solve for the $2n + 1$ coefficients
3929 requires $O(n^2)$ time and would in practice be slower than the Comba technique.
3930
3931 However, numerical analysis theory indicates that only $2n + 1$ distinct points in $W(x)$ are required to determine the values of the $2n + 1$ unknown
3932 coefficients. This means by finding $\zeta_y = W(y)$ for $2n + 1$ small values of $y$ the coefficients of $W(x)$ can be found with
3933 Gaussian elimination. This technique is also occasionally refered to as the \textit{interpolation technique} (\textit{references please...}) since in
3934 effect an interpolation based on $2n + 1$ points will yield a polynomial equivalent to $W(x)$.
3935
3936 The coefficients of the polynomial $W(x)$ are unknown which makes finding $W(y)$ for any value of $y$ impossible. However, since
3937 $W(x) = f(x)g(x)$ the equivalent $\zeta_y = f(y) g(y)$ can be used in its place. The benefit of this technique stems from the
3938 fact that $f(y)$ and $g(y)$ are much smaller than either $a$ or $b$ respectively. As a result finding the $2n + 1$ relations required
3939 by multiplying $f(y)g(y)$ involves multiplying integers that are much smaller than either of the inputs.
3940
3941 When picking points to gather relations there are always three obvious points to choose, $y = 0, 1$ and $ \infty$. The $\zeta_0$ term
3942 is simply the product $W(0) = w_0 = a_0 \cdot b_0$. The $\zeta_1$ term is the product
3943 $W(1) = \left (\sum_{i = 0}^{n} a_i \right ) \left (\sum_{i = 0}^{n} b_i \right )$. The third point $\zeta_{\infty}$ is less obvious but rather
3944 simple to explain. The $2n + 1$'th coefficient of $W(x)$ is numerically equivalent to the most significant column in an integer multiplication.
3945 The point at $\infty$ is used symbolically to represent the most significant column, that is $W(\infty) = w_{2n} = a_nb_n$. Note that the
3946 points at $y = 0$ and $\infty$ yield the coefficients $w_0$ and $w_{2n}$ directly.
3947
3948 If more points are required they should be of small values and powers of two such as $2^q$ and the related \textit{mirror points}
3949 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ for small values of $q$. The term ``mirror point'' stems from the fact that
3950 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ can be calculated in the exact opposite fashion as $\zeta_{2^q}$. For
3951 example, when $n = 2$ and $q = 1$ then following two equations are equivalent to the point $\zeta_{2}$ and its mirror.
3952
3953 \begin{eqnarray}
3954 \zeta_{2} = f(2)g(2) = (4a_2 + 2a_1 + a_0)(4b_2 + 2b_1 + b_0) \nonumber \\
3955 16 \cdot \zeta_{1 \over 2} = 4f({1\over 2}) \cdot 4g({1 \over 2}) = (a_2 + 2a_1 + 4a_0)(b_2 + 2b_1 + 4b_0)
3956 \end{eqnarray}
3957
3958 Using such points will allow the values of $f(y)$ and $g(y)$ to be independently calculated using only left shifts. For example, when $n = 2$ the
3959 polynomial $f(2^q)$ is equal to $2^q((2^qa_2) + a_1) + a_0$. This technique of polynomial representation is known as Horner's method.
3960
3961 As a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications. Each multiplication is of
3962 multiplicands that have $n$ times fewer digits than the inputs. The asymptotic running time of this algorithm is
3963 $O \left ( k^{lg_n(2n - 1)} \right )$ for $k$ digit inputs (\textit{assuming they have the same number of digits}). Figure~\ref{fig:exponent}
3964 summarizes the exponents for various values of $n$.
3965
3966 \begin{figure}
3967 \begin{center}
3968 \begin{tabular}{|c|c|c|}
3969 \hline \textbf{Split into $n$ Parts} & \textbf{Exponent} & \textbf{Notes}\\
3970 \hline $2$ & $1.584962501$ & This is Karatsuba Multiplication. \\
3971 \hline $3$ & $1.464973520$ & This is Toom-Cook Multiplication. \\
3972 \hline $4$ & $1.403677461$ &\\
3973 \hline $5$ & $1.365212389$ &\\
3974 \hline $10$ & $1.278753601$ &\\
3975 \hline $100$ & $1.149426538$ &\\
3976 \hline $1000$ & $1.100270931$ &\\
3977 \hline $10000$ & $1.075252070$ &\\
3978 \hline
3979 \end{tabular}
3980 \end{center}
3981 \caption{Asymptotic Running Time of Polynomial Basis Multiplication}
3982 \label{fig:exponent}
3983 \end{figure}
3984
3985 At first it may seem like a good idea to choose $n = 1000$ since the exponent is approximately $1.1$. However, the overhead
3986 of solving for the 2001 terms of $W(x)$ will certainly consume any savings the algorithm could offer for all but exceedingly large
3987 numbers.
3988
3989 \subsubsection{Cutoff Point}
3990 The polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach. However,
3991 the algorithms incur an overhead (\textit{at the $O(n)$ work level}) since they require a system of equations to be solved. This makes the
3992 polynomial basis approach more costly to use with small inputs.
3993
3994 Let $m$ represent the number of digits in the multiplicands (\textit{assume both multiplicands have the same number of digits}). There exists a
3995 point $y$ such that when $m < y$ the polynomial basis algorithms are more costly than Comba, when $m = y$ they are roughly the same cost and
3996 when $m > y$ the Comba methods are slower than the polynomial basis algorithms.
3997
3998 The exact location of $y$ depends on several key architectural elements of the computer platform in question.
3999
4000 \begin{enumerate}
4001 \item The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc. For example
4002 on the AMD Athlon the ratio is roughly $17 : 1$ while on the Intel P4 it is $29 : 1$. The higher the ratio in favour of multiplication the lower
4003 the cutoff point $y$ will be.
4004
4005 \item The complexity of the linear system of equations (\textit{for the coefficients of $W(x)$}) is. Generally speaking as the number of splits
4006 grows the complexity grows substantially. Ideally solving the system will only involve addition, subtraction and shifting of integers. This
4007 directly reflects on the ratio previous mentioned.
4008
4009 \item To a lesser extent memory bandwidth and function call overheads. Provided the values are in the processor cache this is less of an
4010 influence over the cutoff point.
4011
4012 \end{enumerate}
4013
4014 A clean cutoff point separation occurs when a point $y$ is found such that all of the cutoff point conditions are met. For example, if the point
4015 is too low then there will be values of $m$ such that $m > y$ and the Comba method is still faster. Finding the cutoff points is fairly simple when
4016 a high resolution timer is available.
4017
4018 \subsection{Karatsuba Multiplication}
4019 Karatsuba \cite{KARA} multiplication when originally proposed in 1962 was among the first set of algorithms to break the $O(n^2)$ barrier for
4020 general purpose multiplication. Given two polynomial basis representations $f(x) = ax + b$ and $g(x) = cx + d$, Karatsuba proved with
4021 light algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent.
4022
4023 \begin{equation}
4024 f(x) \cdot g(x) = acx^2 + ((a - b)(c - d) - (ac + bd))x + bd
4025 \end{equation}
4026
4027 Using the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product. Applying
4028 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique. It turns
4029 out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points
4030 $\zeta_0$, $\zeta_{\infty}$ and $-\zeta_{-1}$. Consider the resultant system of equations.
4031
4032 \begin{center}
4033 \begin{tabular}{rcrcrcrc}
4034 $\zeta_{0}$ & $=$ & & & & & $w_0$ \\
4035 $-\zeta_{-1}$ & $=$ & $-w_2$ & $+$ & $w_1$ & $-$ & $w_0$ \\
4036 $\zeta_{\infty}$ & $=$ & $w_2$ & & & & \\
4037 \end{tabular}
4038 \end{center}
4039
4040 By adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for. The simplicity
4041 of this system of equations has made Karatsuba fairly popular. In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
4042 making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman. It is worth noting that the point
4043 $\zeta_1$ could be substituted for $-\zeta_{-1}$. In this case the first and third row are subtracted instead of added to the second row.
4044
4045 \newpage\begin{figure}[!here]
4046 \begin{small}
4047 \begin{center}
4048 \begin{tabular}{l}
4049 \hline Algorithm \textbf{mp\_karatsuba\_mul}. \\
4050 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
4051 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert$ \\
4052 \hline \\
4053 1. Init the following mp\_int variables: $x0$, $x1$, $y0$, $y1$, $t1$, $x0y0$, $x1y1$.\\
4054 2. If step 2 failed then return(\textit{MP\_MEM}). \\
4055 \\
4056 Split the input. e.g. $a = x1 \cdot \beta^B + x0$ \\
4057 3. $B \leftarrow \mbox{min}(a.used, b.used)/2$ \\
4058 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\
4059 5. $y0 \leftarrow b \mbox{ (mod }\beta^B\mbox{)}$ \\
4060 6. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_rshd}) \\
4061 7. $y1 \leftarrow \lfloor b / \beta^B \rfloor$ \\
4062 \\
4063 Calculate the three products. \\
4064 8. $x0y0 \leftarrow x0 \cdot y0$ (\textit{mp\_mul}) \\
4065 9. $x1y1 \leftarrow x1 \cdot y1$ \\
4066 10. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\
4067 11. $x0 \leftarrow y1 - y0$ \\
4068 12. $t1 \leftarrow t1 \cdot x0$ \\
4069 \\
4070 Calculate the middle term. \\
4071 13. $x0 \leftarrow x0y0 + x1y1$ \\
4072 14. $t1 \leftarrow x0 - t1$ \\
4073 \\
4074 Calculate the final product. \\
4075 15. $t1 \leftarrow t1 \cdot \beta^B$ (\textit{mp\_lshd}) \\
4076 16. $x1y1 \leftarrow x1y1 \cdot \beta^{2B}$ \\
4077 17. $t1 \leftarrow x0y0 + t1$ \\
4078 18. $c \leftarrow t1 + x1y1$ \\
4079 19. Clear all of the temporary variables. \\
4080 20. Return(\textit{MP\_OKAY}).\\
4081 \hline
4082 \end{tabular}
4083 \end{center}
4084 \end{small}
4085 \caption{Algorithm mp\_karatsuba\_mul}
4086 \end{figure}
4087
4088 \textbf{Algorithm mp\_karatsuba\_mul.}
4089 This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm. It is loosely based on the description
4090 from Knuth \cite[pp. 294-295]{TAOCPV2}.
4091
4092 \index{radix point}
4093 In order to split the two inputs into their respective halves, a suitable \textit{radix point} must be chosen. The radix point chosen must
4094 be used for both of the inputs meaning that it must be smaller than the smallest input. Step 3 chooses the radix point $B$ as half of the
4095 smallest input \textbf{used} count. After the radix point is chosen the inputs are split into lower and upper halves. Step 4 and 5
4096 compute the lower halves. Step 6 and 7 computer the upper halves.
4097
4098 After the halves have been computed the three intermediate half-size products must be computed. Step 8 and 9 compute the trivial products
4099 $x0 \cdot y0$ and $x1 \cdot y1$. The mp\_int $x0$ is used as a temporary variable after $x1 - x0$ has been computed. By using $x0$ instead
4100 of an additional temporary variable, the algorithm can avoid an addition memory allocation operation.
4101
4102 The remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations.
4103
4104 \vspace{+3mm}\begin{small}
4105 \hspace{-5.1mm}{\bf File}: bn\_mp\_karatsuba\_mul.c
4106 \vspace{-3mm}
4107 \begin{alltt}
4108 016
4109 017 /* c = |a| * |b| using Karatsuba Multiplication using
4110 018 * three half size multiplications
4111 019 *
4112 020 * Let B represent the radix [e.g. 2**DIGIT_BIT] and
4113 021 * let n represent half of the number of digits in
4114 022 * the min(a,b)
4115 023 *
4116 024 * a = a1 * B**n + a0
4117 025 * b = b1 * B**n + b0
4118 026 *
4119 027 * Then, a * b =>
4120 028 a1b1 * B**2n + ((a1 - a0)(b1 - b0) + a0b0 + a1b1) * B + a0b0
4121 029 *
4122 030 * Note that a1b1 and a0b0 are used twice and only need to be
4123 031 * computed once. So in total three half size (half # of
4124 032 * digit) multiplications are performed, a0b0, a1b1 and
4125 033 * (a1-b1)(a0-b0)
4126 034 *
4127 035 * Note that a multiplication of half the digits requires
4128 036 * 1/4th the number of single precision multiplications so in
4129 037 * total after one call 25% of the single precision multiplications
4130 038 * are saved. Note also that the call to mp_mul can end up back
4131 039 * in this function if the a0, a1, b0, or b1 are above the threshold.
4132 040 * This is known as divide-and-conquer and leads to the famous
4133 041 * O(N**lg(3)) or O(N**1.584) work which is asymptopically lower than
4134 042 * the standard O(N**2) that the baseline/comba methods use.
4135 043 * Generally though the overhead of this method doesn't pay off
4136 044 * until a certain size (N ~ 80) is reached.
4137 045 */
4138 046 int mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c)
4139 047 \{
4140 048 mp_int x0, x1, y0, y1, t1, x0y0, x1y1;
4141 049 int B, err;
4142 050
4143 051 /* default the return code to an error */
4144 052 err = MP_MEM;
4145 053
4146 054 /* min # of digits */
4147 055 B = MIN (a->used, b->used);
4148 056
4149 057 /* now divide in two */
4150 058 B = B >> 1;
4151 059
4152 060 /* init copy all the temps */
4153 061 if (mp_init_size (&x0, B) != MP_OKAY)
4154 062 goto ERR;
4155 063 if (mp_init_size (&x1, a->used - B) != MP_OKAY)
4156 064 goto X0;
4157 065 if (mp_init_size (&y0, B) != MP_OKAY)
4158 066 goto X1;
4159 067 if (mp_init_size (&y1, b->used - B) != MP_OKAY)
4160 068 goto Y0;
4161 069
4162 070 /* init temps */
4163 071 if (mp_init_size (&t1, B * 2) != MP_OKAY)
4164 072 goto Y1;
4165 073 if (mp_init_size (&x0y0, B * 2) != MP_OKAY)
4166 074 goto T1;
4167 075 if (mp_init_size (&x1y1, B * 2) != MP_OKAY)
4168 076 goto X0Y0;
4169 077
4170 078 /* now shift the digits */
4171 079 x0.sign = x1.sign = a->sign;
4172 080 y0.sign = y1.sign = b->sign;
4173 081
4174 082 x0.used = y0.used = B;
4175 083 x1.used = a->used - B;
4176 084 y1.used = b->used - B;
4177 085
4178 086 \{
4179 087 register int x;
4180 088 register mp_digit *tmpa, *tmpb, *tmpx, *tmpy;
4181 089
4182 090 /* we copy the digits directly instead of using higher level functions
4183 091 * since we also need to shift the digits
4184 092 */
4185 093 tmpa = a->dp;
4186 094 tmpb = b->dp;
4187 095
4188 096 tmpx = x0.dp;
4189 097 tmpy = y0.dp;
4190 098 for (x = 0; x < B; x++) \{
4191 099 *tmpx++ = *tmpa++;
4192 100 *tmpy++ = *tmpb++;
4193 101 \}
4194 102
4195 103 tmpx = x1.dp;
4196 104 for (x = B; x < a->used; x++) \{
4197 105 *tmpx++ = *tmpa++;
4198 106 \}
4199 107
4200 108 tmpy = y1.dp;
4201 109 for (x = B; x < b->used; x++) \{
4202 110 *tmpy++ = *tmpb++;
4203 111 \}
4204 112 \}
4205 113
4206 114 /* only need to clamp the lower words since by definition the
4207 115 * upper words x1/y1 must have a known number of digits
4208 116 */
4209 117 mp_clamp (&x0);
4210 118 mp_clamp (&y0);
4211 119
4212 120 /* now calc the products x0y0 and x1y1 */
4213 121 /* after this x0 is no longer required, free temp [x0==t2]! */
4214 122 if (mp_mul (&x0, &y0, &x0y0) != MP_OKAY)
4215 123 goto X1Y1; /* x0y0 = x0*y0 */
4216 124 if (mp_mul (&x1, &y1, &x1y1) != MP_OKAY)
4217 125 goto X1Y1; /* x1y1 = x1*y1 */
4218 126
4219 127 /* now calc x1-x0 and y1-y0 */
4220 128 if (mp_sub (&x1, &x0, &t1) != MP_OKAY)
4221 129 goto X1Y1; /* t1 = x1 - x0 */
4222 130 if (mp_sub (&y1, &y0, &x0) != MP_OKAY)
4223 131 goto X1Y1; /* t2 = y1 - y0 */
4224 132 if (mp_mul (&t1, &x0, &t1) != MP_OKAY)
4225 133 goto X1Y1; /* t1 = (x1 - x0) * (y1 - y0) */
4226 134
4227 135 /* add x0y0 */
4228 136 if (mp_add (&x0y0, &x1y1, &x0) != MP_OKAY)
4229 137 goto X1Y1; /* t2 = x0y0 + x1y1 */
4230 138 if (mp_sub (&x0, &t1, &t1) != MP_OKAY)
4231 139 goto X1Y1; /* t1 = x0y0 + x1y1 - (x1-x0)*(y1-y0) */
4232 140
4233 141 /* shift by B */
4234 142 if (mp_lshd (&t1, B) != MP_OKAY)
4235 143 goto X1Y1; /* t1 = (x0y0 + x1y1 - (x1-x0)*(y1-y0))<<B */
4236 144 if (mp_lshd (&x1y1, B * 2) != MP_OKAY)
4237 145 goto X1Y1; /* x1y1 = x1y1 << 2*B */
4238 146
4239 147 if (mp_add (&x0y0, &t1, &t1) != MP_OKAY)
4240 148 goto X1Y1; /* t1 = x0y0 + t1 */
4241 149 if (mp_add (&t1, &x1y1, c) != MP_OKAY)
4242 150 goto X1Y1; /* t1 = x0y0 + t1 + x1y1 */
4243 151
4244 152 /* Algorithm succeeded set the return code to MP_OKAY */
4245 153 err = MP_OKAY;
4246 154
4247 155 X1Y1:mp_clear (&x1y1);
4248 156 X0Y0:mp_clear (&x0y0);
4249 157 T1:mp_clear (&t1);
4250 158 Y1:mp_clear (&y1);
4251 159 Y0:mp_clear (&y0);
4252 160 X1:mp_clear (&x1);
4253 161 X0:mp_clear (&x0);
4254 162 ERR:
4255 163 return err;
4256 164 \}
4257 \end{alltt}
4258 \end{small}
4259
4260 The new coding element in this routine, not seen in previous routines, is the usage of goto statements. The conventional
4261 wisdom is that goto statements should be avoided. This is generally true, however when every single function call can fail, it makes sense
4262 to handle error recovery with a single piece of code. Lines 61 to 75 handle initializing all of the temporary variables
4263 required. Note how each of the if statements goes to a different label in case of failure. This allows the routine to correctly free only
4264 the temporaries that have been successfully allocated so far.
4265
4266 The temporary variables are all initialized using the mp\_init\_size routine since they are expected to be large. This saves the
4267 additional reallocation that would have been necessary. Also $x0$, $x1$, $y0$ and $y1$ have to be able to hold at least their respective
4268 number of digits for the next section of code.
4269
4270 The first algebraic portion of the algorithm is to split the two inputs into their halves. However, instead of using mp\_mod\_2d and mp\_rshd
4271 to extract the halves, the respective code has been placed inline within the body of the function. To initialize the halves, the \textbf{used} and
4272 \textbf{sign} members are copied first. The first for loop on line 98 copies the lower halves. Since they are both the same magnitude it
4273 is simpler to calculate both lower halves in a single loop. The for loop on lines 104 and 109 calculate the upper halves $x1$ and
4274 $y1$ respectively.
4275
4276 By inlining the calculation of the halves, the Karatsuba multiplier has a slightly lower overhead and can be used for smaller magnitude inputs.
4277
4278 When line 153 is reached, the algorithm has completed succesfully. The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that
4279 the same code that handles errors can be used to clear the temporary variables and return.
4280
4281 \subsection{Toom-Cook $3$-Way Multiplication}
4282 Toom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points are
4283 chosen such that $\zeta$ is easy to compute and the resulting system of equations easy to reduce. Here, the points $\zeta_{0}$,
4284 $16 \cdot \zeta_{1 \over 2}$, $\zeta_1$, $\zeta_2$ and $\zeta_{\infty}$ make up the five required points to solve for the coefficients
4285 of the $W(x)$.
4286
4287 With the five relations that Toom-Cook specifies, the following system of equations is formed.
4288
4289 \begin{center}
4290 \begin{tabular}{rcrcrcrcrcr}
4291 $\zeta_0$ & $=$ & $0w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $1w_0$ \\
4292 $16 \cdot \zeta_{1 \over 2}$ & $=$ & $1w_4$ & $+$ & $2w_3$ & $+$ & $4w_2$ & $+$ & $8w_1$ & $+$ & $16w_0$ \\
4293 $\zeta_1$ & $=$ & $1w_4$ & $+$ & $1w_3$ & $+$ & $1w_2$ & $+$ & $1w_1$ & $+$ & $1w_0$ \\
4294 $\zeta_2$ & $=$ & $16w_4$ & $+$ & $8w_3$ & $+$ & $4w_2$ & $+$ & $2w_1$ & $+$ & $1w_0$ \\
4295 $\zeta_{\infty}$ & $=$ & $1w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $0w_0$ \\
4296 \end{tabular}
4297 \end{center}
4298
4299 A trivial solution to this matrix requires $12$ subtractions, two multiplications by a small power of two, two divisions by a small power
4300 of two, two divisions by three and one multiplication by three. All of these $19$ sub-operations require less than quadratic time, meaning that
4301 the algorithm can be faster than a baseline multiplication. However, the greater complexity of this algorithm places the cutoff point
4302 (\textbf{TOOM\_MUL\_CUTOFF}) where Toom-Cook becomes more efficient much higher than the Karatsuba cutoff point.
4303
4304 \begin{figure}[!here]
4305 \begin{small}
4306 \begin{center}
4307 \begin{tabular}{l}
4308 \hline Algorithm \textbf{mp\_toom\_mul}. \\
4309 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
4310 \textbf{Output}. $c \leftarrow a \cdot b $ \\
4311 \hline \\
4312 Split $a$ and $b$ into three pieces. E.g. $a = a_2 \beta^{2k} + a_1 \beta^{k} + a_0$ \\
4313 1. $k \leftarrow \lfloor \mbox{min}(a.used, b.used) / 3 \rfloor$ \\
4314 2. $a_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4315 3. $a_1 \leftarrow \lfloor a / \beta^k \rfloor$, $a_1 \leftarrow a_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4316 4. $a_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $a_2 \leftarrow a_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4317 5. $b_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4318 6. $b_1 \leftarrow \lfloor a / \beta^k \rfloor$, $b_1 \leftarrow b_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4319 7. $b_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $b_2 \leftarrow b_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
4320 \\
4321 Find the five equations for $w_0, w_1, ..., w_4$. \\
4322 8. $w_0 \leftarrow a_0 \cdot b_0$ \\
4323 9. $w_4 \leftarrow a_2 \cdot b_2$ \\
4324 10. $tmp_1 \leftarrow 2 \cdot a_0$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_2$ \\
4325 11. $tmp_2 \leftarrow 2 \cdot b_0$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_2$ \\
4326 12. $w_1 \leftarrow tmp_1 \cdot tmp_2$ \\
4327 13. $tmp_1 \leftarrow 2 \cdot a_2$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_0$ \\
4328 14. $tmp_2 \leftarrow 2 \cdot b_2$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_0$ \\
4329 15. $w_3 \leftarrow tmp_1 \cdot tmp_2$ \\
4330 16. $tmp_1 \leftarrow a_0 + a_1$, $tmp_1 \leftarrow tmp_1 + a_2$, $tmp_2 \leftarrow b_0 + b_1$, $tmp_2 \leftarrow tmp_2 + b_2$ \\
4331 17. $w_2 \leftarrow tmp_1 \cdot tmp_2$ \\
4332 \\
4333 Continued on the next page.\\
4334 \hline
4335 \end{tabular}
4336 \end{center}
4337 \end{small}
4338 \caption{Algorithm mp\_toom\_mul}
4339 \end{figure}
4340
4341 \newpage\begin{figure}[!here]
4342 \begin{small}
4343 \begin{center}
4344 \begin{tabular}{l}
4345 \hline Algorithm \textbf{mp\_toom\_mul} (continued). \\
4346 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
4347 \textbf{Output}. $c \leftarrow a \cdot b $ \\
4348 \hline \\
4349 Now solve the system of equations. \\
4350 18. $w_1 \leftarrow w_4 - w_1$, $w_3 \leftarrow w_3 - w_0$ \\
4351 19. $w_1 \leftarrow \lfloor w_1 / 2 \rfloor$, $w_3 \leftarrow \lfloor w_3 / 2 \rfloor$ \\
4352 20. $w_2 \leftarrow w_2 - w_0$, $w_2 \leftarrow w_2 - w_4$ \\
4353 21. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\
4354 22. $tmp_1 \leftarrow 8 \cdot w_0$, $w_1 \leftarrow w_1 - tmp_1$, $tmp_1 \leftarrow 8 \cdot w_4$, $w_3 \leftarrow w_3 - tmp_1$ \\
4355 23. $w_2 \leftarrow 3 \cdot w_2$, $w_2 \leftarrow w_2 - w_1$, $w_2 \leftarrow w_2 - w_3$ \\
4356 24. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\
4357 25. $w_1 \leftarrow \lfloor w_1 / 3 \rfloor, w_3 \leftarrow \lfloor w_3 / 3 \rfloor$ \\
4358 \\
4359 Now substitute $\beta^k$ for $x$ by shifting $w_0, w_1, ..., w_4$. \\
4360 26. for $n$ from $1$ to $4$ do \\
4361 \hspace{3mm}26.1 $w_n \leftarrow w_n \cdot \beta^{nk}$ \\
4362 27. $c \leftarrow w_0 + w_1$, $c \leftarrow c + w_2$, $c \leftarrow c + w_3$, $c \leftarrow c + w_4$ \\
4363 28. Return(\textit{MP\_OKAY}) \\
4364 \hline
4365 \end{tabular}
4366 \end{center}
4367 \end{small}
4368 \caption{Algorithm mp\_toom\_mul (continued)}
4369 \end{figure}
4370
4371 \textbf{Algorithm mp\_toom\_mul.}
4372 This algorithm computes the product of two mp\_int variables $a$ and $b$ using the Toom-Cook approach. Compared to the Karatsuba multiplication, this
4373 algorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead. In this
4374 description, several statements have been compounded to save space. The intention is that the statements are executed from left to right across
4375 any given step.
4376
4377 The two inputs $a$ and $b$ are first split into three $k$-digit integers $a_0, a_1, a_2$ and $b_0, b_1, b_2$ respectively. From these smaller
4378 integers the coefficients of the polynomial basis representations $f(x)$ and $g(x)$ are known and can be used to find the relations required.
4379
4380 The first two relations $w_0$ and $w_4$ are the points $\zeta_{0}$ and $\zeta_{\infty}$ respectively. The relation $w_1, w_2$ and $w_3$ correspond
4381 to the points $16 \cdot \zeta_{1 \over 2}, \zeta_{2}$ and $\zeta_{1}$ respectively. These are found using logical shifts to independently find
4382 $f(y)$ and $g(y)$ which significantly speeds up the algorithm.
4383
4384 After the five relations $w_0, w_1, \ldots, w_4$ have been computed, the system they represent must be solved in order for the unknown coefficients
4385 $w_1, w_2$ and $w_3$ to be isolated. The steps 18 through 25 perform the system reduction required as previously described. Each step of
4386 the reduction represents the comparable matrix operation that would be performed had this been performed by pencil. For example, step 18 indicates
4387 that row $1$ must be subtracted from row $4$ and simultaneously row $0$ subtracted from row $3$.
4388
4389 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer
4390 result $a \cdot b$ is produced.
4391
4392 \vspace{+3mm}\begin{small}
4393 \hspace{-5.1mm}{\bf File}: bn\_mp\_toom\_mul.c
4394 \vspace{-3mm}
4395 \begin{alltt}
4396 016
4397 017 /* multiplication using the Toom-Cook 3-way algorithm */
4398 018 int mp_toom_mul(mp_int *a, mp_int *b, mp_int *c)
4399 019 \{
4400 020 mp_int w0, w1, w2, w3, w4, tmp1, tmp2, a0, a1, a2, b0, b1, b2;
4401 021 int res, B;
4402 022
4403 023 /* init temps */
4404 024 if ((res = mp_init_multi(&w0, &w1, &w2, &w3, &w4,
4405 025 &a0, &a1, &a2, &b0, &b1,
4406 026 &b2, &tmp1, &tmp2, NULL)) != MP_OKAY) \{
4407 027 return res;
4408 028 \}
4409 029
4410 030 /* B */
4411 031 B = MIN(a->used, b->used) / 3;
4412 032
4413 033 /* a = a2 * B**2 + a1 * B + a0 */
4414 034 if ((res = mp_mod_2d(a, DIGIT_BIT * B, &a0)) != MP_OKAY) \{
4415 035 goto ERR;
4416 036 \}
4417 037
4418 038 if ((res = mp_copy(a, &a1)) != MP_OKAY) \{
4419 039 goto ERR;
4420 040 \}
4421 041 mp_rshd(&a1, B);
4422 042 mp_mod_2d(&a1, DIGIT_BIT * B, &a1);
4423 043
4424 044 if ((res = mp_copy(a, &a2)) != MP_OKAY) \{
4425 045 goto ERR;
4426 046 \}
4427 047 mp_rshd(&a2, B*2);
4428 048
4429 049 /* b = b2 * B**2 + b1 * B + b0 */
4430 050 if ((res = mp_mod_2d(b, DIGIT_BIT * B, &b0)) != MP_OKAY) \{
4431 051 goto ERR;
4432 052 \}
4433 053
4434 054 if ((res = mp_copy(b, &b1)) != MP_OKAY) \{
4435 055 goto ERR;
4436 056 \}
4437 057 mp_rshd(&b1, B);
4438 058 mp_mod_2d(&b1, DIGIT_BIT * B, &b1);
4439 059
4440 060 if ((res = mp_copy(b, &b2)) != MP_OKAY) \{
4441 061 goto ERR;
4442 062 \}
4443 063 mp_rshd(&b2, B*2);
4444 064
4445 065 /* w0 = a0*b0 */
4446 066 if ((res = mp_mul(&a0, &b0, &w0)) != MP_OKAY) \{
4447 067 goto ERR;
4448 068 \}
4449 069
4450 070 /* w4 = a2 * b2 */
4451 071 if ((res = mp_mul(&a2, &b2, &w4)) != MP_OKAY) \{
4452 072 goto ERR;
4453 073 \}
4454 074
4455 075 /* w1 = (a2 + 2(a1 + 2a0))(b2 + 2(b1 + 2b0)) */
4456 076 if ((res = mp_mul_2(&a0, &tmp1)) != MP_OKAY) \{
4457 077 goto ERR;
4458 078 \}
4459 079 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{
4460 080 goto ERR;
4461 081 \}
4462 082 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{
4463 083 goto ERR;
4464 084 \}
4465 085 if ((res = mp_add(&tmp1, &a2, &tmp1)) != MP_OKAY) \{
4466 086 goto ERR;
4467 087 \}
4468 088
4469 089 if ((res = mp_mul_2(&b0, &tmp2)) != MP_OKAY) \{
4470 090 goto ERR;
4471 091 \}
4472 092 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{
4473 093 goto ERR;
4474 094 \}
4475 095 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{
4476 096 goto ERR;
4477 097 \}
4478 098 if ((res = mp_add(&tmp2, &b2, &tmp2)) != MP_OKAY) \{
4479 099 goto ERR;
4480 100 \}
4481 101
4482 102 if ((res = mp_mul(&tmp1, &tmp2, &w1)) != MP_OKAY) \{
4483 103 goto ERR;
4484 104 \}
4485 105
4486 106 /* w3 = (a0 + 2(a1 + 2a2))(b0 + 2(b1 + 2b2)) */
4487 107 if ((res = mp_mul_2(&a2, &tmp1)) != MP_OKAY) \{
4488 108 goto ERR;
4489 109 \}
4490 110 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{
4491 111 goto ERR;
4492 112 \}
4493 113 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{
4494 114 goto ERR;
4495 115 \}
4496 116 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{
4497 117 goto ERR;
4498 118 \}
4499 119
4500 120 if ((res = mp_mul_2(&b2, &tmp2)) != MP_OKAY) \{
4501 121 goto ERR;
4502 122 \}
4503 123 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{
4504 124 goto ERR;
4505 125 \}
4506 126 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{
4507 127 goto ERR;
4508 128 \}
4509 129 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{
4510 130 goto ERR;
4511 131 \}
4512 132
4513 133 if ((res = mp_mul(&tmp1, &tmp2, &w3)) != MP_OKAY) \{
4514 134 goto ERR;
4515 135 \}
4516 136
4517 137
4518 138 /* w2 = (a2 + a1 + a0)(b2 + b1 + b0) */
4519 139 if ((res = mp_add(&a2, &a1, &tmp1)) != MP_OKAY) \{
4520 140 goto ERR;
4521 141 \}
4522 142 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{
4523 143 goto ERR;
4524 144 \}
4525 145 if ((res = mp_add(&b2, &b1, &tmp2)) != MP_OKAY) \{
4526 146 goto ERR;
4527 147 \}
4528 148 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{
4529 149 goto ERR;
4530 150 \}
4531 151 if ((res = mp_mul(&tmp1, &tmp2, &w2)) != MP_OKAY) \{
4532 152 goto ERR;
4533 153 \}
4534 154
4535 155 /* now solve the matrix
4536 156
4537 157 0 0 0 0 1
4538 158 1 2 4 8 16
4539 159 1 1 1 1 1
4540 160 16 8 4 2 1
4541 161 1 0 0 0 0
4542 162
4543 163 using 12 subtractions, 4 shifts,
4544 164 2 small divisions and 1 small multiplication
4545 165 */
4546 166
4547 167 /* r1 - r4 */
4548 168 if ((res = mp_sub(&w1, &w4, &w1)) != MP_OKAY) \{
4549 169 goto ERR;
4550 170 \}
4551 171 /* r3 - r0 */
4552 172 if ((res = mp_sub(&w3, &w0, &w3)) != MP_OKAY) \{
4553 173 goto ERR;
4554 174 \}
4555 175 /* r1/2 */
4556 176 if ((res = mp_div_2(&w1, &w1)) != MP_OKAY) \{
4557 177 goto ERR;
4558 178 \}
4559 179 /* r3/2 */
4560 180 if ((res = mp_div_2(&w3, &w3)) != MP_OKAY) \{
4561 181 goto ERR;
4562 182 \}
4563 183 /* r2 - r0 - r4 */
4564 184 if ((res = mp_sub(&w2, &w0, &w2)) != MP_OKAY) \{
4565 185 goto ERR;
4566 186 \}
4567 187 if ((res = mp_sub(&w2, &w4, &w2)) != MP_OKAY) \{
4568 188 goto ERR;
4569 189 \}
4570 190 /* r1 - r2 */
4571 191 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{
4572 192 goto ERR;
4573 193 \}
4574 194 /* r3 - r2 */
4575 195 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{
4576 196 goto ERR;
4577 197 \}
4578 198 /* r1 - 8r0 */
4579 199 if ((res = mp_mul_2d(&w0, 3, &tmp1)) != MP_OKAY) \{
4580 200 goto ERR;
4581 201 \}
4582 202 if ((res = mp_sub(&w1, &tmp1, &w1)) != MP_OKAY) \{
4583 203 goto ERR;
4584 204 \}
4585 205 /* r3 - 8r4 */
4586 206 if ((res = mp_mul_2d(&w4, 3, &tmp1)) != MP_OKAY) \{
4587 207 goto ERR;
4588 208 \}
4589 209 if ((res = mp_sub(&w3, &tmp1, &w3)) != MP_OKAY) \{
4590 210 goto ERR;
4591 211 \}
4592 212 /* 3r2 - r1 - r3 */
4593 213 if ((res = mp_mul_d(&w2, 3, &w2)) != MP_OKAY) \{
4594 214 goto ERR;
4595 215 \}
4596 216 if ((res = mp_sub(&w2, &w1, &w2)) != MP_OKAY) \{
4597 217 goto ERR;
4598 218 \}
4599 219 if ((res = mp_sub(&w2, &w3, &w2)) != MP_OKAY) \{
4600 220 goto ERR;
4601 221 \}
4602 222 /* r1 - r2 */
4603 223 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{
4604 224 goto ERR;
4605 225 \}
4606 226 /* r3 - r2 */
4607 227 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{
4608 228 goto ERR;
4609 229 \}
4610 230 /* r1/3 */
4611 231 if ((res = mp_div_3(&w1, &w1, NULL)) != MP_OKAY) \{
4612 232 goto ERR;
4613 233 \}
4614 234 /* r3/3 */
4615 235 if ((res = mp_div_3(&w3, &w3, NULL)) != MP_OKAY) \{
4616 236 goto ERR;
4617 237 \}
4618 238
4619 239 /* at this point shift W[n] by B*n */
4620 240 if ((res = mp_lshd(&w1, 1*B)) != MP_OKAY) \{
4621 241 goto ERR;
4622 242 \}
4623 243 if ((res = mp_lshd(&w2, 2*B)) != MP_OKAY) \{
4624 244 goto ERR;
4625 245 \}
4626 246 if ((res = mp_lshd(&w3, 3*B)) != MP_OKAY) \{
4627 247 goto ERR;
4628 248 \}
4629 249 if ((res = mp_lshd(&w4, 4*B)) != MP_OKAY) \{
4630 250 goto ERR;
4631 251 \}
4632 252
4633 253 if ((res = mp_add(&w0, &w1, c)) != MP_OKAY) \{
4634 254 goto ERR;
4635 255 \}
4636 256 if ((res = mp_add(&w2, &w3, &tmp1)) != MP_OKAY) \{
4637 257 goto ERR;
4638 258 \}
4639 259 if ((res = mp_add(&w4, &tmp1, &tmp1)) != MP_OKAY) \{
4640 260 goto ERR;
4641 261 \}
4642 262 if ((res = mp_add(&tmp1, c, c)) != MP_OKAY) \{
4643 263 goto ERR;
4644 264 \}
4645 265
4646 266 ERR:
4647 267 mp_clear_multi(&w0, &w1, &w2, &w3, &w4,
4648 268 &a0, &a1, &a2, &b0, &b1,
4649 269 &b2, &tmp1, &tmp2, NULL);
4650 270 return res;
4651 271 \}
4652 272
4653 \end{alltt}
4654 \end{small}
4655
4656 -- Comments to be added during editing phase.
4657
4658 \subsection{Signed Multiplication}
4659 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all
4660 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.
4661
4662 \newpage\begin{figure}[!here]
4663 \begin{small}
4664 \begin{center}
4665 \begin{tabular}{l}
4666 \hline Algorithm \textbf{mp\_mul}. \\
4667 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
4668 \textbf{Output}. $c \leftarrow a \cdot b$ \\
4669 \hline \\
4670 1. If $a.sign = b.sign$ then \\
4671 \hspace{3mm}1.1 $sign = MP\_ZPOS$ \\
4672 2. else \\
4673 \hspace{3mm}2.1 $sign = MP\_ZNEG$ \\
4674 3. If min$(a.used, b.used) \ge TOOM\_MUL\_CUTOFF$ then \\
4675 \hspace{3mm}3.1 $c \leftarrow a \cdot b$ using algorithm mp\_toom\_mul \\
4676 4. else if min$(a.used, b.used) \ge KARATSUBA\_MUL\_CUTOFF$ then \\
4677 \hspace{3mm}4.1 $c \leftarrow a \cdot b$ using algorithm mp\_karatsuba\_mul \\
4678 5. else \\
4679 \hspace{3mm}5.1 $digs \leftarrow a.used + b.used + 1$ \\
4680 \hspace{3mm}5.2 If $digs < MP\_ARRAY$ and min$(a.used, b.used) \le \delta$ then \\
4681 \hspace{6mm}5.2.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm fast\_s\_mp\_mul\_digs. \\
4682 \hspace{3mm}5.3 else \\
4683 \hspace{6mm}5.3.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm s\_mp\_mul\_digs. \\
4684 6. $c.sign \leftarrow sign$ \\
4685 7. Return the result of the unsigned multiplication performed. \\
4686 \hline
4687 \end{tabular}
4688 \end{center}
4689 \end{small}
4690 \caption{Algorithm mp\_mul}
4691 \end{figure}
4692
4693 \textbf{Algorithm mp\_mul.}
4694 This algorithm performs the signed multiplication of two inputs. It will make use of any of the three unsigned multiplication algorithms
4695 available when the input is of appropriate size. The \textbf{sign} of the result is not set until the end of the algorithm since algorithm
4696 s\_mp\_mul\_digs will clear it.
4697
4698 \vspace{+3mm}\begin{small}
4699 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul.c
4700 \vspace{-3mm}
4701 \begin{alltt}
4702 016
4703 017 /* high level multiplication (handles sign) */
4704 018 int mp_mul (mp_int * a, mp_int * b, mp_int * c)
4705 019 \{
4706 020 int res, neg;
4707 021 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG;
4708 022
4709 023 /* use Toom-Cook? */
4710 024 if (MIN (a->used, b->used) >= TOOM_MUL_CUTOFF) \{
4711 025 res = mp_toom_mul(a, b, c);
4712 026 /* use Karatsuba? */
4713 027 \} else if (MIN (a->used, b->used) >= KARATSUBA_MUL_CUTOFF) \{
4714 028 res = mp_karatsuba_mul (a, b, c);
4715 029 \} else \{
4716 030 /* can we use the fast multiplier?
4717 031 *
4718 032 * The fast multiplier can be used if the output will
4719 033 * have less than MP_WARRAY digits and the number of
4720 034 * digits won't affect carry propagation
4721 035 */
4722 036 int digs = a->used + b->used + 1;
4723 037
4724 038 if ((digs < MP_WARRAY) &&
4725 039 MIN(a->used, b->used) <=
4726 040 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{
4727 041 res = fast_s_mp_mul_digs (a, b, c, digs);
4728 042 \} else \{
4729 043 res = s_mp_mul (a, b, c);
4730 044 \}
4731 045 \}
4732 046 c->sign = neg;
4733 047 return res;
4734 048 \}
4735 \end{alltt}
4736 \end{small}
4737
4738 The implementation is rather simplistic and is not particularly noteworthy. Line 23 computes the sign of the result using the ``?''
4739 operator from the C programming language. Line 40 computes $\delta$ using the fact that $1 << k$ is equal to $2^k$.
4740
4741 \section{Squaring}
4742 \label{sec:basesquare}
4743
4744 Squaring is a special case of multiplication where both multiplicands are equal. At first it may seem like there is no significant optimization
4745 available but in fact there is. Consider the multiplication of $576$ against $241$. In total there will be nine single precision multiplications
4746 performed which are $1\cdot 6$, $1 \cdot 7$, $1 \cdot 5$, $4 \cdot 6$, $4 \cdot 7$, $4 \cdot 5$, $2 \cdot 6$, $2 \cdot 7$ and $2 \cdot 5$. Now consider
4747 the multiplication of $123$ against $123$. The nine products are $3 \cdot 3$, $3 \cdot 2$, $3 \cdot 1$, $2 \cdot 3$, $2 \cdot 2$, $2 \cdot 1$,
4748 $1 \cdot 3$, $1 \cdot 2$ and $1 \cdot 1$. On closer inspection some of the products are equivalent. For example, $3 \cdot 2 = 2 \cdot 3$
4749 and $3 \cdot 1 = 1 \cdot 3$.
4750
4751 For any $n$-digit input, there are ${{\left (n^2 + n \right)}\over 2}$ possible unique single precision multiplications required compared to the $n^2$
4752 required for multiplication. The following diagram gives an example of the operations required.
4753
4754 \begin{figure}[here]
4755 \begin{center}
4756 \begin{tabular}{ccccc|c}
4757 &&1&2&3&\\
4758 $\times$ &&1&2&3&\\
4759 \hline && $3 \cdot 1$ & $3 \cdot 2$ & $3 \cdot 3$ & Row 0\\
4760 & $2 \cdot 1$ & $2 \cdot 2$ & $2 \cdot 3$ && Row 1 \\
4761 $1 \cdot 1$ & $1 \cdot 2$ & $1 \cdot 3$ &&& Row 2 \\
4762 \end{tabular}
4763 \end{center}
4764 \caption{Squaring Optimization Diagram}
4765 \end{figure}
4766
4767 Starting from zero and numbering the columns from right to left a very simple pattern becomes obvious. For the purposes of this discussion let $x$
4768 represent the number being squared. The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it.
4769
4770 The second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product. Every non-square term of a column will
4771 appear twice hence the name ``double product''. Every odd column is made up entirely of double products. In fact every column is made up of double
4772 products and at most one square (\textit{see the exercise section}).
4773
4774 The third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row,
4775 occurs at column $2k + 1$. For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero.
4776 Column two of row one is a square and column three is the first unique column.
4777
4778 \subsection{The Baseline Squaring Algorithm}
4779 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines
4780 will not handle.
4781
4782 \newpage\begin{figure}[!here]
4783 \begin{small}
4784 \begin{center}
4785 \begin{tabular}{l}
4786 \hline Algorithm \textbf{s\_mp\_sqr}. \\
4787 \textbf{Input}. mp\_int $a$ \\
4788 \textbf{Output}. $b \leftarrow a^2$ \\
4789 \hline \\
4790 1. Init a temporary mp\_int of at least $2 \cdot a.used +1$ digits. (\textit{mp\_init\_size}) \\
4791 2. If step 1 failed return(\textit{MP\_MEM}) \\
4792 3. $t.used \leftarrow 2 \cdot a.used + 1$ \\
4793 4. For $ix$ from 0 to $a.used - 1$ do \\
4794 \hspace{3mm}Calculate the square. \\
4795 \hspace{3mm}4.1 $\hat r \leftarrow t_{2ix} + \left (a_{ix} \right )^2$ \\
4796 \hspace{3mm}4.2 $t_{2ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
4797 \hspace{3mm}Calculate the double products after the square. \\
4798 \hspace{3mm}4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
4799 \hspace{3mm}4.4 For $iy$ from $ix + 1$ to $a.used - 1$ do \\
4800 \hspace{6mm}4.4.1 $\hat r \leftarrow 2 \cdot a_{ix}a_{iy} + t_{ix + iy} + u$ \\
4801 \hspace{6mm}4.4.2 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
4802 \hspace{6mm}4.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
4803 \hspace{3mm}Set the last carry. \\
4804 \hspace{3mm}4.5 While $u > 0$ do \\
4805 \hspace{6mm}4.5.1 $iy \leftarrow iy + 1$ \\
4806 \hspace{6mm}4.5.2 $\hat r \leftarrow t_{ix + iy} + u$ \\
4807 \hspace{6mm}4.5.3 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
4808 \hspace{6mm}4.5.4 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
4809 5. Clamp excess digits of $t$. (\textit{mp\_clamp}) \\
4810 6. Exchange $b$ and $t$. \\
4811 7. Clear $t$ (\textit{mp\_clear}) \\
4812 8. Return(\textit{MP\_OKAY}) \\
4813 \hline
4814 \end{tabular}
4815 \end{center}
4816 \end{small}
4817 \caption{Algorithm s\_mp\_sqr}
4818 \end{figure}
4819
4820 \textbf{Algorithm s\_mp\_sqr.}
4821 This algorithm computes the square of an input using the three observations on squaring. It is based fairly faithfully on algorithm 14.16 of HAC
4822 \cite[pp.596-597]{HAC}. Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring. This allows the
4823 destination mp\_int to be the same as the source mp\_int.
4824
4825 The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while
4826 the inner loop computes the columns of the partial result. Step 4.1 and 4.2 compute the square term for each row, and step 4.3 and 4.4 propagate
4827 the carry and compute the double products.
4828
4829 The requirement that a mp\_word be able to represent the range $0 \le x < 2 \beta^2$ arises from this
4830 very algorithm. The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that
4831 when it is multiplied by two, it can be properly represented by a mp\_word.
4832
4833 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
4834 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm.
4835
4836 \vspace{+3mm}\begin{small}
4837 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sqr.c
4838 \vspace{-3mm}
4839 \begin{alltt}
4840 016
4841 017 /* low level squaring, b = a*a, HAC pp.596-597, Algorithm 14.16 */
4842 018 int
4843 019 s_mp_sqr (mp_int * a, mp_int * b)
4844 020 \{
4845 021 mp_int t;
4846 022 int res, ix, iy, pa;
4847 023 mp_word r;
4848 024 mp_digit u, tmpx, *tmpt;
4849 025
4850 026 pa = a->used;
4851 027 if ((res = mp_init_size (&t, 2*pa + 1)) != MP_OKAY) \{
4852 028 return res;
4853 029 \}
4854 030
4855 031 /* default used is maximum possible size */
4856 032 t.used = 2*pa + 1;
4857 033
4858 034 for (ix = 0; ix < pa; ix++) \{
4859 035 /* first calculate the digit at 2*ix */
4860 036 /* calculate double precision result */
4861 037 r = ((mp_word) t.dp[2*ix]) +
4862 038 ((mp_word)a->dp[ix])*((mp_word)a->dp[ix]);
4863 039
4864 040 /* store lower part in result */
4865 041 t.dp[ix+ix] = (mp_digit) (r & ((mp_word) MP_MASK));
4866 042
4867 043 /* get the carry */
4868 044 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4869 045
4870 046 /* left hand side of A[ix] * A[iy] */
4871 047 tmpx = a->dp[ix];
4872 048
4873 049 /* alias for where to store the results */
4874 050 tmpt = t.dp + (2*ix + 1);
4875 051
4876 052 for (iy = ix + 1; iy < pa; iy++) \{
4877 053 /* first calculate the product */
4878 054 r = ((mp_word)tmpx) * ((mp_word)a->dp[iy]);
4879 055
4880 056 /* now calculate the double precision result, note we use
4881 057 * addition instead of *2 since it's easier to optimize
4882 058 */
4883 059 r = ((mp_word) *tmpt) + r + r + ((mp_word) u);
4884 060
4885 061 /* store lower part */
4886 062 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
4887 063
4888 064 /* get carry */
4889 065 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4890 066 \}
4891 067 /* propagate upwards */
4892 068 while (u != ((mp_digit) 0)) \{
4893 069 r = ((mp_word) *tmpt) + ((mp_word) u);
4894 070 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
4895 071 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4896 072 \}
4897 073 \}
4898 074
4899 075 mp_clamp (&t);
4900 076 mp_exch (&t, b);
4901 077 mp_clear (&t);
4902 078 return MP_OKAY;
4903 079 \}
4904 \end{alltt}
4905 \end{small}
4906
4907 Inside the outer loop (\textit{see line 34}) the square term is calculated on line 37. Line 44 extracts the carry from the square
4908 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines 47 and 50 respectively. The doubling is performed using two
4909 additions (\textit{see line 59}) since it is usually faster than shifting,if not at least as fast.
4910
4911 \subsection{Faster Squaring by the ``Comba'' Method}
4912 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional
4913 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these
4914 performance hazards.
4915
4916 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry
4917 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact
4918 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example,
4919 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$.
4920
4921 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word
4922 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be
4923 moved to a $O(n)$ work level outside the $O(n^2)$ level.
4924
4925 \newpage\begin{figure}[!here]
4926 \begin{small}
4927 \begin{center}
4928 \begin{tabular}{l}
4929 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\
4930 \textbf{Input}. mp\_int $a$ \\
4931 \textbf{Output}. $b \leftarrow a^2$ \\
4932 \hline \\
4933 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\
4934 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\
4935 2. If step 1 failed return(\textit{MP\_MEM}). \\
4936 3. for $ix$ from $0$ to $2a.used + 1$ do \\
4937 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\
4938 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\
4939 4. for $ix$ from $0$ to $a.used - 1$ do \\
4940 \hspace{3mm}Compute the square.\\
4941 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\
4942 \\
4943 \hspace{3mm}Compute the double products.\\
4944 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\
4945 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\
4946 5. $oldused \leftarrow b.used$ \\
4947 6. $b.used \leftarrow 2a.used + 1$ \\
4948 \\
4949 Double the products and propagate the carries simultaneously. \\
4950 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\
4951 8. for $ix$ from $1$ to $2a.used$ do \\
4952 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\
4953 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\
4954 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\
4955 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\
4956 10. if $2a.used + 1 < oldused$ then do \\
4957 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\
4958 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\
4959 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\
4960 12. Return(\textit{MP\_OKAY}). \\
4961 \hline
4962 \end{tabular}
4963 \end{center}
4964 \end{small}
4965 \caption{Algorithm fast\_s\_mp\_sqr}
4966 \end{figure}
4967
4968 \textbf{Algorithm fast\_s\_mp\_sqr.}
4969 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when
4970 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.
4971
4972 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second
4973 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most
4974 processors to simply make it a full size array.
4975
4976 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how
4977 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop
4978 computes the sum of the products for each column. They are not doubled until later.
4979
4980 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both
4981 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the
4982 squares in place.
4983
4984 \vspace{+3mm}\begin{small}
4985 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_sqr.c
4986 \vspace{-3mm}
4987 \begin{alltt}
4988 016
4989 017 /* fast squaring
4990 018 *
4991 019 * This is the comba method where the columns of the product
4992 020 * are computed first then the carries are computed. This
4993 021 * has the effect of making a very simple inner loop that
4994 022 * is executed the most
4995 023 *
4996 024 * W2 represents the outer products and W the inner.
4997 025 *
4998 026 * A further optimizations is made because the inner
4999 027 * products are of the form "A * B * 2". The *2 part does
5000 028 * not need to be computed until the end which is good
5001 029 * because 64-bit shifts are slow!
5002 030 *
5003 031 * Based on Algorithm 14.16 on pp.597 of HAC.
5004 032 *
5005 033 */
5006 034 int fast_s_mp_sqr (mp_int * a, mp_int * b)
5007 035 \{
5008 036 int olduse, newused, res, ix, pa;
5009 037 mp_word W2[MP_WARRAY], W[MP_WARRAY];
5010 038
5011 039 /* calculate size of product and allocate as required */
5012 040 pa = a->used;
5013 041 newused = pa + pa + 1;
5014 042 if (b->alloc < newused) \{
5015 043 if ((res = mp_grow (b, newused)) != MP_OKAY) \{
5016 044 return res;
5017 045 \}
5018 046 \}
5019 047
5020 048 /* zero temp buffer (columns)
5021 049 * Note that there are two buffers. Since squaring requires
5022 050 * a outer and inner product and the inner product requires
5023 051 * computing a product and doubling it (a relatively expensive
5024 052 * op to perform n**2 times if you don't have to) the inner and
5025 053 * outer products are computed in different buffers. This way
5026 054 * the inner product can be doubled using n doublings instead of
5027 055 * n**2
5028 056 */
5029 057 memset (W, 0, newused * sizeof (mp_word));
5030 058 memset (W2, 0, newused * sizeof (mp_word));
5031 059
5032 060 /* This computes the inner product. To simplify the inner N**2 loop
5033 061 * the multiplication by two is done afterwards in the N loop.
5034 062 */
5035 063 for (ix = 0; ix < pa; ix++) \{
5036 064 /* compute the outer product
5037 065 *
5038 066 * Note that every outer product is computed
5039 067 * for a particular column only once which means that
5040 068 * there is no need todo a double precision addition
5041 069 * into the W2[] array.
5042 070 */
5043 071 W2[ix + ix] = ((mp_word)a->dp[ix]) * ((mp_word)a->dp[ix]);
5044 072
5045 073 \{
5046 074 register mp_digit tmpx, *tmpy;
5047 075 register mp_word *_W;
5048 076 register int iy;
5049 077
5050 078 /* copy of left side */
5051 079 tmpx = a->dp[ix];
5052 080
5053 081 /* alias for right side */
5054 082 tmpy = a->dp + (ix + 1);
5055 083
5056 084 /* the column to store the result in */
5057 085 _W = W + (ix + ix + 1);
5058 086
5059 087 /* inner products */
5060 088 for (iy = ix + 1; iy < pa; iy++) \{
5061 089 *_W++ += ((mp_word)tmpx) * ((mp_word)*tmpy++);
5062 090 \}
5063 091 \}
5064 092 \}
5065 093
5066 094 /* setup dest */
5067 095 olduse = b->used;
5068 096 b->used = newused;
5069 097
5070 098 /* now compute digits
5071 099 *
5072 100 * We have to double the inner product sums, add in the
5073 101 * outer product sums, propagate carries and convert
5074 102 * to single precision.
5075 103 */
5076 104 \{
5077 105 register mp_digit *tmpb;
5078 106
5079 107 /* double first value, since the inner products are
5080 108 * half of what they should be
5081 109 */
5082 110 W[0] += W[0] + W2[0];
5083 111
5084 112 tmpb = b->dp;
5085 113 for (ix = 1; ix < newused; ix++) \{
5086 114 /* double/add next digit */
5087 115 W[ix] += W[ix] + W2[ix];
5088 116
5089 117 /* propagate carry forwards [from the previous digit] */
5090 118 W[ix] = W[ix] + (W[ix - 1] >> ((mp_word) DIGIT_BIT));
5091 119
5092 120 /* store the current digit now that the carry isn't
5093 121 * needed
5094 122 */
5095 123 *tmpb++ = (mp_digit) (W[ix - 1] & ((mp_word) MP_MASK));
5096 124 \}
5097 125 /* set the last value. Note even if the carry is zero
5098 126 * this is required since the next step will not zero
5099 127 * it if b originally had a value at b->dp[2*a.used]
5100 128 */
5101 129 *tmpb++ = (mp_digit) (W[(newused) - 1] & ((mp_word) MP_MASK));
5102 130
5103 131 /* clear high digits of b if there were any originally */
5104 132 for (; ix < olduse; ix++) \{
5105 133 *tmpb++ = 0;
5106 134 \}
5107 135 \}
5108 136
5109 137 mp_clamp (b);
5110 138 return MP_OKAY;
5111 139 \}
5112 \end{alltt}
5113 \end{small}
5114
5115 -- Write something deep and insightful later, Tom.
5116
5117 \subsection{Polynomial Basis Squaring}
5118 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception
5119 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$
5120 multiplications to find the $\zeta$ relations, squaring operations are performed instead.
5121
5122 \subsection{Karatsuba Squaring}
5123 Let $f(x) = ax + b$ represent the polynomial basis representation of a number to square.
5124 Let $h(x) = \left ( f(x) \right )^2$ represent the square of the polynomial. The Karatsuba equation can be modified to square a
5125 number with the following equation.
5126
5127 \begin{equation}
5128 h(x) = a^2x^2 + \left (a^2 + b^2 - (a - b)^2 \right )x + b^2
5129 \end{equation}
5130
5131 Upon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a - b)^2$. As in
5132 Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of
5133 $O \left ( n^{lg(3)} \right )$.
5134
5135 If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm
5136 instead? The answer to this arises from the cutoff point for squaring. As in multiplication there exists a cutoff point, at which the
5137 time required for a Comba based squaring and a Karatsuba based squaring meet. Due to the overhead inherent in the Karatsuba method, the cutoff
5138 point is fairly high. For example, on an AMD Athlon XP processor with $\beta = 2^{28}$, the cutoff point is around 127 digits.
5139
5140 Consider squaring a 200 digit number with this technique. It will be split into two 100 digit halves which are subsequently squared.
5141 The 100 digit halves will not be squared using Karatsuba, but instead using the faster Comba based squaring algorithm. If Karatsuba multiplication
5142 were used instead, the 100 digit numbers would be squared with a slower Comba based multiplication.
5143
5144 \newpage\begin{figure}[!here]
5145 \begin{small}
5146 \begin{center}
5147 \begin{tabular}{l}
5148 \hline Algorithm \textbf{mp\_karatsuba\_sqr}. \\
5149 \textbf{Input}. mp\_int $a$ \\
5150 \textbf{Output}. $b \leftarrow a^2$ \\
5151 \hline \\
5152 1. Initialize the following temporary mp\_ints: $x0$, $x1$, $t1$, $t2$, $x0x0$ and $x1x1$. \\
5153 2. If any of the initializations on step 1 failed return(\textit{MP\_MEM}). \\
5154 \\
5155 Split the input. e.g. $a = x1\beta^B + x0$ \\
5156 3. $B \leftarrow \lfloor a.used / 2 \rfloor$ \\
5157 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\
5158 5. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_lshd}) \\
5159 \\
5160 Calculate the three squares. \\
5161 6. $x0x0 \leftarrow x0^2$ (\textit{mp\_sqr}) \\
5162 7. $x1x1 \leftarrow x1^2$ \\
5163 8. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\
5164 9. $t1 \leftarrow t1^2$ \\
5165 \\
5166 Compute the middle term. \\
5167 10. $t2 \leftarrow x0x0 + x1x1$ (\textit{s\_mp\_add}) \\
5168 11. $t1 \leftarrow t2 - t1$ \\
5169 \\
5170 Compute final product. \\
5171 12. $t1 \leftarrow t1\beta^B$ (\textit{mp\_lshd}) \\
5172 13. $x1x1 \leftarrow x1x1\beta^{2B}$ \\
5173 14. $t1 \leftarrow t1 + x0x0$ \\
5174 15. $b \leftarrow t1 + x1x1$ \\
5175 16. Return(\textit{MP\_OKAY}). \\
5176 \hline
5177 \end{tabular}
5178 \end{center}
5179 \end{small}
5180 \caption{Algorithm mp\_karatsuba\_sqr}
5181 \end{figure}
5182
5183 \textbf{Algorithm mp\_karatsuba\_sqr.}
5184 This algorithm computes the square of an input $a$ using the Karatsuba technique. This algorithm is very similar to the Karatsuba based
5185 multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings.
5186
5187 The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is
5188 placed just below the middle. Step 3, 4 and 5 compute the two halves required using $B$
5189 as the radix point. The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form.
5190
5191 By expanding $\left (x1 - x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $x1^2 + x0^2 - (x1 - x0)^2 = 2 \cdot x0 \cdot x1$.
5192 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then
5193 this method is faster. Assuming no further recursions occur, the difference can be estimated with the following inequality.
5194
5195 Let $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or
5196 machine clock cycles.}.
5197
5198 \begin{equation}
5199 5pn +{{q(n^2 + n)} \over 2} \le pn + qn^2
5200 \end{equation}
5201
5202 For example, on an AMD Athlon XP processor $p = {1 \over 3}$ and $q = 6$. This implies that the following inequality should hold.
5203 \begin{center}
5204 \begin{tabular}{rcl}
5205 ${5n \over 3} + 3n^2 + 3n$ & $<$ & ${n \over 3} + 6n^2$ \\
5206 ${5 \over 3} + 3n + 3$ & $<$ & ${1 \over 3} + 6n$ \\
5207 ${13 \over 9}$ & $<$ & $n$ \\
5208 \end{tabular}
5209 \end{center}
5210
5211 This results in a cutoff point around $n = 2$. As a consequence it is actually faster to compute the middle term the ``long way'' on processors
5212 where multiplication is substantially slower\footnote{On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication. On
5213 the Intel P4 processor this ratio is 1:29 making this method even more beneficial. The only common exception is the ARMv4 processor which has a
5214 ratio of 1:7. } than simpler operations such as addition.
5215
5216 \vspace{+3mm}\begin{small}
5217 \hspace{-5.1mm}{\bf File}: bn\_mp\_karatsuba\_sqr.c
5218 \vspace{-3mm}
5219 \begin{alltt}
5220 016
5221 017 /* Karatsuba squaring, computes b = a*a using three
5222 018 * half size squarings
5223 019 *
5224 020 * See comments of mp_karatsuba_mul for details. It
5225 021 * is essentially the same algorithm but merely
5226 022 * tuned to perform recursive squarings.
5227 023 */
5228 024 int mp_karatsuba_sqr (mp_int * a, mp_int * b)
5229 025 \{
5230 026 mp_int x0, x1, t1, t2, x0x0, x1x1;
5231 027 int B, err;
5232 028
5233 029 err = MP_MEM;
5234 030
5235 031 /* min # of digits */
5236 032 B = a->used;
5237 033
5238 034 /* now divide in two */
5239 035 B = B >> 1;
5240 036
5241 037 /* init copy all the temps */
5242 038 if (mp_init_size (&x0, B) != MP_OKAY)
5243 039 goto ERR;
5244 040 if (mp_init_size (&x1, a->used - B) != MP_OKAY)
5245 041 goto X0;
5246 042
5247 043 /* init temps */
5248 044 if (mp_init_size (&t1, a->used * 2) != MP_OKAY)
5249 045 goto X1;
5250 046 if (mp_init_size (&t2, a->used * 2) != MP_OKAY)
5251 047 goto T1;
5252 048 if (mp_init_size (&x0x0, B * 2) != MP_OKAY)
5253 049 goto T2;
5254 050 if (mp_init_size (&x1x1, (a->used - B) * 2) != MP_OKAY)
5255 051 goto X0X0;
5256 052
5257 053 \{
5258 054 register int x;
5259 055 register mp_digit *dst, *src;
5260 056
5261 057 src = a->dp;
5262 058
5263 059 /* now shift the digits */
5264 060 dst = x0.dp;
5265 061 for (x = 0; x < B; x++) \{
5266 062 *dst++ = *src++;
5267 063 \}
5268 064
5269 065 dst = x1.dp;
5270 066 for (x = B; x < a->used; x++) \{
5271 067 *dst++ = *src++;
5272 068 \}
5273 069 \}
5274 070
5275 071 x0.used = B;
5276 072 x1.used = a->used - B;
5277 073
5278 074 mp_clamp (&x0);
5279 075
5280 076 /* now calc the products x0*x0 and x1*x1 */
5281 077 if (mp_sqr (&x0, &x0x0) != MP_OKAY)
5282 078 goto X1X1; /* x0x0 = x0*x0 */
5283 079 if (mp_sqr (&x1, &x1x1) != MP_OKAY)
5284 080 goto X1X1; /* x1x1 = x1*x1 */
5285 081
5286 082 /* now calc (x1-x0)**2 */
5287 083 if (mp_sub (&x1, &x0, &t1) != MP_OKAY)
5288 084 goto X1X1; /* t1 = x1 - x0 */
5289 085 if (mp_sqr (&t1, &t1) != MP_OKAY)
5290 086 goto X1X1; /* t1 = (x1 - x0) * (x1 - x0) */
5291 087
5292 088 /* add x0y0 */
5293 089 if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
5294 090 goto X1X1; /* t2 = x0x0 + x1x1 */
5295 091 if (mp_sub (&t2, &t1, &t1) != MP_OKAY)
5296 092 goto X1X1; /* t1 = x0x0 + x1x1 - (x1-x0)*(x1-x0) */
5297 093
5298 094 /* shift by B */
5299 095 if (mp_lshd (&t1, B) != MP_OKAY)
5300 096 goto X1X1; /* t1 = (x0x0 + x1x1 - (x1-x0)*(x1-x0))<<B */
5301 097 if (mp_lshd (&x1x1, B * 2) != MP_OKAY)
5302 098 goto X1X1; /* x1x1 = x1x1 << 2*B */
5303 099
5304 100 if (mp_add (&x0x0, &t1, &t1) != MP_OKAY)
5305 101 goto X1X1; /* t1 = x0x0 + t1 */
5306 102 if (mp_add (&t1, &x1x1, b) != MP_OKAY)
5307 103 goto X1X1; /* t1 = x0x0 + t1 + x1x1 */
5308 104
5309 105 err = MP_OKAY;
5310 106
5311 107 X1X1:mp_clear (&x1x1);
5312 108 X0X0:mp_clear (&x0x0);
5313 109 T2:mp_clear (&t2);
5314 110 T1:mp_clear (&t1);
5315 111 X1:mp_clear (&x1);
5316 112 X0:mp_clear (&x0);
5317 113 ERR:
5318 114 return err;
5319 115 \}
5320 \end{alltt}
5321 \end{small}
5322
5323 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul. It uses the same inline style to copy and
5324 shift the input into the two halves. The loop from line 53 to line 69 has been modified since only one input exists. The \textbf{used}
5325 count of both $x0$ and $x1$ is fixed up and $x0$ is clamped before the calculations begin. At this point $x1$ and $x0$ are valid equivalents
5326 to the respective halves as if mp\_rshd and mp\_mod\_2d had been used.
5327
5328 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point
5329 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4
5330 it is actually below the Comba limit (\textit{at 110 digits}).
5331
5332 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to
5333 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally.
5334
5335 \textit{Last paragraph sucks. re-write! -- Tom}
5336
5337 \subsection{Toom-Cook Squaring}
5338 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
5339 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to
5340 derive their own Toom-Cook squaring algorithm.
5341
5342 \subsection{High Level Squaring}
5343 \newpage\begin{figure}[!here]
5344 \begin{small}
5345 \begin{center}
5346 \begin{tabular}{l}
5347 \hline Algorithm \textbf{mp\_sqr}. \\
5348 \textbf{Input}. mp\_int $a$ \\
5349 \textbf{Output}. $b \leftarrow a^2$ \\
5350 \hline \\
5351 1. If $a.used \ge TOOM\_SQR\_CUTOFF$ then \\
5352 \hspace{3mm}1.1 $b \leftarrow a^2$ using algorithm mp\_toom\_sqr \\
5353 2. else if $a.used \ge KARATSUBA\_SQR\_CUTOFF$ then \\
5354 \hspace{3mm}2.1 $b \leftarrow a^2$ using algorithm mp\_karatsuba\_sqr \\
5355 3. else \\
5356 \hspace{3mm}3.1 $digs \leftarrow a.used + b.used + 1$ \\
5357 \hspace{3mm}3.2 If $digs < MP\_ARRAY$ and $a.used \le \delta$ then \\
5358 \hspace{6mm}3.2.1 $b \leftarrow a^2$ using algorithm fast\_s\_mp\_sqr. \\
5359 \hspace{3mm}3.3 else \\
5360 \hspace{6mm}3.3.1 $b \leftarrow a^2$ using algorithm s\_mp\_sqr. \\
5361 4. $b.sign \leftarrow MP\_ZPOS$ \\
5362 5. Return the result of the unsigned squaring performed. \\
5363 \hline
5364 \end{tabular}
5365 \end{center}
5366 \end{small}
5367 \caption{Algorithm mp\_sqr}
5368 \end{figure}
5369
5370 \textbf{Algorithm mp\_sqr.}
5371 This algorithm computes the square of the input using one of four different algorithms. If the input is very large and has at least
5372 \textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used. If
5373 neither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used.
5374
5375 \vspace{+3mm}\begin{small}
5376 \hspace{-5.1mm}{\bf File}: bn\_mp\_sqr.c
5377 \vspace{-3mm}
5378 \begin{alltt}
5379 016
5380 017 /* computes b = a*a */
5381 018 int
5382 019 mp_sqr (mp_int * a, mp_int * b)
5383 020 \{
5384 021 int res;
5385 022
5386 023 /* use Toom-Cook? */
5387 024 if (a->used >= TOOM_SQR_CUTOFF) \{
5388 025 res = mp_toom_sqr(a, b);
5389 026 /* Karatsuba? */
5390 027 \} else if (a->used >= KARATSUBA_SQR_CUTOFF) \{
5391 028 res = mp_karatsuba_sqr (a, b);
5392 029 \} else \{
5393 030 /* can we use the fast comba multiplier? */
5394 031 if ((a->used * 2 + 1) < MP_WARRAY &&
5395 032 a->used <
5396 033 (1 << (sizeof(mp_word) * CHAR_BIT - 2*DIGIT_BIT - 1))) \{
5397 034 res = fast_s_mp_sqr (a, b);
5398 035 \} else \{
5399 036 res = s_mp_sqr (a, b);
5400 037 \}
5401 038 \}
5402 039 b->sign = MP_ZPOS;
5403 040 return res;
5404 041 \}
5405 \end{alltt}
5406 \end{small}
5407
5408 \section*{Exercises}
5409 \begin{tabular}{cl}
5410 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\
5411 & that have different number of digits in Karatsuba multiplication. \\
5412 & \\
5413 $\left [ 3 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\
5414 & of double products and at most one square is stated. Prove this statement. \\
5415 & \\
5416 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\
5417 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\
5418 & \\
5419 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\
5420 & \\
5421 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\
5422 & \\
5423 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\
5424 & required for equation $6.7$ to be true. \\
5425 & \\
5426 \end{tabular}
5427
5428 \chapter{Modular Reduction}
5429 \section{Basics of Modular Reduction}
5430 \index{modular residue}
5431 Modular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms,
5432 such as factoring. Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set. A number $a$ is said to be \textit{reduced}
5433 modulo another number $b$ by finding the remainder of the division $a/b$. Full integer division with remainder is a topic to be covered
5434 in~\ref{sec:division}.
5435
5436 Modular reduction is equivalent to solving for $r$ in the following equation. $a = bq + r$ where $q = \lfloor a/b \rfloor$. The result
5437 $r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$. In other vernacular $r$ is known as the
5438 ``modular residue'' which leads to ``quadratic residue''\footnote{That's fancy talk for $b \equiv a^2 \mbox{ (mod }p\mbox{)}$.} and
5439 other forms of residues.
5440
5441 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions
5442 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the
5443 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in
5444 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular
5445 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the
5446 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check
5447 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.
5448
5449 \section{The Barrett Reduction}
5450 The Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate
5451 division. Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to
5452
5453 \begin{equation}
5454 c = a - b \cdot \lfloor a/b \rfloor
5455 \end{equation}
5456
5457 Since algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper
5458 targeted the DSP56K processor.} intuition would indicate the next step would be to replace $a/b$ by a multiplication by the reciprocal. However,
5459 DSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types.
5460 It would take another common optimization to optimize the algorithm.
5461
5462 \subsection{Fixed Point Arithmetic}
5463 The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers. Fixed
5464 point arithmetic would become very popular as it greatly optimize the ``3d-shooter'' genre of games in the mid 1990s when floating point units were
5465 fairly slow if not unavailable. The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit
5466 integer and a $q$-bit fraction part (\textit{where $p+q = k$}).
5467
5468 In this system a $k$-bit integer $n$ would actually represent $n/2^q$. For example, with $q = 4$ the integer $n = 37$ would actually represent the
5469 value $2.3125$. To multiply two fixed point numbers the integers are multiplied using traditional arithmetic and subsequently normalized by
5470 moving the implied decimal point back to where it should be. For example, with $q = 4$ to multiply the integers $9$ and $5$ they must be converted
5471 to fixed point first by multiplying by $2^q$. Let $a = 9(2^q)$ represent the fixed point representation of $9$ and $b = 5(2^q)$ represent the
5472 fixed point representation of $5$. The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$.
5473
5474 This technique became popular since a normal integer multiplication and logical shift right are the only required operations to perform a multiplication
5475 of two fixed point numbers. Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal. If $2^q$ is
5476 equivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic. Using this fact dividing an integer
5477 $a$ by another integer $b$ can be achieved with the following expression.
5478
5479 \begin{equation}
5480 \lfloor a / b \rfloor \mbox{ }\approx\mbox{ } \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
5481 \end{equation}
5482
5483 The precision of the division is proportional to the value of $q$. If the divisor $b$ is used frequently as is the case with
5484 modular exponentiation pre-computing $2^q/b$ will allow a division to be performed with a multiplication and a right shift. Both operations
5485 are considerably faster than division on most processors.
5486
5487 Consider dividing $19$ by $5$. The correct result is $\lfloor 19/5 \rfloor = 3$. With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which
5488 leads to a product of $19$ which when divided by $2^q$ produces $2$. However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and
5489 the result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct. The value of $2^q$ must be close to or ideally
5490 larger than the dividend. In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach
5491 to work correctly. Plugging this form of divison into the original equation the following modular residue equation arises.
5492
5493 \begin{equation}
5494 c = a - b \cdot \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
5495 \end{equation}
5496
5497 Using the notation from \cite{BARRETT} the value of $\lfloor 2^q / b \rfloor$ will be represented by the $\mu$ symbol. Using the $\mu$
5498 variable also helps re-inforce the idea that it is meant to be computed once and re-used.
5499
5500 \begin{equation}
5501 c = a - b \cdot \lfloor (a \cdot \mu)/2^q \rfloor
5502 \end{equation}
5503
5504 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one. In the context of Barrett
5505 reduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough
5506 precision.
5507
5508 Let $n$ represent the number of digits in $b$. This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and
5509 another $n^2$ single precision multiplications to find the residue. In total $3n^2$ single precision multiplications are required to
5510 reduce the number.
5511
5512 For example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$. Consider reducing
5513 $a = 180388626447$ modulo $b$ using the above reduction equation. The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$.
5514 By subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found.
5515
5516 \subsection{Choosing a Radix Point}
5517 Using the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications. If that were the best
5518 that could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$.
5519 See~\ref{sec:division} for further details.} might as well be used in its place. The key to optimizing the reduction is to reduce the precision of
5520 the initial multiplication that finds the quotient.
5521
5522 Let $a$ represent the number of which the residue is sought. Let $b$ represent the modulus used to find the residue. Let $m$ represent
5523 the number of digits in $b$. For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if
5524 two $m$-digit numbers have been multiplied. Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer. Digits below the
5525 $m - 1$'th digit of $a$ will contribute at most a value of $1$ to the quotient because $\beta^k < b$ for any $0 \le k \le m - 1$. Another way to
5526 express this is by re-writing $a$ as two parts. If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then
5527 ${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$. Since $a'$ is bound to be less than $b$ the quotient
5528 is bound by $0 \le {a' \over b} < 1$.
5529
5530 Since the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero. However, if the digits
5531 ``might as well be zero'' they might as well not be there in the first place. Let $q_0 = \lfloor a/\beta^{m-1} \rfloor$ represent the input
5532 with the irrelevant digits trimmed. Now the modular reduction is trimmed to the almost equivalent equation
5533
5534 \begin{equation}
5535 c = a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor
5536 \end{equation}
5537
5538 Note that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the
5539 exponent on the divisor when added to the amount $q_0$ was shifted by equals $2m$. If the optimization had not been performed the divisor
5540 would have the exponent $2m$ so in the end the exponents do ``add up''. Using the above equation the quotient
5541 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ can be off from the true quotient by at most two. The original fixed point quotient can be off
5542 by as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient
5543 can be off by an additional value of one for a total of at most two. This implies that
5544 $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. By first subtracting $b$ times the quotient and then conditionally subtracting
5545 $b$ once or twice the residue is found.
5546
5547 The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single
5548 precision multiplications, ignoring the subtractions required. In total $2m^2 + m$ single precision multiplications are required to find the residue.
5549 This is considerably faster than the original attempt.
5550
5551 For example, let $\beta = 10$ represent the radix of the digits. Let $b = 9999$ represent the modulus which implies $m = 4$. Let $a = 99929878$
5552 represent the value of which the residue is desired. In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$.
5553 With the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$. The quotient is then
5554 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor = 9993$. Subtracting $9993b$ from $a$ and the correct residue $a \equiv 9871 \mbox{ (mod }b\mbox{)}$
5555 is found.
5556
5557 \subsection{Trimming the Quotient}
5558 So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications. As
5559 it stands now the algorithm is already fairly fast compared to a full integer division algorithm. However, there is still room for
5560 optimization.
5561
5562 After the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower
5563 half of the product. It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision
5564 multiplications. If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly.
5565 In fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed.
5566
5567 The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number. Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision
5568 multiplications would be required. Using a multiplier that will only produce digits at and above the $m - 1$'th digit reduces the number
5569 of single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications.
5570
5571 \subsection{Trimming the Residue}
5572 After the quotient has been calculated it is used to reduce the input. As previously noted the algorithm is not exact and it can be off by a small
5573 multiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. If $b$ is $m$ digits than the
5574 result of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are
5575 implicitly zero.
5576
5577 The next optimization arises from this very fact. Instead of computing $b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ using a full
5578 $O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed. Similarly the value of $a$ can
5579 be reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well. A multiplication that produces
5580 only the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications.
5581
5582 With both optimizations in place the algorithm is the algorithm Barrett proposed. It requires $m^2 + 2m - 1$ single precision multiplications which
5583 is considerably faster than the straightforward $3m^2$ method.
5584
5585 \subsection{The Barrett Algorithm}
5586 \newpage\begin{figure}[!here]
5587 \begin{small}
5588 \begin{center}
5589 \begin{tabular}{l}
5590 \hline Algorithm \textbf{mp\_reduce}. \\
5591 \textbf{Input}. mp\_int $a$, mp\_int $b$ and $\mu = \lfloor \beta^{2m}/b \rfloor, m = \lceil lg_{\beta}(b) \rceil, (0 \le a < b^2, b > 1)$ \\
5592 \textbf{Output}. $a \mbox{ (mod }b\mbox{)}$ \\
5593 \hline \\
5594 Let $m$ represent the number of digits in $b$. \\
5595 1. Make a copy of $a$ and store it in $q$. (\textit{mp\_init\_copy}) \\
5596 2. $q \leftarrow \lfloor q / \beta^{m - 1} \rfloor$ (\textit{mp\_rshd}) \\
5597 \\
5598 Produce the quotient. \\
5599 3. $q \leftarrow q \cdot \mu$ (\textit{note: only produce digits at or above $m-1$}) \\
5600 4. $q \leftarrow \lfloor q / \beta^{m + 1} \rfloor$ \\
5601 \\
5602 Subtract the multiple of modulus from the input. \\
5603 5. $a \leftarrow a \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{mp\_mod\_2d}) \\
5604 6. $q \leftarrow q \cdot b \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{s\_mp\_mul\_digs}) \\
5605 7. $a \leftarrow a - q$ (\textit{mp\_sub}) \\
5606 \\
5607 Add $\beta^{m+1}$ if a carry occured. \\
5608 8. If $a < 0$ then (\textit{mp\_cmp\_d}) \\
5609 \hspace{3mm}8.1 $q \leftarrow 1$ (\textit{mp\_set}) \\
5610 \hspace{3mm}8.2 $q \leftarrow q \cdot \beta^{m+1}$ (\textit{mp\_lshd}) \\
5611 \hspace{3mm}8.3 $a \leftarrow a + q$ \\
5612 \\
5613 Now subtract the modulus if the residue is too large (e.g. quotient too small). \\
5614 9. While $a \ge b$ do (\textit{mp\_cmp}) \\
5615 \hspace{3mm}9.1 $c \leftarrow a - b$ \\
5616 10. Clear $q$. \\
5617 11. Return(\textit{MP\_OKAY}) \\
5618 \hline
5619 \end{tabular}
5620 \end{center}
5621 \end{small}
5622 \caption{Algorithm mp\_reduce}
5623 \end{figure}
5624
5625 \textbf{Algorithm mp\_reduce.}
5626 This algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm. It is loosely based on algorithm 14.42 of HAC
5627 \cite[pp. 602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}. The algorithm has several restrictions and assumptions which must
5628 be adhered to for the algorithm to work.
5629
5630 First the modulus $b$ is assumed to be positive and greater than one. If the modulus were less than or equal to one than subtracting
5631 a multiple of it would either accomplish nothing or actually enlarge the input. The input $a$ must be in the range $0 \le a < b^2$ in order
5632 for the quotient to have enough precision. If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem.
5633 Technically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish. The value of $\mu$ is passed as an argument to this
5634 algorithm and is assumed to be calculated and stored before the algorithm is used.
5635
5636 Recall that the multiplication for the quotient on step 3 must only produce digits at or above the $m-1$'th position. An algorithm called
5637 $s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task. The algorithm is based on $s\_mp\_mul\_digs$ except that
5638 instead of stopping at a given level of precision it starts at a given level of precision. This optimal algorithm can only be used if the number
5639 of digits in $b$ is very much smaller than $\beta$.
5640
5641 While it is known that
5642 $a \ge b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ only the lower $m+1$ digits are being used to compute the residue, so an implied
5643 ``borrow'' from the higher digits might leave a negative result. After the multiple of the modulus has been subtracted from $a$ the residue must be
5644 fixed up in case it is negative. The invariant $\beta^{m+1}$ must be added to the residue to make it positive again.
5645
5646 The while loop at step 9 will subtract $b$ until the residue is less than $b$. If the algorithm is performed correctly this step is
5647 performed at most twice, and on average once. However, if $a \ge b^2$ than it will iterate substantially more times than it should.
5648
5649 \vspace{+3mm}\begin{small}
5650 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce.c
5651 \vspace{-3mm}
5652 \begin{alltt}
5653 016
5654 017 /* reduces x mod m, assumes 0 < x < m**2, mu is
5655 018 * precomputed via mp_reduce_setup.
5656 019 * From HAC pp.604 Algorithm 14.42
5657 020 */
5658 021 int
5659 022 mp_reduce (mp_int * x, mp_int * m, mp_int * mu)
5660 023 \{
5661 024 mp_int q;
5662 025 int res, um = m->used;
5663 026
5664 027 /* q = x */
5665 028 if ((res = mp_init_copy (&q, x)) != MP_OKAY) \{
5666 029 return res;
5667 030 \}
5668 031
5669 032 /* q1 = x / b**(k-1) */
5670 033 mp_rshd (&q, um - 1);
5671 034
5672 035 /* according to HAC this optimization is ok */
5673 036 if (((unsigned long) um) > (((mp_digit)1) << (DIGIT_BIT - 1))) \{
5674 037 if ((res = mp_mul (&q, mu, &q)) != MP_OKAY) \{
5675 038 goto CLEANUP;
5676 039 \}
5677 040 \} else \{
5678 041 if ((res = s_mp_mul_high_digs (&q, mu, &q, um - 1)) != MP_OKAY) \{
5679 042 goto CLEANUP;
5680 043 \}
5681 044 \}
5682 045
5683 046 /* q3 = q2 / b**(k+1) */
5684 047 mp_rshd (&q, um + 1);
5685 048
5686 049 /* x = x mod b**(k+1), quick (no division) */
5687 050 if ((res = mp_mod_2d (x, DIGIT_BIT * (um + 1), x)) != MP_OKAY) \{
5688 051 goto CLEANUP;
5689 052 \}
5690 053
5691 054 /* q = q * m mod b**(k+1), quick (no division) */
5692 055 if ((res = s_mp_mul_digs (&q, m, &q, um + 1)) != MP_OKAY) \{
5693 056 goto CLEANUP;
5694 057 \}
5695 058
5696 059 /* x = x - q */
5697 060 if ((res = mp_sub (x, &q, x)) != MP_OKAY) \{
5698 061 goto CLEANUP;
5699 062 \}
5700 063
5701 064 /* If x < 0, add b**(k+1) to it */
5702 065 if (mp_cmp_d (x, 0) == MP_LT) \{
5703 066 mp_set (&q, 1);
5704 067 if ((res = mp_lshd (&q, um + 1)) != MP_OKAY)
5705 068 goto CLEANUP;
5706 069 if ((res = mp_add (x, &q, x)) != MP_OKAY)
5707 070 goto CLEANUP;
5708 071 \}
5709 072
5710 073 /* Back off if it's too big */
5711 074 while (mp_cmp (x, m) != MP_LT) \{
5712 075 if ((res = s_mp_sub (x, m, x)) != MP_OKAY) \{
5713 076 goto CLEANUP;
5714 077 \}
5715 078 \}
5716 079
5717 080 CLEANUP:
5718 081 mp_clear (&q);
5719 082
5720 083 return res;
5721 084 \}
5722 \end{alltt}
5723 \end{small}
5724
5725 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up. This essentially halves
5726 the number of single precision multiplications required. However, the optimization is only safe if $\beta$ is much larger than the number of digits
5727 in the modulus. In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is
5728 safe to do so.
5729
5730 \subsection{The Barrett Setup Algorithm}
5731 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for
5732 future use so that the Barrett algorithm can be used without delay.
5733
5734 \begin{figure}[!here]
5735 \begin{small}
5736 \begin{center}
5737 \begin{tabular}{l}
5738 \hline Algorithm \textbf{mp\_reduce\_setup}. \\
5739 \textbf{Input}. mp\_int $a$ ($a > 1$) \\
5740 \textbf{Output}. $\mu \leftarrow \lfloor \beta^{2m}/a \rfloor$ \\
5741 \hline \\
5742 1. $\mu \leftarrow 2^{2 \cdot lg(\beta) \cdot m}$ (\textit{mp\_2expt}) \\
5743 2. $\mu \leftarrow \lfloor \mu / b \rfloor$ (\textit{mp\_div}) \\
5744 3. Return(\textit{MP\_OKAY}) \\
5745 \hline
5746 \end{tabular}
5747 \end{center}
5748 \end{small}
5749 \caption{Algorithm mp\_reduce\_setup}
5750 \end{figure}
5751
5752 \textbf{Algorithm mp\_reduce\_setup.}
5753 This algorithm computes the reciprocal $\mu$ required for Barrett reduction. First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot m}$ which
5754 is equivalent and much faster. The final value is computed by taking the integer quotient of $\lfloor \mu / b \rfloor$.
5755
5756 \vspace{+3mm}\begin{small}
5757 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_setup.c
5758 \vspace{-3mm}
5759 \begin{alltt}
5760 016
5761 017 /* pre-calculate the value required for Barrett reduction
5762 018 * For a given modulus "b" it calulates the value required in "a"
5763 019 */
5764 020 int
5765 021 mp_reduce_setup (mp_int * a, mp_int * b)
5766 022 \{
5767 023 int res;
5768 024
5769 025 if ((res = mp_2expt (a, b->used * 2 * DIGIT_BIT)) != MP_OKAY) \{
5770 026 return res;
5771 027 \}
5772 028 return mp_div (a, b, a, NULL);
5773 029 \}
5774 \end{alltt}
5775 \end{small}
5776
5777 This simple routine calculates the reciprocal $\mu$ required by Barrett reduction. Note the extended usage of algorithm mp\_div where the variable
5778 which would received the remainder is passed as NULL. As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the
5779 remainder to be passed as NULL meaning to ignore the value.
5780
5781 \section{The Montgomery Reduction}
5782 Montgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting
5783 form of reduction in common use. It computes a modular residue which is not actually equal to the residue of the input yet instead equal to a
5784 residue times a constant. However, as perplexing as this may sound the algorithm is relatively simple and very efficient.
5785
5786 Throughout this entire section the variable $n$ will represent the modulus used to form the residue. As will be discussed shortly the value of
5787 $n$ must be odd. The variable $x$ will represent the quantity of which the residue is sought. Similar to the Barrett algorithm the input
5788 is restricted to $0 \le x < n^2$. To begin the description some simple number theory facts must be established.
5789
5790 \textbf{Fact 1.} Adding $n$ to $x$ does not change the residue since in effect it adds one to the quotient $\lfloor x / n \rfloor$. Another way
5791 to explain this is that $n$ is (\textit{or multiples of $n$ are}) congruent to zero modulo $n$. Adding zero will not change the value of the residue.
5792
5793 \textbf{Fact 2.} If $x$ is even then performing a division by two in $\Z$ is congruent to $x \cdot 2^{-1} \mbox{ (mod }n\mbox{)}$. Actually
5794 this is an application of the fact that if $x$ is evenly divisible by any $k \in \Z$ then division in $\Z$ will be congruent to
5795 multiplication by $k^{-1}$ modulo $n$.
5796
5797 From these two simple facts the following simple algorithm can be derived.
5798
5799 \newpage\begin{figure}[!here]
5800 \begin{small}
5801 \begin{center}
5802 \begin{tabular}{l}
5803 \hline Algorithm \textbf{Montgomery Reduction}. \\
5804 \textbf{Input}. Integer $x$, $n$ and $k$ \\
5805 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\
5806 \hline \\
5807 1. for $t$ from $1$ to $k$ do \\
5808 \hspace{3mm}1.1 If $x$ is odd then \\
5809 \hspace{6mm}1.1.1 $x \leftarrow x + n$ \\
5810 \hspace{3mm}1.2 $x \leftarrow x/2$ \\
5811 2. Return $x$. \\
5812 \hline
5813 \end{tabular}
5814 \end{center}
5815 \end{small}
5816 \caption{Algorithm Montgomery Reduction}
5817 \end{figure}
5818
5819 The algorithm reduces the input one bit at a time using the two congruencies stated previously. Inside the loop $n$, which is odd, is
5820 added to $x$ if $x$ is odd. This forces $x$ to be even which allows the division by two in $\Z$ to be congruent to a modular division by two. Since
5821 $x$ is assumed to be initially much larger than $n$ the addition of $n$ will contribute an insignificant magnitude to $x$. Let $r$ represent the
5822 final result of the Montgomery algorithm. If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to
5823 $0 \le r < \lfloor x/2^k \rfloor + n$. As a result at most a single subtraction is required to get the residue desired.
5824
5825 \begin{figure}[here]
5826 \begin{small}
5827 \begin{center}
5828 \begin{tabular}{|c|l|}
5829 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} \\
5830 \hline $1$ & $x + n = 5812$, $x/2 = 2906$ \\
5831 \hline $2$ & $x/2 = 1453$ \\
5832 \hline $3$ & $x + n = 1710$, $x/2 = 855$ \\
5833 \hline $4$ & $x + n = 1112$, $x/2 = 556$ \\
5834 \hline $5$ & $x/2 = 278$ \\
5835 \hline $6$ & $x/2 = 139$ \\
5836 \hline $7$ & $x + n = 396$, $x/2 = 198$ \\
5837 \hline $8$ & $x/2 = 99$ \\
5838 \hline
5839 \end{tabular}
5840 \end{center}
5841 \end{small}
5842 \caption{Example of Montgomery Reduction (I)}
5843 \label{fig:MONT1}
5844 \end{figure}
5845
5846 Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 8$. The result of the algorithm $r = 99$ is
5847 congruent to the value of $2^{-8} \cdot 5555 \mbox{ (mod }257\mbox{)}$. When $r$ is multiplied by $2^8$ modulo $257$ the correct residue
5848 $r \equiv 158$ is produced.
5849
5850 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$. The current algorithm requires $2k^2$ single precision shifts
5851 and $k^2$ single precision additions. At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
5852 Fortunately there exists an alternative representation of the algorithm.
5853
5854 \begin{figure}[!here]
5855 \begin{small}
5856 \begin{center}
5857 \begin{tabular}{l}
5858 \hline Algorithm \textbf{Montgomery Reduction} (modified I). \\
5859 \textbf{Input}. Integer $x$, $n$ and $k$ \\
5860 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\
5861 \hline \\
5862 1. for $t$ from $0$ to $k - 1$ do \\
5863 \hspace{3mm}1.1 If the $t$'th bit of $x$ is one then \\
5864 \hspace{6mm}1.1.1 $x \leftarrow x + 2^tn$ \\
5865 2. Return $x/2^k$. \\
5866 \hline
5867 \end{tabular}
5868 \end{center}
5869 \end{small}
5870 \caption{Algorithm Montgomery Reduction (modified I)}
5871 \end{figure}
5872
5873 This algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2. The number of single
5874 precision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement.
5875
5876 \begin{figure}[here]
5877 \begin{small}
5878 \begin{center}
5879 \begin{tabular}{|c|l|r|}
5880 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} & \textbf{Result ($x$) in Binary} \\
5881 \hline -- & $5555$ & $1010110110011$ \\
5882 \hline $1$ & $x + 2^{0}n = 5812$ & $1011010110100$ \\
5883 \hline $2$ & $5812$ & $1011010110100$ \\
5884 \hline $3$ & $x + 2^{2}n = 6840$ & $1101010111000$ \\
5885 \hline $4$ & $x + 2^{3}n = 8896$ & $10001011000000$ \\
5886 \hline $5$ & $8896$ & $10001011000000$ \\
5887 \hline $6$ & $8896$ & $10001011000000$ \\
5888 \hline $7$ & $x + 2^{6}n = 25344$ & $110001100000000$ \\
5889 \hline $8$ & $25344$ & $110001100000000$ \\
5890 \hline -- & $x/2^k = 99$ & \\
5891 \hline
5892 \end{tabular}
5893 \end{center}
5894 \end{small}
5895 \caption{Example of Montgomery Reduction (II)}
5896 \label{fig:MONT2}
5897 \end{figure}
5898
5899 Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 8$.
5900 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the
5901 loop. Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed. In those iterations the $t$'th bit of $x$ is
5902 zero and the appropriate multiple of $n$ does not need to be added to force the $t$'th bit of the result to zero.
5903
5904 \subsection{Digit Based Montgomery Reduction}
5905 Instead of computing the reduction on a bit-by-bit basis it is actually much faster to compute it on digit-by-digit basis. Consider the
5906 previous algorithm re-written to compute the Montgomery reduction in this new fashion.
5907
5908 \begin{figure}[!here]
5909 \begin{small}
5910 \begin{center}
5911 \begin{tabular}{l}
5912 \hline Algorithm \textbf{Montgomery Reduction} (modified II). \\
5913 \textbf{Input}. Integer $x$, $n$ and $k$ \\
5914 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
5915 \hline \\
5916 1. for $t$ from $0$ to $k - 1$ do \\
5917 \hspace{3mm}1.1 $x \leftarrow x + \mu n \beta^t$ \\
5918 2. Return $x/\beta^k$. \\
5919 \hline
5920 \end{tabular}
5921 \end{center}
5922 \end{small}
5923 \caption{Algorithm Montgomery Reduction (modified II)}
5924 \end{figure}
5925
5926 The value $\mu n \beta^t$ is a multiple of the modulus $n$ meaning that it will not change the residue. If the first digit of
5927 the value $\mu n \beta^t$ equals the negative (modulo $\beta$) of the $t$'th digit of $x$ then the addition will result in a zero digit. This
5928 problem breaks down to solving the following congruency.
5929
5930 \begin{center}
5931 \begin{tabular}{rcl}
5932 $x_t + \mu n_0$ & $\equiv$ & $0 \mbox{ (mod }\beta\mbox{)}$ \\
5933 $\mu n_0$ & $\equiv$ & $-x_t \mbox{ (mod }\beta\mbox{)}$ \\
5934 $\mu$ & $\equiv$ & $-x_t/n_0 \mbox{ (mod }\beta\mbox{)}$ \\
5935 \end{tabular}
5936 \end{center}
5937
5938 In each iteration of the loop on step 1 a new value of $\mu$ must be calculated. The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used
5939 extensively in this algorithm and should be precomputed. Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$.
5940
5941 For example, let $\beta = 10$ represent the radix. Let $n = 17$ represent the modulus which implies $k = 2$ and $\rho \equiv 7$. Let $x = 33$
5942 represent the value to reduce.
5943
5944 \newpage\begin{figure}
5945 \begin{center}
5946 \begin{tabular}{|c|c|c|}
5947 \hline \textbf{Step ($t$)} & \textbf{Value of $x$} & \textbf{Value of $\mu$} \\
5948 \hline -- & $33$ & --\\
5949 \hline $0$ & $33 + \mu n = 50$ & $1$ \\
5950 \hline $1$ & $50 + \mu n \beta = 900$ & $5$ \\
5951 \hline
5952 \end{tabular}
5953 \end{center}
5954 \caption{Example of Montgomery Reduction}
5955 \end{figure}
5956
5957 The final result $900$ is then divided by $\beta^k$ to produce the final result $9$. The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$
5958 which implies the result is not the modular residue of $x$ modulo $n$. However, recall that the residue is actually multiplied by $\beta^{-k}$ in
5959 the algorithm. To get the true residue the value must be multiplied by $\beta^k$. In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and
5960 the correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$.
5961
5962 \subsection{Baseline Montgomery Reduction}
5963 The baseline Montgomery reduction algorithm will produce the residue for any size input. It is designed to be a catch-all algororithm for
5964 Montgomery reductions.
5965
5966 \newpage\begin{figure}[!here]
5967 \begin{small}
5968 \begin{center}
5969 \begin{tabular}{l}
5970 \hline Algorithm \textbf{mp\_montgomery\_reduce}. \\
5971 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\
5972 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\
5973 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
5974 \hline \\
5975 1. $digs \leftarrow 2n.used + 1$ \\
5976 2. If $digs < MP\_ARRAY$ and $m.used < \delta$ then \\
5977 \hspace{3mm}2.1 Use algorithm fast\_mp\_montgomery\_reduce instead. \\
5978 \\
5979 Setup $x$ for the reduction. \\
5980 3. If $x.alloc < digs$ then grow $x$ to $digs$ digits. \\
5981 4. $x.used \leftarrow digs$ \\
5982 \\
5983 Eliminate the lower $k$ digits. \\
5984 5. For $ix$ from $0$ to $k - 1$ do \\
5985 \hspace{3mm}5.1 $\mu \leftarrow x_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\
5986 \hspace{3mm}5.2 $u \leftarrow 0$ \\
5987 \hspace{3mm}5.3 For $iy$ from $0$ to $k - 1$ do \\
5988 \hspace{6mm}5.3.1 $\hat r \leftarrow \mu n_{iy} + x_{ix + iy} + u$ \\
5989 \hspace{6mm}5.3.2 $x_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
5990 \hspace{6mm}5.3.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
5991 \hspace{3mm}5.4 While $u > 0$ do \\
5992 \hspace{6mm}5.4.1 $iy \leftarrow iy + 1$ \\
5993 \hspace{6mm}5.4.2 $x_{ix + iy} \leftarrow x_{ix + iy} + u$ \\
5994 \hspace{6mm}5.4.3 $u \leftarrow \lfloor x_{ix+iy} / \beta \rfloor$ \\
5995 \hspace{6mm}5.4.4 $x_{ix + iy} \leftarrow x_{ix+iy} \mbox{ (mod }\beta\mbox{)}$ \\
5996 \\
5997 Divide by $\beta^k$ and fix up as required. \\
5998 6. $x \leftarrow \lfloor x / \beta^k \rfloor$ \\
5999 7. If $x \ge n$ then \\
6000 \hspace{3mm}7.1 $x \leftarrow x - n$ \\
6001 8. Return(\textit{MP\_OKAY}). \\
6002 \hline
6003 \end{tabular}
6004 \end{center}
6005 \end{small}
6006 \caption{Algorithm mp\_montgomery\_reduce}
6007 \end{figure}
6008
6009 \textbf{Algorithm mp\_montgomery\_reduce.}
6010 This algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm. The algorithm is loosely based
6011 on algorithm 14.32 of \cite[pp.601]{HAC} except it merges the multiplication of $\mu n \beta^t$ with the addition in the inner loop. The
6012 restrictions on this algorithm are fairly easy to adapt to. First $0 \le x < n^2$ bounds the input to numbers in the same range as
6013 for the Barrett algorithm. Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$. $\rho$ must be calculated in
6014 advance of this algorithm. Finally the variable $k$ is fixed and a pseudonym for $n.used$.
6015
6016 Step 2 decides whether a faster Montgomery algorithm can be used. It is based on the Comba technique meaning that there are limits on
6017 the size of the input. This algorithm is discussed in sub-section 6.3.3.
6018
6019 Step 5 is the main reduction loop of the algorithm. The value of $\mu$ is calculated once per iteration in the outer loop. The inner loop
6020 calculates $x + \mu n \beta^{ix}$ by multiplying $\mu n$ and adding the result to $x$ shifted by $ix$ digits. Both the addition and
6021 multiplication are performed in the same loop to save time and memory. Step 5.4 will handle any additional carries that escape the inner loop.
6022
6023 Using a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications
6024 in the inner loop. In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision
6025 multiplications.
6026
6027 \vspace{+3mm}\begin{small}
6028 \hspace{-5.1mm}{\bf File}: bn\_mp\_montgomery\_reduce.c
6029 \vspace{-3mm}
6030 \begin{alltt}
6031 016
6032 017 /* computes xR**-1 == x (mod N) via Montgomery Reduction */
6033 018 int
6034 019 mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho)
6035 020 \{
6036 021 int ix, res, digs;
6037 022 mp_digit mu;
6038 023
6039 024 /* can the fast reduction [comba] method be used?
6040 025 *
6041 026 * Note that unlike in mp_mul you're safely allowed *less*
6042 027 * than the available columns [255 per default] since carries
6043 028 * are fixed up in the inner loop.
6044 029 */
6045 030 digs = n->used * 2 + 1;
6046 031 if ((digs < MP_WARRAY) &&
6047 032 n->used <
6048 033 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{
6049 034 return fast_mp_montgomery_reduce (x, n, rho);
6050 035 \}
6051 036
6052 037 /* grow the input as required */
6053 038 if (x->alloc < digs) \{
6054 039 if ((res = mp_grow (x, digs)) != MP_OKAY) \{
6055 040 return res;
6056 041 \}
6057 042 \}
6058 043 x->used = digs;
6059 044
6060 045 for (ix = 0; ix < n->used; ix++) \{
6061 046 /* mu = ai * rho mod b
6062 047 *
6063 048 * The value of rho must be precalculated via
6064 049 * bn_mp_montgomery_setup() such that
6065 050 * it equals -1/n0 mod b this allows the
6066 051 * following inner loop to reduce the
6067 052 * input one digit at a time
6068 053 */
6069 054 mu = (mp_digit) (((mp_word)x->dp[ix]) * ((mp_word)rho) & MP_MASK);
6070 055
6071 056 /* a = a + mu * m * b**i */
6072 057 \{
6073 058 register int iy;
6074 059 register mp_digit *tmpn, *tmpx, u;
6075 060 register mp_word r;
6076 061
6077 062 /* alias for digits of the modulus */
6078 063 tmpn = n->dp;
6079 064
6080 065 /* alias for the digits of x [the input] */
6081 066 tmpx = x->dp + ix;
6082 067
6083 068 /* set the carry to zero */
6084 069 u = 0;
6085 070
6086 071 /* Multiply and add in place */
6087 072 for (iy = 0; iy < n->used; iy++) \{
6088 073 /* compute product and sum */
6089 074 r = ((mp_word)mu) * ((mp_word)*tmpn++) +
6090 075 ((mp_word) u) + ((mp_word) * tmpx);
6091 076
6092 077 /* get carry */
6093 078 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
6094 079
6095 080 /* fix digit */
6096 081 *tmpx++ = (mp_digit)(r & ((mp_word) MP_MASK));
6097 082 \}
6098 083 /* At this point the ix'th digit of x should be zero */
6099 084
6100 085
6101 086 /* propagate carries upwards as required*/
6102 087 while (u) \{
6103 088 *tmpx += u;
6104 089 u = *tmpx >> DIGIT_BIT;
6105 090 *tmpx++ &= MP_MASK;
6106 091 \}
6107 092 \}
6108 093 \}
6109 094
6110 095 /* at this point the n.used'th least
6111 096 * significant digits of x are all zero
6112 097 * which means we can shift x to the
6113 098 * right by n.used digits and the
6114 099 * residue is unchanged.
6115 100 */
6116 101
6117 102 /* x = x/b**n.used */
6118 103 mp_clamp(x);
6119 104 mp_rshd (x, n->used);
6120 105
6121 106 /* if x >= n then x = x - n */
6122 107 if (mp_cmp_mag (x, n) != MP_LT) \{
6123 108 return s_mp_sub (x, n, x);
6124 109 \}
6125 110
6126 111 return MP_OKAY;
6127 112 \}
6128 \end{alltt}
6129 \end{small}
6130
6131 This is the baseline implementation of the Montgomery reduction algorithm. Lines 30 to 35 determine if the Comba based
6132 routine can be used instead. Line 48 computes the value of $\mu$ for that particular iteration of the outer loop.
6133
6134 The multiplication $\mu n \beta^{ix}$ is performed in one step in the inner loop. The alias $tmpx$ refers to the $ix$'th digit of $x$ and
6135 the alias $tmpn$ refers to the modulus $n$.
6136
6137 \subsection{Faster ``Comba'' Montgomery Reduction}
6138
6139 The Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial
6140 nature of the inner loop. The Barrett reduction algorithm requires two slightly modified multipliers which can be implemented with the Comba
6141 technique. The Montgomery reduction algorithm cannot directly use the Comba technique to any significant advantage since the inner loop calculates
6142 a $k \times 1$ product $k$ times.
6143
6144 The biggest obstacle is that at the $ix$'th iteration of the outer loop the value of $x_{ix}$ is required to calculate $\mu$. This means the
6145 carries from $0$ to $ix - 1$ must have been propagated upwards to form a valid $ix$'th digit. The solution as it turns out is very simple.
6146 Perform a Comba like multiplier and inside the outer loop just after the inner loop fix up the $ix + 1$'th digit by forwarding the carry.
6147
6148 With this change in place the Montgomery reduction algorithm can be performed with a Comba style multiplication loop which substantially increases
6149 the speed of the algorithm.
6150
6151 \newpage\begin{figure}[!here]
6152 \begin{small}
6153 \begin{center}
6154 \begin{tabular}{l}
6155 \hline Algorithm \textbf{fast\_mp\_montgomery\_reduce}. \\
6156 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\
6157 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\
6158 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
6159 \hline \\
6160 Place an array of \textbf{MP\_WARRAY} mp\_word variables called $\hat W$ on the stack. \\
6161 1. if $x.alloc < n.used + 1$ then grow $x$ to $n.used + 1$ digits. \\
6162 Copy the digits of $x$ into the array $\hat W$ \\
6163 2. For $ix$ from $0$ to $x.used - 1$ do \\
6164 \hspace{3mm}2.1 $\hat W_{ix} \leftarrow x_{ix}$ \\
6165 3. For $ix$ from $x.used$ to $2n.used - 1$ do \\
6166 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\
6167 Elimiate the lower $k$ digits. \\
6168 4. for $ix$ from $0$ to $n.used - 1$ do \\
6169 \hspace{3mm}4.1 $\mu \leftarrow \hat W_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\
6170 \hspace{3mm}4.2 For $iy$ from $0$ to $n.used - 1$ do \\
6171 \hspace{6mm}4.2.1 $\hat W_{iy + ix} \leftarrow \hat W_{iy + ix} + \mu \cdot n_{iy}$ \\
6172 \hspace{3mm}4.3 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\
6173 Propagate carries upwards. \\
6174 5. for $ix$ from $n.used$ to $2n.used + 1$ do \\
6175 \hspace{3mm}5.1 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\
6176 Shift right and reduce modulo $\beta$ simultaneously. \\
6177 6. for $ix$ from $0$ to $n.used + 1$ do \\
6178 \hspace{3mm}6.1 $x_{ix} \leftarrow \hat W_{ix + n.used} \mbox{ (mod }\beta\mbox{)}$ \\
6179 Zero excess digits and fixup $x$. \\
6180 7. if $x.used > n.used + 1$ then do \\
6181 \hspace{3mm}7.1 for $ix$ from $n.used + 1$ to $x.used - 1$ do \\
6182 \hspace{6mm}7.1.1 $x_{ix} \leftarrow 0$ \\
6183 8. $x.used \leftarrow n.used + 1$ \\
6184 9. Clamp excessive digits of $x$. \\
6185 10. If $x \ge n$ then \\
6186 \hspace{3mm}10.1 $x \leftarrow x - n$ \\
6187 11. Return(\textit{MP\_OKAY}). \\
6188 \hline
6189 \end{tabular}
6190 \end{center}
6191 \end{small}
6192 \caption{Algorithm fast\_mp\_montgomery\_reduce}
6193 \end{figure}
6194
6195 \textbf{Algorithm fast\_mp\_montgomery\_reduce.}
6196 This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique. It is on most computer platforms significantly
6197 faster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}). The algorithm has the same restrictions
6198 on the input as the baseline reduction algorithm. An additional two restrictions are imposed on this algorithm. The number of digits $k$ in the
6199 the modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$. When $\beta = 2^{28}$ this algorithm can be used to reduce modulo
6200 a modulus of at most $3,556$ bits in length.
6201
6202 As in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product. It is initially filled with the
6203 contents of $x$ with the excess digits zeroed. The reduction loop is very similar the to the baseline loop at heart. The multiplication on step
6204 4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$. Some multipliers such
6205 as those on the ARM processors take a variable length time to complete depending on the number of bytes of result it must produce. By performing
6206 a single precision multiplication instead half the amount of time is spent.
6207
6208 Also note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work. That is what step
6209 4.3 will do. In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards. Note
6210 how the upper bits of those same words are not reduced modulo $\beta$. This is because those values will be discarded shortly and there is no
6211 point.
6212
6213 Step 5 will propagate the remainder of the carries upwards. On step 6 the columns are reduced modulo $\beta$ and shifted simultaneously as they are
6214 stored in the destination $x$.
6215
6216 \vspace{+3mm}\begin{small}
6217 \hspace{-5.1mm}{\bf File}: bn\_fast\_mp\_montgomery\_reduce.c
6218 \vspace{-3mm}
6219 \begin{alltt}
6220 016
6221 017 /* computes xR**-1 == x (mod N) via Montgomery Reduction
6222 018 *
6223 019 * This is an optimized implementation of mp_montgomery_reduce
6224 020 * which uses the comba method to quickly calculate the columns of the
6225 021 * reduction.
6226 022 *
6227 023 * Based on Algorithm 14.32 on pp.601 of HAC.
6228 024 */
6229 025 int
6230 026 fast_mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho)
6231 027 \{
6232 028 int ix, res, olduse;
6233 029 mp_word W[MP_WARRAY];
6234 030
6235 031 /* get old used count */
6236 032 olduse = x->used;
6237 033
6238 034 /* grow a as required */
6239 035 if (x->alloc < n->used + 1) \{
6240 036 if ((res = mp_grow (x, n->used + 1)) != MP_OKAY) \{
6241 037 return res;
6242 038 \}
6243 039 \}
6244 040
6245 041 /* first we have to get the digits of the input into
6246 042 * an array of double precision words W[...]
6247 043 */
6248 044 \{
6249 045 register mp_word *_W;
6250 046 register mp_digit *tmpx;
6251 047
6252 048 /* alias for the W[] array */
6253 049 _W = W;
6254 050
6255 051 /* alias for the digits of x*/
6256 052 tmpx = x->dp;
6257 053
6258 054 /* copy the digits of a into W[0..a->used-1] */
6259 055 for (ix = 0; ix < x->used; ix++) \{
6260 056 *_W++ = *tmpx++;
6261 057 \}
6262 058
6263 059 /* zero the high words of W[a->used..m->used*2] */
6264 060 for (; ix < n->used * 2 + 1; ix++) \{
6265 061 *_W++ = 0;
6266 062 \}
6267 063 \}
6268 064
6269 065 /* now we proceed to zero successive digits
6270 066 * from the least significant upwards
6271 067 */
6272 068 for (ix = 0; ix < n->used; ix++) \{
6273 069 /* mu = ai * m' mod b
6274 070 *
6275 071 * We avoid a double precision multiplication (which isn't required)
6276 072 * by casting the value down to a mp_digit. Note this requires
6277 073 * that W[ix-1] have the carry cleared (see after the inner loop)
6278 074 */
6279 075 register mp_digit mu;
6280 076 mu = (mp_digit) (((W[ix] & MP_MASK) * rho) & MP_MASK);
6281 077
6282 078 /* a = a + mu * m * b**i
6283 079 *
6284 080 * This is computed in place and on the fly. The multiplication
6285 081 * by b**i is handled by offseting which columns the results
6286 082 * are added to.
6287 083 *
6288 084 * Note the comba method normally doesn't handle carries in the
6289 085 * inner loop In this case we fix the carry from the previous
6290 086 * column since the Montgomery reduction requires digits of the
6291 087 * result (so far) [see above] to work. This is
6292 088 * handled by fixing up one carry after the inner loop. The
6293 089 * carry fixups are done in order so after these loops the
6294 090 * first m->used words of W[] have the carries fixed
6295 091 */
6296 092 \{
6297 093 register int iy;
6298 094 register mp_digit *tmpn;
6299 095 register mp_word *_W;
6300 096
6301 097 /* alias for the digits of the modulus */
6302 098 tmpn = n->dp;
6303 099
6304 100 /* Alias for the columns set by an offset of ix */
6305 101 _W = W + ix;
6306 102
6307 103 /* inner loop */
6308 104 for (iy = 0; iy < n->used; iy++) \{
6309 105 *_W++ += ((mp_word)mu) * ((mp_word)*tmpn++);
6310 106 \}
6311 107 \}
6312 108
6313 109 /* now fix carry for next digit, W[ix+1] */
6314 110 W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT);
6315 111 \}
6316 112
6317 113 /* now we have to propagate the carries and
6318 114 * shift the words downward [all those least
6319 115 * significant digits we zeroed].
6320 116 */
6321 117 \{
6322 118 register mp_digit *tmpx;
6323 119 register mp_word *_W, *_W1;
6324 120
6325 121 /* nox fix rest of carries */
6326 122
6327 123 /* alias for current word */
6328 124 _W1 = W + ix;
6329 125
6330 126 /* alias for next word, where the carry goes */
6331 127 _W = W + ++ix;
6332 128
6333 129 for (; ix <= n->used * 2 + 1; ix++) \{
6334 130 *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT);
6335 131 \}
6336 132
6337 133 /* copy out, A = A/b**n
6338 134 *
6339 135 * The result is A/b**n but instead of converting from an
6340 136 * array of mp_word to mp_digit than calling mp_rshd
6341 137 * we just copy them in the right order
6342 138 */
6343 139
6344 140 /* alias for destination word */
6345 141 tmpx = x->dp;
6346 142
6347 143 /* alias for shifted double precision result */
6348 144 _W = W + n->used;
6349 145
6350 146 for (ix = 0; ix < n->used + 1; ix++) \{
6351 147 *tmpx++ = (mp_digit)(*_W++ & ((mp_word) MP_MASK));
6352 148 \}
6353 149
6354 150 /* zero oldused digits, if the input a was larger than
6355 151 * m->used+1 we'll have to clear the digits
6356 152 */
6357 153 for (; ix < olduse; ix++) \{
6358 154 *tmpx++ = 0;
6359 155 \}
6360 156 \}
6361 157
6362 158 /* set the max used and clamp */
6363 159 x->used = n->used + 1;
6364 160 mp_clamp (x);
6365 161
6366 162 /* if A >= m then A = A - m */
6367 163 if (mp_cmp_mag (x, n) != MP_LT) \{
6368 164 return s_mp_sub (x, n, x);
6369 165 \}
6370 166 return MP_OKAY;
6371 167 \}
6372 \end{alltt}
6373 \end{small}
6374
6375 The $\hat W$ array is first filled with digits of $x$ on line 48 then the rest of the digits are zeroed on line 55. Both loops share
6376 the same alias variables to make the code easier to read.
6377
6378 The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This
6379 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line 110 fixes the carry
6380 for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$.
6381
6382 The for loop on line 109 propagates the rest of the carries upwards through the columns. The for loop on line 126 reduces the columns
6383 modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th
6384 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$.
6385
6386 \subsection{Montgomery Setup}
6387 To calculate the variable $\rho$ a relatively simple algorithm will be required.
6388
6389 \begin{figure}[!here]
6390 \begin{small}
6391 \begin{center}
6392 \begin{tabular}{l}
6393 \hline Algorithm \textbf{mp\_montgomery\_setup}. \\
6394 \textbf{Input}. mp\_int $n$ ($n > 1$ and $(n, 2) = 1$) \\
6395 \textbf{Output}. $\rho \equiv -1/n_0 \mbox{ (mod }\beta\mbox{)}$ \\
6396 \hline \\
6397 1. $b \leftarrow n_0$ \\
6398 2. If $b$ is even return(\textit{MP\_VAL}) \\
6399 3. $x \leftarrow ((b + 2) \mbox{ AND } 4) << 1) + b$ \\
6400 4. for $k$ from 0 to $\lceil lg(lg(\beta)) \rceil - 2$ do \\
6401 \hspace{3mm}4.1 $x \leftarrow x \cdot (2 - bx)$ \\
6402 5. $\rho \leftarrow \beta - x \mbox{ (mod }\beta\mbox{)}$ \\
6403 6. Return(\textit{MP\_OKAY}). \\
6404 \hline
6405 \end{tabular}
6406 \end{center}
6407 \end{small}
6408 \caption{Algorithm mp\_montgomery\_setup}
6409 \end{figure}
6410
6411 \textbf{Algorithm mp\_montgomery\_setup.}
6412 This algorithm will calculate the value of $\rho$ required within the Montgomery reduction algorithms. It uses a very interesting trick
6413 to calculate $1/n_0$ when $\beta$ is a power of two.
6414
6415 \vspace{+3mm}\begin{small}
6416 \hspace{-5.1mm}{\bf File}: bn\_mp\_montgomery\_setup.c
6417 \vspace{-3mm}
6418 \begin{alltt}
6419 016
6420 017 /* setups the montgomery reduction stuff */
6421 018 int
6422 019 mp_montgomery_setup (mp_int * n, mp_digit * rho)
6423 020 \{
6424 021 mp_digit x, b;
6425 022
6426 023 /* fast inversion mod 2**k
6427 024 *
6428 025 * Based on the fact that
6429 026 *
6430 027 * XA = 1 (mod 2**n) => (X(2-XA)) A = 1 (mod 2**2n)
6431 028 * => 2*X*A - X*X*A*A = 1
6432 029 * => 2*(1) - (1) = 1
6433 030 */
6434 031 b = n->dp[0];
6435 032
6436 033 if ((b & 1) == 0) \{
6437 034 return MP_VAL;
6438 035 \}
6439 036
6440 037 x = (((b + 2) & 4) << 1) + b; /* here x*a==1 mod 2**4 */
6441 038 x *= 2 - b * x; /* here x*a==1 mod 2**8 */
6442 039 #if !defined(MP_8BIT)
6443 040 x *= 2 - b * x; /* here x*a==1 mod 2**16 */
6444 041 #endif
6445 042 #if defined(MP_64BIT) || !(defined(MP_8BIT) || defined(MP_16BIT))
6446 043 x *= 2 - b * x; /* here x*a==1 mod 2**32 */
6447 044 #endif
6448 045 #ifdef MP_64BIT
6449 046 x *= 2 - b * x; /* here x*a==1 mod 2**64 */
6450 047 #endif
6451 048
6452 049 /* rho = -1/m mod b */
6453 050 *rho = (((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - x) & MP_MASK;
6454 051
6455 052 return MP_OKAY;
6456 053 \}
6457 \end{alltt}
6458 \end{small}
6459
6460 This source code computes the value of $\rho$ required to perform Montgomery reduction. It has been modified to avoid performing excess
6461 multiplications when $\beta$ is not the default 28-bits.
6462
6463 \section{The Diminished Radix Algorithm}
6464 The Diminished Radix method of modular reduction \cite{DRMET} is a fairly clever technique which can be more efficient than either the Barrett
6465 or Montgomery methods for certain forms of moduli. The technique is based on the following simple congruence.
6466
6467 \begin{equation}
6468 (x \mbox{ mod } n) + k \lfloor x / n \rfloor \equiv x \mbox{ (mod }(n - k)\mbox{)}
6469 \end{equation}
6470
6471 This observation was used in the MMB \cite{MMB} block cipher to create a diffusion primitive. It used the fact that if $n = 2^{31}$ and $k=1$ that
6472 then a x86 multiplier could produce the 62-bit product and use the ``shrd'' instruction to perform a double-precision right shift. The proof
6473 of the above equation is very simple. First write $x$ in the product form.
6474
6475 \begin{equation}
6476 x = qn + r
6477 \end{equation}
6478
6479 Now reduce both sides modulo $(n - k)$.
6480
6481 \begin{equation}
6482 x \equiv qk + r \mbox{ (mod }(n-k)\mbox{)}
6483 \end{equation}
6484
6485 The variable $n$ reduces modulo $n - k$ to $k$. By putting $q = \lfloor x/n \rfloor$ and $r = x \mbox{ mod } n$
6486 into the equation the original congruence is reproduced, thus concluding the proof. The following algorithm is based on this observation.
6487
6488 \begin{figure}[!here]
6489 \begin{small}
6490 \begin{center}
6491 \begin{tabular}{l}
6492 \hline Algorithm \textbf{Diminished Radix Reduction}. \\
6493 \textbf{Input}. Integer $x$, $n$, $k$ \\
6494 \textbf{Output}. $x \mbox{ mod } (n - k)$ \\
6495 \hline \\
6496 1. $q \leftarrow \lfloor x / n \rfloor$ \\
6497 2. $q \leftarrow k \cdot q$ \\
6498 3. $x \leftarrow x \mbox{ (mod }n\mbox{)}$ \\
6499 4. $x \leftarrow x + q$ \\
6500 5. If $x \ge (n - k)$ then \\
6501 \hspace{3mm}5.1 $x \leftarrow x - (n - k)$ \\
6502 \hspace{3mm}5.2 Goto step 1. \\
6503 6. Return $x$ \\
6504 \hline
6505 \end{tabular}
6506 \end{center}
6507 \end{small}
6508 \caption{Algorithm Diminished Radix Reduction}
6509 \label{fig:DR}
6510 \end{figure}
6511
6512 This algorithm will reduce $x$ modulo $n - k$ and return the residue. If $0 \le x < (n - k)^2$ then the algorithm will loop almost always
6513 once or twice and occasionally three times. For simplicity sake the value of $x$ is bounded by the following simple polynomial.
6514
6515 \begin{equation}
6516 0 \le x < n^2 + k^2 - 2nk
6517 \end{equation}
6518
6519 The true bound is $0 \le x < (n - k - 1)^2$ but this has quite a few more terms. The value of $q$ after step 1 is bounded by the following.
6520
6521 \begin{equation}
6522 q < n - 2k - k^2/n
6523 \end{equation}
6524
6525 Since $k^2$ is going to be considerably smaller than $n$ that term will always be zero. The value of $x$ after step 3 is bounded trivially as
6526 $0 \le x < n$. By step four the sum $x + q$ is bounded by
6527
6528 \begin{equation}
6529 0 \le q + x < (k + 1)n - 2k^2 - 1
6530 \end{equation}
6531
6532 With a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3. After the second pass it is highly unlike that the
6533 sum in step 4 will exceed $n - k$. In practice fewer than three passes of the algorithm are required to reduce virtually every input in the
6534 range $0 \le x < (n - k - 1)^2$.
6535
6536 \begin{figure}
6537 \begin{small}
6538 \begin{center}
6539 \begin{tabular}{|l|}
6540 \hline
6541 $x = 123456789, n = 256, k = 3$ \\
6542 \hline $q \leftarrow \lfloor x/n \rfloor = 482253$ \\
6543 $q \leftarrow q*k = 1446759$ \\
6544 $x \leftarrow x \mbox{ mod } n = 21$ \\
6545 $x \leftarrow x + q = 1446780$ \\
6546 $x \leftarrow x - (n - k) = 1446527$ \\
6547 \hline
6548 $q \leftarrow \lfloor x/n \rfloor = 5650$ \\
6549 $q \leftarrow q*k = 16950$ \\
6550 $x \leftarrow x \mbox{ mod } n = 127$ \\
6551 $x \leftarrow x + q = 17077$ \\
6552 $x \leftarrow x - (n - k) = 16824$ \\
6553 \hline
6554 $q \leftarrow \lfloor x/n \rfloor = 65$ \\
6555 $q \leftarrow q*k = 195$ \\
6556 $x \leftarrow x \mbox{ mod } n = 184$ \\
6557 $x \leftarrow x + q = 379$ \\
6558 $x \leftarrow x - (n - k) = 126$ \\
6559 \hline
6560 \end{tabular}
6561 \end{center}
6562 \end{small}
6563 \caption{Example Diminished Radix Reduction}
6564 \label{fig:EXDR}
6565 \end{figure}
6566
6567 Figure~\ref{fig:EXDR} demonstrates the reduction of $x = 123456789$ modulo $n - k = 253$ when $n = 256$ and $k = 3$. Note that even while $x$
6568 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast. In this case only
6569 three passes were required to find the residue $x \equiv 126$.
6570
6571
6572 \subsection{Choice of Moduli}
6573 On the surface this algorithm looks like a very expensive algorithm. It requires a couple of subtractions followed by multiplication and other
6574 modular reductions. The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen.
6575
6576 Division in general is a very expensive operation to perform. The one exception is when the division is by a power of the radix of representation used.
6577 Division by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right. Similarly division
6578 by two (\textit{or powers of two}) is very simple for binary computers to perform. It would therefore seem logical to choose $n$ of the form $2^p$
6579 which would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits.
6580
6581 However, there is one operation related to division of power of twos that is even faster than this. If $n = \beta^p$ then the division may be
6582 performed by moving whole digits to the right $p$ places. In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$.
6583 Also with the choice of $n = \beta^p$ reducing $x$ modulo $n$ merely requires zeroing the digits above the $p-1$'th digit of $x$.
6584
6585 Throughout the next section the term ``restricted modulus'' will refer to a modulus of the form $\beta^p - k$ whereas the term ``unrestricted
6586 modulus'' will refer to a modulus of the form $2^p - k$. The word ``restricted'' in this case refers to the fact that it is based on the
6587 $2^p$ logic except $p$ must be a multiple of $lg(\beta)$.
6588
6589 \subsection{Choice of $k$}
6590 Now that division and reduction (\textit{step 1 and 3 of figure~\ref{fig:DR}}) have been optimized to simple digit operations the multiplication by $k$
6591 in step 2 is the most expensive operation. Fortunately the choice of $k$ is not terribly limited. For all intents and purposes it might
6592 as well be a single digit. The smaller the value of $k$ is the faster the algorithm will be.
6593
6594 \subsection{Restricted Diminished Radix Reduction}
6595 The restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$. This algorithm can reduce
6596 an input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}. The implementation
6597 of this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition
6598 of $x$ and $q$. The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular
6599 exponentiations are performed.
6600
6601 \newpage\begin{figure}[!here]
6602 \begin{small}
6603 \begin{center}
6604 \begin{tabular}{l}
6605 \hline Algorithm \textbf{mp\_dr\_reduce}. \\
6606 \textbf{Input}. mp\_int $x$, $n$ and a mp\_digit $k = \beta - n_0$ \\
6607 \hspace{11.5mm}($0 \le x < n^2$, $n > 1$, $0 < k < \beta$) \\
6608 \textbf{Output}. $x \mbox{ mod } n$ \\
6609 \hline \\
6610 1. $m \leftarrow n.used$ \\
6611 2. If $x.alloc < 2m$ then grow $x$ to $2m$ digits. \\
6612 3. $\mu \leftarrow 0$ \\
6613 4. for $i$ from $0$ to $m - 1$ do \\
6614 \hspace{3mm}4.1 $\hat r \leftarrow k \cdot x_{m+i} + x_{i} + \mu$ \\
6615 \hspace{3mm}4.2 $x_{i} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
6616 \hspace{3mm}4.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\
6617 5. $x_{m} \leftarrow \mu$ \\
6618 6. for $i$ from $m + 1$ to $x.used - 1$ do \\
6619 \hspace{3mm}6.1 $x_{i} \leftarrow 0$ \\
6620 7. Clamp excess digits of $x$. \\
6621 8. If $x \ge n$ then \\
6622 \hspace{3mm}8.1 $x \leftarrow x - n$ \\
6623 \hspace{3mm}8.2 Goto step 3. \\
6624 9. Return(\textit{MP\_OKAY}). \\
6625 \hline
6626 \end{tabular}
6627 \end{center}
6628 \end{small}
6629 \caption{Algorithm mp\_dr\_reduce}
6630 \end{figure}
6631
6632 \textbf{Algorithm mp\_dr\_reduce.}
6633 This algorithm will perform the Dimished Radix reduction of $x$ modulo $n$. It has similar restrictions to that of the Barrett reduction
6634 with the addition that $n$ must be of the form $n = \beta^m - k$ where $0 < k <\beta$.
6635
6636 This algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization. The division by $\beta^m$, multiplication by $k$
6637 and addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4. The division by $\beta^m$ is emulated by accessing
6638 the term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position. After the loop the $m$'th
6639 digit is set to the carry and the upper digits are zeroed. Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to
6640 $x$ before the addition of the multiple of the upper half.
6641
6642 At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required. First $n$ is subtracted from $x$ and then the algorithm resumes
6643 at step 3.
6644
6645 \vspace{+3mm}\begin{small}
6646 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_reduce.c
6647 \vspace{-3mm}
6648 \begin{alltt}
6649 016
6650 017 /* reduce "x" in place modulo "n" using the Diminished Radix algorithm.
6651 018 *
6652 019 * Based on algorithm from the paper
6653 020 *
6654 021 * "Generating Efficient Primes for Discrete Log Cryptosystems"
6655 022 * Chae Hoon Lim, Pil Loong Lee,
6656 023 * POSTECH Information Research Laboratories
6657 024 *
6658 025 * The modulus must be of a special format [see manual]
6659 026 *
6660 027 * Has been modified to use algorithm 7.10 from the LTM book instead
6661 028 *
6662 029 * Input x must be in the range 0 <= x <= (n-1)**2
6663 030 */
6664 031 int
6665 032 mp_dr_reduce (mp_int * x, mp_int * n, mp_digit k)
6666 033 \{
6667 034 int err, i, m;
6668 035 mp_word r;
6669 036 mp_digit mu, *tmpx1, *tmpx2;
6670 037
6671 038 /* m = digits in modulus */
6672 039 m = n->used;
6673 040
6674 041 /* ensure that "x" has at least 2m digits */
6675 042 if (x->alloc < m + m) \{
6676 043 if ((err = mp_grow (x, m + m)) != MP_OKAY) \{
6677 044 return err;
6678 045 \}
6679 046 \}
6680 047
6681 048 /* top of loop, this is where the code resumes if
6682 049 * another reduction pass is required.
6683 050 */
6684 051 top:
6685 052 /* aliases for digits */
6686 053 /* alias for lower half of x */
6687 054 tmpx1 = x->dp;
6688 055
6689 056 /* alias for upper half of x, or x/B**m */
6690 057 tmpx2 = x->dp + m;
6691 058
6692 059 /* set carry to zero */
6693 060 mu = 0;
6694 061
6695 062 /* compute (x mod B**m) + k * [x/B**m] inline and inplace */
6696 063 for (i = 0; i < m; i++) \{
6697 064 r = ((mp_word)*tmpx2++) * ((mp_word)k) + *tmpx1 + mu;
6698 065 *tmpx1++ = (mp_digit)(r & MP_MASK);
6699 066 mu = (mp_digit)(r >> ((mp_word)DIGIT_BIT));
6700 067 \}
6701 068
6702 069 /* set final carry */
6703 070 *tmpx1++ = mu;
6704 071
6705 072 /* zero words above m */
6706 073 for (i = m + 1; i < x->used; i++) \{
6707 074 *tmpx1++ = 0;
6708 075 \}
6709 076
6710 077 /* clamp, sub and return */
6711 078 mp_clamp (x);
6712 079
6713 080 /* if x >= n then subtract and reduce again
6714 081 * Each successive "recursion" makes the input smaller and smaller.
6715 082 */
6716 083 if (mp_cmp_mag (x, n) != MP_LT) \{
6717 084 s_mp_sub(x, n, x);
6718 085 goto top;
6719 086 \}
6720 087 return MP_OKAY;
6721 088 \}
6722 \end{alltt}
6723 \end{small}
6724
6725 The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$. The label on line 51 is where
6726 the algorithm will resume if further reduction passes are required. In theory it could be placed at the top of the function however, the size of
6727 the modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time.
6728
6729 The aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits. By reading digits from $x$ offset by $m$ digits
6730 a division by $\beta^m$ can be simulated virtually for free. The loop on line 63 performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11})
6731 in this algorithm.
6732
6733 By line 70 the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed. Similarly by line 73 the
6734 same pointer will point to the $m+1$'th digit where the zeroes will be placed.
6735
6736 Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required.
6737 With the same logic at line 84 the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used
6738 as well. Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code
6739 does not need to be checked.
6740
6741 \subsubsection{Setup}
6742 To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required. This algorithm is not really complicated but provided for
6743 completeness.
6744
6745 \begin{figure}[!here]
6746 \begin{small}
6747 \begin{center}
6748 \begin{tabular}{l}
6749 \hline Algorithm \textbf{mp\_dr\_setup}. \\
6750 \textbf{Input}. mp\_int $n$ \\
6751 \textbf{Output}. $k = \beta - n_0$ \\
6752 \hline \\
6753 1. $k \leftarrow \beta - n_0$ \\
6754 \hline
6755 \end{tabular}
6756 \end{center}
6757 \end{small}
6758 \caption{Algorithm mp\_dr\_setup}
6759 \end{figure}
6760
6761 \vspace{+3mm}\begin{small}
6762 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_setup.c
6763 \vspace{-3mm}
6764 \begin{alltt}
6765 016
6766 017 /* determines the setup value */
6767 018 void mp_dr_setup(mp_int *a, mp_digit *d)
6768 019 \{
6769 020 /* the casts are required if DIGIT_BIT is one less than
6770 021 * the number of bits in a mp_digit [e.g. DIGIT_BIT==31]
6771 022 */
6772 023 *d = (mp_digit)((((mp_word)1) << ((mp_word)DIGIT_BIT)) -
6773 024 ((mp_word)a->dp[0]));
6774 025 \}
6775 026
6776 \end{alltt}
6777 \end{small}
6778
6779 \subsubsection{Modulus Detection}
6780 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus. An integer is said to be
6781 of restricted Diminished Radix form if all of the digits are equal to $\beta - 1$ except the trailing digit which may be any value.
6782
6783 \begin{figure}[!here]
6784 \begin{small}
6785 \begin{center}
6786 \begin{tabular}{l}
6787 \hline Algorithm \textbf{mp\_dr\_is\_modulus}. \\
6788 \textbf{Input}. mp\_int $n$ \\
6789 \textbf{Output}. $1$ if $n$ is in D.R form, $0$ otherwise \\
6790 \hline
6791 1. If $n.used < 2$ then return($0$). \\
6792 2. for $ix$ from $1$ to $n.used - 1$ do \\
6793 \hspace{3mm}2.1 If $n_{ix} \ne \beta - 1$ return($0$). \\
6794 3. Return($1$). \\
6795 \hline
6796 \end{tabular}
6797 \end{center}
6798 \end{small}
6799 \caption{Algorithm mp\_dr\_is\_modulus}
6800 \end{figure}
6801
6802 \textbf{Algorithm mp\_dr\_is\_modulus.}
6803 This algorithm determines if a value is in Diminished Radix form. Step 1 rejects obvious cases where fewer than two digits are
6804 in the mp\_int. Step 2 tests all but the first digit to see if they are equal to $\beta - 1$. If the algorithm manages to get to
6805 step 3 then $n$ must be of Diminished Radix form.
6806
6807 \vspace{+3mm}\begin{small}
6808 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_is\_modulus.c
6809 \vspace{-3mm}
6810 \begin{alltt}
6811 016
6812 017 /* determines if a number is a valid DR modulus */
6813 018 int mp_dr_is_modulus(mp_int *a)
6814 019 \{
6815 020 int ix;
6816 021
6817 022 /* must be at least two digits */
6818 023 if (a->used < 2) \{
6819 024 return 0;
6820 025 \}
6821 026
6822 027 /* must be of the form b**k - a [a <= b] so all
6823 028 * but the first digit must be equal to -1 (mod b).
6824 029 */
6825 030 for (ix = 1; ix < a->used; ix++) \{
6826 031 if (a->dp[ix] != MP_MASK) \{
6827 032 return 0;
6828 033 \}
6829 034 \}
6830 035 return 1;
6831 036 \}
6832 037
6833 \end{alltt}
6834 \end{small}
6835
6836 \subsection{Unrestricted Diminished Radix Reduction}
6837 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$. This algorithm
6838 is a straightforward adaptation of algorithm~\ref{fig:DR}.
6839
6840 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead. However, this new
6841 algorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form.
6842
6843 \begin{figure}[!here]
6844 \begin{small}
6845 \begin{center}
6846 \begin{tabular}{l}
6847 \hline Algorithm \textbf{mp\_reduce\_2k}. \\
6848 \textbf{Input}. mp\_int $a$ and $n$. mp\_digit $k$ \\
6849 \hspace{11.5mm}($a \ge 0$, $n > 1$, $0 < k < \beta$, $n + k$ is a power of two) \\
6850 \textbf{Output}. $a \mbox{ (mod }n\mbox{)}$ \\
6851 \hline
6852 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\
6853 2. While $a \ge n$ do \\
6854 \hspace{3mm}2.1 $q \leftarrow \lfloor a / 2^p \rfloor$ (\textit{mp\_div\_2d}) \\
6855 \hspace{3mm}2.2 $a \leftarrow a \mbox{ (mod }2^p\mbox{)}$ (\textit{mp\_mod\_2d}) \\
6856 \hspace{3mm}2.3 $q \leftarrow q \cdot k$ (\textit{mp\_mul\_d}) \\
6857 \hspace{3mm}2.4 $a \leftarrow a - q$ (\textit{s\_mp\_sub}) \\
6858 \hspace{3mm}2.5 If $a \ge n$ then do \\
6859 \hspace{6mm}2.5.1 $a \leftarrow a - n$ \\
6860 3. Return(\textit{MP\_OKAY}). \\
6861 \hline
6862 \end{tabular}
6863 \end{center}
6864 \end{small}
6865 \caption{Algorithm mp\_reduce\_2k}
6866 \end{figure}
6867
6868 \textbf{Algorithm mp\_reduce\_2k.}
6869 This algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$. Division by $2^p$ is emulated with a right
6870 shift which makes the algorithm fairly inexpensive to use.
6871
6872 \vspace{+3mm}\begin{small}
6873 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k.c
6874 \vspace{-3mm}
6875 \begin{alltt}
6876 016
6877 017 /* reduces a modulo n where n is of the form 2**p - d */
6878 018 int
6879 019 mp_reduce_2k(mp_int *a, mp_int *n, mp_digit d)
6880 020 \{
6881 021 mp_int q;
6882 022 int p, res;
6883 023
6884 024 if ((res = mp_init(&q)) != MP_OKAY) \{
6885 025 return res;
6886 026 \}
6887 027
6888 028 p = mp_count_bits(n);
6889 029 top:
6890 030 /* q = a/2**p, a = a mod 2**p */
6891 031 if ((res = mp_div_2d(a, p, &q, a)) != MP_OKAY) \{
6892 032 goto ERR;
6893 033 \}
6894 034
6895 035 if (d != 1) \{
6896 036 /* q = q * d */
6897 037 if ((res = mp_mul_d(&q, d, &q)) != MP_OKAY) \{
6898 038 goto ERR;
6899 039 \}
6900 040 \}
6901 041
6902 042 /* a = a + q */
6903 043 if ((res = s_mp_add(a, &q, a)) != MP_OKAY) \{
6904 044 goto ERR;
6905 045 \}
6906 046
6907 047 if (mp_cmp_mag(a, n) != MP_LT) \{
6908 048 s_mp_sub(a, n, a);
6909 049 goto top;
6910 050 \}
6911 051
6912 052 ERR:
6913 053 mp_clear(&q);
6914 054 return res;
6915 055 \}
6916 056
6917 \end{alltt}
6918 \end{small}
6919
6920 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$. The call to mp\_div\_2d
6921 on line 31 calculates both the quotient $q$ and the remainder $a$ required. By doing both in a single function call the code size
6922 is kept fairly small. The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without
6923 any multiplications.
6924
6925 The unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are
6926 positive. By using the unsigned versions the overhead is kept to a minimum.
6927
6928 \subsubsection{Unrestricted Setup}
6929 To setup this reduction algorithm the value of $k = 2^p - n$ is required.
6930
6931 \begin{figure}[!here]
6932 \begin{small}
6933 \begin{center}
6934 \begin{tabular}{l}
6935 \hline Algorithm \textbf{mp\_reduce\_2k\_setup}. \\
6936 \textbf{Input}. mp\_int $n$ \\
6937 \textbf{Output}. $k = 2^p - n$ \\
6938 \hline
6939 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\
6940 2. $x \leftarrow 2^p$ (\textit{mp\_2expt}) \\
6941 3. $x \leftarrow x - n$ (\textit{mp\_sub}) \\
6942 4. $k \leftarrow x_0$ \\
6943 5. Return(\textit{MP\_OKAY}). \\
6944 \hline
6945 \end{tabular}
6946 \end{center}
6947 \end{small}
6948 \caption{Algorithm mp\_reduce\_2k\_setup}
6949 \end{figure}
6950
6951 \textbf{Algorithm mp\_reduce\_2k\_setup.}
6952 This algorithm computes the value of $k$ required for the algorithm mp\_reduce\_2k. By making a temporary variable $x$ equal to $2^p$ a subtraction
6953 is sufficient to solve for $k$. Alternatively if $n$ has more than one digit the value of $k$ is simply $\beta - n_0$.
6954
6955 \vspace{+3mm}\begin{small}
6956 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k\_setup.c
6957 \vspace{-3mm}
6958 \begin{alltt}
6959 016
6960 017 /* determines the setup value */
6961 018 int
6962 019 mp_reduce_2k_setup(mp_int *a, mp_digit *d)
6963 020 \{
6964 021 int res, p;
6965 022 mp_int tmp;
6966 023
6967 024 if ((res = mp_init(&tmp)) != MP_OKAY) \{
6968 025 return res;
6969 026 \}
6970 027
6971 028 p = mp_count_bits(a);
6972 029 if ((res = mp_2expt(&tmp, p)) != MP_OKAY) \{
6973 030 mp_clear(&tmp);
6974 031 return res;
6975 032 \}
6976 033
6977 034 if ((res = s_mp_sub(&tmp, a, &tmp)) != MP_OKAY) \{
6978 035 mp_clear(&tmp);
6979 036 return res;
6980 037 \}
6981 038
6982 039 *d = tmp.dp[0];
6983 040 mp_clear(&tmp);
6984 041 return MP_OKAY;
6985 042 \}
6986 \end{alltt}
6987 \end{small}
6988
6989 \subsubsection{Unrestricted Detection}
6990 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true.
6991
6992 \begin{enumerate}
6993 \item The number has only one digit.
6994 \item The number has more than one digit and every bit from the $\beta$'th to the most significant is one.
6995 \end{enumerate}
6996
6997 If either condition is true than there is a power of two $2^p$ such that $0 < 2^p - n < \beta$. If the input is only
6998 one digit than it will always be of the correct form. Otherwise all of the bits above the first digit must be one. This arises from the fact
6999 that there will be value of $k$ that when added to the modulus causes a carry in the first digit which propagates all the way to the most
7000 significant bit. The resulting sum will be a power of two.
7001
7002 \begin{figure}[!here]
7003 \begin{small}
7004 \begin{center}
7005 \begin{tabular}{l}
7006 \hline Algorithm \textbf{mp\_reduce\_is\_2k}. \\
7007 \textbf{Input}. mp\_int $n$ \\
7008 \textbf{Output}. $1$ if of proper form, $0$ otherwise \\
7009 \hline
7010 1. If $n.used = 0$ then return($0$). \\
7011 2. If $n.used = 1$ then return($1$). \\
7012 3. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\
7013 4. for $x$ from $lg(\beta)$ to $p$ do \\
7014 \hspace{3mm}4.1 If the ($x \mbox{ mod }lg(\beta)$)'th bit of the $\lfloor x / lg(\beta) \rfloor$ of $n$ is zero then return($0$). \\
7015 5. Return($1$). \\
7016 \hline
7017 \end{tabular}
7018 \end{center}
7019 \end{small}
7020 \caption{Algorithm mp\_reduce\_is\_2k}
7021 \end{figure}
7022
7023 \textbf{Algorithm mp\_reduce\_is\_2k.}
7024 This algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly.
7025
7026 \vspace{+3mm}\begin{small}
7027 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_is\_2k.c
7028 \vspace{-3mm}
7029 \begin{alltt}
7030 016
7031 017 /* determines if mp_reduce_2k can be used */
7032 018 int mp_reduce_is_2k(mp_int *a)
7033 019 \{
7034 020 int ix, iy, iz, iw;
7035 021
7036 022 if (a->used == 0) \{
7037 023 return 0;
7038 024 \} else if (a->used == 1) \{
7039 025 return 1;
7040 026 \} else if (a->used > 1) \{
7041 027 iy = mp_count_bits(a);
7042 028 iz = 1;
7043 029 iw = 1;
7044 030
7045 031 /* Test every bit from the second digit up, must be 1 */
7046 032 for (ix = DIGIT_BIT; ix < iy; ix++) \{
7047 033 if ((a->dp[iw] & iz) == 0) \{
7048 034 return 0;
7049 035 \}
7050 036 iz <<= 1;
7051 037 if (iz > (int)MP_MASK) \{
7052 038 ++iw;
7053 039 iz = 1;
7054 040 \}
7055 041 \}
7056 042 \}
7057 043 return 1;
7058 044 \}
7059 045
7060 \end{alltt}
7061 \end{small}
7062
7063
7064
7065 \section{Algorithm Comparison}
7066 So far three very different algorithms for modular reduction have been discussed. Each of the algorithms have their own strengths and weaknesses
7067 that makes having such a selection very useful. The following table sumarizes the three algorithms along with comparisons of work factors. Since
7068 all three algorithms have the restriction that $0 \le x < n^2$ and $n > 1$ those limitations are not included in the table.
7069
7070 \begin{center}
7071 \begin{small}
7072 \begin{tabular}{|c|c|c|c|c|c|}
7073 \hline \textbf{Method} & \textbf{Work Required} & \textbf{Limitations} & \textbf{$m = 8$} & \textbf{$m = 32$} & \textbf{$m = 64$} \\
7074 \hline Barrett & $m^2 + 2m - 1$ & None & $79$ & $1087$ & $4223$ \\
7075 \hline Montgomery & $m^2 + m$ & $n$ must be odd & $72$ & $1056$ & $4160$ \\
7076 \hline D.R. & $2m$ & $n = \beta^m - k$ & $16$ & $64$ & $128$ \\
7077 \hline
7078 \end{tabular}
7079 \end{small}
7080 \end{center}
7081
7082 In theory Montgomery and Barrett reductions would require roughly the same amount of time to complete. However, in practice since Montgomery
7083 reduction can be written as a single function with the Comba technique it is much faster. Barrett reduction suffers from the overhead of
7084 calling the half precision multipliers, addition and division by $\beta$ algorithms.
7085
7086 For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice. The one set of algorithms where Diminished Radix reduction truly
7087 shines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}. In these algorithms
7088 primes of the form $\beta^m - k$ can be found and shared amongst users. These primes will allow the Diminished Radix algorithm to be used in
7089 modular exponentiation to greatly speed up the operation.
7090
7091
7092
7093 \section*{Exercises}
7094 \begin{tabular}{cl}
7095 $\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\
7096 & calculates the correct value of $\rho$. \\
7097 & \\
7098 $\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly. \\
7099 & \\
7100 $\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\
7101 & (\textit{figure~\ref{fig:DR}}) terminates. Also prove the probability that it will \\
7102 & terminate within $1 \le k \le 10$ iterations. \\
7103 & \\
7104 \end{tabular}
7105
7106
7107 \chapter{Exponentiation}
7108 Exponentiation is the operation of raising one variable to the power of another, for example, $a^b$. A variant of exponentiation, computed
7109 in a finite field or ring, is called modular exponentiation. This latter style of operation is typically used in public key
7110 cryptosystems such as RSA and Diffie-Hellman. The ability to quickly compute modular exponentiations is of great benefit to any
7111 such cryptosystem and many methods have been sought to speed it up.
7112
7113 \section{Exponentiation Basics}
7114 A trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired. However, as $b$ grows in size
7115 the number of multiplications becomes prohibitive. Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature
7116 with a $1024$-bit key. Such a calculation could never be completed as it would take simply far too long.
7117
7118 Fortunately there is a very simple algorithm based on the laws of exponents. Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which
7119 are two trivial relationships between the base and the exponent. Let $b_i$ represent the $i$'th bit of $b$ starting from the least
7120 significant bit. If $b$ is a $k$-bit integer than the following equation is true.
7121
7122 \begin{equation}
7123 a^b = \prod_{i=0}^{k-1} a^{2^i \cdot b_i}
7124 \end{equation}
7125
7126 By taking the base $a$ logarithm of both sides of the equation the following equation is the result.
7127
7128 \begin{equation}
7129 b = \sum_{i=0}^{k-1}2^i \cdot b_i
7130 \end{equation}
7131
7132 The term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to
7133 $a^{2^{i+1}}$. This observation forms the basis of essentially all fast exponentiation algorithms. It requires $k$ squarings and on average
7134 $k \over 2$ multiplications to compute the result. This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times.
7135
7136 While this current method is a considerable speed up there are further improvements to be made. For example, the $a^{2^i}$ term does not need to
7137 be computed in an auxilary variable. Consider the following equivalent algorithm.
7138
7139 \begin{figure}[!here]
7140 \begin{small}
7141 \begin{center}
7142 \begin{tabular}{l}
7143 \hline Algorithm \textbf{Left to Right Exponentiation}. \\
7144 \textbf{Input}. Integer $a$, $b$ and $k$ \\
7145 \textbf{Output}. $c = a^b$ \\
7146 \hline \\
7147 1. $c \leftarrow 1$ \\
7148 2. for $i$ from $k - 1$ to $0$ do \\
7149 \hspace{3mm}2.1 $c \leftarrow c^2$ \\
7150 \hspace{3mm}2.2 $c \leftarrow c \cdot a^{b_i}$ \\
7151 3. Return $c$. \\
7152 \hline
7153 \end{tabular}
7154 \end{center}
7155 \end{small}
7156 \caption{Left to Right Exponentiation}
7157 \label{fig:LTOR}
7158 \end{figure}
7159
7160 This algorithm starts from the most significant bit and works towards the least significant bit. When the $i$'th bit of $b$ is set $a$ is
7161 multiplied against the current product. In each iteration the product is squared which doubles the exponent of the individual terms of the
7162 product.
7163
7164 For example, let $b = 101100_2 \equiv 44_{10}$. The following chart demonstrates the actions of the algorithm.
7165
7166 \newpage\begin{figure}
7167 \begin{center}
7168 \begin{tabular}{|c|c|}
7169 \hline \textbf{Value of $i$} & \textbf{Value of $c$} \\
7170 \hline - & $1$ \\
7171 \hline $5$ & $a$ \\
7172 \hline $4$ & $a^2$ \\
7173 \hline $3$ & $a^4 \cdot a$ \\
7174 \hline $2$ & $a^8 \cdot a^2 \cdot a$ \\
7175 \hline $1$ & $a^{16} \cdot a^4 \cdot a^2$ \\
7176 \hline $0$ & $a^{32} \cdot a^8 \cdot a^4$ \\
7177 \hline
7178 \end{tabular}
7179 \end{center}
7180 \caption{Example of Left to Right Exponentiation}
7181 \end{figure}
7182
7183 When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation. This particular algorithm is
7184 called ``Left to Right'' because it reads the exponent in that order. All of the exponentiation algorithms that will be presented are of this nature.
7185
7186 \subsection{Single Digit Exponentiation}
7187 The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit. It is intended
7188 to be used when a small power of an input is required (\textit{e.g. $a^5$}). It is faster than simply multiplying $b - 1$ times for all values of
7189 $b$ that are greater than three.
7190
7191 \newpage\begin{figure}[!here]
7192 \begin{small}
7193 \begin{center}
7194 \begin{tabular}{l}
7195 \hline Algorithm \textbf{mp\_expt\_d}. \\
7196 \textbf{Input}. mp\_int $a$ and mp\_digit $b$ \\
7197 \textbf{Output}. $c = a^b$ \\
7198 \hline \\
7199 1. $g \leftarrow a$ (\textit{mp\_init\_copy}) \\
7200 2. $c \leftarrow 1$ (\textit{mp\_set}) \\
7201 3. for $x$ from 1 to $lg(\beta)$ do \\
7202 \hspace{3mm}3.1 $c \leftarrow c^2$ (\textit{mp\_sqr}) \\
7203 \hspace{3mm}3.2 If $b$ AND $2^{lg(\beta) - 1} \ne 0$ then \\
7204 \hspace{6mm}3.2.1 $c \leftarrow c \cdot g$ (\textit{mp\_mul}) \\
7205 \hspace{3mm}3.3 $b \leftarrow b << 1$ \\
7206 4. Clear $g$. \\
7207 5. Return(\textit{MP\_OKAY}). \\
7208 \hline
7209 \end{tabular}
7210 \end{center}
7211 \end{small}
7212 \caption{Algorithm mp\_expt\_d}
7213 \end{figure}
7214
7215 \textbf{Algorithm mp\_expt\_d.}
7216 This algorithm computes the value of $a$ raised to the power of a single digit $b$. It uses the left to right exponentiation algorithm to
7217 quickly compute the exponentiation. It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the
7218 exponent is a fixed width.
7219
7220 A copy of $a$ is made first to allow destination variable $c$ be the same as the source variable $a$. The result is set to the initial value of
7221 $1$ in the subsequent step.
7222
7223 Inside the loop the exponent is read from the most significant bit first down to the least significant bit. First $c$ is invariably squared
7224 on step 3.1. In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$. The value
7225 of $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit. In effect each
7226 iteration of the loop moves the bits of the exponent $b$ upwards to the most significant location.
7227
7228 \vspace{+3mm}\begin{small}
7229 \hspace{-5.1mm}{\bf File}: bn\_mp\_expt\_d.c
7230 \vspace{-3mm}
7231 \begin{alltt}
7232 016
7233 017 /* calculate c = a**b using a square-multiply algorithm */
7234 018 int mp_expt_d (mp_int * a, mp_digit b, mp_int * c)
7235 019 \{
7236 020 int res, x;
7237 021 mp_int g;
7238 022
7239 023 if ((res = mp_init_copy (&g, a)) != MP_OKAY) \{
7240 024 return res;
7241 025 \}
7242 026
7243 027 /* set initial result */
7244 028 mp_set (c, 1);
7245 029
7246 030 for (x = 0; x < (int) DIGIT_BIT; x++) \{
7247 031 /* square */
7248 032 if ((res = mp_sqr (c, c)) != MP_OKAY) \{
7249 033 mp_clear (&g);
7250 034 return res;
7251 035 \}
7252 036
7253 037 /* if the bit is set multiply */
7254 038 if ((b & (mp_digit) (((mp_digit)1) << (DIGIT_BIT - 1))) != 0) \{
7255 039 if ((res = mp_mul (c, &g, c)) != MP_OKAY) \{
7256 040 mp_clear (&g);
7257 041 return res;
7258 042 \}
7259 043 \}
7260 044
7261 045 /* shift to next bit */
7262 046 b <<= 1;
7263 047 \}
7264 048
7265 049 mp_clear (&g);
7266 050 return MP_OKAY;
7267 051 \}
7268 \end{alltt}
7269 \end{small}
7270
7271 Line 28 sets the initial value of the result to $1$. Next the loop on line 30 steps through each bit of the exponent starting from
7272 the most significant down towards the least significant. The invariant squaring operation placed on line 32 is performed first. After
7273 the squaring the result $c$ is multiplied by the base $g$ if and only if the most significant bit of the exponent is set. The shift on line
7274 46 moves all of the bits of the exponent upwards towards the most significant location.
7275
7276 \section{$k$-ary Exponentiation}
7277 When calculating an exponentiation the most time consuming bottleneck is the multiplications which are in general a small factor
7278 slower than squaring. Recall from the previous algorithm that $b_{i}$ refers to the $i$'th bit of the exponent $b$. Suppose instead it referred to
7279 the $i$'th $k$-bit digit of the exponent of $b$. For $k = 1$ the definitions are synonymous and for $k > 1$ algorithm~\ref{fig:KARY}
7280 computes the same exponentiation. A group of $k$ bits from the exponent is called a \textit{window}. That is it is a small window on only a
7281 portion of the entire exponent. Consider the following modification to the basic left to right exponentiation algorithm.
7282
7283 \begin{figure}[!here]
7284 \begin{small}
7285 \begin{center}
7286 \begin{tabular}{l}
7287 \hline Algorithm \textbf{$k$-ary Exponentiation}. \\
7288 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\
7289 \textbf{Output}. $c = a^b$ \\
7290 \hline \\
7291 1. $c \leftarrow 1$ \\
7292 2. for $i$ from $t - 1$ to $0$ do \\
7293 \hspace{3mm}2.1 $c \leftarrow c^{2^k} $ \\
7294 \hspace{3mm}2.2 Extract the $i$'th $k$-bit word from $b$ and store it in $g$. \\
7295 \hspace{3mm}2.3 $c \leftarrow c \cdot a^g$ \\
7296 3. Return $c$. \\
7297 \hline
7298 \end{tabular}
7299 \end{center}
7300 \end{small}
7301 \caption{$k$-ary Exponentiation}
7302 \label{fig:KARY}
7303 \end{figure}
7304
7305 The squaring on step 2.1 can be calculated by squaring the value $c$ successively $k$ times. If the values of $a^g$ for $0 < g < 2^k$ have been
7306 precomputed this algorithm requires only $t$ multiplications and $tk$ squarings. The table can be generated with $2^{k - 1} - 1$ squarings and
7307 $2^{k - 1} + 1$ multiplications. This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$.
7308 However, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}.
7309
7310 Suppose $k = 4$ and $t = 100$. This modified algorithm will require $109$ multiplications and $408$ squarings to compute the exponentiation. The
7311 original algorithm would on average have required $200$ multiplications and $400$ squrings to compute the same value. The total number of squarings
7312 has increased slightly but the number of multiplications has nearly halved.
7313
7314 \subsection{Optimal Values of $k$}
7315 An optimal value of $k$ will minimize $2^{k} + \lceil n / k \rceil + n - 1$ for a fixed number of bits in the exponent $n$. The simplest
7316 approach is to brute force search amongst the values $k = 2, 3, \ldots, 8$ for the lowest result. Table~\ref{fig:OPTK} lists optimal values of $k$
7317 for various exponent sizes and compares the number of multiplication and squarings required against algorithm~\ref{fig:LTOR}.
7318
7319 \begin{figure}[here]
7320 \begin{center}
7321 \begin{small}
7322 \begin{tabular}{|c|c|c|c|c|c|}
7323 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:LTOR}} \\
7324 \hline $16$ & $2$ & $27$ & $24$ \\
7325 \hline $32$ & $3$ & $49$ & $48$ \\
7326 \hline $64$ & $3$ & $92$ & $96$ \\
7327 \hline $128$ & $4$ & $175$ & $192$ \\
7328 \hline $256$ & $4$ & $335$ & $384$ \\
7329 \hline $512$ & $5$ & $645$ & $768$ \\
7330 \hline $1024$ & $6$ & $1257$ & $1536$ \\
7331 \hline $2048$ & $6$ & $2452$ & $3072$ \\
7332 \hline $4096$ & $7$ & $4808$ & $6144$ \\
7333 \hline
7334 \end{tabular}
7335 \end{small}
7336 \end{center}
7337 \caption{Optimal Values of $k$ for $k$-ary Exponentiation}
7338 \label{fig:OPTK}
7339 \end{figure}
7340
7341 \subsection{Sliding-Window Exponentiation}
7342 A simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$. Essentially
7343 this is a table for all values of $g$ where the most significant bit of $g$ is a one. However, in order for this to be allowed in the
7344 algorithm values of $g$ in the range $0 \le g < 2^{k-1}$ must be avoided.
7345
7346 Table~\ref{fig:OPTK2} lists optimal values of $k$ for various exponent sizes and compares the work required against algorithm~\ref{fig:KARY}.
7347
7348 \begin{figure}[here]
7349 \begin{center}
7350 \begin{small}
7351 \begin{tabular}{|c|c|c|c|c|c|}
7352 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:KARY}} \\
7353 \hline $16$ & $3$ & $24$ & $27$ \\
7354 \hline $32$ & $3$ & $45$ & $49$ \\
7355 \hline $64$ & $4$ & $87$ & $92$ \\
7356 \hline $128$ & $4$ & $167$ & $175$ \\
7357 \hline $256$ & $5$ & $322$ & $335$ \\
7358 \hline $512$ & $6$ & $628$ & $645$ \\
7359 \hline $1024$ & $6$ & $1225$ & $1257$ \\
7360 \hline $2048$ & $7$ & $2403$ & $2452$ \\
7361 \hline $4096$ & $8$ & $4735$ & $4808$ \\
7362 \hline
7363 \end{tabular}
7364 \end{small}
7365 \end{center}
7366 \caption{Optimal Values of $k$ for Sliding Window Exponentiation}
7367 \label{fig:OPTK2}
7368 \end{figure}
7369
7370 \newpage\begin{figure}[!here]
7371 \begin{small}
7372 \begin{center}
7373 \begin{tabular}{l}
7374 \hline Algorithm \textbf{Sliding Window $k$-ary Exponentiation}. \\
7375 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\
7376 \textbf{Output}. $c = a^b$ \\
7377 \hline \\
7378 1. $c \leftarrow 1$ \\
7379 2. for $i$ from $t - 1$ to $0$ do \\
7380 \hspace{3mm}2.1 If the $i$'th bit of $b$ is a zero then \\
7381 \hspace{6mm}2.1.1 $c \leftarrow c^2$ \\
7382 \hspace{3mm}2.2 else do \\
7383 \hspace{6mm}2.2.1 $c \leftarrow c^{2^k}$ \\
7384 \hspace{6mm}2.2.2 Extract the $k$ bits from $(b_{i}b_{i-1}\ldots b_{i-(k-1)})$ and store it in $g$. \\
7385 \hspace{6mm}2.2.3 $c \leftarrow c \cdot a^g$ \\
7386 \hspace{6mm}2.2.4 $i \leftarrow i - k$ \\
7387 3. Return $c$. \\
7388 \hline
7389 \end{tabular}
7390 \end{center}
7391 \end{small}
7392 \caption{Sliding Window $k$-ary Exponentiation}
7393 \end{figure}
7394
7395 Similar to the previous algorithm this algorithm must have a special handler when fewer than $k$ bits are left in the exponent. While this
7396 algorithm requires the same number of squarings it can potentially have fewer multiplications. The pre-computed table $a^g$ is also half
7397 the size as the previous table.
7398
7399 Consider the exponent $b = 111101011001000_2 \equiv 31432_{10}$ with $k = 3$ using both algorithms. The first algorithm will divide the exponent up as
7400 the following five $3$-bit words $b \equiv \left ( 111, 101, 011, 001, 000 \right )_{2}$. The second algorithm will break the
7401 exponent as $b \equiv \left ( 111, 101, 0, 110, 0, 100, 0 \right )_{2}$. The single digit $0$ in the second representation are where
7402 a single squaring took place instead of a squaring and multiplication. In total the first method requires $10$ multiplications and $18$
7403 squarings. The second method requires $8$ multiplications and $18$ squarings.
7404
7405 In general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster.
7406
7407 \section{Modular Exponentiation}
7408
7409 Modular exponentiation is essentially computing the power of a base within a finite field or ring. For example, computing
7410 $d \equiv a^b \mbox{ (mod }c\mbox{)}$ is a modular exponentiation. Instead of first computing $a^b$ and then reducing it
7411 modulo $c$ the intermediate result is reduced modulo $c$ after every squaring or multiplication operation.
7412
7413 This guarantees that any intermediate result is bounded by $0 \le d \le c^2 - 2c + 1$ and can be reduced modulo $c$ quickly using
7414 one of the algorithms presented in chapter six.
7415
7416 Before the actual modular exponentiation algorithm can be written a wrapper algorithm must be written first. This algorithm
7417 will allow the exponent $b$ to be negative which is computed as $c \equiv \left (1 / a \right )^{\vert b \vert} \mbox{(mod }d\mbox{)}$. The
7418 value of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}). If no inverse exists the algorithm
7419 terminates with an error.
7420
7421 \begin{figure}[!here]
7422 \begin{small}
7423 \begin{center}
7424 \begin{tabular}{l}
7425 \hline Algorithm \textbf{mp\_exptmod}. \\
7426 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\
7427 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
7428 \hline \\
7429 1. If $c.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\
7430 2. If $b.sign = MP\_NEG$ then \\
7431 \hspace{3mm}2.1 $g' \leftarrow g^{-1} \mbox{ (mod }c\mbox{)}$ \\
7432 \hspace{3mm}2.2 $x' \leftarrow \vert x \vert$ \\
7433 \hspace{3mm}2.3 Compute $d \equiv g'^{x'} \mbox{ (mod }c\mbox{)}$ via recursion. \\
7434 3. if $p$ is odd \textbf{OR} $p$ is a D.R. modulus then \\
7435 \hspace{3mm}3.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm mp\_exptmod\_fast. \\
7436 4. else \\
7437 \hspace{3mm}4.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm s\_mp\_exptmod. \\
7438 \hline
7439 \end{tabular}
7440 \end{center}
7441 \end{small}
7442 \caption{Algorithm mp\_exptmod}
7443 \end{figure}
7444
7445 \textbf{Algorithm mp\_exptmod.}
7446 The first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod. It is a sliding window $k$-ary algorithm
7447 which uses Barrett reduction to reduce the product modulo $p$. The second algorithm mp\_exptmod\_fast performs the same operation
7448 except it uses either Montgomery or Diminished Radix reduction. The two latter reduction algorithms are clumped in the same exponentiation
7449 algorithm since their arguments are essentially the same (\textit{two mp\_ints and one mp\_digit}).
7450
7451 \vspace{+3mm}\begin{small}
7452 \hspace{-5.1mm}{\bf File}: bn\_mp\_exptmod.c
7453 \vspace{-3mm}
7454 \begin{alltt}
7455 016
7456 017
7457 018 /* this is a shell function that calls either the normal or Montgomery
7458 019 * exptmod functions. Originally the call to the montgomery code was
7459 020 * embedded in the normal function but that wasted alot of stack space
7460 021 * for nothing (since 99% of the time the Montgomery code would be called)
7461 022 */
7462 023 int mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y)
7463 024 \{
7464 025 int dr;
7465 026
7466 027 /* modulus P must be positive */
7467 028 if (P->sign == MP_NEG) \{
7468 029 return MP_VAL;
7469 030 \}
7470 031
7471 032 /* if exponent X is negative we have to recurse */
7472 033 if (X->sign == MP_NEG) \{
7473 034 mp_int tmpG, tmpX;
7474 035 int err;
7475 036
7476 037 /* first compute 1/G mod P */
7477 038 if ((err = mp_init(&tmpG)) != MP_OKAY) \{
7478 039 return err;
7479 040 \}
7480 041 if ((err = mp_invmod(G, P, &tmpG)) != MP_OKAY) \{
7481 042 mp_clear(&tmpG);
7482 043 return err;
7483 044 \}
7484 045
7485 046 /* now get |X| */
7486 047 if ((err = mp_init(&tmpX)) != MP_OKAY) \{
7487 048 mp_clear(&tmpG);
7488 049 return err;
7489 050 \}
7490 051 if ((err = mp_abs(X, &tmpX)) != MP_OKAY) \{
7491 052 mp_clear_multi(&tmpG, &tmpX, NULL);
7492 053 return err;
7493 054 \}
7494 055
7495 056 /* and now compute (1/G)**|X| instead of G**X [X < 0] */
7496 057 err = mp_exptmod(&tmpG, &tmpX, P, Y);
7497 058 mp_clear_multi(&tmpG, &tmpX, NULL);
7498 059 return err;
7499 060 \}
7500 061
7501 062 /* is it a DR modulus? */
7502 063 dr = mp_dr_is_modulus(P);
7503 064
7504 065 /* if not, is it a uDR modulus? */
7505 066 if (dr == 0) \{
7506 067 dr = mp_reduce_is_2k(P) << 1;
7507 068 \}
7508 069
7509 070 /* if the modulus is odd or dr != 0 use the fast method */
7510 071 if (mp_isodd (P) == 1 || dr != 0) \{
7511 072 return mp_exptmod_fast (G, X, P, Y, dr);
7512 073 \} else \{
7513 074 /* otherwise use the generic Barrett reduction technique */
7514 075 return s_mp_exptmod (G, X, P, Y);
7515 076 \}
7516 077 \}
7517 078
7518 \end{alltt}
7519 \end{small}
7520
7521 In order to keep the algorithms in a known state the first step on line 28 is to reject any negative modulus as input. If the exponent is
7522 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$. The temporary variable $tmpG$ is assigned
7523 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$. The algorithm will recuse with these new values with a positive
7524 exponent.
7525
7526 If the exponent is positive the algorithm resumes the exponentiation. Line 63 determines if the modulus is of the restricted Diminished Radix
7527 form. If it is not line 67 attempts to determine if it is of a unrestricted Diminished Radix form. The integer $dr$ will take on one
7528 of three values.
7529
7530 \begin{enumerate}
7531 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form.
7532 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form.
7533 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form.
7534 \end{enumerate}
7535
7536 Line 70 determines if the fast modular exponentiation algorithm can be used. It is allowed if $dr \ne 0$ or if the modulus is odd. Otherwise,
7537 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction.
7538
7539 \subsection{Barrett Modular Exponentiation}
7540
7541 \newpage\begin{figure}[!here]
7542 \begin{small}
7543 \begin{center}
7544 \begin{tabular}{l}
7545 \hline Algorithm \textbf{s\_mp\_exptmod}. \\
7546 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\
7547 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
7548 \hline \\
7549 1. $k \leftarrow lg(x)$ \\
7550 2. $winsize \leftarrow \left \lbrace \begin{array}{ll}
7551 2 & \mbox{if }k \le 7 \\
7552 3 & \mbox{if }7 < k \le 36 \\
7553 4 & \mbox{if }36 < k \le 140 \\
7554 5 & \mbox{if }140 < k \le 450 \\
7555 6 & \mbox{if }450 < k \le 1303 \\
7556 7 & \mbox{if }1303 < k \le 3529 \\
7557 8 & \mbox{if }3529 < k \\
7558 \end{array} \right .$ \\
7559 3. Initialize $2^{winsize}$ mp\_ints in an array named $M$ and one mp\_int named $\mu$ \\
7560 4. Calculate the $\mu$ required for Barrett Reduction (\textit{mp\_reduce\_setup}). \\
7561 5. $M_1 \leftarrow g \mbox{ (mod }p\mbox{)}$ \\
7562 \\
7563 Setup the table of small powers of $g$. First find $g^{2^{winsize}}$ and then all multiples of it. \\
7564 6. $k \leftarrow 2^{winsize - 1}$ \\
7565 7. $M_{k} \leftarrow M_1$ \\
7566 8. for $ix$ from 0 to $winsize - 2$ do \\
7567 \hspace{3mm}8.1 $M_k \leftarrow \left ( M_k \right )^2$ (\textit{mp\_sqr}) \\
7568 \hspace{3mm}8.2 $M_k \leftarrow M_k \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\
7569 9. for $ix$ from $2^{winsize - 1} + 1$ to $2^{winsize} - 1$ do \\
7570 \hspace{3mm}9.1 $M_{ix} \leftarrow M_{ix - 1} \cdot M_{1}$ (\textit{mp\_mul}) \\
7571 \hspace{3mm}9.2 $M_{ix} \leftarrow M_{ix} \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\
7572 10. $res \leftarrow 1$ \\
7573 \\
7574 Start Sliding Window. \\
7575 11. $mode \leftarrow 0, bitcnt \leftarrow 1, buf \leftarrow 0, digidx \leftarrow x.used - 1, bitcpy \leftarrow 0, bitbuf \leftarrow 0$ \\
7576 12. Loop \\
7577 \hspace{3mm}12.1 $bitcnt \leftarrow bitcnt - 1$ \\
7578 \hspace{3mm}12.2 If $bitcnt = 0$ then do \\
7579 \hspace{6mm}12.2.1 If $digidx = -1$ goto step 13. \\
7580 \hspace{6mm}12.2.2 $buf \leftarrow x_{digidx}$ \\
7581 \hspace{6mm}12.2.3 $digidx \leftarrow digidx - 1$ \\
7582 \hspace{6mm}12.2.4 $bitcnt \leftarrow lg(\beta)$ \\
7583 Continued on next page. \\
7584 \hline
7585 \end{tabular}
7586 \end{center}
7587 \end{small}
7588 \caption{Algorithm s\_mp\_exptmod}
7589 \end{figure}
7590
7591 \newpage\begin{figure}[!here]
7592 \begin{small}
7593 \begin{center}
7594 \begin{tabular}{l}
7595 \hline Algorithm \textbf{s\_mp\_exptmod} (\textit{continued}). \\
7596 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\
7597 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
7598 \hline \\
7599 \hspace{3mm}12.3 $y \leftarrow (buf >> (lg(\beta) - 1))$ AND $1$ \\
7600 \hspace{3mm}12.4 $buf \leftarrow buf << 1$ \\
7601 \hspace{3mm}12.5 if $mode = 0$ and $y = 0$ then goto step 12. \\
7602 \hspace{3mm}12.6 if $mode = 1$ and $y = 0$ then do \\
7603 \hspace{6mm}12.6.1 $res \leftarrow res^2$ \\
7604 \hspace{6mm}12.6.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
7605 \hspace{6mm}12.6.3 Goto step 12. \\
7606 \hspace{3mm}12.7 $bitcpy \leftarrow bitcpy + 1$ \\
7607 \hspace{3mm}12.8 $bitbuf \leftarrow bitbuf + (y << (winsize - bitcpy))$ \\
7608 \hspace{3mm}12.9 $mode \leftarrow 2$ \\
7609 \hspace{3mm}12.10 If $bitcpy = winsize$ then do \\
7610 \hspace{6mm}Window is full so perform the squarings and single multiplication. \\
7611 \hspace{6mm}12.10.1 for $ix$ from $0$ to $winsize -1$ do \\
7612 \hspace{9mm}12.10.1.1 $res \leftarrow res^2$ \\
7613 \hspace{9mm}12.10.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
7614 \hspace{6mm}12.10.2 $res \leftarrow res \cdot M_{bitbuf}$ \\
7615 \hspace{6mm}12.10.3 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
7616 \hspace{6mm}Reset the window. \\
7617 \hspace{6mm}12.10.4 $bitcpy \leftarrow 0, bitbuf \leftarrow 0, mode \leftarrow 1$ \\
7618 \\
7619 No more windows left. Check for residual bits of exponent. \\
7620 13. If $mode = 2$ and $bitcpy > 0$ then do \\
7621 \hspace{3mm}13.1 for $ix$ form $0$ to $bitcpy - 1$ do \\
7622 \hspace{6mm}13.1.1 $res \leftarrow res^2$ \\
7623 \hspace{6mm}13.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
7624 \hspace{6mm}13.1.3 $bitbuf \leftarrow bitbuf << 1$ \\
7625 \hspace{6mm}13.1.4 If $bitbuf$ AND $2^{winsize} \ne 0$ then do \\
7626 \hspace{9mm}13.1.4.1 $res \leftarrow res \cdot M_{1}$ \\
7627 \hspace{9mm}13.1.4.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
7628 14. $y \leftarrow res$ \\
7629 15. Clear $res$, $mu$ and the $M$ array. \\
7630 16. Return(\textit{MP\_OKAY}). \\
7631 \hline
7632 \end{tabular}
7633 \end{center}
7634 \end{small}
7635 \caption{Algorithm s\_mp\_exptmod (continued)}
7636 \end{figure}
7637
7638 \textbf{Algorithm s\_mp\_exptmod.}
7639 This algorithm computes the $x$'th power of $g$ modulo $p$ and stores the result in $y$. It takes advantage of the Barrett reduction
7640 algorithm to keep the product small throughout the algorithm.
7641
7642 The first two steps determine the optimal window size based on the number of bits in the exponent. The larger the exponent the
7643 larger the window size becomes. After a window size $winsize$ has been chosen an array of $2^{winsize}$ mp\_int variables is allocated. This
7644 table will hold the values of $g^x \mbox{ (mod }p\mbox{)}$ for $2^{winsize - 1} \le x < 2^{winsize}$.
7645
7646 After the table is allocated the first power of $g$ is found. Since $g \ge p$ is allowed it must be first reduced modulo $p$ to make
7647 the rest of the algorithm more efficient. The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$
7648 times. The rest of the table elements are found by multiplying the previous element by $M_1$ modulo $p$.
7649
7650 Now that the table is available the sliding window may begin. The following list describes the functions of all the variables in the window.
7651 \begin{enumerate}
7652 \item The variable $mode$ dictates how the bits of the exponent are interpreted.
7653 \begin{enumerate}
7654 \item When $mode = 0$ the bits are ignored since no non-zero bit of the exponent has been seen yet. For example, if the exponent were simply
7655 $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit. In this case bits are ignored until a non-zero bit is found.
7656 \item When $mode = 1$ a non-zero bit has been seen before and a new $winsize$-bit window has not been formed yet. In this mode leading $0$ bits
7657 are read and a single squaring is performed. If a non-zero bit is read a new window is created.
7658 \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit
7659 downwards.
7660 \end{enumerate}
7661 \item The variable $bitcnt$ indicates how many bits are left in the current digit of the exponent left to be read. When it reaches zero a new digit
7662 is fetched from the exponent.
7663 \item The variable $buf$ holds the currently read digit of the exponent.
7664 \item The variable $digidx$ is an index into the exponents digits. It starts at the leading digit $x.used - 1$ and moves towards the trailing digit.
7665 \item The variable $bitcpy$ indicates how many bits are in the currently formed window. When it reaches $winsize$ the window is flushed and
7666 the appropriate operations performed.
7667 \item The variable $bitbuf$ holds the current bits of the window being formed.
7668 \end{enumerate}
7669
7670 All of step 12 is the window processing loop. It will iterate while there are digits available form the exponent to read. The first step
7671 inside this loop is to extract a new digit if no more bits are available in the current digit. If there are no bits left a new digit is
7672 read and if there are no digits left than the loop terminates.
7673
7674 After a digit is made available step 12.3 will extract the most significant bit of the current digit and move all other bits in the digit
7675 upwards. In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to
7676 trailing edges the entire exponent is read from most significant bit to least significant bit.
7677
7678 At step 12.5 if the $mode$ and currently extracted bit $y$ are both zero the bit is ignored and the next bit is read. This prevents the
7679 algorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read. Step 12.6 and 12.7-10 handle
7680 the two cases of $mode = 1$ and $mode = 2$ respectively.
7681
7682 \begin{center}
7683 \begin{figure}[here]
7684 \includegraphics{pics/expt_state.ps}
7685 \caption{Sliding Window State Diagram}
7686 \label{pic:expt_state}
7687 \end{figure}
7688 \end{center}
7689
7690 By step 13 there are no more digits left in the exponent. However, there may be partial bits in the window left. If $mode = 2$ then
7691 a Left-to-Right algorithm is used to process the remaining few bits.
7692
7693 \vspace{+3mm}\begin{small}
7694 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_exptmod.c
7695 \vspace{-3mm}
7696 \begin{alltt}
7697 016
7698 017 #ifdef MP_LOW_MEM
7699 018 #define TAB_SIZE 32
7700 019 #else
7701 020 #define TAB_SIZE 256
7702 021 #endif
7703 022
7704 023 int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y)
7705 024 \{
7706 025 mp_int M[TAB_SIZE], res, mu;
7707 026 mp_digit buf;
7708 027 int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize;
7709 028
7710 029 /* find window size */
7711 030 x = mp_count_bits (X);
7712 031 if (x <= 7) \{
7713 032 winsize = 2;
7714 033 \} else if (x <= 36) \{
7715 034 winsize = 3;
7716 035 \} else if (x <= 140) \{
7717 036 winsize = 4;
7718 037 \} else if (x <= 450) \{
7719 038 winsize = 5;
7720 039 \} else if (x <= 1303) \{
7721 040 winsize = 6;
7722 041 \} else if (x <= 3529) \{
7723 042 winsize = 7;
7724 043 \} else \{
7725 044 winsize = 8;
7726 045 \}
7727 046
7728 047 #ifdef MP_LOW_MEM
7729 048 if (winsize > 5) \{
7730 049 winsize = 5;
7731 050 \}
7732 051 #endif
7733 052
7734 053 /* init M array */
7735 054 /* init first cell */
7736 055 if ((err = mp_init(&M[1])) != MP_OKAY) \{
7737 056 return err;
7738 057 \}
7739 058
7740 059 /* now init the second half of the array */
7741 060 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
7742 061 if ((err = mp_init(&M[x])) != MP_OKAY) \{
7743 062 for (y = 1<<(winsize-1); y < x; y++) \{
7744 063 mp_clear (&M[y]);
7745 064 \}
7746 065 mp_clear(&M[1]);
7747 066 return err;
7748 067 \}
7749 068 \}
7750 069
7751 070 /* create mu, used for Barrett reduction */
7752 071 if ((err = mp_init (&mu)) != MP_OKAY) \{
7753 072 goto __M;
7754 073 \}
7755 074 if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{
7756 075 goto __MU;
7757 076 \}
7758 077
7759 078 /* create M table
7760 079 *
7761 080 * The M table contains powers of the base,
7762 081 * e.g. M[x] = G**x mod P
7763 082 *
7764 083 * The first half of the table is not
7765 084 * computed though accept for M[0] and M[1]
7766 085 */
7767 086 if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{
7768 087 goto __MU;
7769 088 \}
7770 089
7771 090 /* compute the value at M[1<<(winsize-1)] by squaring
7772 091 * M[1] (winsize-1) times
7773 092 */
7774 093 if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{
7775 094 goto __MU;
7776 095 \}
7777 096
7778 097 for (x = 0; x < (winsize - 1); x++) \{
7779 098 if ((err = mp_sqr (&M[1 << (winsize - 1)],
7780 099 &M[1 << (winsize - 1)])) != MP_OKAY) \{
7781 100 goto __MU;
7782 101 \}
7783 102 if ((err = mp_reduce (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{
7784 103 goto __MU;
7785 104 \}
7786 105 \}
7787 106
7788 107 /* create upper table, that is M[x] = M[x-1] * M[1] (mod P)
7789 108 * for x = (2**(winsize - 1) + 1) to (2**winsize - 1)
7790 109 */
7791 110 for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{
7792 111 if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{
7793 112 goto __MU;
7794 113 \}
7795 114 if ((err = mp_reduce (&M[x], P, &mu)) != MP_OKAY) \{
7796 115 goto __MU;
7797 116 \}
7798 117 \}
7799 118
7800 119 /* setup result */
7801 120 if ((err = mp_init (&res)) != MP_OKAY) \{
7802 121 goto __MU;
7803 122 \}
7804 123 mp_set (&res, 1);
7805 124
7806 125 /* set initial mode and bit cnt */
7807 126 mode = 0;
7808 127 bitcnt = 1;
7809 128 buf = 0;
7810 129 digidx = X->used - 1;
7811 130 bitcpy = 0;
7812 131 bitbuf = 0;
7813 132
7814 133 for (;;) \{
7815 134 /* grab next digit as required */
7816 135 if (--bitcnt == 0) \{
7817 136 /* if digidx == -1 we are out of digits */
7818 137 if (digidx == -1) \{
7819 138 break;
7820 139 \}
7821 140 /* read next digit and reset the bitcnt */
7822 141 buf = X->dp[digidx--];
7823 142 bitcnt = (int) DIGIT_BIT;
7824 143 \}
7825 144
7826 145 /* grab the next msb from the exponent */
7827 146 y = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1;
7828 147 buf <<= (mp_digit)1;
7829 148
7830 149 /* if the bit is zero and mode == 0 then we ignore it
7831 150 * These represent the leading zero bits before the first 1 bit
7832 151 * in the exponent. Technically this opt is not required but it
7833 152 * does lower the # of trivial squaring/reductions used
7834 153 */
7835 154 if (mode == 0 && y == 0) \{
7836 155 continue;
7837 156 \}
7838 157
7839 158 /* if the bit is zero and mode == 1 then we square */
7840 159 if (mode == 1 && y == 0) \{
7841 160 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
7842 161 goto __RES;
7843 162 \}
7844 163 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{
7845 164 goto __RES;
7846 165 \}
7847 166 continue;
7848 167 \}
7849 168
7850 169 /* else we add it to the window */
7851 170 bitbuf |= (y << (winsize - ++bitcpy));
7852 171 mode = 2;
7853 172
7854 173 if (bitcpy == winsize) \{
7855 174 /* ok window is filled so square as required and multiply */
7856 175 /* square first */
7857 176 for (x = 0; x < winsize; x++) \{
7858 177 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
7859 178 goto __RES;
7860 179 \}
7861 180 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{
7862 181 goto __RES;
7863 182 \}
7864 183 \}
7865 184
7866 185 /* then multiply */
7867 186 if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{
7868 187 goto __RES;
7869 188 \}
7870 189 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{
7871 190 goto __RES;
7872 191 \}
7873 192
7874 193 /* empty window and reset */
7875 194 bitcpy = 0;
7876 195 bitbuf = 0;
7877 196 mode = 1;
7878 197 \}
7879 198 \}
7880 199
7881 200 /* if bits remain then square/multiply */
7882 201 if (mode == 2 && bitcpy > 0) \{
7883 202 /* square then multiply if the bit is set */
7884 203 for (x = 0; x < bitcpy; x++) \{
7885 204 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
7886 205 goto __RES;
7887 206 \}
7888 207 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{
7889 208 goto __RES;
7890 209 \}
7891 210
7892 211 bitbuf <<= 1;
7893 212 if ((bitbuf & (1 << winsize)) != 0) \{
7894 213 /* then multiply */
7895 214 if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{
7896 215 goto __RES;
7897 216 \}
7898 217 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{
7899 218 goto __RES;
7900 219 \}
7901 220 \}
7902 221 \}
7903 222 \}
7904 223
7905 224 mp_exch (&res, Y);
7906 225 err = MP_OKAY;
7907 226 __RES:mp_clear (&res);
7908 227 __MU:mp_clear (&mu);
7909 228 __M:
7910 229 mp_clear(&M[1]);
7911 230 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
7912 231 mp_clear (&M[x]);
7913 232 \}
7914 233 return err;
7915 234 \}
7916 \end{alltt}
7917 \end{small}
7918
7919 Lines 31 through 41 determine the optimal window size based on the length of the exponent in bits. The window divisions are sorted
7920 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested. For example, by the \textbf{if} statement
7921 on line 33 the value of $x$ is already known to be greater than $140$.
7922
7923 The conditional piece of code beginning on line 47 allows the window size to be restricted to five bits. This logic is used to ensure
7924 the table of precomputed powers of $G$ remains relatively small.
7925
7926 The for loop on line 60 initializes the $M$ array while lines 61 and 74 compute the value of $\mu$ required for
7927 Barrett reduction.
7928
7929 -- More later.
7930
7931 \section{Quick Power of Two}
7932 Calculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms. Recall that a logical shift left $m << k$ is
7933 equivalent to $m \cdot 2^k$. By this logic when $m = 1$ a quick power of two can be achieved.
7934
7935 \begin{figure}[!here]
7936 \begin{small}
7937 \begin{center}
7938 \begin{tabular}{l}
7939 \hline Algorithm \textbf{mp\_2expt}. \\
7940 \textbf{Input}. integer $b$ \\
7941 \textbf{Output}. $a \leftarrow 2^b$ \\
7942 \hline \\
7943 1. $a \leftarrow 0$ \\
7944 2. If $a.alloc < \lfloor b / lg(\beta) \rfloor + 1$ then grow $a$ appropriately. \\
7945 3. $a.used \leftarrow \lfloor b / lg(\beta) \rfloor + 1$ \\
7946 4. $a_{\lfloor b / lg(\beta) \rfloor} \leftarrow 1 << (b \mbox{ mod } lg(\beta))$ \\
7947 5. Return(\textit{MP\_OKAY}). \\
7948 \hline
7949 \end{tabular}
7950 \end{center}
7951 \end{small}
7952 \caption{Algorithm mp\_2expt}
7953 \end{figure}
7954
7955 \textbf{Algorithm mp\_2expt.}
7956
7957 \vspace{+3mm}\begin{small}
7958 \hspace{-5.1mm}{\bf File}: bn\_mp\_2expt.c
7959 \vspace{-3mm}
7960 \begin{alltt}
7961 016
7962 017 /* computes a = 2**b
7963 018 *
7964 019 * Simple algorithm which zeroes the int, grows it then just sets one bit
7965 020 * as required.
7966 021 */
7967 022 int
7968 023 mp_2expt (mp_int * a, int b)
7969 024 \{
7970 025 int res;
7971 026
7972 027 /* zero a as per default */
7973 028 mp_zero (a);
7974 029
7975 030 /* grow a to accomodate the single bit */
7976 031 if ((res = mp_grow (a, b / DIGIT_BIT + 1)) != MP_OKAY) \{
7977 032 return res;
7978 033 \}
7979 034
7980 035 /* set the used count of where the bit will go */
7981 036 a->used = b / DIGIT_BIT + 1;
7982 037
7983 038 /* put the single bit in its place */
7984 039 a->dp[b / DIGIT_BIT] = 1 << (b % DIGIT_BIT);
7985 040
7986 041 return MP_OKAY;
7987 042 \}
7988 \end{alltt}
7989 \end{small}
7990
7991 \chapter{Higher Level Algorithms}
7992
7993 This chapter discusses the various higher level algorithms that are required to complete a well rounded multiple precision integer package. These
7994 routines are less performance oriented than the algorithms of chapters five, six and seven but are no less important.
7995
7996 The first section describes a method of integer division with remainder that is universally well known. It provides the signed division logic
7997 for the package. The subsequent section discusses a set of algorithms which allow a single digit to be the 2nd operand for a variety of operations.
7998 These algorithms serve mostly to simplify other algorithms where small constants are required. The last two sections discuss how to manipulate
7999 various representations of integers. For example, converting from an mp\_int to a string of character.
8000
8001 \section{Integer Division with Remainder}
8002 \label{sec:division}
8003
8004 Integer division aside from modular exponentiation is the most intensive algorithm to compute. Like addition, subtraction and multiplication
8005 the basis of this algorithm is the long-hand division algorithm taught to school children. Throughout this discussion several common variables
8006 will be used. Let $x$ represent the divisor and $y$ represent the dividend. Let $q$ represent the integer quotient $\lfloor y / x \rfloor$ and
8007 let $r$ represent the remainder $r = y - x \lfloor y / x \rfloor$. The following simple algorithm will be used to start the discussion.
8008
8009 \newpage\begin{figure}[!here]
8010 \begin{small}
8011 \begin{center}
8012 \begin{tabular}{l}
8013 \hline Algorithm \textbf{Radix-$\beta$ Integer Division}. \\
8014 \textbf{Input}. integer $x$ and $y$ \\
8015 \textbf{Output}. $q = \lfloor y/x\rfloor, r = y - xq$ \\
8016 \hline \\
8017 1. $q \leftarrow 0$ \\
8018 2. $n \leftarrow \vert \vert y \vert \vert - \vert \vert x \vert \vert$ \\
8019 3. for $t$ from $n$ down to $0$ do \\
8020 \hspace{3mm}3.1 Maximize $k$ such that $kx\beta^t$ is less than or equal to $y$ and $(k + 1)x\beta^t$ is greater. \\
8021 \hspace{3mm}3.2 $q \leftarrow q + k\beta^t$ \\
8022 \hspace{3mm}3.3 $y \leftarrow y - kx\beta^t$ \\
8023 4. $r \leftarrow y$ \\
8024 5. Return($q, r$) \\
8025 \hline
8026 \end{tabular}
8027 \end{center}
8028 \end{small}
8029 \caption{Algorithm Radix-$\beta$ Integer Division}
8030 \label{fig:raddiv}
8031 \end{figure}
8032
8033 As children we are taught this very simple algorithm for the case of $\beta = 10$. Almost instinctively several optimizations are taught for which
8034 their reason of existing are never explained. For this example let $y = 5471$ represent the dividend and $x = 23$ represent the divisor.
8035
8036 To find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and
8037 simultaneously $(k + 1)x\beta^t$ is greater than $y$. Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have. The habitual method
8038 used to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient. By only using leading
8039 digits a much simpler division may be used to form an educated guess at what the value must be. In this case $k = \lfloor 54/23\rfloor = 2$ quickly
8040 arises as a possible solution. Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$.
8041 As a result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$.
8042
8043 Again this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder
8044 $y = 841 - 3x\beta = 181$. Finally the last iteration of the loop produces $k = 7$ which leads to the quotient $q = 230 + 7 = 237$ and the
8045 remainder $y = 181 - 7x = 20$. The final quotient and remainder found are $q = 237$ and $r = y = 20$ which are indeed correct since
8046 $237 \cdot 23 + 20 = 5471$ is true.
8047
8048 \subsection{Quotient Estimation}
8049 \label{sec:divest}
8050 As alluded to earlier the quotient digit $k$ can be estimated from only the leading digits of both the divisor and dividend. When $p$ leading
8051 digits are used from both the divisor and dividend to form an estimation the accuracy of the estimation rises as $p$ grows. Technically
8052 speaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the
8053 dividend and divisor are zero.
8054
8055 The value of the estimation may off by a few values in either direction and in general is fairly correct. A simplification \cite[pp. 271]{TAOCPV2}
8056 of the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$. The estimate
8057 using this technique is never too small. For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$
8058 represent the most significant digits of the dividend and divisor respectively.
8059
8060 \textbf{Proof.}\textit{ The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to
8061 $k = \lfloor y / (x \cdot \beta^{\vert \vert y \vert \vert - \vert \vert x \vert \vert - 1}) \rfloor$. }
8062 The first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger. For all other
8063 cases $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ and $\hat k x_s \ge y_t\beta + y_{t-1} - x_s + 1$. The latter portion of the inequalility
8064 $-x_s + 1$ arises from the fact that a truncated integer division will give the same quotient for at most $x_s - 1$ values. Next a series of
8065 inequalities will prove the hypothesis.
8066
8067 \begin{equation}
8068 y - \hat k x \le y - \hat k x_s\beta^s
8069 \end{equation}
8070
8071 This is trivially true since $x \ge x_s\beta^s$. Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$.
8072
8073 \begin{equation}
8074 y - \hat k x \le y_t\beta^t + \ldots + y_0 - (y_t\beta^t + y_{t-1}\beta^{t-1} - x_s\beta^t + \beta^s)
8075 \end{equation}
8076
8077 By simplifying the previous inequality the following inequality is formed.
8078
8079 \begin{equation}
8080 y - \hat k x \le y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s
8081 \end{equation}
8082
8083 Subsequently,
8084
8085 \begin{equation}
8086 y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s < x_s\beta^s \le x
8087 \end{equation}
8088
8089 Which proves that $y - \hat kx \le x$ and by consequence $\hat k \ge k$ which concludes the proof. \textbf{QED}
8090
8091
8092 \subsection{Normalized Integers}
8093 For the purposes of division a normalized input is when the divisors leading digit $x_n$ is greater than or equal to $\beta / 2$. By multiplying both
8094 $x$ and $y$ by $j = \lfloor (\beta / 2) / x_n \rfloor$ the quotient remains unchanged and the remainder is simply $j$ times the original
8095 remainder. The purpose of normalization is to ensure the leading digit of the divisor is sufficiently large such that the estimated quotient will
8096 lie in the domain of a single digit. Consider the maximum dividend $(\beta - 1) \cdot \beta + (\beta - 1)$ and the minimum divisor $\beta / 2$.
8097
8098 \begin{equation}
8099 {{\beta^2 - 1} \over { \beta / 2}} \le 2\beta - {2 \over \beta}
8100 \end{equation}
8101
8102 At most the quotient approaches $2\beta$, however, in practice this will not occur since that would imply the previous quotient digit was too small.
8103
8104 \subsection{Radix-$\beta$ Division with Remainder}
8105 \newpage\begin{figure}[!here]
8106 \begin{small}
8107 \begin{center}
8108 \begin{tabular}{l}
8109 \hline Algorithm \textbf{mp\_div}. \\
8110 \textbf{Input}. mp\_int $a, b$ \\
8111 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\
8112 \hline \\
8113 1. If $b = 0$ return(\textit{MP\_VAL}). \\
8114 2. If $\vert a \vert < \vert b \vert$ then do \\
8115 \hspace{3mm}2.1 $d \leftarrow a$ \\
8116 \hspace{3mm}2.2 $c \leftarrow 0$ \\
8117 \hspace{3mm}2.3 Return(\textit{MP\_OKAY}). \\
8118 \\
8119 Setup the quotient to receive the digits. \\
8120 3. Grow $q$ to $a.used + 2$ digits. \\
8121 4. $q \leftarrow 0$ \\
8122 5. $x \leftarrow \vert a \vert , y \leftarrow \vert b \vert$ \\
8123 6. $sign \leftarrow \left \lbrace \begin{array}{ll}
8124 MP\_ZPOS & \mbox{if }a.sign = b.sign \\
8125 MP\_NEG & \mbox{otherwise} \\
8126 \end{array} \right .$ \\
8127 \\
8128 Normalize the inputs such that the leading digit of $y$ is greater than or equal to $\beta / 2$. \\
8129 7. $norm \leftarrow (lg(\beta) - 1) - (\lceil lg(y) \rceil \mbox{ (mod }lg(\beta)\mbox{)})$ \\
8130 8. $x \leftarrow x \cdot 2^{norm}, y \leftarrow y \cdot 2^{norm}$ \\
8131 \\
8132 Find the leading digit of the quotient. \\
8133 9. $n \leftarrow x.used - 1, t \leftarrow y.used - 1$ \\
8134 10. $y \leftarrow y \cdot \beta^{n - t}$ \\
8135 11. While ($x \ge y$) do \\
8136 \hspace{3mm}11.1 $q_{n - t} \leftarrow q_{n - t} + 1$ \\
8137 \hspace{3mm}11.2 $x \leftarrow x - y$ \\
8138 12. $y \leftarrow \lfloor y / \beta^{n-t} \rfloor$ \\
8139 \\
8140 Continued on the next page. \\
8141 \hline
8142 \end{tabular}
8143 \end{center}
8144 \end{small}
8145 \caption{Algorithm mp\_div}
8146 \end{figure}
8147
8148 \newpage\begin{figure}[!here]
8149 \begin{small}
8150 \begin{center}
8151 \begin{tabular}{l}
8152 \hline Algorithm \textbf{mp\_div} (continued). \\
8153 \textbf{Input}. mp\_int $a, b$ \\
8154 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\
8155 \hline \\
8156 Now find the remainder fo the digits. \\
8157 13. for $i$ from $n$ down to $(t + 1)$ do \\
8158 \hspace{3mm}13.1 If $i > x.used$ then jump to the next iteration of this loop. \\
8159 \hspace{3mm}13.2 If $x_{i} = y_{t}$ then \\
8160 \hspace{6mm}13.2.1 $q_{i - t - 1} \leftarrow \beta - 1$ \\
8161 \hspace{3mm}13.3 else \\
8162 \hspace{6mm}13.3.1 $\hat r \leftarrow x_{i} \cdot \beta + x_{i - 1}$ \\
8163 \hspace{6mm}13.3.2 $\hat r \leftarrow \lfloor \hat r / y_{t} \rfloor$ \\
8164 \hspace{6mm}13.3.3 $q_{i - t - 1} \leftarrow \hat r$ \\
8165 \hspace{3mm}13.4 $q_{i - t - 1} \leftarrow q_{i - t - 1} + 1$ \\
8166 \\
8167 Fixup quotient estimation. \\
8168 \hspace{3mm}13.5 Loop \\
8169 \hspace{6mm}13.5.1 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\
8170 \hspace{6mm}13.5.2 t$1 \leftarrow 0$ \\
8171 \hspace{6mm}13.5.3 t$1_0 \leftarrow y_{t - 1}, $ t$1_1 \leftarrow y_t,$ t$1.used \leftarrow 2$ \\
8172 \hspace{6mm}13.5.4 $t1 \leftarrow t1 \cdot q_{i - t - 1}$ \\
8173 \hspace{6mm}13.5.5 t$2_0 \leftarrow x_{i - 2}, $ t$2_1 \leftarrow x_{i - 1}, $ t$2_2 \leftarrow x_i, $ t$2.used \leftarrow 3$ \\
8174 \hspace{6mm}13.5.6 If $\vert t1 \vert > \vert t2 \vert$ then goto step 13.5. \\
8175 \hspace{3mm}13.6 t$1 \leftarrow y \cdot q_{i - t - 1}$ \\
8176 \hspace{3mm}13.7 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\
8177 \hspace{3mm}13.8 $x \leftarrow x - $ t$1$ \\
8178 \hspace{3mm}13.9 If $x.sign = MP\_NEG$ then \\
8179 \hspace{6mm}13.10 t$1 \leftarrow y$ \\
8180 \hspace{6mm}13.11 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\
8181 \hspace{6mm}13.12 $x \leftarrow x + $ t$1$ \\
8182 \hspace{6mm}13.13 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\
8183 \\
8184 Finalize the result. \\
8185 14. Clamp excess digits of $q$ \\
8186 15. $c \leftarrow q, c.sign \leftarrow sign$ \\
8187 16. $x.sign \leftarrow a.sign$ \\
8188 17. $d \leftarrow \lfloor x / 2^{norm} \rfloor$ \\
8189 18. Return(\textit{MP\_OKAY}). \\
8190 \hline
8191 \end{tabular}
8192 \end{center}
8193 \end{small}
8194 \caption{Algorithm mp\_div (continued)}
8195 \end{figure}
8196 \textbf{Algorithm mp\_div.}
8197 This algorithm will calculate quotient and remainder from an integer division given a dividend and divisor. The algorithm is a signed
8198 division and will produce a fully qualified quotient and remainder.
8199
8200 First the divisor $b$ must be non-zero which is enforced in step one. If the divisor is larger than the dividend than the quotient is implicitly
8201 zero and the remainder is the dividend.
8202
8203 After the first two trivial cases of inputs are handled the variable $q$ is setup to receive the digits of the quotient. Two unsigned copies of the
8204 divisor $y$ and dividend $x$ are made as well. The core of the division algorithm is an unsigned division and will only work if the values are
8205 positive. Now the two values $x$ and $y$ must be normalized such that the leading digit of $y$ is greater than or equal to $\beta / 2$.
8206 This is performed by shifting both to the left by enough bits to get the desired normalization.
8207
8208 At this point the division algorithm can begin producing digits of the quotient. Recall that maximum value of the estimation used is
8209 $2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means. In this case $y$ is shifted
8210 to the left (\textit{step ten}) so that it has the same number of digits as $x$. The loop on step eleven will subtract multiples of the
8211 shifted copy of $y$ until $x$ is smaller. Since the leading digit of $y$ is greater than or equal to $\beta/2$ this loop will iterate at most two
8212 times to produce the desired leading digit of the quotient.
8213
8214 Now the remainder of the digits can be produced. The equation $\hat q = \lfloor {{x_i \beta + x_{i-1}}\over y_t} \rfloor$ is used to fairly
8215 accurately approximate the true quotient digit. The estimation can in theory produce an estimation as high as $2\beta - {2 \over \beta}$ but by
8216 induction the upper quotient digit is correct (\textit{as established on step eleven}) and the estimate must be less than $\beta$.
8217
8218 Recall from section~\ref{sec:divest} that the estimation is never too low but may be too high. The next step of the estimation process is
8219 to refine the estimation. The loop on step 13.5 uses $x_i\beta^2 + x_{i-1}\beta + x_{i-2}$ and $q_{i - t - 1}(y_t\beta + y_{t-1})$ as a higher
8220 order approximation to adjust the quotient digit.
8221
8222 After both phases of estimation the quotient digit may still be off by a value of one\footnote{This is similar to the error introduced
8223 by optimizing Barrett reduction.}. Steps 13.6 and 13.7 subtract the multiple of the divisor from the dividend (\textit{Similar to step 3.3 of
8224 algorithm~\ref{fig:raddiv}} and then subsequently add a multiple of the divisor if the quotient was too large.
8225
8226 Now that the quotient has been determine finializing the result is a matter of clamping the quotient, fixing the sizes and de-normalizing the
8227 remainder. An important aspect of this algorithm seemingly overlooked in other descriptions such as that of Algorithm 14.20 HAC \cite[pp. 598]{HAC}
8228 is that when the estimations are being made (\textit{inside the loop on step 13.5}) that the digits $y_{t-1}$, $x_{i-2}$ and $x_{i-1}$ may lie
8229 outside their respective boundaries. For example, if $t = 0$ or $i \le 1$ then the digits would be undefined. In those cases the digits should
8230 respectively be replaced with a zero.
8231
8232 \vspace{+3mm}\begin{small}
8233 \hspace{-5.1mm}{\bf File}: bn\_mp\_div.c
8234 \vspace{-3mm}
8235 \begin{alltt}
8236 016
8237 017 /* integer signed division.
8238 018 * c*b + d == a [e.g. a/b, c=quotient, d=remainder]
8239 019 * HAC pp.598 Algorithm 14.20
8240 020 *
8241 021 * Note that the description in HAC is horribly
8242 022 * incomplete. For example, it doesn't consider
8243 023 * the case where digits are removed from 'x' in
8244 024 * the inner loop. It also doesn't consider the
8245 025 * case that y has fewer than three digits, etc..
8246 026 *
8247 027 * The overall algorithm is as described as
8248 028 * 14.20 from HAC but fixed to treat these cases.
8249 029 */
8250 030 int mp_div (mp_int * a, mp_int * b, mp_int * c, mp_int * d)
8251 031 \{
8252 032 mp_int q, x, y, t1, t2;
8253 033 int res, n, t, i, norm, neg;
8254 034
8255 035 /* is divisor zero ? */
8256 036 if (mp_iszero (b) == 1) \{
8257 037 return MP_VAL;
8258 038 \}
8259 039
8260 040 /* if a < b then q=0, r = a */
8261 041 if (mp_cmp_mag (a, b) == MP_LT) \{
8262 042 if (d != NULL) \{
8263 043 res = mp_copy (a, d);
8264 044 \} else \{
8265 045 res = MP_OKAY;
8266 046 \}
8267 047 if (c != NULL) \{
8268 048 mp_zero (c);
8269 049 \}
8270 050 return res;
8271 051 \}
8272 052
8273 053 if ((res = mp_init_size (&q, a->used + 2)) != MP_OKAY) \{
8274 054 return res;
8275 055 \}
8276 056 q.used = a->used + 2;
8277 057
8278 058 if ((res = mp_init (&t1)) != MP_OKAY) \{
8279 059 goto __Q;
8280 060 \}
8281 061
8282 062 if ((res = mp_init (&t2)) != MP_OKAY) \{
8283 063 goto __T1;
8284 064 \}
8285 065
8286 066 if ((res = mp_init_copy (&x, a)) != MP_OKAY) \{
8287 067 goto __T2;
8288 068 \}
8289 069
8290 070 if ((res = mp_init_copy (&y, b)) != MP_OKAY) \{
8291 071 goto __X;
8292 072 \}
8293 073
8294 074 /* fix the sign */
8295 075 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG;
8296 076 x.sign = y.sign = MP_ZPOS;
8297 077
8298 078 /* normalize both x and y, ensure that y >= b/2, [b == 2**DIGIT_BIT] */
8299 079 norm = mp_count_bits(&y) % DIGIT_BIT;
8300 080 if (norm < (int)(DIGIT_BIT-1)) \{
8301 081 norm = (DIGIT_BIT-1) - norm;
8302 082 if ((res = mp_mul_2d (&x, norm, &x)) != MP_OKAY) \{
8303 083 goto __Y;
8304 084 \}
8305 085 if ((res = mp_mul_2d (&y, norm, &y)) != MP_OKAY) \{
8306 086 goto __Y;
8307 087 \}
8308 088 \} else \{
8309 089 norm = 0;
8310 090 \}
8311 091
8312 092 /* note hac does 0 based, so if used==5 then its 0,1,2,3,4, e.g. use 4 */
8313 093 n = x.used - 1;
8314 094 t = y.used - 1;
8315 095
8316 096 /* while (x >= y*b**n-t) do \{ q[n-t] += 1; x -= y*b**\{n-t\} \} */
8317 097 if ((res = mp_lshd (&y, n - t)) != MP_OKAY) \{ /* y = y*b**\{n-t\} */
8318 098 goto __Y;
8319 099 \}
8320 100
8321 101 while (mp_cmp (&x, &y) != MP_LT) \{
8322 102 ++(q.dp[n - t]);
8323 103 if ((res = mp_sub (&x, &y, &x)) != MP_OKAY) \{
8324 104 goto __Y;
8325 105 \}
8326 106 \}
8327 107
8328 108 /* reset y by shifting it back down */
8329 109 mp_rshd (&y, n - t);
8330 110
8331 111 /* step 3. for i from n down to (t + 1) */
8332 112 for (i = n; i >= (t + 1); i--) \{
8333 113 if (i > x.used) \{
8334 114 continue;
8335 115 \}
8336 116
8337 117 /* step 3.1 if xi == yt then set q\{i-t-1\} to b-1,
8338 118 * otherwise set q\{i-t-1\} to (xi*b + x\{i-1\})/yt */
8339 119 if (x.dp[i] == y.dp[t]) \{
8340 120 q.dp[i - t - 1] = ((((mp_digit)1) << DIGIT_BIT) - 1);
8341 121 \} else \{
8342 122 mp_word tmp;
8343 123 tmp = ((mp_word) x.dp[i]) << ((mp_word) DIGIT_BIT);
8344 124 tmp |= ((mp_word) x.dp[i - 1]);
8345 125 tmp /= ((mp_word) y.dp[t]);
8346 126 if (tmp > (mp_word) MP_MASK)
8347 127 tmp = MP_MASK;
8348 128 q.dp[i - t - 1] = (mp_digit) (tmp & (mp_word) (MP_MASK));
8349 129 \}
8350 130
8351 131 /* while (q\{i-t-1\} * (yt * b + y\{t-1\})) >
8352 132 xi * b**2 + xi-1 * b + xi-2
8353 133
8354 134 do q\{i-t-1\} -= 1;
8355 135 */
8356 136 q.dp[i - t - 1] = (q.dp[i - t - 1] + 1) & MP_MASK;
8357 137 do \{
8358 138 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1) & MP_MASK;
8359 139
8360 140 /* find left hand */
8361 141 mp_zero (&t1);
8362 142 t1.dp[0] = (t - 1 < 0) ? 0 : y.dp[t - 1];
8363 143 t1.dp[1] = y.dp[t];
8364 144 t1.used = 2;
8365 145 if ((res = mp_mul_d (&t1, q.dp[i - t - 1], &t1)) != MP_OKAY) \{
8366 146 goto __Y;
8367 147 \}
8368 148
8369 149 /* find right hand */
8370 150 t2.dp[0] = (i - 2 < 0) ? 0 : x.dp[i - 2];
8371 151 t2.dp[1] = (i - 1 < 0) ? 0 : x.dp[i - 1];
8372 152 t2.dp[2] = x.dp[i];
8373 153 t2.used = 3;
8374 154 \} while (mp_cmp_mag(&t1, &t2) == MP_GT);
8375 155
8376 156 /* step 3.3 x = x - q\{i-t-1\} * y * b**\{i-t-1\} */
8377 157 if ((res = mp_mul_d (&y, q.dp[i - t - 1], &t1)) != MP_OKAY) \{
8378 158 goto __Y;
8379 159 \}
8380 160
8381 161 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{
8382 162 goto __Y;
8383 163 \}
8384 164
8385 165 if ((res = mp_sub (&x, &t1, &x)) != MP_OKAY) \{
8386 166 goto __Y;
8387 167 \}
8388 168
8389 169 /* if x < 0 then \{ x = x + y*b**\{i-t-1\}; q\{i-t-1\} -= 1; \} */
8390 170 if (x.sign == MP_NEG) \{
8391 171 if ((res = mp_copy (&y, &t1)) != MP_OKAY) \{
8392 172 goto __Y;
8393 173 \}
8394 174 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{
8395 175 goto __Y;
8396 176 \}
8397 177 if ((res = mp_add (&x, &t1, &x)) != MP_OKAY) \{
8398 178 goto __Y;
8399 179 \}
8400 180
8401 181 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1UL) & MP_MASK;
8402 182 \}
8403 183 \}
8404 184
8405 185 /* now q is the quotient and x is the remainder
8406 186 * [which we have to normalize]
8407 187 */
8408 188
8409 189 /* get sign before writing to c */
8410 190 x.sign = a->sign;
8411 191
8412 192 if (c != NULL) \{
8413 193 mp_clamp (&q);
8414 194 mp_exch (&q, c);
8415 195 c->sign = neg;
8416 196 \}
8417 197
8418 198 if (d != NULL) \{
8419 199 mp_div_2d (&x, norm, &x, NULL);
8420 200 mp_exch (&x, d);
8421 201 \}
8422 202
8423 203 res = MP_OKAY;
8424 204
8425 205 __Y:mp_clear (&y);
8426 206 __X:mp_clear (&x);
8427 207 __T2:mp_clear (&t2);
8428 208 __T1:mp_clear (&t1);
8429 209 __Q:mp_clear (&q);
8430 210 return res;
8431 211 \}
8432 \end{alltt}
8433 \end{small}
8434
8435 The implementation of this algorithm differs slightly from the pseudo code presented previously. In this algorithm either of the quotient $c$ or
8436 remainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired. For example, the C code to call the division
8437 algorithm with only the quotient is
8438
8439 \begin{verbatim}
8440 mp_div(&a, &b, &c, NULL); /* c = [a/b] */
8441 \end{verbatim}
8442
8443 Lines 36 and 42 handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor
8444 respectively. After the two trivial cases all of the temporary variables are initialized. Line 75 determines the sign of
8445 the quotient and line 76 ensures that both $x$ and $y$ are positive.
8446
8447 The number of bits in the leading digit is calculated on line 80. Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits
8448 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$. In this case $k$ is the number of bits in the leading digit which is
8449 exactly what is required. For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting
8450 them to the left by $lg(\beta) - 1 - k$ bits.
8451
8452 Throughout the variables $n$ and $t$ will represent the highest digit of $x$ and $y$ respectively. These are first used to produce the
8453 leading digit of the quotient. The loop beginning on line 112 will produce the remainder of the quotient digits.
8454
8455 The conditional ``continue'' on line 113 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
8456 algorithm eliminates multiple non-zero digits in a single iteration. This ensures that $x_i$ is always non-zero since by definition the digits
8457 above the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}.
8458
8459 Lines 142, 143 and 150 through 152 manually construct the high accuracy estimations by setting the digits of the two mp\_int
8460 variables directly.
8461
8462 \section{Single Digit Helpers}
8463
8464 This section briefly describes a series of single digit helper algorithms which come in handy when working with small constants. All of
8465 the helper functions assume the single digit input is positive and will treat them as such.
8466
8467 \subsection{Single Digit Addition and Subtraction}
8468
8469 Both addition and subtraction are performed by ``cheating'' and using mp\_set followed by the higher level addition or subtraction
8470 algorithms. As a result these algorithms are subtantially simpler with a slight cost in performance.
8471
8472 \newpage\begin{figure}[!here]
8473 \begin{small}
8474 \begin{center}
8475 \begin{tabular}{l}
8476 \hline Algorithm \textbf{mp\_add\_d}. \\
8477 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\
8478 \textbf{Output}. $c = a + b$ \\
8479 \hline \\
8480 1. $t \leftarrow b$ (\textit{mp\_set}) \\
8481 2. $c \leftarrow a + t$ \\
8482 3. Return(\textit{MP\_OKAY}) \\
8483 \hline
8484 \end{tabular}
8485 \end{center}
8486 \end{small}
8487 \caption{Algorithm mp\_add\_d}
8488 \end{figure}
8489
8490 \textbf{Algorithm mp\_add\_d.}
8491 This algorithm initiates a temporary mp\_int with the value of the single digit and uses algorithm mp\_add to add the two values together.
8492
8493 \vspace{+3mm}\begin{small}
8494 \hspace{-5.1mm}{\bf File}: bn\_mp\_add\_d.c
8495 \vspace{-3mm}
8496 \begin{alltt}
8497 016
8498 017 /* single digit addition */
8499 018 int
8500 019 mp_add_d (mp_int * a, mp_digit b, mp_int * c)
8501 020 \{
8502 021 int res, ix, oldused;
8503 022 mp_digit *tmpa, *tmpc, mu;
8504 023
8505 024 /* grow c as required */
8506 025 if (c->alloc < a->used + 1) \{
8507 026 if ((res = mp_grow(c, a->used + 1)) != MP_OKAY) \{
8508 027 return res;
8509 028 \}
8510 029 \}
8511 030
8512 031 /* if a is negative and |a| >= b, call c = |a| - b */
8513 032 if (a->sign == MP_NEG && (a->used > 1 || a->dp[0] >= b)) \{
8514 033 /* temporarily fix sign of a */
8515 034 a->sign = MP_ZPOS;
8516 035
8517 036 /* c = |a| - b */
8518 037 res = mp_sub_d(a, b, c);
8519 038
8520 039 /* fix sign */
8521 040 a->sign = c->sign = MP_NEG;
8522 041
8523 042 return res;
8524 043 \}
8525 044
8526 045 /* old number of used digits in c */
8527 046 oldused = c->used;
8528 047
8529 048 /* sign always positive */
8530 049 c->sign = MP_ZPOS;
8531 050
8532 051 /* source alias */
8533 052 tmpa = a->dp;
8534 053
8535 054 /* destination alias */
8536 055 tmpc = c->dp;
8537 056
8538 057 /* if a is positive */
8539 058 if (a->sign == MP_ZPOS) \{
8540 059 /* add digit, after this we're propagating
8541 060 * the carry.
8542 061 */
8543 062 *tmpc = *tmpa++ + b;
8544 063 mu = *tmpc >> DIGIT_BIT;
8545 064 *tmpc++ &= MP_MASK;
8546 065
8547 066 /* now handle rest of the digits */
8548 067 for (ix = 1; ix < a->used; ix++) \{
8549 068 *tmpc = *tmpa++ + mu;
8550 069 mu = *tmpc >> DIGIT_BIT;
8551 070 *tmpc++ &= MP_MASK;
8552 071 \}
8553 072 /* set final carry */
8554 073 ix++;
8555 074 *tmpc++ = mu;
8556 075
8557 076 /* setup size */
8558 077 c->used = a->used + 1;
8559 078 \} else \{
8560 079 /* a was negative and |a| < b */
8561 080 c->used = 1;
8562 081
8563 082 /* the result is a single digit */
8564 083 if (a->used == 1) \{
8565 084 *tmpc++ = b - a->dp[0];
8566 085 \} else \{
8567 086 *tmpc++ = b;
8568 087 \}
8569 088
8570 089 /* setup count so the clearing of oldused
8571 090 * can fall through correctly
8572 091 */
8573 092 ix = 1;
8574 093 \}
8575 094
8576 095 /* now zero to oldused */
8577 096 while (ix++ < oldused) \{
8578 097 *tmpc++ = 0;
8579 098 \}
8580 099 mp_clamp(c);
8581 100
8582 101 return MP_OKAY;
8583 102 \}
8584 103
8585 \end{alltt}
8586 \end{small}
8587
8588 Clever use of the letter 't'.
8589
8590 \subsubsection{Subtraction}
8591 The single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int.
8592
8593 \subsection{Single Digit Multiplication}
8594 Single digit multiplication arises enough in division and radix conversion that it ought to be implement as a special case of the baseline
8595 multiplication algorithm. Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands
8596 only has one digit.
8597
8598 \begin{figure}[!here]
8599 \begin{small}
8600 \begin{center}
8601 \begin{tabular}{l}
8602 \hline Algorithm \textbf{mp\_mul\_d}. \\
8603 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\
8604 \textbf{Output}. $c = ab$ \\
8605 \hline \\
8606 1. $pa \leftarrow a.used$ \\
8607 2. Grow $c$ to at least $pa + 1$ digits. \\
8608 3. $oldused \leftarrow c.used$ \\
8609 4. $c.used \leftarrow pa + 1$ \\
8610 5. $c.sign \leftarrow a.sign$ \\
8611 6. $\mu \leftarrow 0$ \\
8612 7. for $ix$ from $0$ to $pa - 1$ do \\
8613 \hspace{3mm}7.1 $\hat r \leftarrow \mu + a_{ix}b$ \\
8614 \hspace{3mm}7.2 $c_{ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
8615 \hspace{3mm}7.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\
8616 8. $c_{pa} \leftarrow \mu$ \\
8617 9. for $ix$ from $pa + 1$ to $oldused$ do \\
8618 \hspace{3mm}9.1 $c_{ix} \leftarrow 0$ \\
8619 10. Clamp excess digits of $c$. \\
8620 11. Return(\textit{MP\_OKAY}). \\
8621 \hline
8622 \end{tabular}
8623 \end{center}
8624 \end{small}
8625 \caption{Algorithm mp\_mul\_d}
8626 \end{figure}
8627 \textbf{Algorithm mp\_mul\_d.}
8628 This algorithm quickly multiplies an mp\_int by a small single digit value. It is specially tailored to the job and has a minimal of overhead.
8629 Unlike the full multiplication algorithms this algorithm does not require any significnat temporary storage or memory allocations.
8630
8631 \vspace{+3mm}\begin{small}
8632 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_d.c
8633 \vspace{-3mm}
8634 \begin{alltt}
8635 016
8636 017 /* multiply by a digit */
8637 018 int
8638 019 mp_mul_d (mp_int * a, mp_digit b, mp_int * c)
8639 020 \{
8640 021 mp_digit u, *tmpa, *tmpc;
8641 022 mp_word r;
8642 023 int ix, res, olduse;
8643 024
8644 025 /* make sure c is big enough to hold a*b */
8645 026 if (c->alloc < a->used + 1) \{
8646 027 if ((res = mp_grow (c, a->used + 1)) != MP_OKAY) \{
8647 028 return res;
8648 029 \}
8649 030 \}
8650 031
8651 032 /* get the original destinations used count */
8652 033 olduse = c->used;
8653 034
8654 035 /* set the sign */
8655 036 c->sign = a->sign;
8656 037
8657 038 /* alias for a->dp [source] */
8658 039 tmpa = a->dp;
8659 040
8660 041 /* alias for c->dp [dest] */
8661 042 tmpc = c->dp;
8662 043
8663 044 /* zero carry */
8664 045 u = 0;
8665 046
8666 047 /* compute columns */
8667 048 for (ix = 0; ix < a->used; ix++) \{
8668 049 /* compute product and carry sum for this term */
8669 050 r = ((mp_word) u) + ((mp_word)*tmpa++) * ((mp_word)b);
8670 051
8671 052 /* mask off higher bits to get a single digit */
8672 053 *tmpc++ = (mp_digit) (r & ((mp_word) MP_MASK));
8673 054
8674 055 /* send carry into next iteration */
8675 056 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT));
8676 057 \}
8677 058
8678 059 /* store final carry [if any] */
8679 060 *tmpc++ = u;
8680 061
8681 062 /* now zero digits above the top */
8682 063 while (ix++ < olduse) \{
8683 064 *tmpc++ = 0;
8684 065 \}
8685 066
8686 067 /* set used count */
8687 068 c->used = a->used + 1;
8688 069 mp_clamp(c);
8689 070
8690 071 return MP_OKAY;
8691 072 \}
8692 \end{alltt}
8693 \end{small}
8694
8695 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
8696 read from the source. This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively.
8697
8698 \subsection{Single Digit Division}
8699 Like the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion. Since the
8700 divisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient.
8701
8702 \newpage\begin{figure}[!here]
8703 \begin{small}
8704 \begin{center}
8705 \begin{tabular}{l}
8706 \hline Algorithm \textbf{mp\_div\_d}. \\
8707 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\
8708 \textbf{Output}. $c = \lfloor a / b \rfloor, d = a - cb$ \\
8709 \hline \\
8710 1. If $b = 0$ then return(\textit{MP\_VAL}).\\
8711 2. If $b = 3$ then use algorithm mp\_div\_3 instead. \\
8712 3. Init $q$ to $a.used$ digits. \\
8713 4. $q.used \leftarrow a.used$ \\
8714 5. $q.sign \leftarrow a.sign$ \\
8715 6. $\hat w \leftarrow 0$ \\
8716 7. for $ix$ from $a.used - 1$ down to $0$ do \\
8717 \hspace{3mm}7.1 $\hat w \leftarrow \hat w \beta + a_{ix}$ \\
8718 \hspace{3mm}7.2 If $\hat w \ge b$ then \\
8719 \hspace{6mm}7.2.1 $t \leftarrow \lfloor \hat w / b \rfloor$ \\
8720 \hspace{6mm}7.2.2 $\hat w \leftarrow \hat w \mbox{ (mod }b\mbox{)}$ \\
8721 \hspace{3mm}7.3 else\\
8722 \hspace{6mm}7.3.1 $t \leftarrow 0$ \\
8723 \hspace{3mm}7.4 $q_{ix} \leftarrow t$ \\
8724 8. $d \leftarrow \hat w$ \\
8725 9. Clamp excess digits of $q$. \\
8726 10. $c \leftarrow q$ \\
8727 11. Return(\textit{MP\_OKAY}). \\
8728 \hline
8729 \end{tabular}
8730 \end{center}
8731 \end{small}
8732 \caption{Algorithm mp\_div\_d}
8733 \end{figure}
8734 \textbf{Algorithm mp\_div\_d.}
8735 This algorithm divides the mp\_int $a$ by the single mp\_digit $b$ using an optimized approach. Essentially in every iteration of the
8736 algorithm another digit of the dividend is reduced and another digit of quotient produced. Provided $b < \beta$ the value of $\hat w$
8737 after step 7.1 will be limited such that $0 \le \lfloor \hat w / b \rfloor < \beta$.
8738
8739 If the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3. It replaces the division by three with
8740 a multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup. In essence it is much like the Barrett reduction
8741 from chapter seven.
8742
8743 \vspace{+3mm}\begin{small}
8744 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_d.c
8745 \vspace{-3mm}
8746 \begin{alltt}
8747 016
8748 017 static int s_is_power_of_two(mp_digit b, int *p)
8749 018 \{
8750 019 int x;
8751 020
8752 021 for (x = 1; x < DIGIT_BIT; x++) \{
8753 022 if (b == (((mp_digit)1)<<x)) \{
8754 023 *p = x;
8755 024 return 1;
8756 025 \}
8757 026 \}
8758 027 return 0;
8759 028 \}
8760 029
8761 030 /* single digit division (based on routine from MPI) */
8762 031 int mp_div_d (mp_int * a, mp_digit b, mp_int * c, mp_digit * d)
8763 032 \{
8764 033 mp_int q;
8765 034 mp_word w;
8766 035 mp_digit t;
8767 036 int res, ix;
8768 037
8769 038 /* cannot divide by zero */
8770 039 if (b == 0) \{
8771 040 return MP_VAL;
8772 041 \}
8773 042
8774 043 /* quick outs */
8775 044 if (b == 1 || mp_iszero(a) == 1) \{
8776 045 if (d != NULL) \{
8777 046 *d = 0;
8778 047 \}
8779 048 if (c != NULL) \{
8780 049 return mp_copy(a, c);
8781 050 \}
8782 051 return MP_OKAY;
8783 052 \}
8784 053
8785 054 /* power of two ? */
8786 055 if (s_is_power_of_two(b, &ix) == 1) \{
8787 056 if (d != NULL) \{
8788 057 *d = a->dp[0] & ((1<<ix) - 1);
8789 058 \}
8790 059 if (c != NULL) \{
8791 060 return mp_div_2d(a, ix, c, NULL);
8792 061 \}
8793 062 return MP_OKAY;
8794 063 \}
8795 064
8796 065 /* three? */
8797 066 if (b == 3) \{
8798 067 return mp_div_3(a, c, d);
8799 068 \}
8800 069
8801 070 /* no easy answer [c'est la vie]. Just division */
8802 071 if ((res = mp_init_size(&q, a->used)) != MP_OKAY) \{
8803 072 return res;
8804 073 \}
8805 074
8806 075 q.used = a->used;
8807 076 q.sign = a->sign;
8808 077 w = 0;
8809 078 for (ix = a->used - 1; ix >= 0; ix--) \{
8810 079 w = (w << ((mp_word)DIGIT_BIT)) | ((mp_word)a->dp[ix]);
8811 080
8812 081 if (w >= b) \{
8813 082 t = (mp_digit)(w / b);
8814 083 w -= ((mp_word)t) * ((mp_word)b);
8815 084 \} else \{
8816 085 t = 0;
8817 086 \}
8818 087 q.dp[ix] = (mp_digit)t;
8819 088 \}
8820 089
8821 090 if (d != NULL) \{
8822 091 *d = (mp_digit)w;
8823 092 \}
8824 093
8825 094 if (c != NULL) \{
8826 095 mp_clamp(&q);
8827 096 mp_exch(&q, c);
8828 097 \}
8829 098 mp_clear(&q);
8830 099
8831 100 return res;
8832 101 \}
8833 102
8834 \end{alltt}
8835 \end{small}
8836
8837 Like the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to
8838 indicate the respective value is not required. This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created.
8839
8840 The division and remainder on lines 43 and @45,%@ can be replaced often by a single division on most processors. For example, the 32-bit x86 based
8841 processors can divide a 64-bit quantity by a 32-bit quantity and produce the quotient and remainder simultaneously. Unfortunately the GCC
8842 compiler does not recognize that optimization and will actually produce two function calls to find the quotient and remainder respectively.
8843
8844 \subsection{Single Digit Root Extraction}
8845
8846 Finding the $n$'th root of an integer is fairly easy as far as numerical analysis is concerned. Algorithms such as the Newton-Raphson approximation
8847 (\ref{eqn:newton}) series will converge very quickly to a root for any continuous function $f(x)$.
8848
8849 \begin{equation}
8850 x_{i+1} = x_i - {f(x_i) \over f'(x_i)}
8851 \label{eqn:newton}
8852 \end{equation}
8853
8854 In this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired. The derivative of $f(x)$ is
8855 simply $f'(x) = nx^{n - 1}$. Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain
8856 such as the real numbers. As a result the root found can be above the true root by few and must be manually adjusted. Ideally at the end of the
8857 algorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$.
8858
8859 \newpage\begin{figure}[!here]
8860 \begin{small}
8861 \begin{center}
8862 \begin{tabular}{l}
8863 \hline Algorithm \textbf{mp\_n\_root}. \\
8864 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\
8865 \textbf{Output}. $c^b \le a$ \\
8866 \hline \\
8867 1. If $b$ is even and $a.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\
8868 2. $sign \leftarrow a.sign$ \\
8869 3. $a.sign \leftarrow MP\_ZPOS$ \\
8870 4. t$2 \leftarrow 2$ \\
8871 5. Loop \\
8872 \hspace{3mm}5.1 t$1 \leftarrow $ t$2$ \\
8873 \hspace{3mm}5.2 t$3 \leftarrow $ t$1^{b - 1}$ \\
8874 \hspace{3mm}5.3 t$2 \leftarrow $ t$3 $ $\cdot$ t$1$ \\
8875 \hspace{3mm}5.4 t$2 \leftarrow $ t$2 - a$ \\
8876 \hspace{3mm}5.5 t$3 \leftarrow $ t$3 \cdot b$ \\
8877 \hspace{3mm}5.6 t$3 \leftarrow \lfloor $t$2 / $t$3 \rfloor$ \\
8878 \hspace{3mm}5.7 t$2 \leftarrow $ t$1 - $ t$3$ \\
8879 \hspace{3mm}5.8 If t$1 \ne $ t$2$ then goto step 5. \\
8880 6. Loop \\
8881 \hspace{3mm}6.1 t$2 \leftarrow $ t$1^b$ \\
8882 \hspace{3mm}6.2 If t$2 > a$ then \\
8883 \hspace{6mm}6.2.1 t$1 \leftarrow $ t$1 - 1$ \\
8884 \hspace{6mm}6.2.2 Goto step 6. \\
8885 7. $a.sign \leftarrow sign$ \\
8886 8. $c \leftarrow $ t$1$ \\
8887 9. $c.sign \leftarrow sign$ \\
8888 10. Return(\textit{MP\_OKAY}). \\
8889 \hline
8890 \end{tabular}
8891 \end{center}
8892 \end{small}
8893 \caption{Algorithm mp\_n\_root}
8894 \end{figure}
8895 \textbf{Algorithm mp\_n\_root.}
8896 This algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach. It is partially optimized based on the observation
8897 that the numerator of ${f(x) \over f'(x)}$ can be derived from a partial denominator. That is at first the denominator is calculated by finding
8898 $x^{b - 1}$. This value can then be multiplied by $x$ and have $a$ subtracted from it to find the numerator. This saves a total of $b - 1$
8899 multiplications by t$1$ inside the loop.
8900
8901 The initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the
8902 root. Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$.
8903
8904 \vspace{+3mm}\begin{small}
8905 \hspace{-5.1mm}{\bf File}: bn\_mp\_n\_root.c
8906 \vspace{-3mm}
8907 \begin{alltt}
8908 016
8909 017 /* find the n'th root of an integer
8910 018 *
8911 019 * Result found such that (c)**b <= a and (c+1)**b > a
8912 020 *
8913 021 * This algorithm uses Newton's approximation
8914 022 * x[i+1] = x[i] - f(x[i])/f'(x[i])
8915 023 * which will find the root in log(N) time where
8916 024 * each step involves a fair bit. This is not meant to
8917 025 * find huge roots [square and cube, etc].
8918 026 */
8919 027 int mp_n_root (mp_int * a, mp_digit b, mp_int * c)
8920 028 \{
8921 029 mp_int t1, t2, t3;
8922 030 int res, neg;
8923 031
8924 032 /* input must be positive if b is even */
8925 033 if ((b & 1) == 0 && a->sign == MP_NEG) \{
8926 034 return MP_VAL;
8927 035 \}
8928 036
8929 037 if ((res = mp_init (&t1)) != MP_OKAY) \{
8930 038 return res;
8931 039 \}
8932 040
8933 041 if ((res = mp_init (&t2)) != MP_OKAY) \{
8934 042 goto __T1;
8935 043 \}
8936 044
8937 045 if ((res = mp_init (&t3)) != MP_OKAY) \{
8938 046 goto __T2;
8939 047 \}
8940 048
8941 049 /* if a is negative fudge the sign but keep track */
8942 050 neg = a->sign;
8943 051 a->sign = MP_ZPOS;
8944 052
8945 053 /* t2 = 2 */
8946 054 mp_set (&t2, 2);
8947 055
8948 056 do \{
8949 057 /* t1 = t2 */
8950 058 if ((res = mp_copy (&t2, &t1)) != MP_OKAY) \{
8951 059 goto __T3;
8952 060 \}
8953 061
8954 062 /* t2 = t1 - ((t1**b - a) / (b * t1**(b-1))) */
8955 063
8956 064 /* t3 = t1**(b-1) */
8957 065 if ((res = mp_expt_d (&t1, b - 1, &t3)) != MP_OKAY) \{
8958 066 goto __T3;
8959 067 \}
8960 068
8961 069 /* numerator */
8962 070 /* t2 = t1**b */
8963 071 if ((res = mp_mul (&t3, &t1, &t2)) != MP_OKAY) \{
8964 072 goto __T3;
8965 073 \}
8966 074
8967 075 /* t2 = t1**b - a */
8968 076 if ((res = mp_sub (&t2, a, &t2)) != MP_OKAY) \{
8969 077 goto __T3;
8970 078 \}
8971 079
8972 080 /* denominator */
8973 081 /* t3 = t1**(b-1) * b */
8974 082 if ((res = mp_mul_d (&t3, b, &t3)) != MP_OKAY) \{
8975 083 goto __T3;
8976 084 \}
8977 085
8978 086 /* t3 = (t1**b - a)/(b * t1**(b-1)) */
8979 087 if ((res = mp_div (&t2, &t3, &t3, NULL)) != MP_OKAY) \{
8980 088 goto __T3;
8981 089 \}
8982 090
8983 091 if ((res = mp_sub (&t1, &t3, &t2)) != MP_OKAY) \{
8984 092 goto __T3;
8985 093 \}
8986 094 \} while (mp_cmp (&t1, &t2) != MP_EQ);
8987 095
8988 096 /* result can be off by a few so check */
8989 097 for (;;) \{
8990 098 if ((res = mp_expt_d (&t1, b, &t2)) != MP_OKAY) \{
8991 099 goto __T3;
8992 100 \}
8993 101
8994 102 if (mp_cmp (&t2, a) == MP_GT) \{
8995 103 if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) \{
8996 104 goto __T3;
8997 105 \}
8998 106 \} else \{
8999 107 break;
9000 108 \}
9001 109 \}
9002 110
9003 111 /* reset the sign of a first */
9004 112 a->sign = neg;
9005 113
9006 114 /* set the result */
9007 115 mp_exch (&t1, c);
9008 116
9009 117 /* set the sign of the result */
9010 118 c->sign = neg;
9011 119
9012 120 res = MP_OKAY;
9013 121
9014 122 __T3:mp_clear (&t3);
9015 123 __T2:mp_clear (&t2);
9016 124 __T1:mp_clear (&t1);
9017 125 return res;
9018 126 \}
9019 \end{alltt}
9020 \end{small}
9021
9022 \section{Random Number Generation}
9023
9024 Random numbers come up in a variety of activities from public key cryptography to simple simulations and various randomized algorithms. Pollard-Rho
9025 factoring for example, can make use of random values as starting points to find factors of a composite integer. In this case the algorithm presented
9026 is solely for simulations and not intended for cryptographic use.
9027
9028 \newpage\begin{figure}[!here]
9029 \begin{small}
9030 \begin{center}
9031 \begin{tabular}{l}
9032 \hline Algorithm \textbf{mp\_rand}. \\
9033 \textbf{Input}. An integer $b$ \\
9034 \textbf{Output}. A pseudo-random number of $b$ digits \\
9035 \hline \\
9036 1. $a \leftarrow 0$ \\
9037 2. If $b \le 0$ return(\textit{MP\_OKAY}) \\
9038 3. Pick a non-zero random digit $d$. \\
9039 4. $a \leftarrow a + d$ \\
9040 5. for $ix$ from 1 to $d - 1$ do \\
9041 \hspace{3mm}5.1 $a \leftarrow a \cdot \beta$ \\
9042 \hspace{3mm}5.2 Pick a random digit $d$. \\
9043 \hspace{3mm}5.3 $a \leftarrow a + d$ \\
9044 6. Return(\textit{MP\_OKAY}). \\
9045 \hline
9046 \end{tabular}
9047 \end{center}
9048 \end{small}
9049 \caption{Algorithm mp\_rand}
9050 \end{figure}
9051 \textbf{Algorithm mp\_rand.}
9052 This algorithm produces a pseudo-random integer of $b$ digits. By ensuring that the first digit is non-zero the algorithm also guarantees that the
9053 final result has at least $b$ digits. It relies heavily on a third-part random number generator which should ideally generate uniformly all of
9054 the integers from $0$ to $\beta - 1$.
9055
9056 \vspace{+3mm}\begin{small}
9057 \hspace{-5.1mm}{\bf File}: bn\_mp\_rand.c
9058 \vspace{-3mm}
9059 \begin{alltt}
9060 016
9061 017 /* makes a pseudo-random int of a given size */
9062 018 int
9063 019 mp_rand (mp_int * a, int digits)
9064 020 \{
9065 021 int res;
9066 022 mp_digit d;
9067 023
9068 024 mp_zero (a);
9069 025 if (digits <= 0) \{
9070 026 return MP_OKAY;
9071 027 \}
9072 028
9073 029 /* first place a random non-zero digit */
9074 030 do \{
9075 031 d = ((mp_digit) abs (rand ()));
9076 032 \} while (d == 0);
9077 033
9078 034 if ((res = mp_add_d (a, d, a)) != MP_OKAY) \{
9079 035 return res;
9080 036 \}
9081 037
9082 038 while (digits-- > 0) \{
9083 039 if ((res = mp_lshd (a, 1)) != MP_OKAY) \{
9084 040 return res;
9085 041 \}
9086 042
9087 043 if ((res = mp_add_d (a, ((mp_digit) abs (rand ())), a)) != MP_OKAY) \{
9088 044 return res;
9089 045 \}
9090 046 \}
9091 047
9092 048 return MP_OKAY;
9093 049 \}
9094 \end{alltt}
9095 \end{small}
9096
9097 \section{Formatted Representations}
9098 The ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties. For example, the ability to
9099 be given a string of characters such as ``114585'' and turn it into the radix-$\beta$ equivalent would make it easier to enter numbers
9100 into a program.
9101
9102 \subsection{Reading Radix-n Input}
9103 For the purposes of this text we will assume that a simple lower ASCII map (\ref{fig:ASC}) is used for the values of from $0$ to $63$ to
9104 printable characters. For example, when the character ``N'' is read it represents the integer $23$. The first $16$ characters of the
9105 map are for the common representations up to hexadecimal. After that they match the ``base64'' encoding scheme which are suitable chosen
9106 such that they are printable. While outputting as base64 may not be too helpful for human operators it does allow communication via non binary
9107 mediums.
9108
9109 \newpage\begin{figure}[here]
9110 \begin{center}
9111 \begin{tabular}{cc|cc|cc|cc}
9112 \hline \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} \\
9113 \hline
9114 0 & 0 & 1 & 1 & 2 & 2 & 3 & 3 \\
9115 4 & 4 & 5 & 5 & 6 & 6 & 7 & 7 \\
9116 8 & 8 & 9 & 9 & 10 & A & 11 & B \\
9117 12 & C & 13 & D & 14 & E & 15 & F \\
9118 16 & G & 17 & H & 18 & I & 19 & J \\
9119 20 & K & 21 & L & 22 & M & 23 & N \\
9120 24 & O & 25 & P & 26 & Q & 27 & R \\
9121 28 & S & 29 & T & 30 & U & 31 & V \\
9122 32 & W & 33 & X & 34 & Y & 35 & Z \\
9123 36 & a & 37 & b & 38 & c & 39 & d \\
9124 40 & e & 41 & f & 42 & g & 43 & h \\
9125 44 & i & 45 & j & 46 & k & 47 & l \\
9126 48 & m & 49 & n & 50 & o & 51 & p \\
9127 52 & q & 53 & r & 54 & s & 55 & t \\
9128 56 & u & 57 & v & 58 & w & 59 & x \\
9129 60 & y & 61 & z & 62 & $+$ & 63 & $/$ \\
9130 \hline
9131 \end{tabular}
9132 \end{center}
9133 \caption{Lower ASCII Map}
9134 \label{fig:ASC}
9135 \end{figure}
9136
9137 \newpage\begin{figure}[!here]
9138 \begin{small}
9139 \begin{center}
9140 \begin{tabular}{l}
9141 \hline Algorithm \textbf{mp\_read\_radix}. \\
9142 \textbf{Input}. A string $str$ of length $sn$ and radix $r$. \\
9143 \textbf{Output}. The radix-$\beta$ equivalent mp\_int. \\
9144 \hline \\
9145 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
9146 2. $ix \leftarrow 0$ \\
9147 3. If $str_0 =$ ``-'' then do \\
9148 \hspace{3mm}3.1 $ix \leftarrow ix + 1$ \\
9149 \hspace{3mm}3.2 $sign \leftarrow MP\_NEG$ \\
9150 4. else \\
9151 \hspace{3mm}4.1 $sign \leftarrow MP\_ZPOS$ \\
9152 5. $a \leftarrow 0$ \\
9153 6. for $iy$ from $ix$ to $sn - 1$ do \\
9154 \hspace{3mm}6.1 Let $y$ denote the position in the map of $str_{iy}$. \\
9155 \hspace{3mm}6.2 If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\
9156 \hspace{3mm}6.3 $a \leftarrow a \cdot r$ \\
9157 \hspace{3mm}6.4 $a \leftarrow a + y$ \\
9158 7. If $a \ne 0$ then $a.sign \leftarrow sign$ \\
9159 8. Return(\textit{MP\_OKAY}). \\
9160 \hline
9161 \end{tabular}
9162 \end{center}
9163 \end{small}
9164 \caption{Algorithm mp\_read\_radix}
9165 \end{figure}
9166 \textbf{Algorithm mp\_read\_radix.}
9167 This algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer. A minus symbol ``-'' may precede the
9168 string to indicate the value is negative, otherwise it is assumed to be positive. The algorithm will read up to $sn$ characters from the input
9169 and will stop when it reads a character it cannot map the algorithm stops reading characters from the string. This allows numbers to be embedded
9170 as part of larger input without any significant problem.
9171
9172 \vspace{+3mm}\begin{small}
9173 \hspace{-5.1mm}{\bf File}: bn\_mp\_read\_radix.c
9174 \vspace{-3mm}
9175 \begin{alltt}
9176 016
9177 017 /* read a string [ASCII] in a given radix */
9178 018 int mp_read_radix (mp_int * a, char *str, int radix)
9179 019 \{
9180 020 int y, res, neg;
9181 021 char ch;
9182 022
9183 023 /* make sure the radix is ok */
9184 024 if (radix < 2 || radix > 64) \{
9185 025 return MP_VAL;
9186 026 \}
9187 027
9188 028 /* if the leading digit is a
9189 029 * minus set the sign to negative.
9190 030 */
9191 031 if (*str == '-') \{
9192 032 ++str;
9193 033 neg = MP_NEG;
9194 034 \} else \{
9195 035 neg = MP_ZPOS;
9196 036 \}
9197 037
9198 038 /* set the integer to the default of zero */
9199 039 mp_zero (a);
9200 040
9201 041 /* process each digit of the string */
9202 042 while (*str) \{
9203 043 /* if the radix < 36 the conversion is case insensitive
9204 044 * this allows numbers like 1AB and 1ab to represent the same value
9205 045 * [e.g. in hex]
9206 046 */
9207 047 ch = (char) ((radix < 36) ? toupper (*str) : *str);
9208 048 for (y = 0; y < 64; y++) \{
9209 049 if (ch == mp_s_rmap[y]) \{
9210 050 break;
9211 051 \}
9212 052 \}
9213 053
9214 054 /* if the char was found in the map
9215 055 * and is less than the given radix add it
9216 056 * to the number, otherwise exit the loop.
9217 057 */
9218 058 if (y < radix) \{
9219 059 if ((res = mp_mul_d (a, (mp_digit) radix, a)) != MP_OKAY) \{
9220 060 return res;
9221 061 \}
9222 062 if ((res = mp_add_d (a, (mp_digit) y, a)) != MP_OKAY) \{
9223 063 return res;
9224 064 \}
9225 065 \} else \{
9226 066 break;
9227 067 \}
9228 068 ++str;
9229 069 \}
9230 070
9231 071 /* set the sign only if a != 0 */
9232 072 if (mp_iszero(a) != 1) \{
9233 073 a->sign = neg;
9234 074 \}
9235 075 return MP_OKAY;
9236 076 \}
9237 \end{alltt}
9238 \end{small}
9239
9240 \subsection{Generating Radix-$n$ Output}
9241 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm.
9242
9243 \newpage\begin{figure}[!here]
9244 \begin{small}
9245 \begin{center}
9246 \begin{tabular}{l}
9247 \hline Algorithm \textbf{mp\_toradix}. \\
9248 \textbf{Input}. A mp\_int $a$ and an integer $r$\\
9249 \textbf{Output}. The radix-$r$ representation of $a$ \\
9250 \hline \\
9251 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
9252 2. If $a = 0$ then $str = $ ``$0$'' and return(\textit{MP\_OKAY}). \\
9253 3. $t \leftarrow a$ \\
9254 4. $str \leftarrow$ ``'' \\
9255 5. if $t.sign = MP\_NEG$ then \\
9256 \hspace{3mm}5.1 $str \leftarrow str + $ ``-'' \\
9257 \hspace{3mm}5.2 $t.sign = MP\_ZPOS$ \\
9258 6. While ($t \ne 0$) do \\
9259 \hspace{3mm}6.1 $d \leftarrow t \mbox{ (mod }r\mbox{)}$ \\
9260 \hspace{3mm}6.2 $t \leftarrow \lfloor t / r \rfloor$ \\
9261 \hspace{3mm}6.3 Look up $d$ in the map and store the equivalent character in $y$. \\
9262 \hspace{3mm}6.4 $str \leftarrow str + y$ \\
9263 7. If $str_0 = $``$-$'' then \\
9264 \hspace{3mm}7.1 Reverse the digits $str_1, str_2, \ldots str_n$. \\
9265 8. Otherwise \\
9266 \hspace{3mm}8.1 Reverse the digits $str_0, str_1, \ldots str_n$. \\
9267 9. Return(\textit{MP\_OKAY}).\\
9268 \hline
9269 \end{tabular}
9270 \end{center}
9271 \end{small}
9272 \caption{Algorithm mp\_toradix}
9273 \end{figure}
9274 \textbf{Algorithm mp\_toradix.}
9275 This algorithm computes the radix-$r$ representation of an mp\_int $a$. The ``digits'' of the representation are extracted by reducing
9276 successive powers of $\lfloor a / r^k \rfloor$ the input modulo $r$ until $r^k > a$. Note that instead of actually dividing by $r^k$ in
9277 each iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration. As a result a series of trivial $n \times 1$ divisions
9278 are required instead of a series of $n \times k$ divisions. One design flaw of this approach is that the digits are produced in the reverse order
9279 (see~\ref{fig:mpradix}). To remedy this flaw the digits must be swapped or simply ``reversed''.
9280
9281 \begin{figure}
9282 \begin{center}
9283 \begin{tabular}{|c|c|c|}
9284 \hline \textbf{Value of $a$} & \textbf{Value of $d$} & \textbf{Value of $str$} \\
9285 \hline $1234$ & -- & -- \\
9286 \hline $123$ & $4$ & ``4'' \\
9287 \hline $12$ & $3$ & ``43'' \\
9288 \hline $1$ & $2$ & ``432'' \\
9289 \hline $0$ & $1$ & ``4321'' \\
9290 \hline
9291 \end{tabular}
9292 \end{center}
9293 \caption{Example of Algorithm mp\_toradix.}
9294 \label{fig:mpradix}
9295 \end{figure}
9296
9297 \vspace{+3mm}\begin{small}
9298 \hspace{-5.1mm}{\bf File}: bn\_mp\_toradix.c
9299 \vspace{-3mm}
9300 \begin{alltt}
9301 016
9302 017 /* stores a bignum as a ASCII string in a given radix (2..64) */
9303 018 int mp_toradix (mp_int * a, char *str, int radix)
9304 019 \{
9305 020 int res, digs;
9306 021 mp_int t;
9307 022 mp_digit d;
9308 023 char *_s = str;
9309 024
9310 025 /* check range of the radix */
9311 026 if (radix < 2 || radix > 64) \{
9312 027 return MP_VAL;
9313 028 \}
9314 029
9315 030 /* quick out if its zero */
9316 031 if (mp_iszero(a) == 1) \{
9317 032 *str++ = '0';
9318 033 *str = '\symbol{92}0';
9319 034 return MP_OKAY;
9320 035 \}
9321 036
9322 037 if ((res = mp_init_copy (&t, a)) != MP_OKAY) \{
9323 038 return res;
9324 039 \}
9325 040
9326 041 /* if it is negative output a - */
9327 042 if (t.sign == MP_NEG) \{
9328 043 ++_s;
9329 044 *str++ = '-';
9330 045 t.sign = MP_ZPOS;
9331 046 \}
9332 047
9333 048 digs = 0;
9334 049 while (mp_iszero (&t) == 0) \{
9335 050 if ((res = mp_div_d (&t, (mp_digit) radix, &t, &d)) != MP_OKAY) \{
9336 051 mp_clear (&t);
9337 052 return res;
9338 053 \}
9339 054 *str++ = mp_s_rmap[d];
9340 055 ++digs;
9341 056 \}
9342 057
9343 058 /* reverse the digits of the string. In this case _s points
9344 059 * to the first digit [exluding the sign] of the number]
9345 060 */
9346 061 bn_reverse ((unsigned char *)_s, digs);
9347 062
9348 063 /* append a NULL so the string is properly terminated */
9349 064 *str = '\symbol{92}0';
9350 065
9351 066 mp_clear (&t);
9352 067 return MP_OKAY;
9353 068 \}
9354 069
9355 \end{alltt}
9356 \end{small}
9357
9358 \chapter{Number Theoretic Algorithms}
9359 This chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi
9360 symbol computation. These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and
9361 various Sieve based factoring algorithms.
9362
9363 \section{Greatest Common Divisor}
9364 The greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of
9365 both $a$ and $b$. That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur
9366 simultaneously.
9367
9368 The most common approach (cite) is to reduce one input modulo another. That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then
9369 $r$ is also divisible by $k$. The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$.
9370
9371 \newpage\begin{figure}[!here]
9372 \begin{small}
9373 \begin{center}
9374 \begin{tabular}{l}
9375 \hline Algorithm \textbf{Greatest Common Divisor (I)}. \\
9376 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
9377 \textbf{Output}. The greatest common divisor $(a, b)$. \\
9378 \hline \\
9379 1. While ($b > 0$) do \\
9380 \hspace{3mm}1.1 $r \leftarrow a \mbox{ (mod }b\mbox{)}$ \\
9381 \hspace{3mm}1.2 $a \leftarrow b$ \\
9382 \hspace{3mm}1.3 $b \leftarrow r$ \\
9383 2. Return($a$). \\
9384 \hline
9385 \end{tabular}
9386 \end{center}
9387 \end{small}
9388 \caption{Algorithm Greatest Common Divisor (I)}
9389 \label{fig:gcd1}
9390 \end{figure}
9391
9392 This algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly. However, divisions are
9393 relatively expensive operations to perform and should ideally be avoided. There is another approach based on a similar relationship of
9394 greatest common divisors. The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$.
9395 In particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$.
9396
9397 \begin{figure}[!here]
9398 \begin{small}
9399 \begin{center}
9400 \begin{tabular}{l}
9401 \hline Algorithm \textbf{Greatest Common Divisor (II)}. \\
9402 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
9403 \textbf{Output}. The greatest common divisor $(a, b)$. \\
9404 \hline \\
9405 1. While ($b > 0$) do \\
9406 \hspace{3mm}1.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
9407 \hspace{3mm}1.2 $b \leftarrow b - a$ \\
9408 2. Return($a$). \\
9409 \hline
9410 \end{tabular}
9411 \end{center}
9412 \end{small}
9413 \caption{Algorithm Greatest Common Divisor (II)}
9414 \label{fig:gcd2}
9415 \end{figure}
9416
9417 \textbf{Proof} \textit{Algorithm~\ref{fig:gcd2} will return the greatest common divisor of $a$ and $b$.}
9418 The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$. In other
9419 words in every iteration that tuple $\left < a, b \right >$ decrease in magnitude until eventually $a = b$. Since both $a$ and $b$ are always
9420 divisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the
9421 second to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof. \textbf{QED}.
9422
9423 As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful. Specially if $b$ is much larger than $a$ such that
9424 $b - a$ is still very much larger than $a$. A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does
9425 not divide the greatest common divisor but will divide $b - a$. In this case ${b - a} \over p$ is also an integer and still divisible by
9426 the greatest common divisor.
9427
9428 However, instead of factoring $b - a$ to find a suitable value of $p$ the powers of $p$ can be removed from $a$ and $b$ that are in common first.
9429 Then inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed.
9430
9431 \begin{figure}[!here]
9432 \begin{small}
9433 \begin{center}
9434 \begin{tabular}{l}
9435 \hline Algorithm \textbf{Greatest Common Divisor (III)}. \\
9436 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
9437 \textbf{Output}. The greatest common divisor $(a, b)$. \\
9438 \hline \\
9439 1. $k \leftarrow 0$ \\
9440 2. While $a$ and $b$ are both divisible by $p$ do \\
9441 \hspace{3mm}2.1 $a \leftarrow \lfloor a / p \rfloor$ \\
9442 \hspace{3mm}2.2 $b \leftarrow \lfloor b / p \rfloor$ \\
9443 \hspace{3mm}2.3 $k \leftarrow k + 1$ \\
9444 3. While $a$ is divisible by $p$ do \\
9445 \hspace{3mm}3.1 $a \leftarrow \lfloor a / p \rfloor$ \\
9446 4. While $b$ is divisible by $p$ do \\
9447 \hspace{3mm}4.1 $b \leftarrow \lfloor b / p \rfloor$ \\
9448 5. While ($b > 0$) do \\
9449 \hspace{3mm}5.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
9450 \hspace{3mm}5.2 $b \leftarrow b - a$ \\
9451 \hspace{3mm}5.3 While $b$ is divisible by $p$ do \\
9452 \hspace{6mm}5.3.1 $b \leftarrow \lfloor b / p \rfloor$ \\
9453 6. Return($a \cdot p^k$). \\
9454 \hline
9455 \end{tabular}
9456 \end{center}
9457 \end{small}
9458 \caption{Algorithm Greatest Common Divisor (III)}
9459 \label{fig:gcd3}
9460 \end{figure}
9461
9462 This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$
9463 decreases more rapidly. The first loop on step two removes powers of $p$ that are in common. A count, $k$, is kept which will present a common
9464 divisor of $p^k$. After step two the remaining common divisor of $a$ and $b$ cannot be divisible by $p$. This means that $p$ can be safely
9465 divided out of the difference $b - a$ so long as the division leaves no remainder.
9466
9467 In particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often. It also helps that division by $p$ be easy
9468 to compute. The ideal choice of $p$ is two since division by two amounts to a right logical shift. Another important observation is that by
9469 step five both $a$ and $b$ are odd. Therefore, the diffrence $b - a$ must be even which means that each iteration removes one bit from the
9470 largest of the pair.
9471
9472 \subsection{Complete Greatest Common Divisor}
9473 The algorithms presented so far cannot handle inputs which are zero or negative. The following algorithm can handle all input cases properly
9474 and will produce the greatest common divisor.
9475
9476 \newpage\begin{figure}[!here]
9477 \begin{small}
9478 \begin{center}
9479 \begin{tabular}{l}
9480 \hline Algorithm \textbf{mp\_gcd}. \\
9481 \textbf{Input}. mp\_int $a$ and $b$ \\
9482 \textbf{Output}. The greatest common divisor $c = (a, b)$. \\
9483 \hline \\
9484 1. If $a = 0$ and $b \ne 0$ then \\
9485 \hspace{3mm}1.1 $c \leftarrow b$ \\
9486 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\
9487 2. If $a \ne 0$ and $b = 0$ then \\
9488 \hspace{3mm}2.1 $c \leftarrow a$ \\
9489 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\
9490 3. If $a = b = 0$ then \\
9491 \hspace{3mm}3.1 $c \leftarrow 1$ \\
9492 \hspace{3mm}3.2 Return(\textit{MP\_OKAY}). \\
9493 4. $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\
9494 5. $k \leftarrow 0$ \\
9495 6. While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
9496 \hspace{3mm}6.1 $k \leftarrow k + 1$ \\
9497 \hspace{3mm}6.2 $u \leftarrow \lfloor u / 2 \rfloor$ \\
9498 \hspace{3mm}6.3 $v \leftarrow \lfloor v / 2 \rfloor$ \\
9499 7. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
9500 \hspace{3mm}7.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\
9501 8. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
9502 \hspace{3mm}8.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\
9503 9. While $v.used > 0$ \\
9504 \hspace{3mm}9.1 If $\vert u \vert > \vert v \vert$ then \\
9505 \hspace{6mm}9.1.1 Swap $u$ and $v$. \\
9506 \hspace{3mm}9.2 $v \leftarrow \vert v \vert - \vert u \vert$ \\
9507 \hspace{3mm}9.3 While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
9508 \hspace{6mm}9.3.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\
9509 10. $c \leftarrow u \cdot 2^k$ \\
9510 11. Return(\textit{MP\_OKAY}). \\
9511 \hline
9512 \end{tabular}
9513 \end{center}
9514 \end{small}
9515 \caption{Algorithm mp\_gcd}
9516 \end{figure}
9517 \textbf{Algorithm mp\_gcd.}
9518 This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$. The algorithm was originally based on Algorithm B of
9519 Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain. In theory it achieves the same asymptotic working time as
9520 Algorithm B and in practice this appears to be true.
9521
9522 The first three steps handle the cases where either one of or both inputs are zero. If either input is zero the greatest common divisor is the
9523 largest input or zero if they are both zero. If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of
9524 $a$ and $b$ respectively and the algorithm will proceed to reduce the pair.
9525
9526 Step six will divide out any common factors of two and keep track of the count in the variable $k$. After this step two is no longer a
9527 factor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even. Step
9528 seven and eight ensure that the $u$ and $v$ respectively have no more factors of two. At most only one of the while loops will iterate since
9529 they cannot both be even.
9530
9531 By step nine both of $u$ and $v$ are odd which is required for the inner logic. First the pair are swapped such that $v$ is equal to
9532 or greater than $u$. This ensures that the subtraction on step 9.2 will always produce a positive and even result. Step 9.3 removes any
9533 factors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd.
9534
9535 After $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six. The result
9536 must be adjusted by multiplying by the common factors of two ($2^k$) removed earlier.
9537
9538 \vspace{+3mm}\begin{small}
9539 \hspace{-5.1mm}{\bf File}: bn\_mp\_gcd.c
9540 \vspace{-3mm}
9541 \begin{alltt}
9542 016
9543 017 /* Greatest Common Divisor using the binary method */
9544 018 int mp_gcd (mp_int * a, mp_int * b, mp_int * c)
9545 019 \{
9546 020 mp_int u, v;
9547 021 int k, u_lsb, v_lsb, res;
9548 022
9549 023 /* either zero than gcd is the largest */
9550 024 if (mp_iszero (a) == 1 && mp_iszero (b) == 0) \{
9551 025 return mp_abs (b, c);
9552 026 \}
9553 027 if (mp_iszero (a) == 0 && mp_iszero (b) == 1) \{
9554 028 return mp_abs (a, c);
9555 029 \}
9556 030
9557 031 /* optimized. At this point if a == 0 then
9558 032 * b must equal zero too
9559 033 */
9560 034 if (mp_iszero (a) == 1) \{
9561 035 mp_zero(c);
9562 036 return MP_OKAY;
9563 037 \}
9564 038
9565 039 /* get copies of a and b we can modify */
9566 040 if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{
9567 041 return res;
9568 042 \}
9569 043
9570 044 if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{
9571 045 goto __U;
9572 046 \}
9573 047
9574 048 /* must be positive for the remainder of the algorithm */
9575 049 u.sign = v.sign = MP_ZPOS;
9576 050
9577 051 /* B1. Find the common power of two for u and v */
9578 052 u_lsb = mp_cnt_lsb(&u);
9579 053 v_lsb = mp_cnt_lsb(&v);
9580 054 k = MIN(u_lsb, v_lsb);
9581 055
9582 056 if (k > 0) \{
9583 057 /* divide the power of two out */
9584 058 if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{
9585 059 goto __V;
9586 060 \}
9587 061
9588 062 if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{
9589 063 goto __V;
9590 064 \}
9591 065 \}
9592 066
9593 067 /* divide any remaining factors of two out */
9594 068 if (u_lsb != k) \{
9595 069 if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{
9596 070 goto __V;
9597 071 \}
9598 072 \}
9599 073
9600 074 if (v_lsb != k) \{
9601 075 if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{
9602 076 goto __V;
9603 077 \}
9604 078 \}
9605 079
9606 080 while (mp_iszero(&v) == 0) \{
9607 081 /* make sure v is the largest */
9608 082 if (mp_cmp_mag(&u, &v) == MP_GT) \{
9609 083 /* swap u and v to make sure v is >= u */
9610 084 mp_exch(&u, &v);
9611 085 \}
9612 086
9613 087 /* subtract smallest from largest */
9614 088 if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{
9615 089 goto __V;
9616 090 \}
9617 091
9618 092 /* Divide out all factors of two */
9619 093 if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{
9620 094 goto __V;
9621 095 \}
9622 096 \}
9623 097
9624 098 /* multiply by 2**k which we divided out at the beginning */
9625 099 if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{
9626 100 goto __V;
9627 101 \}
9628 102 c->sign = MP_ZPOS;
9629 103 res = MP_OKAY;
9630 104 __V:mp_clear (&u);
9631 105 __U:mp_clear (&v);
9632 106 return res;
9633 107 \}
9634 \end{alltt}
9635 \end{small}
9636
9637 This function makes use of the macros mp\_iszero and mp\_iseven. The former evaluates to $1$ if the input mp\_int is equivalent to the
9638 integer zero otherwise it evaluates to $0$. The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise
9639 it evaluates to $0$. Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero. The three
9640 trivial cases of inputs are handled on lines 24 through 37. After those lines the inputs are assumed to be non-zero.
9641
9642 Lines 34 and 40 make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively. At this point the common factors of two
9643 must be divided out of the two inputs. The while loop on line 80 iterates so long as both are even. The local integer $k$ is used to
9644 keep track of how many factors of $2$ are pulled out of both values. It is assumed that the number of factors will not exceed the maximum
9645 value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than entries than are accessible by an ``int'' so this is not
9646 a limitation.}.
9647
9648 At this point there are no more common factors of two in the two values. The while loops on lines 80 and 80 remove any independent
9649 factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm. The while loop
9650 on line 80 performs the reduction of the pair until $v$ is equal to zero. The unsigned comparison and subtraction algorithms are used in
9651 place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative.
9652
9653 \section{Least Common Multiple}
9654 The least common multiple of a pair of integers is their product divided by their greatest common divisor. For two integers $a$ and $b$ the
9655 least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$. For example, if $a = 2 \cdot 2 \cdot 3 = 12$
9656 and $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$.
9657
9658 The least common multiple arises often in coding theory as well as number theory. If two functions have periods of $a$ and $b$ respectively they will
9659 collide, that is be in synchronous states, after only $[ a, b ]$ iterations. This is why, for example, random number generators based on
9660 Linear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}).
9661 Similarly in number theory if a composite $n$ has two prime factors $p$ and $q$ then maximal order of any unit of $\Z/n\Z$ will be $[ p - 1, q - 1] $.
9662
9663 \begin{figure}[!here]
9664 \begin{small}
9665 \begin{center}
9666 \begin{tabular}{l}
9667 \hline Algorithm \textbf{mp\_lcm}. \\
9668 \textbf{Input}. mp\_int $a$ and $b$ \\
9669 \textbf{Output}. The least common multiple $c = [a, b]$. \\
9670 \hline \\
9671 1. $c \leftarrow (a, b)$ \\
9672 2. $t \leftarrow a \cdot b$ \\
9673 3. $c \leftarrow \lfloor t / c \rfloor$ \\
9674 4. Return(\textit{MP\_OKAY}). \\
9675 \hline
9676 \end{tabular}
9677 \end{center}
9678 \end{small}
9679 \caption{Algorithm mp\_lcm}
9680 \end{figure}
9681 \textbf{Algorithm mp\_lcm.}
9682 This algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$. It computes the least common multiple directly by
9683 dividing the product of the two inputs by their greatest common divisor.
9684
9685 \vspace{+3mm}\begin{small}
9686 \hspace{-5.1mm}{\bf File}: bn\_mp\_lcm.c
9687 \vspace{-3mm}
9688 \begin{alltt}
9689 016
9690 017 /* computes least common multiple as |a*b|/(a, b) */
9691 018 int mp_lcm (mp_int * a, mp_int * b, mp_int * c)
9692 019 \{
9693 020 int res;
9694 021 mp_int t1, t2;
9695 022
9696 023
9697 024 if ((res = mp_init_multi (&t1, &t2, NULL)) != MP_OKAY) \{
9698 025 return res;
9699 026 \}
9700 027
9701 028 /* t1 = get the GCD of the two inputs */
9702 029 if ((res = mp_gcd (a, b, &t1)) != MP_OKAY) \{
9703 030 goto __T;
9704 031 \}
9705 032
9706 033 /* divide the smallest by the GCD */
9707 034 if (mp_cmp_mag(a, b) == MP_LT) \{
9708 035 /* store quotient in t2 such that t2 * b is the LCM */
9709 036 if ((res = mp_div(a, &t1, &t2, NULL)) != MP_OKAY) \{
9710 037 goto __T;
9711 038 \}
9712 039 res = mp_mul(b, &t2, c);
9713 040 \} else \{
9714 041 /* store quotient in t2 such that t2 * a is the LCM */
9715 042 if ((res = mp_div(b, &t1, &t2, NULL)) != MP_OKAY) \{
9716 043 goto __T;
9717 044 \}
9718 045 res = mp_mul(a, &t2, c);
9719 046 \}
9720 047
9721 048 /* fix the sign to positive */
9722 049 c->sign = MP_ZPOS;
9723 050
9724 051 __T:
9725 052 mp_clear_multi (&t1, &t2, NULL);
9726 053 return res;
9727 054 \}
9728 \end{alltt}
9729 \end{small}
9730
9731 \section{Jacobi Symbol Computation}
9732 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is
9733 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is
9734 equivalent to equation \ref{eqn:legendre}.
9735
9736 \begin{equation}
9737 a^{(p-1)/2} \equiv \begin{array}{rl}
9738 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\
9739 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\
9740 1 & \mbox{if }a\mbox{ is a quadratic residue}.
9741 \end{array} \mbox{ (mod }p\mbox{)}
9742 \label{eqn:legendre}
9743 \end{equation}
9744
9745 \textbf{Proof.} \textit{Equation \ref{eqn:legendre} correctly identifies the residue status of an integer $a$ modulo a prime $p$.}
9746 An integer $a$ is a quadratic residue if the following equation has a solution.
9747
9748 \begin{equation}
9749 x^2 \equiv a \mbox{ (mod }p\mbox{)}
9750 \label{eqn:root}
9751 \end{equation}
9752
9753 Consider the following equation.
9754
9755 \begin{equation}
9756 0 \equiv x^{p-1} - 1 \equiv \left \lbrace \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \right \rbrace + \left ( a^{(p-1)/2} - 1 \right ) \mbox{ (mod }p\mbox{)}
9757 \label{eqn:rooti}
9758 \end{equation}
9759
9760 Whether equation \ref{eqn:root} has a solution or not equation \ref{eqn:rooti} is always true. If $a^{(p-1)/2} - 1 \equiv 0 \mbox{ (mod }p\mbox{)}$
9761 then the quantity in the braces must be zero. By reduction,
9762
9763 \begin{eqnarray}
9764 \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \equiv 0 \nonumber \\
9765 \left (x^2 \right )^{(p-1)/2} \equiv a^{(p-1)/2} \nonumber \\
9766 x^2 \equiv a \mbox{ (mod }p\mbox{)}
9767 \end{eqnarray}
9768
9769 As a result there must be a solution to the quadratic equation and in turn $a$ must be a quadratic residue. If $a$ does not divide $p$ and $a$
9770 is not a quadratic residue then the only other value $a^{(p-1)/2}$ may be congruent to is $-1$ since
9771 \begin{equation}
9772 0 \equiv a^{p - 1} - 1 \equiv (a^{(p-1)/2} + 1)(a^{(p-1)/2} - 1) \mbox{ (mod }p\mbox{)}
9773 \end{equation}
9774 One of the terms on the right hand side must be zero. \textbf{QED}
9775
9776 \subsection{Jacobi Symbol}
9777 The Jacobi symbol is a generalization of the Legendre function for any odd non prime moduli $p$ greater than 2. If $p = \prod_{i=0}^n p_i$ then
9778 the Jacobi symbol $\left ( { a \over p } \right )$ is equal to the following equation.
9779
9780 \begin{equation}
9781 \left ( { a \over p } \right ) = \left ( { a \over p_0} \right ) \left ( { a \over p_1} \right ) \ldots \left ( { a \over p_n} \right )
9782 \end{equation}
9783
9784 By inspection if $p$ is prime the Jacobi symbol is equivalent to the Legendre function. The following facts\footnote{See HAC \cite[pp. 72-74]{HAC} for
9785 further details.} will be used to derive an efficient Jacobi symbol algorithm. Where $p$ is an odd integer greater than two and $a, b \in \Z$ the
9786 following are true.
9787
9788 \begin{enumerate}
9789 \item $\left ( { a \over p} \right )$ equals $-1$, $0$ or $1$.
9790 \item $\left ( { ab \over p} \right ) = \left ( { a \over p} \right )\left ( { b \over p} \right )$.
9791 \item If $a \equiv b$ then $\left ( { a \over p} \right ) = \left ( { b \over p} \right )$.
9792 \item $\left ( { 2 \over p} \right )$ equals $1$ if $p \equiv 1$ or $7 \mbox{ (mod }8\mbox{)}$. Otherwise, it equals $-1$.
9793 \item $\left ( { a \over p} \right ) \equiv \left ( { p \over a} \right ) \cdot (-1)^{(p-1)(a-1)/4}$. More specifically
9794 $\left ( { a \over p} \right ) = \left ( { p \over a} \right )$ if $p \equiv a \equiv 1 \mbox{ (mod }4\mbox{)}$.
9795 \end{enumerate}
9796
9797 Using these facts if $a = 2^k \cdot a'$ then
9798
9799 \begin{eqnarray}
9800 \left ( { a \over p } \right ) = \left ( {{2^k} \over p } \right ) \left ( {a' \over p} \right ) \nonumber \\
9801 = \left ( {2 \over p } \right )^k \left ( {a' \over p} \right )
9802 \label{eqn:jacobi}
9803 \end{eqnarray}
9804
9805 By fact five,
9806
9807 \begin{equation}
9808 \left ( { a \over p } \right ) = \left ( { p \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4}
9809 \end{equation}
9810
9811 Subsequently by fact three since $p \equiv (p \mbox{ mod }a) \mbox{ (mod }a\mbox{)}$ then
9812
9813 \begin{equation}
9814 \left ( { a \over p } \right ) = \left ( { {p \mbox{ mod } a} \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4}
9815 \end{equation}
9816
9817 By putting both observations into equation \ref{eqn:jacobi} the following simplified equation is formed.
9818
9819 \begin{equation}
9820 \left ( { a \over p } \right ) = \left ( {2 \over p } \right )^k \left ( {{p\mbox{ mod }a'} \over a'} \right ) \cdot (-1)^{(p-1)(a'-1)/4}
9821 \end{equation}
9822
9823 The value of $\left ( {{p \mbox{ mod }a'} \over a'} \right )$ can be found by using the same equation recursively. The value of
9824 $\left ( {2 \over p } \right )^k$ equals $1$ if $k$ is even otherwise it equals $\left ( {2 \over p } \right )$. Using this approach the
9825 factors of $p$ do not have to be known. Furthermore, if $(a, p) = 1$ then the algorithm will terminate when the recursion requests the
9826 Jacobi symbol computation of $\left ( {1 \over a'} \right )$ which is simply $1$.
9827
9828 \newpage\begin{figure}[!here]
9829 \begin{small}
9830 \begin{center}
9831 \begin{tabular}{l}
9832 \hline Algorithm \textbf{mp\_jacobi}. \\
9833 \textbf{Input}. mp\_int $a$ and $p$, $a \ge 0$, $p \ge 3$, $p \equiv 1 \mbox{ (mod }2\mbox{)}$ \\
9834 \textbf{Output}. The Jacobi symbol $c = \left ( {a \over p } \right )$. \\
9835 \hline \\
9836 1. If $a = 0$ then \\
9837 \hspace{3mm}1.1 $c \leftarrow 0$ \\
9838 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\
9839 2. If $a = 1$ then \\
9840 \hspace{3mm}2.1 $c \leftarrow 1$ \\
9841 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\
9842 3. $a' \leftarrow a$ \\
9843 4. $k \leftarrow 0$ \\
9844 5. While $a'.used > 0$ and $a'_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
9845 \hspace{3mm}5.1 $k \leftarrow k + 1$ \\
9846 \hspace{3mm}5.2 $a' \leftarrow \lfloor a' / 2 \rfloor$ \\
9847 6. If $k \equiv 0 \mbox{ (mod }2\mbox{)}$ then \\
9848 \hspace{3mm}6.1 $s \leftarrow 1$ \\
9849 7. else \\
9850 \hspace{3mm}7.1 $r \leftarrow p_0 \mbox{ (mod }8\mbox{)}$ \\
9851 \hspace{3mm}7.2 If $r = 1$ or $r = 7$ then \\
9852 \hspace{6mm}7.2.1 $s \leftarrow 1$ \\
9853 \hspace{3mm}7.3 else \\
9854 \hspace{6mm}7.3.1 $s \leftarrow -1$ \\
9855 8. If $p_0 \equiv a'_0 \equiv 3 \mbox{ (mod }4\mbox{)}$ then \\
9856 \hspace{3mm}8.1 $s \leftarrow -s$ \\
9857 9. If $a' \ne 1$ then \\
9858 \hspace{3mm}9.1 $p' \leftarrow p \mbox{ (mod }a'\mbox{)}$ \\
9859 \hspace{3mm}9.2 $s \leftarrow s \cdot \mbox{mp\_jacobi}(p', a')$ \\
9860 10. $c \leftarrow s$ \\
9861 11. Return(\textit{MP\_OKAY}). \\
9862 \hline
9863 \end{tabular}
9864 \end{center}
9865 \end{small}
9866 \caption{Algorithm mp\_jacobi}
9867 \end{figure}
9868 \textbf{Algorithm mp\_jacobi.}
9869 This algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three. The algorithm
9870 is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}.
9871
9872 Step numbers one and two handle the trivial cases of $a = 0$ and $a = 1$ respectively. Step five determines the number of two factors in the
9873 input $a$. If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one. If $k$ is odd than the term evaluates to one
9874 if $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled
9875 the $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$. The latter term evaluates to one if both $p$ and $a'$
9876 are congruent to one modulo four, otherwise it evaluates to negative one.
9877
9878 By step nine if $a'$ does not equal one a recursion is required. Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute
9879 $\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product.
9880
9881 \vspace{+3mm}\begin{small}
9882 \hspace{-5.1mm}{\bf File}: bn\_mp\_jacobi.c
9883 \vspace{-3mm}
9884 \begin{alltt}
9885 016
9886 017 /* computes the jacobi c = (a | n) (or Legendre if n is prime)
9887 018 * HAC pp. 73 Algorithm 2.149
9888 019 */
9889 020 int mp_jacobi (mp_int * a, mp_int * p, int *c)
9890 021 \{
9891 022 mp_int a1, p1;
9892 023 int k, s, r, res;
9893 024 mp_digit residue;
9894 025
9895 026 /* if p <= 0 return MP_VAL */
9896 027 if (mp_cmp_d(p, 0) != MP_GT) \{
9897 028 return MP_VAL;
9898 029 \}
9899 030
9900 031 /* step 1. if a == 0, return 0 */
9901 032 if (mp_iszero (a) == 1) \{
9902 033 *c = 0;
9903 034 return MP_OKAY;
9904 035 \}
9905 036
9906 037 /* step 2. if a == 1, return 1 */
9907 038 if (mp_cmp_d (a, 1) == MP_EQ) \{
9908 039 *c = 1;
9909 040 return MP_OKAY;
9910 041 \}
9911 042
9912 043 /* default */
9913 044 s = 0;
9914 045
9915 046 /* step 3. write a = a1 * 2**k */
9916 047 if ((res = mp_init_copy (&a1, a)) != MP_OKAY) \{
9917 048 return res;
9918 049 \}
9919 050
9920 051 if ((res = mp_init (&p1)) != MP_OKAY) \{
9921 052 goto __A1;
9922 053 \}
9923 054
9924 055 /* divide out larger power of two */
9925 056 k = mp_cnt_lsb(&a1);
9926 057 if ((res = mp_div_2d(&a1, k, &a1, NULL)) != MP_OKAY) \{
9927 058 goto __P1;
9928 059 \}
9929 060
9930 061 /* step 4. if e is even set s=1 */
9931 062 if ((k & 1) == 0) \{
9932 063 s = 1;
9933 064 \} else \{
9934 065 /* else set s=1 if p = 1/7 (mod 8) or s=-1 if p = 3/5 (mod 8) */
9935 066 residue = p->dp[0] & 7;
9936 067
9937 068 if (residue == 1 || residue == 7) \{
9938 069 s = 1;
9939 070 \} else if (residue == 3 || residue == 5) \{
9940 071 s = -1;
9941 072 \}
9942 073 \}
9943 074
9944 075 /* step 5. if p == 3 (mod 4) *and* a1 == 3 (mod 4) then s = -s */
9945 076 if ( ((p->dp[0] & 3) == 3) && ((a1.dp[0] & 3) == 3)) \{
9946 077 s = -s;
9947 078 \}
9948 079
9949 080 /* if a1 == 1 we're done */
9950 081 if (mp_cmp_d (&a1, 1) == MP_EQ) \{
9951 082 *c = s;
9952 083 \} else \{
9953 084 /* n1 = n mod a1 */
9954 085 if ((res = mp_mod (p, &a1, &p1)) != MP_OKAY) \{
9955 086 goto __P1;
9956 087 \}
9957 088 if ((res = mp_jacobi (&p1, &a1, &r)) != MP_OKAY) \{
9958 089 goto __P1;
9959 090 \}
9960 091 *c = s * r;
9961 092 \}
9962 093
9963 094 /* done */
9964 095 res = MP_OKAY;
9965 096 __P1:mp_clear (&p1);
9966 097 __A1:mp_clear (&a1);
9967 098 return res;
9968 099 \}
9969 \end{alltt}
9970 \end{small}
9971
9972 As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C
9973 variable name character.
9974
9975 The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm. If the input is non-trivial the algorithm
9976 has to proceed compute the Jacobi. The variable $s$ is used to hold the current Jacobi product. Note that $s$ is merely a C ``int'' data type since
9977 the values it may obtain are merely $-1$, $0$ and $1$.
9978
9979 After a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$. Technically only the least significant
9980 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
9981 processor requirements and neither is faster than the other.
9982
9983 Line 61 through 70 determines the value of $\left ( { 2 \over p } \right )^k$. If the least significant bit of $k$ is zero than
9984 $k$ is even and the value is one. Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight. The value of
9985 $(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines 75 through 73.
9986
9987 Finally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$.
9988
9989 \textit{-- Comment about default $s$ and such...}
9990
9991 \section{Modular Inverse}
9992 \label{sec:modinv}
9993 The modular inverse of a number actually refers to the modular multiplicative inverse. Essentially for any integer $a$ such that $(a, p) = 1$ there
9994 exist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$. The integer $b$ is called the multiplicative inverse of $a$ which is
9995 denoted as $b = a^{-1}$. Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and
9996 fields of integers. However, the former will be the matter of discussion.
9997
9998 The simplest approach is to compute the algebraic inverse of the input. That is to compute $b \equiv a^{\Phi(p) - 1}$. If $\Phi(p)$ is the
9999 order of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$. The proof of which is trivial.
10000
10001 \begin{equation}
10002 ab \equiv a \left (a^{\Phi(p) - 1} \right ) \equiv a^{\Phi(p)} \equiv a^0 \equiv 1 \mbox{ (mod }p\mbox{)}
10003 \end{equation}
10004
10005 However, as simple as this approach may be it has two serious flaws. It requires that the value of $\Phi(p)$ be known which if $p$ is composite
10006 requires all of the prime factors. This approach also is very slow as the size of $p$ grows.
10007
10008 A simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear
10009 Diophantine\footnote{See LeVeque \cite[pp. 40-43]{LeVeque} for more information.} equation.
10010
10011 \begin{equation}
10012 ab + pq = 1
10013 \end{equation}
10014
10015 Where $a$, $b$, $p$ and $q$ are all integers. If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of
10016 $a$ modulo $p$. The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$.
10017 However, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place. The
10018 binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine
10019 equation.
10020
10021 \subsection{General Case}
10022 \newpage\begin{figure}[!here]
10023 \begin{small}
10024 \begin{center}
10025 \begin{tabular}{l}
10026 \hline Algorithm \textbf{mp\_invmod}. \\
10027 \textbf{Input}. mp\_int $a$ and $b$, $(a, b) = 1$, $p \ge 2$, $0 < a < p$. \\
10028 \textbf{Output}. The modular inverse $c \equiv a^{-1} \mbox{ (mod }b\mbox{)}$. \\
10029 \hline \\
10030 1. If $b \le 0$ then return(\textit{MP\_VAL}). \\
10031 2. If $b_0 \equiv 1 \mbox{ (mod }2\mbox{)}$ then use algorithm fast\_mp\_invmod. \\
10032 3. $x \leftarrow \vert a \vert, y \leftarrow b$ \\
10033 4. If $x_0 \equiv y_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ then return(\textit{MP\_VAL}). \\
10034 5. $B \leftarrow 0, C \leftarrow 0, A \leftarrow 1, D \leftarrow 1$ \\
10035 6. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
10036 \hspace{3mm}6.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\
10037 \hspace{3mm}6.2 If ($A.used > 0$ and $A_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($B.used > 0$ and $B_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\
10038 \hspace{6mm}6.2.1 $A \leftarrow A + y$ \\
10039 \hspace{6mm}6.2.2 $B \leftarrow B - x$ \\
10040 \hspace{3mm}6.3 $A \leftarrow \lfloor A / 2 \rfloor$ \\
10041 \hspace{3mm}6.4 $B \leftarrow \lfloor B / 2 \rfloor$ \\
10042 7. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
10043 \hspace{3mm}7.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\
10044 \hspace{3mm}7.2 If ($C.used > 0$ and $C_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($D.used > 0$ and $D_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\
10045 \hspace{6mm}7.2.1 $C \leftarrow C + y$ \\
10046 \hspace{6mm}7.2.2 $D \leftarrow D - x$ \\
10047 \hspace{3mm}7.3 $C \leftarrow \lfloor C / 2 \rfloor$ \\
10048 \hspace{3mm}7.4 $D \leftarrow \lfloor D / 2 \rfloor$ \\
10049 8. If $u \ge v$ then \\
10050 \hspace{3mm}8.1 $u \leftarrow u - v$ \\
10051 \hspace{3mm}8.2 $A \leftarrow A - C$ \\
10052 \hspace{3mm}8.3 $B \leftarrow B - D$ \\
10053 9. else \\
10054 \hspace{3mm}9.1 $v \leftarrow v - u$ \\
10055 \hspace{3mm}9.2 $C \leftarrow C - A$ \\
10056 \hspace{3mm}9.3 $D \leftarrow D - B$ \\
10057 10. If $u \ne 0$ goto step 6. \\
10058 11. If $v \ne 1$ return(\textit{MP\_VAL}). \\
10059 12. While $C \le 0$ do \\
10060 \hspace{3mm}12.1 $C \leftarrow C + b$ \\
10061 13. While $C \ge b$ do \\
10062 \hspace{3mm}13.1 $C \leftarrow C - b$ \\
10063 14. $c \leftarrow C$ \\
10064 15. Return(\textit{MP\_OKAY}). \\
10065 \hline
10066 \end{tabular}
10067 \end{center}
10068 \end{small}
10069 \end{figure}
10070 \textbf{Algorithm mp\_invmod.}
10071 This algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$. This algorithm is a variation of the
10072 extended binary Euclidean algorithm from HAC \cite[pp. 608]{HAC}. It has been modified to only compute the modular inverse and not a complete
10073 Diophantine solution.
10074
10075 If $b \le 0$ than the modulus is invalid and MP\_VAL is returned. Similarly if both $a$ and $b$ are even then there cannot be a multiplicative
10076 inverse for $a$ and the error is reported.
10077
10078 The astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd. In this case
10079 the other variables to the Diophantine equation are solved. The algorithm terminates when $u = 0$ in which case the solution is
10080
10081 \begin{equation}
10082 Ca + Db = v
10083 \end{equation}
10084
10085 If $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists. Otherwise, $C$
10086 is the modular inverse of $a$. The actual value of $C$ is congruent to, but not necessarily equal to, the ideal modular inverse which should lie
10087 within $1 \le a^{-1} < b$. Step numbers twelve and thirteen adjust the inverse until it is in range. If the original input $a$ is within $0 < a < p$
10088 then only a couple of additions or subtractions will be required to adjust the inverse.
10089
10090 \vspace{+3mm}\begin{small}
10091 \hspace{-5.1mm}{\bf File}: bn\_mp\_invmod.c
10092 \vspace{-3mm}
10093 \begin{alltt}
10094 016
10095 017 /* hac 14.61, pp608 */
10096 018 int mp_invmod (mp_int * a, mp_int * b, mp_int * c)
10097 019 \{
10098 020 mp_int x, y, u, v, A, B, C, D;
10099 021 int res;
10100 022
10101 023 /* b cannot be negative */
10102 024 if (b->sign == MP_NEG || mp_iszero(b) == 1) \{
10103 025 return MP_VAL;
10104 026 \}
10105 027
10106 028 /* if the modulus is odd we can use a faster routine instead */
10107 029 if (mp_isodd (b) == 1) \{
10108 030 return fast_mp_invmod (a, b, c);
10109 031 \}
10110 032
10111 033 /* init temps */
10112 034 if ((res = mp_init_multi(&x, &y, &u, &v,
10113 035 &A, &B, &C, &D, NULL)) != MP_OKAY) \{
10114 036 return res;
10115 037 \}
10116 038
10117 039 /* x = a, y = b */
10118 040 if ((res = mp_copy (a, &x)) != MP_OKAY) \{
10119 041 goto __ERR;
10120 042 \}
10121 043 if ((res = mp_copy (b, &y)) != MP_OKAY) \{
10122 044 goto __ERR;
10123 045 \}
10124 046
10125 047 /* 2. [modified] if x,y are both even then return an error! */
10126 048 if (mp_iseven (&x) == 1 && mp_iseven (&y) == 1) \{
10127 049 res = MP_VAL;
10128 050 goto __ERR;
10129 051 \}
10130 052
10131 053 /* 3. u=x, v=y, A=1, B=0, C=0,D=1 */
10132 054 if ((res = mp_copy (&x, &u)) != MP_OKAY) \{
10133 055 goto __ERR;
10134 056 \}
10135 057 if ((res = mp_copy (&y, &v)) != MP_OKAY) \{
10136 058 goto __ERR;
10137 059 \}
10138 060 mp_set (&A, 1);
10139 061 mp_set (&D, 1);
10140 062
10141 063 top:
10142 064 /* 4. while u is even do */
10143 065 while (mp_iseven (&u) == 1) \{
10144 066 /* 4.1 u = u/2 */
10145 067 if ((res = mp_div_2 (&u, &u)) != MP_OKAY) \{
10146 068 goto __ERR;
10147 069 \}
10148 070 /* 4.2 if A or B is odd then */
10149 071 if (mp_isodd (&A) == 1 || mp_isodd (&B) == 1) \{
10150 072 /* A = (A+y)/2, B = (B-x)/2 */
10151 073 if ((res = mp_add (&A, &y, &A)) != MP_OKAY) \{
10152 074 goto __ERR;
10153 075 \}
10154 076 if ((res = mp_sub (&B, &x, &B)) != MP_OKAY) \{
10155 077 goto __ERR;
10156 078 \}
10157 079 \}
10158 080 /* A = A/2, B = B/2 */
10159 081 if ((res = mp_div_2 (&A, &A)) != MP_OKAY) \{
10160 082 goto __ERR;
10161 083 \}
10162 084 if ((res = mp_div_2 (&B, &B)) != MP_OKAY) \{
10163 085 goto __ERR;
10164 086 \}
10165 087 \}
10166 088
10167 089 /* 5. while v is even do */
10168 090 while (mp_iseven (&v) == 1) \{
10169 091 /* 5.1 v = v/2 */
10170 092 if ((res = mp_div_2 (&v, &v)) != MP_OKAY) \{
10171 093 goto __ERR;
10172 094 \}
10173 095 /* 5.2 if C or D is odd then */
10174 096 if (mp_isodd (&C) == 1 || mp_isodd (&D) == 1) \{
10175 097 /* C = (C+y)/2, D = (D-x)/2 */
10176 098 if ((res = mp_add (&C, &y, &C)) != MP_OKAY) \{
10177 099 goto __ERR;
10178 100 \}
10179 101 if ((res = mp_sub (&D, &x, &D)) != MP_OKAY) \{
10180 102 goto __ERR;
10181 103 \}
10182 104 \}
10183 105 /* C = C/2, D = D/2 */
10184 106 if ((res = mp_div_2 (&C, &C)) != MP_OKAY) \{
10185 107 goto __ERR;
10186 108 \}
10187 109 if ((res = mp_div_2 (&D, &D)) != MP_OKAY) \{
10188 110 goto __ERR;
10189 111 \}
10190 112 \}
10191 113
10192 114 /* 6. if u >= v then */
10193 115 if (mp_cmp (&u, &v) != MP_LT) \{
10194 116 /* u = u - v, A = A - C, B = B - D */
10195 117 if ((res = mp_sub (&u, &v, &u)) != MP_OKAY) \{
10196 118 goto __ERR;
10197 119 \}
10198 120
10199 121 if ((res = mp_sub (&A, &C, &A)) != MP_OKAY) \{
10200 122 goto __ERR;
10201 123 \}
10202 124
10203 125 if ((res = mp_sub (&B, &D, &B)) != MP_OKAY) \{
10204 126 goto __ERR;
10205 127 \}
10206 128 \} else \{
10207 129 /* v - v - u, C = C - A, D = D - B */
10208 130 if ((res = mp_sub (&v, &u, &v)) != MP_OKAY) \{
10209 131 goto __ERR;
10210 132 \}
10211 133
10212 134 if ((res = mp_sub (&C, &A, &C)) != MP_OKAY) \{
10213 135 goto __ERR;
10214 136 \}
10215 137
10216 138 if ((res = mp_sub (&D, &B, &D)) != MP_OKAY) \{
10217 139 goto __ERR;
10218 140 \}
10219 141 \}
10220 142
10221 143 /* if not zero goto step 4 */
10222 144 if (mp_iszero (&u) == 0)
10223 145 goto top;
10224 146
10225 147 /* now a = C, b = D, gcd == g*v */
10226 148
10227 149 /* if v != 1 then there is no inverse */
10228 150 if (mp_cmp_d (&v, 1) != MP_EQ) \{
10229 151 res = MP_VAL;
10230 152 goto __ERR;
10231 153 \}
10232 154
10233 155 /* if its too low */
10234 156 while (mp_cmp_d(&C, 0) == MP_LT) \{
10235 157 if ((res = mp_add(&C, b, &C)) != MP_OKAY) \{
10236 158 goto __ERR;
10237 159 \}
10238 160 \}
10239 161
10240 162 /* too big */
10241 163 while (mp_cmp_mag(&C, b) != MP_LT) \{
10242 164 if ((res = mp_sub(&C, b, &C)) != MP_OKAY) \{
10243 165 goto __ERR;
10244 166 \}
10245 167 \}
10246 168
10247 169 /* C is now the inverse */
10248 170 mp_exch (&C, c);
10249 171 res = MP_OKAY;
10250 172 __ERR:mp_clear_multi (&x, &y, &u, &v, &A, &B, &C, &D, NULL);
10251 173 return res;
10252 174 \}
10253 \end{alltt}
10254 \end{small}
10255
10256 \subsubsection{Odd Moduli}
10257
10258 When the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse. In particular by attempting to solve
10259 the Diophantine $Cb + Da = 1$ only $B$ and $D$ are required to find the inverse of $a$.
10260
10261 The algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed. This
10262 optimization will halve the time required to compute the modular inverse.
10263
10264 \section{Primality Tests}
10265
10266 A non-zero integer $a$ is said to be prime if it is not divisible by any other integer excluding one and itself. For example, $a = 7$ is prime
10267 since the integers $2 \ldots 6$ do not evenly divide $a$. By contrast, $a = 6$ is not prime since $a = 6 = 2 \cdot 3$.
10268
10269 Prime numbers arise in cryptography considerably as they allow finite fields to be formed. The ability to determine whether an integer is prime or
10270 not quickly has been a viable subject in cryptography and number theory for considerable time. The algorithms that will be presented are all
10271 probablistic algorithms in that when they report an integer is composite it must be composite. However, when the algorithms report an integer is
10272 prime the algorithm may be incorrect.
10273
10274 As will be discussed it is possible to limit the probability of error so well that for practical purposes the probablity of error might as
10275 well be zero. For the purposes of these discussions let $n$ represent the candidate integer of which the primality is in question.
10276
10277 \subsection{Trial Division}
10278
10279 Trial division means to attempt to evenly divide a candidate integer by small prime integers. If the candidate can be evenly divided it obviously
10280 cannot be prime. By dividing by all primes $1 < p \le \sqrt{n}$ this test can actually prove whether an integer is prime. However, such a test
10281 would require a prohibitive amount of time as $n$ grows.
10282
10283 Instead of dividing by every prime, a smaller, more mangeable set of primes may be used instead. By performing trial division with only a subset
10284 of the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime. However, often it can prove a candidate is not prime.
10285
10286 The benefit of this test is that trial division by small values is fairly efficient. Specially compared to the other algorithms that will be
10287 discussed shortly. The probability that this approach correctly identifies a composite candidate when tested with all primes upto $q$ is given by
10288 $1 - {1.12 \over ln(q)}$. The graph (\ref{pic:primality}, will be added later) demonstrates the probability of success for the range
10289 $3 \le q \le 100$.
10290
10291 At approximately $q = 30$ the gain of performing further tests diminishes fairly quickly. At $q = 90$ further testing is generally not going to
10292 be of any practical use. In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate
10293 approximately $80\%$ of all candidate integers. The constant \textbf{PRIME\_SIZE} is equal to the number of primes in the test base. The
10294 array \_\_prime\_tab is an array of the first \textbf{PRIME\_SIZE} prime numbers.
10295
10296 \begin{figure}[!here]
10297 \begin{small}
10298 \begin{center}
10299 \begin{tabular}{l}
10300 \hline Algorithm \textbf{mp\_prime\_is\_divisible}. \\
10301 \textbf{Input}. mp\_int $a$ \\
10302 \textbf{Output}. $c = 1$ if $n$ is divisible by a small prime, otherwise $c = 0$. \\
10303 \hline \\
10304 1. for $ix$ from $0$ to $PRIME\_SIZE$ do \\
10305 \hspace{3mm}1.1 $d \leftarrow n \mbox{ (mod }\_\_prime\_tab_{ix}\mbox{)}$ \\
10306 \hspace{3mm}1.2 If $d = 0$ then \\
10307 \hspace{6mm}1.2.1 $c \leftarrow 1$ \\
10308 \hspace{6mm}1.2.2 Return(\textit{MP\_OKAY}). \\
10309 2. $c \leftarrow 0$ \\
10310 3. Return(\textit{MP\_OKAY}). \\
10311 \hline
10312 \end{tabular}
10313 \end{center}
10314 \end{small}
10315 \caption{Algorithm mp\_prime\_is\_divisible}
10316 \end{figure}
10317 \textbf{Algorithm mp\_prime\_is\_divisible.}
10318 This algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions.
10319
10320 \vspace{+3mm}\begin{small}
10321 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_is\_divisible.c
10322 \vspace{-3mm}
10323 \begin{alltt}
10324 016
10325 017 /* determines if an integers is divisible by one
10326 018 * of the first PRIME_SIZE primes or not
10327 019 *
10328 020 * sets result to 0 if not, 1 if yes
10329 021 */
10330 022 int mp_prime_is_divisible (mp_int * a, int *result)
10331 023 \{
10332 024 int err, ix;
10333 025 mp_digit res;
10334 026
10335 027 /* default to not */
10336 028 *result = MP_NO;
10337 029
10338 030 for (ix = 0; ix < PRIME_SIZE; ix++) \{
10339 031 /* what is a mod __prime_tab[ix] */
10340 032 if ((err = mp_mod_d (a, __prime_tab[ix], &res)) != MP_OKAY) \{
10341 033 return err;
10342 034 \}
10343 035
10344 036 /* is the residue zero? */
10345 037 if (res == 0) \{
10346 038 *result = MP_YES;
10347 039 return MP_OKAY;
10348 040 \}
10349 041 \}
10350 042
10351 043 return MP_OKAY;
10352 044 \}
10353 \end{alltt}
10354 \end{small}
10355
10356 The algorithm defaults to a return of $0$ in case an error occurs. The values in the prime table are all specified to be in the range of a
10357 mp\_digit. The table \_\_prime\_tab is defined in the following file.
10358
10359 \vspace{+3mm}\begin{small}
10360 \hspace{-5.1mm}{\bf File}: bn\_prime\_tab.c
10361 \vspace{-3mm}
10362 \begin{alltt}
10363 016 const mp_digit __prime_tab[] = \{
10364 017 0x0002, 0x0003, 0x0005, 0x0007, 0x000B, 0x000D, 0x0011, 0x0013,
10365 018 0x0017, 0x001D, 0x001F, 0x0025, 0x0029, 0x002B, 0x002F, 0x0035,
10366 019 0x003B, 0x003D, 0x0043, 0x0047, 0x0049, 0x004F, 0x0053, 0x0059,
10367 020 0x0061, 0x0065, 0x0067, 0x006B, 0x006D, 0x0071, 0x007F,
10368 021 #ifndef MP_8BIT
10369 022 0x0083,
10370 023 0x0089, 0x008B, 0x0095, 0x0097, 0x009D, 0x00A3, 0x00A7, 0x00AD,
10371 024 0x00B3, 0x00B5, 0x00BF, 0x00C1, 0x00C5, 0x00C7, 0x00D3, 0x00DF,
10372 025 0x00E3, 0x00E5, 0x00E9, 0x00EF, 0x00F1, 0x00FB, 0x0101, 0x0107,
10373 026 0x010D, 0x010F, 0x0115, 0x0119, 0x011B, 0x0125, 0x0133, 0x0137,
10374 027
10375 028 0x0139, 0x013D, 0x014B, 0x0151, 0x015B, 0x015D, 0x0161, 0x0167,
10376 029 0x016F, 0x0175, 0x017B, 0x017F, 0x0185, 0x018D, 0x0191, 0x0199,
10377 030 0x01A3, 0x01A5, 0x01AF, 0x01B1, 0x01B7, 0x01BB, 0x01C1, 0x01C9,
10378 031 0x01CD, 0x01CF, 0x01D3, 0x01DF, 0x01E7, 0x01EB, 0x01F3, 0x01F7,
10379 032 0x01FD, 0x0209, 0x020B, 0x021D, 0x0223, 0x022D, 0x0233, 0x0239,
10380 033 0x023B, 0x0241, 0x024B, 0x0251, 0x0257, 0x0259, 0x025F, 0x0265,
10381 034 0x0269, 0x026B, 0x0277, 0x0281, 0x0283, 0x0287, 0x028D, 0x0293,
10382 035 0x0295, 0x02A1, 0x02A5, 0x02AB, 0x02B3, 0x02BD, 0x02C5, 0x02CF,
10383 036
10384 037 0x02D7, 0x02DD, 0x02E3, 0x02E7, 0x02EF, 0x02F5, 0x02F9, 0x0301,
10385 038 0x0305, 0x0313, 0x031D, 0x0329, 0x032B, 0x0335, 0x0337, 0x033B,
10386 039 0x033D, 0x0347, 0x0355, 0x0359, 0x035B, 0x035F, 0x036D, 0x0371,
10387 040 0x0373, 0x0377, 0x038B, 0x038F, 0x0397, 0x03A1, 0x03A9, 0x03AD,
10388 041 0x03B3, 0x03B9, 0x03C7, 0x03CB, 0x03D1, 0x03D7, 0x03DF, 0x03E5,
10389 042 0x03F1, 0x03F5, 0x03FB, 0x03FD, 0x0407, 0x0409, 0x040F, 0x0419,
10390 043 0x041B, 0x0425, 0x0427, 0x042D, 0x043F, 0x0443, 0x0445, 0x0449,
10391 044 0x044F, 0x0455, 0x045D, 0x0463, 0x0469, 0x047F, 0x0481, 0x048B,
10392 045
10393 046 0x0493, 0x049D, 0x04A3, 0x04A9, 0x04B1, 0x04BD, 0x04C1, 0x04C7,
10394 047 0x04CD, 0x04CF, 0x04D5, 0x04E1, 0x04EB, 0x04FD, 0x04FF, 0x0503,
10395 048 0x0509, 0x050B, 0x0511, 0x0515, 0x0517, 0x051B, 0x0527, 0x0529,
10396 049 0x052F, 0x0551, 0x0557, 0x055D, 0x0565, 0x0577, 0x0581, 0x058F,
10397 050 0x0593, 0x0595, 0x0599, 0x059F, 0x05A7, 0x05AB, 0x05AD, 0x05B3,
10398 051 0x05BF, 0x05C9, 0x05CB, 0x05CF, 0x05D1, 0x05D5, 0x05DB, 0x05E7,
10399 052 0x05F3, 0x05FB, 0x0607, 0x060D, 0x0611, 0x0617, 0x061F, 0x0623,
10400 053 0x062B, 0x062F, 0x063D, 0x0641, 0x0647, 0x0649, 0x064D, 0x0653
10401 054 #endif
10402 055 \};
10403 \end{alltt}
10404 \end{small}
10405
10406 Note that there are two possible tables. When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes
10407 upto $1619$ are used. Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit.
10408
10409 \subsection{The Fermat Test}
10410 The Fermat test is probably one the oldest tests to have a non-trivial probability of success. It is based on the fact that if $n$ is in
10411 fact prime then $a^{n} \equiv a \mbox{ (mod }n\mbox{)}$ for all $0 < a < n$. The reason being that if $n$ is prime than the order of
10412 the multiplicative sub group is $n - 1$. Any base $a$ must have an order which divides $n - 1$ and as such $a^n$ is equivalent to
10413 $a^1 = a$.
10414
10415 If $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$. In which case
10416 it is possible that $a^n \nequiv a \mbox{ (mod }n\mbox{)}$. However, this test is not absolute as it is possible that the order
10417 of a base will divide $n - 1$ which would then be reported as prime. Such a base yields what is known as a Fermat pseudo-prime. Several
10418 integers known as Carmichael numbers will be a pseudo-prime to all valid bases. Fortunately such numbers are extremely rare as $n$ grows
10419 in size.
10420
10421 \begin{figure}[!here]
10422 \begin{small}
10423 \begin{center}
10424 \begin{tabular}{l}
10425 \hline Algorithm \textbf{mp\_prime\_fermat}. \\
10426 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\
10427 \textbf{Output}. $c = 1$ if $b^a \equiv b \mbox{ (mod }a\mbox{)}$, otherwise $c = 0$. \\
10428 \hline \\
10429 1. $t \leftarrow b^a \mbox{ (mod }a\mbox{)}$ \\
10430 2. If $t = b$ then \\
10431 \hspace{3mm}2.1 $c = 1$ \\
10432 3. else \\
10433 \hspace{3mm}3.1 $c = 0$ \\
10434 4. Return(\textit{MP\_OKAY}). \\
10435 \hline
10436 \end{tabular}
10437 \end{center}
10438 \end{small}
10439 \caption{Algorithm mp\_prime\_fermat}
10440 \end{figure}
10441 \textbf{Algorithm mp\_prime\_fermat.}
10442 This algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not. It uses a single modular exponentiation to
10443 determine the result.
10444
10445 \vspace{+3mm}\begin{small}
10446 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_fermat.c
10447 \vspace{-3mm}
10448 \begin{alltt}
10449 016
10450 017 /* performs one Fermat test.
10451 018 *
10452 019 * If "a" were prime then b**a == b (mod a) since the order of
10453 020 * the multiplicative sub-group would be phi(a) = a-1. That means
10454 021 * it would be the same as b**(a mod (a-1)) == b**1 == b (mod a).
10455 022 *
10456 023 * Sets result to 1 if the congruence holds, or zero otherwise.
10457 024 */
10458 025 int mp_prime_fermat (mp_int * a, mp_int * b, int *result)
10459 026 \{
10460 027 mp_int t;
10461 028 int err;
10462 029
10463 030 /* default to composite */
10464 031 *result = MP_NO;
10465 032
10466 033 /* ensure b > 1 */
10467 034 if (mp_cmp_d(b, 1) != MP_GT) \{
10468 035 return MP_VAL;
10469 036 \}
10470 037
10471 038 /* init t */
10472 039 if ((err = mp_init (&t)) != MP_OKAY) \{
10473 040 return err;
10474 041 \}
10475 042
10476 043 /* compute t = b**a mod a */
10477 044 if ((err = mp_exptmod (b, a, a, &t)) != MP_OKAY) \{
10478 045 goto __T;
10479 046 \}
10480 047
10481 048 /* is it equal to b? */
10482 049 if (mp_cmp (&t, b) == MP_EQ) \{
10483 050 *result = MP_YES;
10484 051 \}
10485 052
10486 053 err = MP_OKAY;
10487 054 __T:mp_clear (&t);
10488 055 return err;
10489 056 \}
10490 \end{alltt}
10491 \end{small}
10492
10493 \subsection{The Miller-Rabin Test}
10494 The Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen
10495 candidate integers. The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the
10496 value must be equal to $-1$. The squarings are stopped as soon as $-1$ is observed. If the value of $1$ is observed first it means that
10497 some value not congruent to $\pm 1$ when squared equals one which cannot occur if $n$ is prime.
10498
10499 \begin{figure}[!here]
10500 \begin{small}
10501 \begin{center}
10502 \begin{tabular}{l}
10503 \hline Algorithm \textbf{mp\_prime\_miller\_rabin}. \\
10504 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\
10505 \textbf{Output}. $c = 1$ if $a$ is a Miller-Rabin prime to the base $a$, otherwise $c = 0$. \\
10506 \hline
10507 1. $a' \leftarrow a - 1$ \\
10508 2. $r \leftarrow n1$ \\
10509 3. $c \leftarrow 0, s \leftarrow 0$ \\
10510 4. While $r.used > 0$ and $r_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
10511 \hspace{3mm}4.1 $s \leftarrow s + 1$ \\
10512 \hspace{3mm}4.2 $r \leftarrow \lfloor r / 2 \rfloor$ \\
10513 5. $y \leftarrow b^r \mbox{ (mod }a\mbox{)}$ \\
10514 6. If $y \nequiv \pm 1$ then \\
10515 \hspace{3mm}6.1 $j \leftarrow 1$ \\
10516 \hspace{3mm}6.2 While $j \le (s - 1)$ and $y \nequiv a'$ \\
10517 \hspace{6mm}6.2.1 $y \leftarrow y^2 \mbox{ (mod }a\mbox{)}$ \\
10518 \hspace{6mm}6.2.2 If $y = 1$ then goto step 8. \\
10519 \hspace{6mm}6.2.3 $j \leftarrow j + 1$ \\
10520 \hspace{3mm}6.3 If $y \nequiv a'$ goto step 8. \\
10521 7. $c \leftarrow 1$\\
10522 8. Return(\textit{MP\_OKAY}). \\
10523 \hline
10524 \end{tabular}
10525 \end{center}
10526 \end{small}
10527 \caption{Algorithm mp\_prime\_miller\_rabin}
10528 \end{figure}
10529 \textbf{Algorithm mp\_prime\_miller\_rabin.}
10530 This algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$. It will set $c = 1$ if the algorithm cannot determine
10531 if $b$ is composite or $c = 0$ if $b$ is provably composite. The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$.
10532
10533 If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not. Otherwise, the algorithm will
10534 square $y$ upto $s - 1$ times stopping only when $y \equiv -1$. If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$
10535 is provably composite. If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite. If $a$ is not provably
10536 composite then it is \textit{probably} prime.
10537
10538 \vspace{+3mm}\begin{small}
10539 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_miller\_rabin.c
10540 \vspace{-3mm}
10541 \begin{alltt}
10542 016
10543 017 /* Miller-Rabin test of "a" to the base of "b" as described in
10544 018 * HAC pp. 139 Algorithm 4.24
10545 019 *
10546 020 * Sets result to 0 if definitely composite or 1 if probably prime.
10547 021 * Randomly the chance of error is no more than 1/4 and often
10548 022 * very much lower.
10549 023 */
10550 024 int mp_prime_miller_rabin (mp_int * a, mp_int * b, int *result)
10551 025 \{
10552 026 mp_int n1, y, r;
10553 027 int s, j, err;
10554 028
10555 029 /* default */
10556 030 *result = MP_NO;
10557 031
10558 032 /* ensure b > 1 */
10559 033 if (mp_cmp_d(b, 1) != MP_GT) \{
10560 034 return MP_VAL;
10561 035 \}
10562 036
10563 037 /* get n1 = a - 1 */
10564 038 if ((err = mp_init_copy (&n1, a)) != MP_OKAY) \{
10565 039 return err;
10566 040 \}
10567 041 if ((err = mp_sub_d (&n1, 1, &n1)) != MP_OKAY) \{
10568 042 goto __N1;
10569 043 \}
10570 044
10571 045 /* set 2**s * r = n1 */
10572 046 if ((err = mp_init_copy (&r, &n1)) != MP_OKAY) \{
10573 047 goto __N1;
10574 048 \}
10575 049
10576 050 /* count the number of least significant bits
10577 051 * which are zero
10578 052 */
10579 053 s = mp_cnt_lsb(&r);
10580 054
10581 055 /* now divide n - 1 by 2**s */
10582 056 if ((err = mp_div_2d (&r, s, &r, NULL)) != MP_OKAY) \{
10583 057 goto __R;
10584 058 \}
10585 059
10586 060 /* compute y = b**r mod a */
10587 061 if ((err = mp_init (&y)) != MP_OKAY) \{
10588 062 goto __R;
10589 063 \}
10590 064 if ((err = mp_exptmod (b, &r, a, &y)) != MP_OKAY) \{
10591 065 goto __Y;
10592 066 \}
10593 067
10594 068 /* if y != 1 and y != n1 do */
10595 069 if (mp_cmp_d (&y, 1) != MP_EQ && mp_cmp (&y, &n1) != MP_EQ) \{
10596 070 j = 1;
10597 071 /* while j <= s-1 and y != n1 */
10598 072 while ((j <= (s - 1)) && mp_cmp (&y, &n1) != MP_EQ) \{
10599 073 if ((err = mp_sqrmod (&y, a, &y)) != MP_OKAY) \{
10600 074 goto __Y;
10601 075 \}
10602 076
10603 077 /* if y == 1 then composite */
10604 078 if (mp_cmp_d (&y, 1) == MP_EQ) \{
10605 079 goto __Y;
10606 080 \}
10607 081
10608 082 ++j;
10609 083 \}
10610 084
10611 085 /* if y != n1 then composite */
10612 086 if (mp_cmp (&y, &n1) != MP_EQ) \{
10613 087 goto __Y;
10614 088 \}
10615 089 \}
10616 090
10617 091 /* probably prime now */
10618 092 *result = MP_YES;
10619 093 __Y:mp_clear (&y);
10620 094 __R:mp_clear (&r);
10621 095 __N1:mp_clear (&n1);
10622 096 return err;
10623 097 \}
10624 \end{alltt}
10625 \end{small}
10626
10627
10628
10629
10630 \backmatter
10631 \appendix
10632 \begin{thebibliography}{ABCDEF}
10633 \bibitem[1]{TAOCPV2}
10634 Donald Knuth, \textit{The Art of Computer Programming}, Third Edition, Volume Two, Seminumerical Algorithms, Addison-Wesley, 1998
10635
10636 \bibitem[2]{HAC}
10637 A. Menezes, P. van Oorschot, S. Vanstone, \textit{Handbook of Applied Cryptography}, CRC Press, 1996
10638
10639 \bibitem[3]{ROSE}
10640 Michael Rosing, \textit{Implementing Elliptic Curve Cryptography}, Manning Publications, 1999
10641
10642 \bibitem[4]{COMBA}
10643 Paul G. Comba, \textit{Exponentiation Cryptosystems on the IBM PC}. IBM Systems Journal 29(4): 526-538 (1990)
10644
10645 \bibitem[5]{KARA}
10646 A. Karatsuba, Doklay Akad. Nauk SSSR 145 (1962), pp.293-294
10647
10648 \bibitem[6]{KARAP}
10649 Andre Weimerskirch and Christof Paar, \textit{Generalizations of the Karatsuba Algorithm for Polynomial Multiplication}, Submitted to Design, Codes and Cryptography, March 2002
10650
10651 \bibitem[7]{BARRETT}
10652 Paul Barrett, \textit{Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor}, Advances in Cryptology, Crypto '86, Springer-Verlag.
10653
10654 \bibitem[8]{MONT}
10655 P.L.Montgomery. \textit{Modular multiplication without trial division}. Mathematics of Computation, 44(170):519-521, April 1985.
10656
10657 \bibitem[9]{DRMET}
10658 Chae Hoon Lim and Pil Joong Lee, \textit{Generating Efficient Primes for Discrete Log Cryptosystems}, POSTECH Information Research Laboratories
10659
10660 \bibitem[10]{MMB}
10661 J. Daemen and R. Govaerts and J. Vandewalle, \textit{Block ciphers based on Modular Arithmetic}, State and {P}rogress in the {R}esearch of {C}ryptography, 1993, pp. 80-89
10662
10663 \bibitem[11]{RSAREF}
10664 R.L. Rivest, A. Shamir, L. Adleman, \textit{A Method for Obtaining Digital Signatures and Public-Key Cryptosystems}
10665
10666 \bibitem[12]{DHREF}
10667 Whitfield Diffie, Martin E. Hellman, \textit{New Directions in Cryptography}, IEEE Transactions on Information Theory, 1976
10668
10669 \bibitem[13]{IEEE}
10670 IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)
10671
10672 \bibitem[14]{GMP}
10673 GNU Multiple Precision (GMP), \url{http://www.swox.com/gmp/}
10674
10675 \bibitem[15]{MPI}
10676 Multiple Precision Integer Library (MPI), Michael Fromberger, \url{http://thayer.dartmouth.edu/~sting/mpi/}
10677
10678 \bibitem[16]{OPENSSL}
10679 OpenSSL Cryptographic Toolkit, \url{http://openssl.org}
10680
10681 \bibitem[17]{LIP}
10682 Large Integer Package, \url{http://home.hetnet.nl/~ecstr/LIP.zip}
10683
10684 \bibitem[18]{ISOC}
10685 JTC1/SC22/WG14, ISO/IEC 9899:1999, ``A draft rationale for the C99 standard.''
10686
10687 \bibitem[19]{JAVA}
10688 The Sun Java Website, \url{http://java.sun.com/}
10689
10690 \end{thebibliography}
10691
10692 \input{tommath.ind}
10693
10694 \end{document}