Mercurial > dropbear
comparison tommath.tex @ 19:e1037a1e12e7 libtommath-orig
0.30 release of LibTomMath
author | Matt Johnston <matt@ucc.asn.au> |
---|---|
date | Tue, 15 Jun 2004 14:42:57 +0000 |
parents | |
children | d29b64170cf0 |
comparison
equal
deleted
inserted
replaced
2:86e0b50a9b58 | 19:e1037a1e12e7 |
---|---|
1 \documentclass[b5paper]{book} | |
2 \usepackage{hyperref} | |
3 \usepackage{makeidx} | |
4 \usepackage{amssymb} | |
5 \usepackage{color} | |
6 \usepackage{alltt} | |
7 \usepackage{graphicx} | |
8 \usepackage{layout} | |
9 \def\union{\cup} | |
10 \def\intersect{\cap} | |
11 \def\getsrandom{\stackrel{\rm R}{\gets}} | |
12 \def\cross{\times} | |
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}} | |
14 \def\catn{$\|$} | |
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}} | |
16 \def\nequiv{\not\equiv} | |
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}} | |
18 \def\lcm{{\rm lcm}} | |
19 \def\gcd{{\rm gcd}} | |
20 \def\log{{\rm log}} | |
21 \def\ord{{\rm ord}} | |
22 \def\abs{{\mathit abs}} | |
23 \def\rep{{\mathit rep}} | |
24 \def\mod{{\mathit\ mod\ }} | |
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})} | |
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} | |
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} | |
28 \def\Or{{\rm\ or\ }} | |
29 \def\And{{\rm\ and\ }} | |
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}} | |
31 \def\implies{\Rightarrow} | |
32 \def\undefined{{\rm ``undefined"}} | |
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}} | |
34 \let\oldphi\phi | |
35 \def\phi{\varphi} | |
36 \def\Pr{{\rm Pr}} | |
37 \newcommand{\str}[1]{{\mathbf{#1}}} | |
38 \def\F{{\mathbb F}} | |
39 \def\N{{\mathbb N}} | |
40 \def\Z{{\mathbb Z}} | |
41 \def\R{{\mathbb R}} | |
42 \def\C{{\mathbb C}} | |
43 \def\Q{{\mathbb Q}} | |
44 \definecolor{DGray}{gray}{0.5} | |
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}} | |
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}} | |
47 \def\gap{\vspace{0.5ex}} | |
48 \makeindex | |
49 \begin{document} | |
50 \frontmatter | |
51 \pagestyle{empty} | |
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } | |
53 \author{\mbox{ | |
54 %\begin{small} | |
55 \begin{tabular}{c} | |
56 Tom St Denis \\ | |
57 Algonquin College \\ | |
58 \\ | |
59 Mads Rasmussen \\ | |
60 Open Communications Security \\ | |
61 \\ | |
62 Greg Rose \\ | |
63 QUALCOMM Australia \\ | |
64 \end{tabular} | |
65 %\end{small} | |
66 } | |
67 } | |
68 \maketitle | |
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the | |
70 LibTomMath project. | |
71 | |
72 \begin{alltt} | |
73 Tom St Denis | |
74 111 Banning Rd | |
75 Ottawa, Ontario | |
76 K2L 1C3 | |
77 Canada | |
78 | |
79 Phone: 1-613-836-3160 | |
80 Email: [email protected] | |
81 \end{alltt} | |
82 | |
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} | |
84 {\em book} macro package and the Perl {\em booker} package. | |
85 | |
86 \tableofcontents | |
87 \listoffigures | |
88 \chapter*{Prefaces to the Draft Edition} | |
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions | |
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their | |
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the | |
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance | |
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the | |
94 text. | |
95 | |
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit | |
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only | |
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted | |
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have | |
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain | |
101 rewarding. | |
102 | |
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. | |
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even | |
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then | |
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy | |
107 onto finishing the book not securing a contract. | |
108 | |
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. | |
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written | |
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy | |
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, | |
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely | |
114 without hinderance. | |
115 | |
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come | |
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their | |
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as | |
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing | |
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy | |
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C. | |
122 | |
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is | |
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic. | |
125 | |
126 At this time I feel I should share a little information about myself. The most common question I was asked at | |
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate | |
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which | |
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training. | |
130 | |
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen | |
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am | |
133 still far off from that goal. | |
134 | |
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing | |
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out | |
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even | |
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off | |
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what | |
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors | |
141 in my written English have saved me on several occasions to say the least. | |
142 | |
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've | |
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative | |
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college | |
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many | |
147 people who will take it. | |
148 | |
149 \begin{flushright} Tom St Denis \end{flushright} | |
150 | |
151 \newpage | |
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also | |
153 contribute to educate others facing the problem of having to handle big number mathematical calculations. | |
154 | |
155 This book is Tom's child and he has been caring and fostering the project ever since the beginning with a clear mind of | |
156 how he wanted the project to turn out. I have helped by proofreading the text and we have had several discussions about | |
157 the layout and language used. | |
158 | |
159 I hold a masters degree in cryptography from the University of Southern Denmark and have always been interested in the | |
160 practical aspects of cryptography. | |
161 | |
162 Having worked in the security consultancy business for several years in S\~{a}o Paulo, Brazil, I have been in touch with a | |
163 great deal of work in which multiple precision mathematics was needed. Understanding the possibilities for speeding up | |
164 multiple precision calculations is often very important since we deal with outdated machine architecture where modular | |
165 reductions, for example, become painfully slow. | |
166 | |
167 This text is for people who stop and wonder when first examining algorithms such as RSA for the first time and asks | |
168 themselves, ``You tell me this is only secure for large numbers, fine; but how do you implement these numbers?'' | |
169 | |
170 \begin{flushright} | |
171 Mads Rasmussen | |
172 | |
173 S\~{a}o Paulo - SP | |
174 | |
175 Brazil | |
176 \end{flushright} | |
177 | |
178 \newpage | |
179 It's all because I broke my leg. That just happened to be at about the same time that Tom asked for someone to review the section of the book about | |
180 Karatsuba multiplication. I was laid up, alone and immobile, and thought ``Why not?'' I vaguely knew what Karatsuba multiplication was, but not | |
181 really, so I thought I could help, learn, and stop myself from watching daytime cable TV, all at once. | |
182 | |
183 At the time of writing this, I've still not met Tom or Mads in meatspace. I've been following Tom's progress since his first splash on the | |
184 sci.crypt Usenet news group. I watched him go from a clueless newbie, to the cryptographic equivalent of a reformed smoker, to a real | |
185 contributor to the field, over a period of about two years. I've been impressed with his obvious intelligence, and astounded by his productivity. | |
186 Of course, he's young enough to be my own child, so he doesn't have my problems with staying awake. | |
187 | |
188 When I reviewed that single section of the book, in its very earliest form, I was very pleasantly surprised. So I decided to collaborate more fully, | |
189 and at least review all of it, and perhaps write some bits too. There's still a long way to go with it, and I have watched a number of close | |
190 friends go through the mill of publication, so I think that the way to go is longer than Tom thinks it is. Nevertheless, it's a good effort, | |
191 and I'm pleased to be involved with it. | |
192 | |
193 \begin{flushright} | |
194 Greg Rose, Sydney, Australia, June 2003. | |
195 \end{flushright} | |
196 | |
197 \mainmatter | |
198 \pagestyle{headings} | |
199 \chapter{Introduction} | |
200 \section{Multiple Precision Arithmetic} | |
201 | |
202 \subsection{What is Multiple Precision Arithmetic?} | |
203 When we think of long-hand arithmetic such as addition or multiplication we rarely consider the fact that we instinctively | |
204 raise or lower the precision of the numbers we are dealing with. For example, in decimal we almost immediate can | |
205 reason that $7$ times $6$ is $42$. However, $42$ has two digits of precision as opposed to one digit we started with. | |
206 Further multiplications of say $3$ result in a larger precision result $126$. In these few examples we have multiple | |
207 precisions for the numbers we are working with. Despite the various levels of precision a single subset\footnote{With the occasional optimization.} | |
208 of algorithms can be designed to accomodate them. | |
209 | |
210 By way of comparison a fixed or single precision operation would lose precision on various operations. For example, in | |
211 the decimal system with fixed precision $6 \cdot 7 = 2$. | |
212 | |
213 Essentially at the heart of computer based multiple precision arithmetic are the same long-hand algorithms taught in | |
214 schools to manually add, subtract, multiply and divide. | |
215 | |
216 \subsection{The Need for Multiple Precision Arithmetic} | |
217 The most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation | |
218 of public-key cryptography algorithms. Algorithms such as RSA \cite{RSAREF} and Diffie-Hellman \cite{DHREF} require | |
219 integers of significant magnitude to resist known cryptanalytic attacks. For example, at the time of this writing a | |
220 typical RSA modulus would be at least greater than $10^{309}$. However, modern programming languages such as ISO C \cite{ISOC} and | |
221 Java \cite{JAVA} only provide instrinsic support for integers which are relatively small and single precision. | |
222 | |
223 \begin{figure}[!here] | |
224 \begin{center} | |
225 \begin{tabular}{|r|c|} | |
226 \hline \textbf{Data Type} & \textbf{Range} \\ | |
227 \hline char & $-128 \ldots 127$ \\ | |
228 \hline short & $-32768 \ldots 32767$ \\ | |
229 \hline long & $-2147483648 \ldots 2147483647$ \\ | |
230 \hline long long & $-9223372036854775808 \ldots 9223372036854775807$ \\ | |
231 \hline | |
232 \end{tabular} | |
233 \end{center} | |
234 \caption{Typical Data Types for the C Programming Language} | |
235 \label{fig:ISOC} | |
236 \end{figure} | |
237 | |
238 The largest data type guaranteed to be provided by the ISO C programming | |
239 language\footnote{As per the ISO C standard. However, each compiler vendor is allowed to augment the precision as they | |
240 see fit.} can only represent values up to $10^{19}$ as shown in figure \ref{fig:ISOC}. On its own the C language is | |
241 insufficient to accomodate the magnitude required for the problem at hand. An RSA modulus of magnitude $10^{19}$ could be | |
242 trivially factored\footnote{A Pollard-Rho factoring would take only $2^{16}$ time.} on the average desktop computer, | |
243 rendering any protocol based on the algorithm insecure. Multiple precision algorithms solve this very problem by | |
244 extending the range of representable integers while using single precision data types. | |
245 | |
246 Most advancements in fast multiple precision arithmetic stem from the need for faster and more efficient cryptographic | |
247 primitives. Faster modular reduction and exponentiation algorithms such as Barrett's algorithm, which have appeared in | |
248 various cryptographic journals, can render algorithms such as RSA and Diffie-Hellman more efficient. In fact, several | |
249 major companies such as RSA Security, Certicom and Entrust have built entire product lines on the implementation and | |
250 deployment of efficient algorithms. | |
251 | |
252 However, cryptography is not the only field of study that can benefit from fast multiple precision integer routines. | |
253 Another auxiliary use of multiple precision integers is high precision floating point data types. | |
254 The basic IEEE \cite{IEEE} standard floating point type is made up of an integer mantissa $q$, an exponent $e$ and a sign bit $s$. | |
255 Numbers are given in the form $n = q \cdot b^e \cdot -1^s$ where $b = 2$ is the most common base for IEEE. Since IEEE | |
256 floating point is meant to be implemented in hardware the precision of the mantissa is often fairly small | |
257 (\textit{23, 48 and 64 bits}). The mantissa is merely an integer and a multiple precision integer could be used to create | |
258 a mantissa of much larger precision than hardware alone can efficiently support. This approach could be useful where | |
259 scientific applications must minimize the total output error over long calculations. | |
260 | |
261 Another use for large integers is within arithmetic on polynomials of large characteristic (i.e. $GF(p)[x]$ for large $p$). | |
262 In fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}. | |
263 | |
264 \subsection{Benefits of Multiple Precision Arithmetic} | |
265 \index{precision} | |
266 The benefit of multiple precision representations over single or fixed precision representations is that | |
267 no precision is lost while representing the result of an operation which requires excess precision. For example, | |
268 the product of two $n$-bit integers requires at least $2n$ bits of precision to be represented faithfully. A multiple | |
269 precision algorithm would augment the precision of the destination to accomodate the result while a single precision system | |
270 would truncate excess bits to maintain a fixed level of precision. | |
271 | |
272 It is possible to implement algorithms which require large integers with fixed precision algorithms. For example, elliptic | |
273 curve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum | |
274 size the system will ever need. Such an approach can lead to vastly simpler algorithms which can accomodate the | |
275 integers required even if the host platform cannot natively accomodate them\footnote{For example, the average smartcard | |
276 processor has an 8 bit accumulator.}. However, as efficient as such an approach may be, the resulting source code is not | |
277 normally very flexible. It cannot, at runtime, accomodate inputs of higher magnitude than the designer anticipated. | |
278 | |
279 Multiple precision algorithms have the most overhead of any style of arithmetic. For the the most part the | |
280 overhead can be kept to a minimum with careful planning, but overall, it is not well suited for most memory starved | |
281 platforms. However, multiple precision algorithms do offer the most flexibility in terms of the magnitude of the | |
282 inputs. That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input | |
283 without the designer's explicit forethought. This leads to lower cost of ownership for the code as it only has to | |
284 be written and tested once. | |
285 | |
286 \section{Purpose of This Text} | |
287 The purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms. | |
288 That is to not only explain a limited subset of the core theory behind the algorithms but also the various ``house keeping'' | |
289 elements that are neglected by authors of other texts on the subject. Several well reknowned texts \cite{TAOCPV2,HAC} | |
290 give considerably detailed explanations of the theoretical aspects of algorithms and often very little information | |
291 regarding the practical implementation aspects. | |
292 | |
293 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts. For | |
294 example, the Handbook of Applied Cryptography (\textit{HAC}), algorithm 14.7 on page 594, gives a relatively simple | |
295 algorithm for performing multiple precision integer addition. However, the description lacks any discussion concerning | |
296 the fact that the two integer inputs may be of differing magnitudes. As a result the implementation is not as simple | |
297 as the text would lead people to believe. Similarly the division routine (\textit{algorithm 14.20, pp. 598}) does not | |
298 discuss how to handle sign or handle the dividend's decreasing magnitude in the main loop (\textit{step \#3}). | |
299 | |
300 Both texts also do not discuss several key optimal algorithms required such as ``Comba'' and Karatsuba multipliers | |
301 and fast modular inversion, which we consider practical oversights. These optimal algorithms are vital to achieve | |
302 any form of useful performance in non-trivial applications. | |
303 | |
304 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer | |
305 package. As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.org}} package is used | |
306 to demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field | |
307 tested and work very well. The LibTomMath library is freely available on the Internet for all uses and this text | |
308 discusses a very large portion of the inner workings of the library. | |
309 | |
310 The algorithms that are presented will always include at least one ``pseudo-code'' description followed | |
311 by the actual C source code that implements the algorithm. The pseudo-code can be used to implement the same | |
312 algorithm in other programming languages as the reader sees fit. | |
313 | |
314 This text shall also serve as a walkthrough of the creation of multiple precision algorithms from scratch. Showing | |
315 the reader how the algorithms fit together as well as where to start on various taskings. | |
316 | |
317 \section{Discussion and Notation} | |
318 \subsection{Notation} | |
319 A multiple precision integer of $n$-digits shall be denoted as $x = (x_{n-1} ... x_1 x_0)_{ \beta }$ and represent | |
320 the integer $x \equiv \sum_{i=0}^{n-1} x_i\beta^i$. The elements of the array $x$ are said to be the radix $\beta$ digits | |
321 of the integer. For example, $x = (1,2,3)_{10}$ would represent the integer | |
322 $1\cdot 10^2 + 2\cdot10^1 + 3\cdot10^0 = 123$. | |
323 | |
324 \index{mp\_int} | |
325 The term ``mp\_int'' shall refer to a composite structure which contains the digits of the integer it represents, as well | |
326 as auxilary data required to manipulate the data. These additional members are discussed further in section | |
327 \ref{sec:MPINT}. For the purposes of this text a ``multiple precision integer'' and an ``mp\_int'' are assumed to be | |
328 synonymous. When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members | |
329 are present as well. An expression of the type \textit{variablename.item} implies that it should evaluate to the | |
330 member named ``item'' of the variable. For example, a string of characters may have a member ``length'' which would | |
331 evaluate to the number of characters in the string. If the string $a$ equals ``hello'' then it follows that | |
332 $a.length = 5$. | |
333 | |
334 For certain discussions more generic algorithms are presented to help the reader understand the final algorithm used | |
335 to solve a given problem. When an algorithm is described as accepting an integer input it is assumed the input is | |
336 a plain integer with no additional multiple-precision members. That is, algorithms that use integers as opposed to | |
337 mp\_ints as inputs do not concern themselves with the housekeeping operations required such as memory management. These | |
338 algorithms will be used to establish the relevant theory which will subsequently be used to describe a multiple | |
339 precision algorithm to solve the same problem. | |
340 | |
341 \subsection{Precision Notation} | |
342 For the purposes of this text a single precision variable must be able to represent integers in the range | |
343 $0 \le x < q \beta$ while a double precision variable must be able to represent integers in the range | |
344 $0 \le x < q \beta^2$. The variable $\beta$ represents the radix of a single digit of a multiple precision integer and | |
345 must be of the form $q^p$ for $q, p \in \Z^+$. The extra radix-$q$ factor allows additions and subtractions to proceed | |
346 without truncation of the carry. Since all modern computers are binary, it is assumed that $q$ is two, for all intents | |
347 and purposes. | |
348 | |
349 \index{mp\_digit} \index{mp\_word} | |
350 Within the source code that will be presented for each algorithm, the data type \textbf{mp\_digit} will represent | |
351 a single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type. In | |
352 several algorithms (notably the Comba routines) temporary results will be stored in arrays of double precision mp\_words. | |
353 For the purposes of this text $x_j$ will refer to the $j$'th digit of a single precision array and $\hat x_j$ will refer to | |
354 the $j$'th digit of a double precision array. Whenever an expression is to be assigned to a double precision | |
355 variable it is assumed that all single precision variables are promoted to double precision during the evaluation. | |
356 Expressions that are assigned to a single precision variable are truncated to fit within the precision of a single | |
357 precision data type. | |
358 | |
359 For example, if $\beta = 10^2$ a single precision data type may represent a value in the | |
360 range $0 \le x < 10^3$, while a double precision data type may represent a value in the range $0 \le x < 10^5$. Let | |
361 $a = 23$ and $b = 49$ represent two single precision variables. The single precision product shall be written | |
362 as $c \leftarrow a \cdot b$ while the double precision product shall be written as $\hat c \leftarrow a \cdot b$. | |
363 In this particular case, $\hat c = 1127$ and $c = 127$. The most significant digit of the product would not fit | |
364 in a single precision data type and as a result $c \ne \hat c$. | |
365 | |
366 \subsection{Algorithm Inputs and Outputs} | |
367 Within the algorithm descriptions all variables are assumed to be scalars of either single or double precision | |
368 as indicated. The only exception to this rule is when variables have been indicated to be of type mp\_int. This | |
369 distinction is important as scalars are often used as array indicies and various other counters. | |
370 | |
371 \subsection{Mathematical Expressions} | |
372 The $\lfloor \mbox{ } \rfloor$ brackets imply an expression truncated to an integer not greater than the expression | |
373 itself. For example, $\lfloor 5.7 \rfloor = 5$. Similarly the $\lceil \mbox{ } \rceil$ brackets imply an expression | |
374 rounded to an integer not less than the expression itself. For example, $\lceil 5.1 \rceil = 6$. Typically when | |
375 the $/$ division symbol is used the intention is to perform an integer division with truncation. For example, | |
376 $5/2 = 2$ which will often be written as $\lfloor 5/2 \rfloor = 2$ for clarity. When an expression is written as a | |
377 fraction a real value division is implied, for example ${5 \over 2} = 2.5$. | |
378 | |
379 The norm of a multiple precision integer, for example, $\vert \vert x \vert \vert$ will be used to represent the number of digits in the representation | |
380 of the integer. For example, $\vert \vert 123 \vert \vert = 3$ and $\vert \vert 79452 \vert \vert = 5$. | |
381 | |
382 \subsection{Work Effort} | |
383 \index{big-Oh} | |
384 To measure the efficiency of the specified algorithms, a modified big-Oh notation is used. In this system all | |
385 single precision operations are considered to have the same cost\footnote{Except where explicitly noted.}. | |
386 That is a single precision addition, multiplication and division are assumed to take the same time to | |
387 complete. While this is generally not true in practice, it will simplify the discussions considerably. | |
388 | |
389 Some algorithms have slight advantages over others which is why some constants will not be removed in | |
390 the notation. For example, a normal baseline multiplication (section \ref{sec:basemult}) requires $O(n^2)$ work while a | |
391 baseline squaring (section \ref{sec:basesquare}) requires $O({{n^2 + n}\over 2})$ work. In standard big-Oh notation these | |
392 would both be said to be equivalent to $O(n^2)$. However, | |
393 in the context of the this text this is not the case as the magnitude of the inputs will typically be rather small. As a | |
394 result small constant factors in the work effort will make an observable difference in algorithm efficiency. | |
395 | |
396 All of the algorithms presented in this text have a polynomial time work level. That is, of the form | |
397 $O(n^k)$ for $n, k \in \Z^{+}$. This will help make useful comparisons in terms of the speed of the algorithms and how | |
398 various optimizations will help pay off in the long run. | |
399 | |
400 \section{Exercises} | |
401 Within the more advanced chapters a section will be set aside to give the reader some challenging exercises related to | |
402 the discussion at hand. These exercises are not designed to be prize winning problems, but instead to be thought | |
403 provoking. Wherever possible the problems are forward minded, stating problems that will be answered in subsequent | |
404 chapters. The reader is encouraged to finish the exercises as they appear to get a better understanding of the | |
405 subject material. | |
406 | |
407 That being said, the problems are designed to affirm knowledge of a particular subject matter. Students in particular | |
408 are encouraged to verify they can answer the problems correctly before moving on. | |
409 | |
410 Similar to the exercises of \cite[pp. ix]{TAOCPV2} these exercises are given a scoring system based on the difficulty of | |
411 the problem. However, unlike \cite{TAOCPV2} the problems do not get nearly as hard. The scoring of these | |
412 exercises ranges from one (the easiest) to five (the hardest). The following table sumarizes the | |
413 scoring system used. | |
414 | |
415 \begin{figure}[here] | |
416 \begin{center} | |
417 \begin{small} | |
418 \begin{tabular}{|c|l|} | |
419 \hline $\left [ 1 \right ]$ & An easy problem that should only take the reader a manner of \\ | |
420 & minutes to solve. Usually does not involve much computer time \\ | |
421 & to solve. \\ | |
422 \hline $\left [ 2 \right ]$ & An easy problem that involves a marginal amount of computer \\ | |
423 & time usage. Usually requires a program to be written to \\ | |
424 & solve the problem. \\ | |
425 \hline $\left [ 3 \right ]$ & A moderately hard problem that requires a non-trivial amount \\ | |
426 & of work. Usually involves trivial research and development of \\ | |
427 & new theory from the perspective of a student. \\ | |
428 \hline $\left [ 4 \right ]$ & A moderately hard problem that involves a non-trivial amount \\ | |
429 & of work and research, the solution to which will demonstrate \\ | |
430 & a higher mastery of the subject matter. \\ | |
431 \hline $\left [ 5 \right ]$ & A hard problem that involves concepts that are difficult for a \\ | |
432 & novice to solve. Solutions to these problems will demonstrate a \\ | |
433 & complete mastery of the given subject. \\ | |
434 \hline | |
435 \end{tabular} | |
436 \end{small} | |
437 \end{center} | |
438 \caption{Exercise Scoring System} | |
439 \end{figure} | |
440 | |
441 Problems at the first level are meant to be simple questions that the reader can answer quickly without programming a solution or | |
442 devising new theory. These problems are quick tests to see if the material is understood. Problems at the second level | |
443 are also designed to be easy but will require a program or algorithm to be implemented to arrive at the answer. These | |
444 two levels are essentially entry level questions. | |
445 | |
446 Problems at the third level are meant to be a bit more difficult than the first two levels. The answer is often | |
447 fairly obvious but arriving at an exacting solution requires some thought and skill. These problems will almost always | |
448 involve devising a new algorithm or implementing a variation of another algorithm previously presented. Readers who can | |
449 answer these questions will feel comfortable with the concepts behind the topic at hand. | |
450 | |
451 Problems at the fourth level are meant to be similar to those of the level three questions except they will require | |
452 additional research to be completed. The reader will most likely not know the answer right away, nor will the text provide | |
453 the exact details of the answer until a subsequent chapter. | |
454 | |
455 Problems at the fifth level are meant to be the hardest | |
456 problems relative to all the other problems in the chapter. People who can correctly answer fifth level problems have a | |
457 mastery of the subject matter at hand. | |
458 | |
459 Often problems will be tied together. The purpose of this is to start a chain of thought that will be discussed in future chapters. The reader | |
460 is encouraged to answer the follow-up problems and try to draw the relevance of problems. | |
461 | |
462 \section{Introduction to LibTomMath} | |
463 | |
464 \subsection{What is LibTomMath?} | |
465 LibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C. By portable it | |
466 is meant that the library does not contain any code that is computer platform dependent or otherwise problematic to use on | |
467 any given platform. | |
468 | |
469 The library has been successfully tested under numerous operating systems including Unix\footnote{All of these | |
470 trademarks belong to their respective rightful owners.}, MacOS, Windows, Linux, PalmOS and on standalone hardware such | |
471 as the Gameboy Advance. The library is designed to contain enough functionality to be able to develop applications such | |
472 as public key cryptosystems and still maintain a relatively small footprint. | |
473 | |
474 \subsection{Goals of LibTomMath} | |
475 | |
476 Libraries which obtain the most efficiency are rarely written in a high level programming language such as C. However, | |
477 even though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the | |
478 library. Specifically the code has been written to work well with the GNU C Compiler (\textit{GCC}) on both x86 and ARM | |
479 processors. Wherever possible, highly efficient algorithms, such as Karatsuba multiplication, sliding window | |
480 exponentiation and Montgomery reduction have been provided to make the library more efficient. | |
481 | |
482 Even with the nearly optimal and specialized algorithms that have been included the Application Programing Interface | |
483 (\textit{API}) has been kept as simple as possible. Often generic place holder routines will make use of specialized | |
484 algorithms automatically without the developer's specific attention. One such example is the generic multiplication | |
485 algorithm \textbf{mp\_mul()} which will automatically use Toom--Cook, Karatsuba, Comba or baseline multiplication | |
486 based on the magnitude of the inputs and the configuration of the library. | |
487 | |
488 Making LibTomMath as efficient as possible is not the only goal of the LibTomMath project. Ideally the library should | |
489 be source compatible with another popular library which makes it more attractive for developers to use. In this case the | |
490 MPI library was used as a API template for all the basic functions. MPI was chosen because it is another library that fits | |
491 in the same niche as LibTomMath. Even though LibTomMath uses MPI as the template for the function names and argument | |
492 passing conventions, it has been written from scratch by Tom St Denis. | |
493 | |
494 The project is also meant to act as a learning tool for students, the logic being that no easy-to-follow ``bignum'' | |
495 library exists which can be used to teach computer science students how to perform fast and reliable multiple precision | |
496 integer arithmetic. To this end the source code has been given quite a few comments and algorithm discussion points. | |
497 | |
498 \section{Choice of LibTomMath} | |
499 LibTomMath was chosen as the case study of this text not only because the author of both projects is one and the same but | |
500 for more worthy reasons. Other libraries such as GMP \cite{GMP}, MPI \cite{MPI}, LIP \cite{LIP} and OpenSSL | |
501 \cite{OPENSSL} have multiple precision integer arithmetic routines but would not be ideal for this text for | |
502 reasons that will be explained in the following sub-sections. | |
503 | |
504 \subsection{Code Base} | |
505 The LibTomMath code base is all portable ISO C source code. This means that there are no platform dependent conditional | |
506 segments of code littered throughout the source. This clean and uncluttered approach to the library means that a | |
507 developer can more readily discern the true intent of a given section of source code without trying to keep track of | |
508 what conditional code will be used. | |
509 | |
510 The code base of LibTomMath is well organized. Each function is in its own separate source code file | |
511 which allows the reader to find a given function very quickly. On average there are $76$ lines of code per source | |
512 file which makes the source very easily to follow. By comparison MPI and LIP are single file projects making code tracing | |
513 very hard. GMP has many conditional code segments which also hinder tracing. | |
514 | |
515 When compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.} | |
516 which is fairly small compared to GMP (over $250$KiB). LibTomMath is slightly larger than MPI (which compiles to about | |
517 $50$KiB) but LibTomMath is also much faster and more complete than MPI. | |
518 | |
519 \subsection{API Simplicity} | |
520 LibTomMath is designed after the MPI library and shares the API design. Quite often programs that use MPI will build | |
521 with LibTomMath without change. The function names correlate directly to the action they perform. Almost all of the | |
522 functions share the same parameter passing convention. The learning curve is fairly shallow with the API provided | |
523 which is an extremely valuable benefit for the student and developer alike. | |
524 | |
525 The LIP library is an example of a library with an API that is awkward to work with. LIP uses function names that are often ``compressed'' to | |
526 illegible short hand. LibTomMath does not share this characteristic. | |
527 | |
528 The GMP library also does not return error codes. Instead it uses a POSIX.1 \cite{POSIX1} signal system where errors | |
529 are signaled to the host application. This happens to be the fastest approach but definitely not the most versatile. In | |
530 effect a math error (i.e. invalid input, heap error, etc) can cause a program to stop functioning which is definitely | |
531 undersireable in many situations. | |
532 | |
533 \subsection{Optimizations} | |
534 While LibTomMath is certainly not the fastest library (GMP often beats LibTomMath by a factor of two) it does | |
535 feature a set of optimal algorithms for tasks such as modular reduction, exponentiation, multiplication and squaring. GMP | |
536 and LIP also feature such optimizations while MPI only uses baseline algorithms with no optimizations. GMP lacks a few | |
537 of the additional modular reduction optimizations that LibTomMath features\footnote{At the time of this writing GMP | |
538 only had Barrett and Montgomery modular reduction algorithms.}. | |
539 | |
540 LibTomMath is almost always an order of magnitude faster than the MPI library at computationally expensive tasks such as modular | |
541 exponentiation. In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually | |
542 slower than the best libraries such as GMP and OpenSSL by only a small factor. | |
543 | |
544 \subsection{Portability and Stability} | |
545 LibTomMath will build ``out of the box'' on any platform equipped with a modern version of the GNU C Compiler | |
546 (\textit{GCC}). This means that without changes the library will build without configuration or setting up any | |
547 variables. LIP and MPI will build ``out of the box'' as well but have numerous known bugs. Most notably the author of | |
548 MPI has recently stopped working on his library and LIP has long since been discontinued. | |
549 | |
550 GMP requires a configuration script to run and will not build out of the box. GMP and LibTomMath are still in active | |
551 development and are very stable across a variety of platforms. | |
552 | |
553 \subsection{Choice} | |
554 LibTomMath is a relatively compact, well documented, highly optimized and portable library which seems only natural for | |
555 the case study of this text. Various source files from the LibTomMath project will be included within the text. However, | |
556 the reader is encouraged to download their own copy of the library to actually be able to work with the library. | |
557 | |
558 \chapter{Getting Started} | |
559 \section{Library Basics} | |
560 The trick to writing any useful library of source code is to build a solid foundation and work outwards from it. First, | |
561 a problem along with allowable solution parameters should be identified and analyzed. In this particular case the | |
562 inability to accomodate multiple precision integers is the problem. Futhermore, the solution must be written | |
563 as portable source code that is reasonably efficient across several different computer platforms. | |
564 | |
565 After a foundation is formed the remainder of the library can be designed and implemented in a hierarchical fashion. | |
566 That is, to implement the lowest level dependencies first and work towards the most abstract functions last. For example, | |
567 before implementing a modular exponentiation algorithm one would implement a modular reduction algorithm. | |
568 By building outwards from a base foundation instead of using a parallel design methodology the resulting project is | |
569 highly modular. Being highly modular is a desirable property of any project as it often means the resulting product | |
570 has a small footprint and updates are easy to perform. | |
571 | |
572 Usually when I start a project I will begin with the header file. I define the data types I think I will need and | |
573 prototype the initial functions that are not dependent on other functions (within the library). After I | |
574 implement these base functions I prototype more dependent functions and implement them. The process repeats until | |
575 I implement all of the functions I require. For example, in the case of LibTomMath I implemented functions such as | |
576 mp\_init() well before I implemented mp\_mul() and even further before I implemented mp\_exptmod(). As an example as to | |
577 why this design works note that the Karatsuba and Toom-Cook multipliers were written \textit{after} the | |
578 dependent function mp\_exptmod() was written. Adding the new multiplication algorithms did not require changes to the | |
579 mp\_exptmod() function itself and lowered the total cost of ownership (\textit{so to speak}) and of development | |
580 for new algorithms. This methodology allows new algorithms to be tested in a complete framework with relative ease. | |
581 | |
582 \begin{center} | |
583 \begin{figure}[here] | |
584 \includegraphics{pics/design_process.ps} | |
585 \caption{Design Flow of the First Few Original LibTomMath Functions.} | |
586 \label{pic:design_process} | |
587 \end{figure} | |
588 \end{center} | |
589 | |
590 Only after the majority of the functions were in place did I pursue a less hierarchical approach to auditing and optimizing | |
591 the source code. For example, one day I may audit the multipliers and the next day the polynomial basis functions. | |
592 | |
593 It only makes sense to begin the text with the preliminary data types and support algorithms required as well. | |
594 This chapter discusses the core algorithms of the library which are the dependents for every other algorithm. | |
595 | |
596 \section{What is a Multiple Precision Integer?} | |
597 Recall that most programming languages, in particular ISO C \cite{ISOC}, only have fixed precision data types that on their own cannot | |
598 be used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is | |
599 to use fixed precision data types to create and manipulate multiple precision integers which may represent values | |
600 that are very large. | |
601 | |
602 As a well known analogy, school children are taught how to form numbers larger than nine by prepending more radix ten digits. In the decimal system | |
603 the largest single digit value is $9$. However, by concatenating digits together larger numbers may be represented. Newly prepended digits | |
604 (\textit{to the left}) are said to be in a different power of ten column. That is, the number $123$ can be described as having a $1$ in the hundreds | |
605 column, $2$ in the tens column and $3$ in the ones column. Or more formally $123 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3 \cdot 10^0$. Computer based | |
606 multiple precision arithmetic is essentially the same concept. Larger integers are represented by adjoining fixed | |
607 precision computer words with the exception that a different radix is used. | |
608 | |
609 What most people probably do not think about explicitly are the various other attributes that describe a multiple precision | |
610 integer. For example, the integer $154_{10}$ has two immediately obvious properties. First, the integer is positive, | |
611 that is the sign of this particular integer is positive as opposed to negative. Second, the integer has three digits in | |
612 its representation. There is an additional property that the integer posesses that does not concern pencil-and-paper | |
613 arithmetic. The third property is how many digits placeholders are available to hold the integer. | |
614 | |
615 The human analogy of this third property is ensuring there is enough space on the paper to write the integer. For example, | |
616 if one starts writing a large number too far to the right on a piece of paper they will have to erase it and move left. | |
617 Similarly, computer algorithms must maintain strict control over memory usage to ensure that the digits of an integer | |
618 will not exceed the allowed boundaries. These three properties make up what is known as a multiple precision | |
619 integer or mp\_int for short. | |
620 | |
621 \subsection{The mp\_int Structure} | |
622 \label{sec:MPINT} | |
623 The mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer. The ISO C standard does not provide for | |
624 any such data type but it does provide for making composite data types known as structures. The following is the structure definition | |
625 used within LibTomMath. | |
626 | |
627 \index{mp\_int} | |
628 \begin{verbatim} | |
629 typedef struct { | |
630 int used, alloc, sign; | |
631 mp_digit *dp; | |
632 } mp_int; | |
633 \end{verbatim} | |
634 | |
635 The mp\_int structure can be broken down as follows. | |
636 | |
637 \begin{enumerate} | |
638 \item The \textbf{used} parameter denotes how many digits of the array \textbf{dp} contain the digits used to represent | |
639 a given integer. The \textbf{used} count must be positive (or zero) and may not exceed the \textbf{alloc} count. | |
640 | |
641 \item The \textbf{alloc} parameter denotes how | |
642 many digits are available in the array to use by functions before it has to increase in size. When the \textbf{used} count | |
643 of a result would exceed the \textbf{alloc} count all of the algorithms will automatically increase the size of the | |
644 array to accommodate the precision of the result. | |
645 | |
646 \item The pointer \textbf{dp} points to a dynamically allocated array of digits that represent the given multiple | |
647 precision integer. It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits. The array is maintained in a least | |
648 significant digit order. As a pencil and paper analogy the array is organized such that the right most digits are stored | |
649 first starting at the location indexed by zero\footnote{In C all arrays begin at zero.} in the array. For example, | |
650 if \textbf{dp} contains $\lbrace a, b, c, \ldots \rbrace$ where \textbf{dp}$_0 = a$, \textbf{dp}$_1 = b$, \textbf{dp}$_2 = c$, $\ldots$ then | |
651 it would represent the integer $a + b\beta + c\beta^2 + \ldots$ | |
652 | |
653 \index{MP\_ZPOS} \index{MP\_NEG} | |
654 \item The \textbf{sign} parameter denotes the sign as either zero/positive (\textbf{MP\_ZPOS}) or negative (\textbf{MP\_NEG}). | |
655 \end{enumerate} | |
656 | |
657 \subsubsection{Valid mp\_int Structures} | |
658 Several rules are placed on the state of an mp\_int structure and are assumed to be followed for reasons of efficiency. | |
659 The only exceptions are when the structure is passed to initialization functions such as mp\_init() and mp\_init\_copy(). | |
660 | |
661 \begin{enumerate} | |
662 \item The value of \textbf{alloc} may not be less than one. That is \textbf{dp} always points to a previously allocated | |
663 array of digits. | |
664 \item The value of \textbf{used} may not exceed \textbf{alloc} and must be greater than or equal to zero. | |
665 \item The value of \textbf{used} implies the digit at index $(used - 1)$ of the \textbf{dp} array is non-zero. That is, | |
666 leading zero digits in the most significant positions must be trimmed. | |
667 \begin{enumerate} | |
668 \item Digits in the \textbf{dp} array at and above the \textbf{used} location must be zero. | |
669 \end{enumerate} | |
670 \item The value of \textbf{sign} must be \textbf{MP\_ZPOS} if \textbf{used} is zero; | |
671 this represents the mp\_int value of zero. | |
672 \end{enumerate} | |
673 | |
674 \section{Argument Passing} | |
675 A convention of argument passing must be adopted early on in the development of any library. Making the function | |
676 prototypes consistent will help eliminate many headaches in the future as the library grows to significant complexity. | |
677 In LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int | |
678 structures. That means that the source (input) operands are placed on the left and the destination (output) on the right. | |
679 Consider the following examples. | |
680 | |
681 \begin{verbatim} | |
682 mp_mul(&a, &b, &c); /* c = a * b */ | |
683 mp_add(&a, &b, &a); /* a = a + b */ | |
684 mp_sqr(&a, &b); /* b = a * a */ | |
685 \end{verbatim} | |
686 | |
687 The left to right order is a fairly natural way to implement the functions since it lets the developer read aloud the | |
688 functions and make sense of them. For example, the first function would read ``multiply a and b and store in c''. | |
689 | |
690 Certain libraries (\textit{LIP by Lenstra for instance}) accept parameters the other way around, to mimic the order | |
691 of assignment expressions. That is, the destination (output) is on the left and arguments (inputs) are on the right. In | |
692 truth, it is entirely a matter of preference. In the case of LibTomMath the convention from the MPI library has been | |
693 adopted. | |
694 | |
695 Another very useful design consideration, provided for in LibTomMath, is whether to allow argument sources to also be a | |
696 destination. For example, the second example (\textit{mp\_add}) adds $a$ to $b$ and stores in $a$. This is an important | |
697 feature to implement since it allows the calling functions to cut down on the number of variables it must maintain. | |
698 However, to implement this feature specific care has to be given to ensure the destination is not modified before the | |
699 source is fully read. | |
700 | |
701 \section{Return Values} | |
702 A well implemented application, no matter what its purpose, should trap as many runtime errors as possible and return them | |
703 to the caller. By catching runtime errors a library can be guaranteed to prevent undefined behaviour. However, the end | |
704 developer can still manage to cause a library to crash. For example, by passing an invalid pointer an application may | |
705 fault by dereferencing memory not owned by the application. | |
706 | |
707 In the case of LibTomMath the only errors that are checked for are related to inappropriate inputs (division by zero for | |
708 instance) and memory allocation errors. It will not check that the mp\_int passed to any function is valid nor | |
709 will it check pointers for validity. Any function that can cause a runtime error will return an error code as an | |
710 \textbf{int} data type with one of the following values. | |
711 | |
712 \index{MP\_OKAY} \index{MP\_VAL} \index{MP\_MEM} | |
713 \begin{center} | |
714 \begin{tabular}{|l|l|} | |
715 \hline \textbf{Value} & \textbf{Meaning} \\ | |
716 \hline \textbf{MP\_OKAY} & The function was successful \\ | |
717 \hline \textbf{MP\_VAL} & One of the input value(s) was invalid \\ | |
718 \hline \textbf{MP\_MEM} & The function ran out of heap memory \\ | |
719 \hline | |
720 \end{tabular} | |
721 \end{center} | |
722 | |
723 When an error is detected within a function it should free any memory it allocated, often during the initialization of | |
724 temporary mp\_ints, and return as soon as possible. The goal is to leave the system in the same state it was when the | |
725 function was called. Error checking with this style of API is fairly simple. | |
726 | |
727 \begin{verbatim} | |
728 int err; | |
729 if ((err = mp_add(&a, &b, &c)) != MP_OKAY) { | |
730 printf("Error: %s\n", mp_error_to_string(err)); | |
731 exit(EXIT_FAILURE); | |
732 } | |
733 \end{verbatim} | |
734 | |
735 The GMP \cite{GMP} library uses C style \textit{signals} to flag errors which is of questionable use. Not all errors are fatal | |
736 and it was not deemed ideal by the author of LibTomMath to force developers to have signal handlers for such cases. | |
737 | |
738 \section{Initialization and Clearing} | |
739 The logical starting point when actually writing multiple precision integer functions is the initialization and | |
740 clearing of the mp\_int structures. These two algorithms will be used by the majority of the higher level algorithms. | |
741 | |
742 Given the basic mp\_int structure an initialization routine must first allocate memory to hold the digits of | |
743 the integer. Often it is optimal to allocate a sufficiently large pre-set number of digits even though | |
744 the initial integer will represent zero. If only a single digit were allocated quite a few subsequent re-allocations | |
745 would occur when operations are performed on the integers. There is a tradeoff between how many default digits to allocate | |
746 and how many re-allocations are tolerable. Obviously allocating an excessive amount of digits initially will waste | |
747 memory and become unmanageable. | |
748 | |
749 If the memory for the digits has been successfully allocated then the rest of the members of the structure must | |
750 be initialized. Since the initial state of an mp\_int is to represent the zero integer, the allocated digits must be set | |
751 to zero. The \textbf{used} count set to zero and \textbf{sign} set to \textbf{MP\_ZPOS}. | |
752 | |
753 \subsection{Initializing an mp\_int} | |
754 An mp\_int is said to be initialized if it is set to a valid, preferably default, state such that all of the members of the | |
755 structure are set to valid values. The mp\_init algorithm will perform such an action. | |
756 | |
757 \begin{figure}[here] | |
758 \begin{center} | |
759 \begin{tabular}{l} | |
760 \hline Algorithm \textbf{mp\_init}. \\ | |
761 \textbf{Input}. An mp\_int $a$ \\ | |
762 \textbf{Output}. Allocate memory and initialize $a$ to a known valid mp\_int state. \\ | |
763 \hline \\ | |
764 1. Allocate memory for \textbf{MP\_PREC} digits. \\ | |
765 2. If the allocation failed return(\textit{MP\_MEM}) \\ | |
766 3. for $n$ from $0$ to $MP\_PREC - 1$ do \\ | |
767 \hspace{3mm}3.1 $a_n \leftarrow 0$\\ | |
768 4. $a.sign \leftarrow MP\_ZPOS$\\ | |
769 5. $a.used \leftarrow 0$\\ | |
770 6. $a.alloc \leftarrow MP\_PREC$\\ | |
771 7. Return(\textit{MP\_OKAY})\\ | |
772 \hline | |
773 \end{tabular} | |
774 \end{center} | |
775 \caption{Algorithm mp\_init} | |
776 \end{figure} | |
777 | |
778 \textbf{Algorithm mp\_init.} | |
779 The \textbf{MP\_PREC} name represents a constant\footnote{Defined in the ``tommath.h'' header file within LibTomMath.} | |
780 used to dictate the minimum precision of allocated mp\_int integers. Ideally, it is at least equal to $32$ since for most | |
781 purposes that will be more than enough. | |
782 | |
783 Memory for the default number of digits is allocated first. If the allocation fails the algorithm returns immediately | |
784 with the \textbf{MP\_MEM} error code. If the allocation succeeds the remaining members of the mp\_int structure | |
785 must be initialized to reflect the default initial state. | |
786 | |
787 The allocated digits are all set to zero (step three) to ensure they are in a known state. The \textbf{sign}, \textbf{used} | |
788 and \textbf{alloc} are subsequently initialized to represent the zero integer. By step seven the algorithm returns a success | |
789 code and the mp\_int $a$ has been successfully initialized to a valid state representing the integer zero. | |
790 | |
791 \textbf{Remark.} | |
792 This function introduces the idiosyncrasy that all iterative loops, commonly initiated with the ``for'' keyword, iterate incrementally | |
793 when the ``to'' keyword is placed between two expressions. For example, ``for $a$ from $b$ to $c$ do'' means that | |
794 a subsequent expression (or body of expressions) are to be evaluated upto $c - b$ times so long as $b \le c$. In each | |
795 iteration the variable $a$ is substituted for a new integer that lies inclusively between $b$ and $c$. If $b > c$ occured | |
796 the loop would not iterate. By contrast if the ``downto'' keyword were used in place of ``to'' the loop would iterate | |
797 decrementally. | |
798 | |
799 \vspace{+3mm}\begin{small} | |
800 \hspace{-5.1mm}{\bf File}: bn\_mp\_init.c | |
801 \vspace{-3mm} | |
802 \begin{alltt} | |
803 016 | |
804 017 /* init a new bigint */ | |
805 018 int mp_init (mp_int * a) | |
806 019 \{ | |
807 020 /* allocate memory required and clear it */ | |
808 021 a->dp = OPT_CAST(mp_digit) XCALLOC (sizeof (mp_digit), MP_PREC); | |
809 022 if (a->dp == NULL) \{ | |
810 023 return MP_MEM; | |
811 024 \} | |
812 025 | |
813 026 /* set the used to zero, allocated digits to the default precision | |
814 027 * and sign to positive */ | |
815 028 a->used = 0; | |
816 029 a->alloc = MP_PREC; | |
817 030 a->sign = MP_ZPOS; | |
818 031 | |
819 032 return MP_OKAY; | |
820 033 \} | |
821 \end{alltt} | |
822 \end{small} | |
823 | |
824 One immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure. It | |
825 is assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack. The | |
826 call to mp\_init() is used only to initialize the members of the structure to a known default state. | |
827 | |
828 Before any of the other members of the structure are initialized memory from the application heap is allocated with | |
829 the calloc() function (line @22,calloc@). The size of the allocated memory is large enough to hold \textbf{MP\_PREC} | |
830 mp\_digit variables. The calloc() function is used instead\footnote{calloc() will allocate memory in the same | |
831 manner as malloc() except that it also sets the contents to zero upon successfully allocating the memory.} of malloc() | |
832 since digits have to be set to zero for the function to finish correctly. The \textbf{OPT\_CAST} token is a macro | |
833 definition which will turn into a cast from void * to mp\_digit * for C++ compilers. It is not required for C compilers. | |
834 | |
835 After the memory has been successfully allocated the remainder of the members are initialized | |
836 (lines 28 through 30) to their respective default states. At this point the algorithm has succeeded and | |
837 a success code is returned to the calling function. | |
838 | |
839 If this function returns \textbf{MP\_OKAY} it is safe to assume the mp\_int structure has been properly initialized and | |
840 is safe to use with other functions within the library. | |
841 | |
842 \subsection{Clearing an mp\_int} | |
843 When an mp\_int is no longer required by the application, the memory that has been allocated for its digits must be | |
844 returned to the application's memory pool with the mp\_clear algorithm. | |
845 | |
846 \begin{figure}[here] | |
847 \begin{center} | |
848 \begin{tabular}{l} | |
849 \hline Algorithm \textbf{mp\_clear}. \\ | |
850 \textbf{Input}. An mp\_int $a$ \\ | |
851 \textbf{Output}. The memory for $a$ is freed for reuse. \\ | |
852 \hline \\ | |
853 1. If $a$ has been previously freed then return(\textit{MP\_OKAY}). \\ | |
854 2. for $n$ from 0 to $a.used - 1$ do \\ | |
855 \hspace{3mm}2.1 $a_n \leftarrow 0$ \\ | |
856 3. Free the memory allocated for the digits of $a$. \\ | |
857 4. $a.used \leftarrow 0$ \\ | |
858 5. $a.alloc \leftarrow 0$ \\ | |
859 6. $a.sign \leftarrow MP\_ZPOS$ \\ | |
860 7. Return(\textit{MP\_OKAY}). \\ | |
861 \hline | |
862 \end{tabular} | |
863 \end{center} | |
864 \caption{Algorithm mp\_clear} | |
865 \end{figure} | |
866 | |
867 \textbf{Algorithm mp\_clear.} | |
868 This algorithm releases the memory allocated for an mp\_int back into the memory pool for reuse. It is designed | |
869 such that a given mp\_int structure can be cleared multiple times between initializations without attempting to | |
870 free the memory twice\footnote{In ISO C for example, calling free() twice on the same memory block causes undefinied | |
871 behaviour.}. | |
872 | |
873 The first step determines if the mp\_int structure has been marked as free already. If it has, the algorithm returns | |
874 success immediately as no further actions are required. Otherwise, the algorithm will proceed to put the structure | |
875 in a known empty and otherwise invalid state. First the digits of the mp\_int are set to zero. The memory that has been allocated for the | |
876 digits is then freed. The \textbf{used} and \textbf{alloc} counts are both set to zero and the \textbf{sign} set to | |
877 \textbf{MP\_ZPOS}. This known fixed state for cleared mp\_int structures will make debuging easier for the end | |
878 developer. That is, if they spot (via their debugger) an mp\_int they are using that is in this state it will be | |
879 obvious that they erroneously and prematurely cleared the mp\_int structure. | |
880 | |
881 Note that once an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm | |
882 with the exception of algorithms mp\_init, mp\_init\_copy, mp\_init\_size and mp\_clear. | |
883 | |
884 \vspace{+3mm}\begin{small} | |
885 \hspace{-5.1mm}{\bf File}: bn\_mp\_clear.c | |
886 \vspace{-3mm} | |
887 \begin{alltt} | |
888 016 | |
889 017 /* clear one (frees) */ | |
890 018 void | |
891 019 mp_clear (mp_int * a) | |
892 020 \{ | |
893 021 /* only do anything if a hasn't been freed previously */ | |
894 022 if (a->dp != NULL) \{ | |
895 023 /* first zero the digits */ | |
896 024 memset (a->dp, 0, sizeof (mp_digit) * a->used); | |
897 025 | |
898 026 /* free ram */ | |
899 027 XFREE(a->dp); | |
900 028 | |
901 029 /* reset members to make debugging easier */ | |
902 030 a->dp = NULL; | |
903 031 a->alloc = a->used = 0; | |
904 032 a->sign = MP_ZPOS; | |
905 033 \} | |
906 034 \} | |
907 \end{alltt} | |
908 \end{small} | |
909 | |
910 The ``if'' statement (line 22) prevents the heap from being corrupted if a user double-frees an | |
911 mp\_int. This is because once the memory is freed the pointer is set to \textbf{NULL} (line 30). | |
912 | |
913 Without the check, code that accidentally calls mp\_clear twice for a given mp\_int structure would try to free the memory | |
914 allocated for the digits twice. This may cause some C libraries to signal a fault. By setting the pointer to | |
915 \textbf{NULL} it helps debug code that may inadvertently free the mp\_int before it is truly not needed, because attempts | |
916 to reference digits should fail immediately. The allocated digits are set to zero before being freed (line 24). | |
917 This is ideal for cryptographic situations where the integer that the mp\_int represents might need to be kept a secret. | |
918 | |
919 \section{Maintenance Algorithms} | |
920 | |
921 The previous sections describes how to initialize and clear an mp\_int structure. To further support operations | |
922 that are to be performed on mp\_int structures (such as addition and multiplication) the dependent algorithms must be | |
923 able to augment the precision of an mp\_int and | |
924 initialize mp\_ints with differing initial conditions. | |
925 | |
926 These algorithms complete the set of low level algorithms required to work with mp\_int structures in the higher level | |
927 algorithms such as addition, multiplication and modular exponentiation. | |
928 | |
929 \subsection{Augmenting an mp\_int's Precision} | |
930 When storing a value in an mp\_int structure, a sufficient number of digits must be available to accomodate the entire | |
931 result of an operation without loss of precision. Quite often the size of the array given by the \textbf{alloc} member | |
932 is large enough to simply increase the \textbf{used} digit count. However, when the size of the array is too small it | |
933 must be re-sized appropriately to accomodate the result. The mp\_grow algorithm will provide this functionality. | |
934 | |
935 \newpage\begin{figure}[here] | |
936 \begin{center} | |
937 \begin{tabular}{l} | |
938 \hline Algorithm \textbf{mp\_grow}. \\ | |
939 \textbf{Input}. An mp\_int $a$ and an integer $b$. \\ | |
940 \textbf{Output}. $a$ is expanded to accomodate $b$ digits. \\ | |
941 \hline \\ | |
942 1. if $a.alloc \ge b$ then return(\textit{MP\_OKAY}) \\ | |
943 2. $u \leftarrow b\mbox{ (mod }MP\_PREC\mbox{)}$ \\ | |
944 3. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\ | |
945 4. Re-Allocate the array of digits $a$ to size $v$ \\ | |
946 5. If the allocation failed then return(\textit{MP\_MEM}). \\ | |
947 6. for n from a.alloc to $v - 1$ do \\ | |
948 \hspace{+3mm}6.1 $a_n \leftarrow 0$ \\ | |
949 7. $a.alloc \leftarrow v$ \\ | |
950 8. Return(\textit{MP\_OKAY}) \\ | |
951 \hline | |
952 \end{tabular} | |
953 \end{center} | |
954 \caption{Algorithm mp\_grow} | |
955 \end{figure} | |
956 | |
957 \textbf{Algorithm mp\_grow.} | |
958 It is ideal to prevent re-allocations from being performed if they are not required (step one). This is useful to | |
959 prevent mp\_ints from growing excessively in code that erroneously calls mp\_grow. | |
960 | |
961 The requested digit count is padded up to next multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} (steps two and three). | |
962 This helps prevent many trivial reallocations that would grow an mp\_int by trivially small values. | |
963 | |
964 It is assumed that the reallocation (step four) leaves the lower $a.alloc$ digits of the mp\_int intact. This is much | |
965 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are | |
966 assumed to contain undefined values they are initially set to zero. | |
967 | |
968 \vspace{+3mm}\begin{small} | |
969 \hspace{-5.1mm}{\bf File}: bn\_mp\_grow.c | |
970 \vspace{-3mm} | |
971 \begin{alltt} | |
972 016 | |
973 017 /* grow as required */ | |
974 018 int mp_grow (mp_int * a, int size) | |
975 019 \{ | |
976 020 int i; | |
977 021 mp_digit *tmp; | |
978 022 | |
979 023 /* if the alloc size is smaller alloc more ram */ | |
980 024 if (a->alloc < size) \{ | |
981 025 /* ensure there are always at least MP_PREC digits extra on top */ | |
982 026 size += (MP_PREC * 2) - (size % MP_PREC); | |
983 027 | |
984 028 /* reallocate the array a->dp | |
985 029 * | |
986 030 * We store the return in a temporary variable | |
987 031 * in case the operation failed we don't want | |
988 032 * to overwrite the dp member of a. | |
989 033 */ | |
990 034 tmp = OPT_CAST(mp_digit) XREALLOC (a->dp, sizeof (mp_digit) * size); | |
991 035 if (tmp == NULL) \{ | |
992 036 /* reallocation failed but "a" is still valid [can be freed] */ | |
993 037 return MP_MEM; | |
994 038 \} | |
995 039 | |
996 040 /* reallocation succeeded so set a->dp */ | |
997 041 a->dp = tmp; | |
998 042 | |
999 043 /* zero excess digits */ | |
1000 044 i = a->alloc; | |
1001 045 a->alloc = size; | |
1002 046 for (; i < a->alloc; i++) \{ | |
1003 047 a->dp[i] = 0; | |
1004 048 \} | |
1005 049 \} | |
1006 050 return MP_OKAY; | |
1007 051 \} | |
1008 \end{alltt} | |
1009 \end{small} | |
1010 | |
1011 The first step is to see if we actually need to perform a re-allocation at all (line 24). If a reallocation | |
1012 must occur the digit count is padded upwards to help prevent many trivial reallocations (line 26). Next the reallocation is performed | |
1013 and the return of realloc() is stored in a temporary pointer named $tmp$ (line 36). The return is stored in a temporary | |
1014 instead of $a.dp$ to prevent the code from losing the original pointer in case the reallocation fails. Had the return been stored | |
1015 in $a.dp$ instead there would be no way to reclaim the heap originally used. | |
1016 | |
1017 If the reallocation fails the function will return \textbf{MP\_MEM} (line 37), otherwise, the value of $tmp$ is assigned | |
1018 to the pointer $a.dp$ and the function continues. A simple for loop from line 46 to line 51 will zero all digits | |
1019 that were above the old \textbf{alloc} limit to make sure the integer is in a known state. | |
1020 | |
1021 \subsection{Initializing Variable Precision mp\_ints} | |
1022 Occasionally the number of digits required will be known in advance of an initialization, based on, for example, the size | |
1023 of input mp\_ints to a given algorithm. The purpose of algorithm mp\_init\_size is similar to mp\_init except that it | |
1024 will allocate \textit{at least} a specified number of digits. | |
1025 | |
1026 \begin{figure}[here] | |
1027 \begin{small} | |
1028 \begin{center} | |
1029 \begin{tabular}{l} | |
1030 \hline Algorithm \textbf{mp\_init\_size}. \\ | |
1031 \textbf{Input}. An mp\_int $a$ and the requested number of digits $b$. \\ | |
1032 \textbf{Output}. $a$ is initialized to hold at least $b$ digits. \\ | |
1033 \hline \\ | |
1034 1. $u \leftarrow b \mbox{ (mod }MP\_PREC\mbox{)}$ \\ | |
1035 2. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\ | |
1036 3. Allocate $v$ digits. \\ | |
1037 4. for $n$ from $0$ to $v - 1$ do \\ | |
1038 \hspace{3mm}4.1 $a_n \leftarrow 0$ \\ | |
1039 5. $a.sign \leftarrow MP\_ZPOS$\\ | |
1040 6. $a.used \leftarrow 0$\\ | |
1041 7. $a.alloc \leftarrow v$\\ | |
1042 8. Return(\textit{MP\_OKAY})\\ | |
1043 \hline | |
1044 \end{tabular} | |
1045 \end{center} | |
1046 \end{small} | |
1047 \caption{Algorithm mp\_init\_size} | |
1048 \end{figure} | |
1049 | |
1050 \textbf{Algorithm mp\_init\_size.} | |
1051 This algorithm will initialize an mp\_int structure $a$ like algorithm mp\_init with the exception that the number of | |
1052 digits allocated can be controlled by the second input argument $b$. The input size is padded upwards so it is a | |
1053 multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} digits. This padding is used to prevent trivial | |
1054 allocations from becoming a bottleneck in the rest of the algorithms. | |
1055 | |
1056 Like algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero. This | |
1057 particular algorithm is useful if it is known ahead of time the approximate size of the input. If the approximation is | |
1058 correct no further memory re-allocations are required to work with the mp\_int. | |
1059 | |
1060 \vspace{+3mm}\begin{small} | |
1061 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_size.c | |
1062 \vspace{-3mm} | |
1063 \begin{alltt} | |
1064 016 | |
1065 017 /* init an mp_init for a given size */ | |
1066 018 int mp_init_size (mp_int * a, int size) | |
1067 019 \{ | |
1068 020 /* pad size so there are always extra digits */ | |
1069 021 size += (MP_PREC * 2) - (size % MP_PREC); | |
1070 022 | |
1071 023 /* alloc mem */ | |
1072 024 a->dp = OPT_CAST(mp_digit) XCALLOC (sizeof (mp_digit), size); | |
1073 025 if (a->dp == NULL) \{ | |
1074 026 return MP_MEM; | |
1075 027 \} | |
1076 028 a->used = 0; | |
1077 029 a->alloc = size; | |
1078 030 a->sign = MP_ZPOS; | |
1079 031 | |
1080 032 return MP_OKAY; | |
1081 033 \} | |
1082 \end{alltt} | |
1083 \end{small} | |
1084 | |
1085 The number of digits $b$ requested is padded (line 21) by first augmenting it to the next multiple of | |
1086 \textbf{MP\_PREC} and then adding \textbf{MP\_PREC} to the result. If the memory can be successfully allocated the | |
1087 mp\_int is placed in a default state representing the integer zero. Otherwise, the error code \textbf{MP\_MEM} will be | |
1088 returned (line 26). | |
1089 | |
1090 The digits are allocated and set to zero at the same time with the calloc() function (line @25,calloc@). The | |
1091 \textbf{used} count is set to zero, the \textbf{alloc} count set to the padded digit count and the \textbf{sign} flag set | |
1092 to \textbf{MP\_ZPOS} to achieve a default valid mp\_int state (lines 28, 29 and 30). If the function | |
1093 returns succesfully then it is correct to assume that the mp\_int structure is in a valid state for the remainder of the | |
1094 functions to work with. | |
1095 | |
1096 \subsection{Multiple Integer Initializations and Clearings} | |
1097 Occasionally a function will require a series of mp\_int data types to be made available simultaneously. | |
1098 The purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single | |
1099 statement. It is essentially a shortcut to multiple initializations. | |
1100 | |
1101 \newpage\begin{figure}[here] | |
1102 \begin{center} | |
1103 \begin{tabular}{l} | |
1104 \hline Algorithm \textbf{mp\_init\_multi}. \\ | |
1105 \textbf{Input}. Variable length array $V_k$ of mp\_int variables of length $k$. \\ | |
1106 \textbf{Output}. The array is initialized such that each mp\_int of $V_k$ is ready to use. \\ | |
1107 \hline \\ | |
1108 1. for $n$ from 0 to $k - 1$ do \\ | |
1109 \hspace{+3mm}1.1. Initialize the mp\_int $V_n$ (\textit{mp\_init}) \\ | |
1110 \hspace{+3mm}1.2. If initialization failed then do \\ | |
1111 \hspace{+6mm}1.2.1. for $j$ from $0$ to $n$ do \\ | |
1112 \hspace{+9mm}1.2.1.1. Free the mp\_int $V_j$ (\textit{mp\_clear}) \\ | |
1113 \hspace{+6mm}1.2.2. Return(\textit{MP\_MEM}) \\ | |
1114 2. Return(\textit{MP\_OKAY}) \\ | |
1115 \hline | |
1116 \end{tabular} | |
1117 \end{center} | |
1118 \caption{Algorithm mp\_init\_multi} | |
1119 \end{figure} | |
1120 | |
1121 \textbf{Algorithm mp\_init\_multi.} | |
1122 The algorithm will initialize the array of mp\_int variables one at a time. If a runtime error has been detected | |
1123 (\textit{step 1.2}) all of the previously initialized variables are cleared. The goal is an ``all or nothing'' | |
1124 initialization which allows for quick recovery from runtime errors. | |
1125 | |
1126 \vspace{+3mm}\begin{small} | |
1127 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_multi.c | |
1128 \vspace{-3mm} | |
1129 \begin{alltt} | |
1130 016 #include <stdarg.h> | |
1131 017 | |
1132 018 int mp_init_multi(mp_int *mp, ...) | |
1133 019 \{ | |
1134 020 mp_err res = MP_OKAY; /* Assume ok until proven otherwise */ | |
1135 021 int n = 0; /* Number of ok inits */ | |
1136 022 mp_int* cur_arg = mp; | |
1137 023 va_list args; | |
1138 024 | |
1139 025 va_start(args, mp); /* init args to next argument from caller */ | |
1140 026 while (cur_arg != NULL) \{ | |
1141 027 if (mp_init(cur_arg) != MP_OKAY) \{ | |
1142 028 /* Oops - error! Back-track and mp_clear what we already | |
1143 029 succeeded in init-ing, then return error. | |
1144 030 */ | |
1145 031 va_list clean_args; | |
1146 032 | |
1147 033 /* end the current list */ | |
1148 034 va_end(args); | |
1149 035 | |
1150 036 /* now start cleaning up */ | |
1151 037 cur_arg = mp; | |
1152 038 va_start(clean_args, mp); | |
1153 039 while (n--) \{ | |
1154 040 mp_clear(cur_arg); | |
1155 041 cur_arg = va_arg(clean_args, mp_int*); | |
1156 042 \} | |
1157 043 va_end(clean_args); | |
1158 044 res = MP_MEM; | |
1159 045 break; | |
1160 046 \} | |
1161 047 n++; | |
1162 048 cur_arg = va_arg(args, mp_int*); | |
1163 049 \} | |
1164 050 va_end(args); | |
1165 051 return res; /* Assumed ok, if error flagged above. */ | |
1166 052 \} | |
1167 053 | |
1168 \end{alltt} | |
1169 \end{small} | |
1170 | |
1171 This function intializes a variable length list of mp\_int structure pointers. However, instead of having the mp\_int | |
1172 structures in an actual C array they are simply passed as arguments to the function. This function makes use of the | |
1173 ``...'' argument syntax of the C programming language. The list is terminated with a final \textbf{NULL} argument | |
1174 appended on the right. | |
1175 | |
1176 The function uses the ``stdarg.h'' \textit{va} functions to step portably through the arguments to the function. A count | |
1177 $n$ of succesfully initialized mp\_int structures is maintained (line 47) such that if a failure does occur, | |
1178 the algorithm can backtrack and free the previously initialized structures (lines 27 to 46). | |
1179 | |
1180 | |
1181 \subsection{Clamping Excess Digits} | |
1182 When a function anticipates a result will be $n$ digits it is simpler to assume this is true within the body of | |
1183 the function instead of checking during the computation. For example, a multiplication of a $i$ digit number by a | |
1184 $j$ digit produces a result of at most $i + j$ digits. It is entirely possible that the result is $i + j - 1$ | |
1185 though, with no final carry into the last position. However, suppose the destination had to be first expanded | |
1186 (\textit{via mp\_grow}) to accomodate $i + j - 1$ digits than further expanded to accomodate the final carry. | |
1187 That would be a considerable waste of time since heap operations are relatively slow. | |
1188 | |
1189 The ideal solution is to always assume the result is $i + j$ and fix up the \textbf{used} count after the function | |
1190 terminates. This way a single heap operation (\textit{at most}) is required. However, if the result was not checked | |
1191 there would be an excess high order zero digit. | |
1192 | |
1193 For example, suppose the product of two integers was $x_n = (0x_{n-1}x_{n-2}...x_0)_{\beta}$. The leading zero digit | |
1194 will not contribute to the precision of the result. In fact, through subsequent operations more leading zero digits would | |
1195 accumulate to the point the size of the integer would be prohibitive. As a result even though the precision is very | |
1196 low the representation is excessively large. | |
1197 | |
1198 The mp\_clamp algorithm is designed to solve this very problem. It will trim high-order zeros by decrementing the | |
1199 \textbf{used} count until a non-zero most significant digit is found. Also in this system, zero is considered to be a | |
1200 positive number which means that if the \textbf{used} count is decremented to zero, the sign must be set to | |
1201 \textbf{MP\_ZPOS}. | |
1202 | |
1203 \begin{figure}[here] | |
1204 \begin{center} | |
1205 \begin{tabular}{l} | |
1206 \hline Algorithm \textbf{mp\_clamp}. \\ | |
1207 \textbf{Input}. An mp\_int $a$ \\ | |
1208 \textbf{Output}. Any excess leading zero digits of $a$ are removed \\ | |
1209 \hline \\ | |
1210 1. while $a.used > 0$ and $a_{a.used - 1} = 0$ do \\ | |
1211 \hspace{+3mm}1.1 $a.used \leftarrow a.used - 1$ \\ | |
1212 2. if $a.used = 0$ then do \\ | |
1213 \hspace{+3mm}2.1 $a.sign \leftarrow MP\_ZPOS$ \\ | |
1214 \hline \\ | |
1215 \end{tabular} | |
1216 \end{center} | |
1217 \caption{Algorithm mp\_clamp} | |
1218 \end{figure} | |
1219 | |
1220 \textbf{Algorithm mp\_clamp.} | |
1221 As can be expected this algorithm is very simple. The loop on step one is expected to iterate only once or twice at | |
1222 the most. For example, this will happen in cases where there is not a carry to fill the last position. Step two fixes the sign for | |
1223 when all of the digits are zero to ensure that the mp\_int is valid at all times. | |
1224 | |
1225 \vspace{+3mm}\begin{small} | |
1226 \hspace{-5.1mm}{\bf File}: bn\_mp\_clamp.c | |
1227 \vspace{-3mm} | |
1228 \begin{alltt} | |
1229 016 | |
1230 017 /* trim unused digits | |
1231 018 * | |
1232 019 * This is used to ensure that leading zero digits are | |
1233 020 * trimed and the leading "used" digit will be non-zero | |
1234 021 * Typically very fast. Also fixes the sign if there | |
1235 022 * are no more leading digits | |
1236 023 */ | |
1237 024 void | |
1238 025 mp_clamp (mp_int * a) | |
1239 026 \{ | |
1240 027 /* decrease used while the most significant digit is | |
1241 028 * zero. | |
1242 029 */ | |
1243 030 while (a->used > 0 && a->dp[a->used - 1] == 0) \{ | |
1244 031 --(a->used); | |
1245 032 \} | |
1246 033 | |
1247 034 /* reset the sign flag if used == 0 */ | |
1248 035 if (a->used == 0) \{ | |
1249 036 a->sign = MP_ZPOS; | |
1250 037 \} | |
1251 038 \} | |
1252 \end{alltt} | |
1253 \end{small} | |
1254 | |
1255 Note on line 27 how to test for the \textbf{used} count is made on the left of the \&\& operator. In the C programming | |
1256 language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails. This is | |
1257 important since if the \textbf{used} is zero the test on the right would fetch below the array. That is obviously | |
1258 undesirable. The parenthesis on line 30 is used to make sure the \textbf{used} count is decremented and not | |
1259 the pointer ``a''. | |
1260 | |
1261 \section*{Exercises} | |
1262 \begin{tabular}{cl} | |
1263 $\left [ 1 \right ]$ & Discuss the relevance of the \textbf{used} member of the mp\_int structure. \\ | |
1264 & \\ | |
1265 $\left [ 1 \right ]$ & Discuss the consequences of not using padding when performing allocations. \\ | |
1266 & \\ | |
1267 $\left [ 2 \right ]$ & Estimate an ideal value for \textbf{MP\_PREC} when performing 1024-bit RSA \\ | |
1268 & encryption when $\beta = 2^{28}$. \\ | |
1269 & \\ | |
1270 $\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp. What does it prevent? \\ | |
1271 & \\ | |
1272 $\left [ 1 \right ]$ & Give an example of when the algorithm mp\_init\_copy might be useful. \\ | |
1273 & \\ | |
1274 \end{tabular} | |
1275 | |
1276 | |
1277 %%% | |
1278 % CHAPTER FOUR | |
1279 %%% | |
1280 | |
1281 \chapter{Basic Operations} | |
1282 | |
1283 \section{Introduction} | |
1284 In the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining | |
1285 mp\_int structures. This chapter will discuss another set of seemingly non-algebraic algorithms which will form the low | |
1286 level basis of the entire library. While these algorithm are relatively trivial it is important to understand how they | |
1287 work before proceeding since these algorithms will be used almost intrinsically in the following chapters. | |
1288 | |
1289 The algorithms in this chapter deal primarily with more ``programmer'' related tasks such as creating copies of | |
1290 mp\_int structures, assigning small values to mp\_int structures and comparisons of the values mp\_int structures | |
1291 represent. | |
1292 | |
1293 \section{Assigning Values to mp\_int Structures} | |
1294 \subsection{Copying an mp\_int} | |
1295 Assigning the value that a given mp\_int structure represents to another mp\_int structure shall be known as making | |
1296 a copy for the purposes of this text. The copy of the mp\_int will be a separate entity that represents the same | |
1297 value as the mp\_int it was copied from. The mp\_copy algorithm provides this functionality. | |
1298 | |
1299 \newpage\begin{figure}[here] | |
1300 \begin{center} | |
1301 \begin{tabular}{l} | |
1302 \hline Algorithm \textbf{mp\_copy}. \\ | |
1303 \textbf{Input}. An mp\_int $a$ and $b$. \\ | |
1304 \textbf{Output}. Store a copy of $a$ in $b$. \\ | |
1305 \hline \\ | |
1306 1. If $b.alloc < a.used$ then grow $b$ to $a.used$ digits. (\textit{mp\_grow}) \\ | |
1307 2. for $n$ from 0 to $a.used - 1$ do \\ | |
1308 \hspace{3mm}2.1 $b_{n} \leftarrow a_{n}$ \\ | |
1309 3. for $n$ from $a.used$ to $b.used - 1$ do \\ | |
1310 \hspace{3mm}3.1 $b_{n} \leftarrow 0$ \\ | |
1311 4. $b.used \leftarrow a.used$ \\ | |
1312 5. $b.sign \leftarrow a.sign$ \\ | |
1313 6. return(\textit{MP\_OKAY}) \\ | |
1314 \hline | |
1315 \end{tabular} | |
1316 \end{center} | |
1317 \caption{Algorithm mp\_copy} | |
1318 \end{figure} | |
1319 | |
1320 \textbf{Algorithm mp\_copy.} | |
1321 This algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will | |
1322 represent the same integer as the mp\_int $a$. The mp\_int $b$ shall be a complete and distinct copy of the | |
1323 mp\_int $a$ meaing that the mp\_int $a$ can be modified and it shall not affect the value of the mp\_int $b$. | |
1324 | |
1325 If $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow | |
1326 algorithm. The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two | |
1327 and three). The \textbf{used} and \textbf{sign} members of $a$ are finally copied over the respective members of | |
1328 $b$. | |
1329 | |
1330 \textbf{Remark.} This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the | |
1331 text. The error return codes of other algorithms are not explicitly checked in the pseudo-code presented. For example, in | |
1332 step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded. Text space is | |
1333 limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return | |
1334 the error code itself. However, the C code presented will demonstrate all of the error handling logic required to | |
1335 implement the pseudo-code. | |
1336 | |
1337 \vspace{+3mm}\begin{small} | |
1338 \hspace{-5.1mm}{\bf File}: bn\_mp\_copy.c | |
1339 \vspace{-3mm} | |
1340 \begin{alltt} | |
1341 016 | |
1342 017 /* copy, b = a */ | |
1343 018 int | |
1344 019 mp_copy (mp_int * a, mp_int * b) | |
1345 020 \{ | |
1346 021 int res, n; | |
1347 022 | |
1348 023 /* if dst == src do nothing */ | |
1349 024 if (a == b) \{ | |
1350 025 return MP_OKAY; | |
1351 026 \} | |
1352 027 | |
1353 028 /* grow dest */ | |
1354 029 if (b->alloc < a->used) \{ | |
1355 030 if ((res = mp_grow (b, a->used)) != MP_OKAY) \{ | |
1356 031 return res; | |
1357 032 \} | |
1358 033 \} | |
1359 034 | |
1360 035 /* zero b and copy the parameters over */ | |
1361 036 \{ | |
1362 037 register mp_digit *tmpa, *tmpb; | |
1363 038 | |
1364 039 /* pointer aliases */ | |
1365 040 | |
1366 041 /* source */ | |
1367 042 tmpa = a->dp; | |
1368 043 | |
1369 044 /* destination */ | |
1370 045 tmpb = b->dp; | |
1371 046 | |
1372 047 /* copy all the digits */ | |
1373 048 for (n = 0; n < a->used; n++) \{ | |
1374 049 *tmpb++ = *tmpa++; | |
1375 050 \} | |
1376 051 | |
1377 052 /* clear high digits */ | |
1378 053 for (; n < b->used; n++) \{ | |
1379 054 *tmpb++ = 0; | |
1380 055 \} | |
1381 056 \} | |
1382 057 | |
1383 058 /* copy used count and sign */ | |
1384 059 b->used = a->used; | |
1385 060 b->sign = a->sign; | |
1386 061 return MP_OKAY; | |
1387 062 \} | |
1388 \end{alltt} | |
1389 \end{small} | |
1390 | |
1391 Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output | |
1392 mp\_int structures passed to a function are one and the same. For this case it is optimal to return immediately without | |
1393 copying digits (line 24). | |
1394 | |
1395 The mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$. If $b.alloc$ is less than | |
1396 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines 29 to 33). In order to | |
1397 simplify the inner loop that copies the digits from $a$ to $b$, two aliases $tmpa$ and $tmpb$ point directly at the digits | |
1398 of the mp\_ints $a$ and $b$ respectively. These aliases (lines 42 and 45) allow the compiler to access the digits without first dereferencing the | |
1399 mp\_int pointers and then subsequently the pointer to the digits. | |
1400 | |
1401 After the aliases are established the digits from $a$ are copied into $b$ (lines 48 to 50) and then the excess | |
1402 digits of $b$ are set to zero (lines 53 to 55). Both ``for'' loops make use of the pointer aliases and in | |
1403 fact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits. This optimization | |
1404 allows the alias to stay in a machine register fairly easy between the two loops. | |
1405 | |
1406 \textbf{Remarks.} The use of pointer aliases is an implementation methodology first introduced in this function that will | |
1407 be used considerably in other functions. Technically, a pointer alias is simply a short hand alias used to lower the | |
1408 number of pointer dereferencing operations required to access data. For example, a for loop may resemble | |
1409 | |
1410 \begin{alltt} | |
1411 for (x = 0; x < 100; x++) \{ | |
1412 a->num[4]->dp[x] = 0; | |
1413 \} | |
1414 \end{alltt} | |
1415 | |
1416 This could be re-written using aliases as | |
1417 | |
1418 \begin{alltt} | |
1419 mp_digit *tmpa; | |
1420 a = a->num[4]->dp; | |
1421 for (x = 0; x < 100; x++) \{ | |
1422 *a++ = 0; | |
1423 \} | |
1424 \end{alltt} | |
1425 | |
1426 In this case an alias is used to access the | |
1427 array of digits within an mp\_int structure directly. It may seem that a pointer alias is strictly not required | |
1428 as a compiler may optimize out the redundant pointer operations. However, there are two dominant reasons to use aliases. | |
1429 | |
1430 The first reason is that most compilers will not effectively optimize pointer arithmetic. For example, some optimizations | |
1431 may work for the Microsoft Visual C++ compiler (MSVC) and not for the GNU C Compiler (GCC). Also some optimizations may | |
1432 work for GCC and not MSVC. As such it is ideal to find a common ground for as many compilers as possible. Pointer | |
1433 aliases optimize the code considerably before the compiler even reads the source code which means the end compiled code | |
1434 stands a better chance of being faster. | |
1435 | |
1436 The second reason is that pointer aliases often can make an algorithm simpler to read. Consider the first ``for'' | |
1437 loop of the function mp\_copy() re-written to not use pointer aliases. | |
1438 | |
1439 \begin{alltt} | |
1440 /* copy all the digits */ | |
1441 for (n = 0; n < a->used; n++) \{ | |
1442 b->dp[n] = a->dp[n]; | |
1443 \} | |
1444 \end{alltt} | |
1445 | |
1446 Whether this code is harder to read depends strongly on the individual. However, it is quantifiably slightly more | |
1447 complicated as there are four variables within the statement instead of just two. | |
1448 | |
1449 \subsubsection{Nested Statements} | |
1450 Another commonly used technique in the source routines is that certain sections of code are nested. This is used in | |
1451 particular with the pointer aliases to highlight code phases. For example, a Comba multiplier (discussed in chapter six) | |
1452 will typically have three different phases. First the temporaries are initialized, then the columns calculated and | |
1453 finally the carries are propagated. In this example the middle column production phase will typically be nested as it | |
1454 uses temporary variables and aliases the most. | |
1455 | |
1456 The nesting also simplies the source code as variables that are nested are only valid for their scope. As a result | |
1457 the various temporary variables required do not propagate into other sections of code. | |
1458 | |
1459 | |
1460 \subsection{Creating a Clone} | |
1461 Another common operation is to make a local temporary copy of an mp\_int argument. To initialize an mp\_int | |
1462 and then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone. This is | |
1463 useful within functions that need to modify an argument but do not wish to actually modify the original copy. The | |
1464 mp\_init\_copy algorithm has been designed to help perform this task. | |
1465 | |
1466 \begin{figure}[here] | |
1467 \begin{center} | |
1468 \begin{tabular}{l} | |
1469 \hline Algorithm \textbf{mp\_init\_copy}. \\ | |
1470 \textbf{Input}. An mp\_int $a$ and $b$\\ | |
1471 \textbf{Output}. $a$ is initialized to be a copy of $b$. \\ | |
1472 \hline \\ | |
1473 1. Init $a$. (\textit{mp\_init}) \\ | |
1474 2. Copy $b$ to $a$. (\textit{mp\_copy}) \\ | |
1475 3. Return the status of the copy operation. \\ | |
1476 \hline | |
1477 \end{tabular} | |
1478 \end{center} | |
1479 \caption{Algorithm mp\_init\_copy} | |
1480 \end{figure} | |
1481 | |
1482 \textbf{Algorithm mp\_init\_copy.} | |
1483 This algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it. As | |
1484 such this algorithm will perform two operations in one step. | |
1485 | |
1486 \vspace{+3mm}\begin{small} | |
1487 \hspace{-5.1mm}{\bf File}: bn\_mp\_init\_copy.c | |
1488 \vspace{-3mm} | |
1489 \begin{alltt} | |
1490 016 | |
1491 017 /* creates "a" then copies b into it */ | |
1492 018 int mp_init_copy (mp_int * a, mp_int * b) | |
1493 019 \{ | |
1494 020 int res; | |
1495 021 | |
1496 022 if ((res = mp_init (a)) != MP_OKAY) \{ | |
1497 023 return res; | |
1498 024 \} | |
1499 025 return mp_copy (b, a); | |
1500 026 \} | |
1501 \end{alltt} | |
1502 \end{small} | |
1503 | |
1504 This will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}. Note that | |
1505 \textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call | |
1506 and \textbf{a} will be left intact. | |
1507 | |
1508 \section{Zeroing an Integer} | |
1509 Reseting an mp\_int to the default state is a common step in many algorithms. The mp\_zero algorithm will be the algorithm used to | |
1510 perform this task. | |
1511 | |
1512 \begin{figure}[here] | |
1513 \begin{center} | |
1514 \begin{tabular}{l} | |
1515 \hline Algorithm \textbf{mp\_zero}. \\ | |
1516 \textbf{Input}. An mp\_int $a$ \\ | |
1517 \textbf{Output}. Zero the contents of $a$ \\ | |
1518 \hline \\ | |
1519 1. $a.used \leftarrow 0$ \\ | |
1520 2. $a.sign \leftarrow$ MP\_ZPOS \\ | |
1521 3. for $n$ from 0 to $a.alloc - 1$ do \\ | |
1522 \hspace{3mm}3.1 $a_n \leftarrow 0$ \\ | |
1523 \hline | |
1524 \end{tabular} | |
1525 \end{center} | |
1526 \caption{Algorithm mp\_zero} | |
1527 \end{figure} | |
1528 | |
1529 \textbf{Algorithm mp\_zero.} | |
1530 This algorithm simply resets a mp\_int to the default state. | |
1531 | |
1532 \vspace{+3mm}\begin{small} | |
1533 \hspace{-5.1mm}{\bf File}: bn\_mp\_zero.c | |
1534 \vspace{-3mm} | |
1535 \begin{alltt} | |
1536 016 | |
1537 017 /* set to zero */ | |
1538 018 void | |
1539 019 mp_zero (mp_int * a) | |
1540 020 \{ | |
1541 021 a->sign = MP_ZPOS; | |
1542 022 a->used = 0; | |
1543 023 memset (a->dp, 0, sizeof (mp_digit) * a->alloc); | |
1544 024 \} | |
1545 \end{alltt} | |
1546 \end{small} | |
1547 | |
1548 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the | |
1549 \textbf{sign} variable is set to \textbf{MP\_ZPOS}. | |
1550 | |
1551 \section{Sign Manipulation} | |
1552 \subsection{Absolute Value} | |
1553 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute | |
1554 the absolute value of an mp\_int. | |
1555 | |
1556 \newpage\begin{figure}[here] | |
1557 \begin{center} | |
1558 \begin{tabular}{l} | |
1559 \hline Algorithm \textbf{mp\_abs}. \\ | |
1560 \textbf{Input}. An mp\_int $a$ \\ | |
1561 \textbf{Output}. Computes $b = \vert a \vert$ \\ | |
1562 \hline \\ | |
1563 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\ | |
1564 2. If the copy failed return(\textit{MP\_MEM}). \\ | |
1565 3. $b.sign \leftarrow MP\_ZPOS$ \\ | |
1566 4. Return(\textit{MP\_OKAY}) \\ | |
1567 \hline | |
1568 \end{tabular} | |
1569 \end{center} | |
1570 \caption{Algorithm mp\_abs} | |
1571 \end{figure} | |
1572 | |
1573 \textbf{Algorithm mp\_abs.} | |
1574 This algorithm computes the absolute of an mp\_int input. First it copies $a$ over $b$. This is an example of an | |
1575 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, | |
1576 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition | |
1577 logic to handle it. | |
1578 | |
1579 \vspace{+3mm}\begin{small} | |
1580 \hspace{-5.1mm}{\bf File}: bn\_mp\_abs.c | |
1581 \vspace{-3mm} | |
1582 \begin{alltt} | |
1583 016 | |
1584 017 /* b = |a| | |
1585 018 * | |
1586 019 * Simple function copies the input and fixes the sign to positive | |
1587 020 */ | |
1588 021 int | |
1589 022 mp_abs (mp_int * a, mp_int * b) | |
1590 023 \{ | |
1591 024 int res; | |
1592 025 | |
1593 026 /* copy a to b */ | |
1594 027 if (a != b) \{ | |
1595 028 if ((res = mp_copy (a, b)) != MP_OKAY) \{ | |
1596 029 return res; | |
1597 030 \} | |
1598 031 \} | |
1599 032 | |
1600 033 /* force the sign of b to positive */ | |
1601 034 b->sign = MP_ZPOS; | |
1602 035 | |
1603 036 return MP_OKAY; | |
1604 037 \} | |
1605 \end{alltt} | |
1606 \end{small} | |
1607 | |
1608 \subsection{Integer Negation} | |
1609 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute | |
1610 the negative of an mp\_int input. | |
1611 | |
1612 \begin{figure}[here] | |
1613 \begin{center} | |
1614 \begin{tabular}{l} | |
1615 \hline Algorithm \textbf{mp\_neg}. \\ | |
1616 \textbf{Input}. An mp\_int $a$ \\ | |
1617 \textbf{Output}. Computes $b = -a$ \\ | |
1618 \hline \\ | |
1619 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\ | |
1620 2. If the copy failed return(\textit{MP\_MEM}). \\ | |
1621 3. If $a.used = 0$ then return(\textit{MP\_OKAY}). \\ | |
1622 4. If $a.sign = MP\_ZPOS$ then do \\ | |
1623 \hspace{3mm}4.1 $b.sign = MP\_NEG$. \\ | |
1624 5. else do \\ | |
1625 \hspace{3mm}5.1 $b.sign = MP\_ZPOS$. \\ | |
1626 6. Return(\textit{MP\_OKAY}) \\ | |
1627 \hline | |
1628 \end{tabular} | |
1629 \end{center} | |
1630 \caption{Algorithm mp\_neg} | |
1631 \end{figure} | |
1632 | |
1633 \textbf{Algorithm mp\_neg.} | |
1634 This algorithm computes the negation of an input. First it copies $a$ over $b$. If $a$ has no used digits then | |
1635 the algorithm returns immediately. Otherwise it flips the sign flag and stores the result in $b$. Note that if | |
1636 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return | |
1637 zero as negative. | |
1638 | |
1639 \vspace{+3mm}\begin{small} | |
1640 \hspace{-5.1mm}{\bf File}: bn\_mp\_neg.c | |
1641 \vspace{-3mm} | |
1642 \begin{alltt} | |
1643 016 | |
1644 017 /* b = -a */ | |
1645 018 int mp_neg (mp_int * a, mp_int * b) | |
1646 019 \{ | |
1647 020 int res; | |
1648 021 if ((res = mp_copy (a, b)) != MP_OKAY) \{ | |
1649 022 return res; | |
1650 023 \} | |
1651 024 if (mp_iszero(b) != MP_YES) \{ | |
1652 025 b->sign = (a->sign == MP_ZPOS) ? MP_NEG : MP_ZPOS; | |
1653 026 \} | |
1654 027 return MP_OKAY; | |
1655 028 \} | |
1656 \end{alltt} | |
1657 \end{small} | |
1658 | |
1659 \section{Small Constants} | |
1660 \subsection{Setting Small Constants} | |
1661 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. | |
1662 | |
1663 \begin{figure}[here] | |
1664 \begin{center} | |
1665 \begin{tabular}{l} | |
1666 \hline Algorithm \textbf{mp\_set}. \\ | |
1667 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ | |
1668 \textbf{Output}. Make $a$ equivalent to $b$ \\ | |
1669 \hline \\ | |
1670 1. Zero $a$ (\textit{mp\_zero}). \\ | |
1671 2. $a_0 \leftarrow b \mbox{ (mod }\beta\mbox{)}$ \\ | |
1672 3. $a.used \leftarrow \left \lbrace \begin{array}{ll} | |
1673 1 & \mbox{if }a_0 > 0 \\ | |
1674 0 & \mbox{if }a_0 = 0 | |
1675 \end{array} \right .$ \\ | |
1676 \hline | |
1677 \end{tabular} | |
1678 \end{center} | |
1679 \caption{Algorithm mp\_set} | |
1680 \end{figure} | |
1681 | |
1682 \textbf{Algorithm mp\_set.} | |
1683 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The | |
1684 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. | |
1685 | |
1686 \vspace{+3mm}\begin{small} | |
1687 \hspace{-5.1mm}{\bf File}: bn\_mp\_set.c | |
1688 \vspace{-3mm} | |
1689 \begin{alltt} | |
1690 016 | |
1691 017 /* set to a digit */ | |
1692 018 void mp_set (mp_int * a, mp_digit b) | |
1693 019 \{ | |
1694 020 mp_zero (a); | |
1695 021 a->dp[0] = b & MP_MASK; | |
1696 022 a->used = (a->dp[0] != 0) ? 1 : 0; | |
1697 023 \} | |
1698 \end{alltt} | |
1699 \end{small} | |
1700 | |
1701 Line 20 calls mp\_zero() to clear the mp\_int and reset the sign. Line 21 copies the digit | |
1702 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly | |
1703 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with | |
1704 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line 22 will set the \textbf{used} member with respect to the | |
1705 digit actually set. This function will always make the integer positive. | |
1706 | |
1707 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses | |
1708 this function should take that into account. Only trivially small constants can be set using this function. | |
1709 | |
1710 \subsection{Setting Large Constants} | |
1711 To overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal. It accepts a ``long'' | |
1712 data type as input and will always treat it as a 32-bit integer. | |
1713 | |
1714 \begin{figure}[here] | |
1715 \begin{center} | |
1716 \begin{tabular}{l} | |
1717 \hline Algorithm \textbf{mp\_set\_int}. \\ | |
1718 \textbf{Input}. An mp\_int $a$ and a ``long'' integer $b$ \\ | |
1719 \textbf{Output}. Make $a$ equivalent to $b$ \\ | |
1720 \hline \\ | |
1721 1. Zero $a$ (\textit{mp\_zero}) \\ | |
1722 2. for $n$ from 0 to 7 do \\ | |
1723 \hspace{3mm}2.1 $a \leftarrow a \cdot 16$ (\textit{mp\_mul2d}) \\ | |
1724 \hspace{3mm}2.2 $u \leftarrow \lfloor b / 2^{4(7 - n)} \rfloor \mbox{ (mod }16\mbox{)}$\\ | |
1725 \hspace{3mm}2.3 $a_0 \leftarrow a_0 + u$ \\ | |
1726 \hspace{3mm}2.4 $a.used \leftarrow a.used + 1$ \\ | |
1727 3. Clamp excess used digits (\textit{mp\_clamp}) \\ | |
1728 \hline | |
1729 \end{tabular} | |
1730 \end{center} | |
1731 \caption{Algorithm mp\_set\_int} | |
1732 \end{figure} | |
1733 | |
1734 \textbf{Algorithm mp\_set\_int.} | |
1735 The algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the | |
1736 mp\_int. Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions. In step 2.2 the | |
1737 next four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is | |
1738 incremented to reflect the addition. The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have | |
1739 zero digits used and the newly added four bits would be ignored. | |
1740 | |
1741 Excess zero digits are trimmed in steps 2.1 and 3 by using higher level algorithms mp\_mul2d and mp\_clamp. | |
1742 | |
1743 \vspace{+3mm}\begin{small} | |
1744 \hspace{-5.1mm}{\bf File}: bn\_mp\_set\_int.c | |
1745 \vspace{-3mm} | |
1746 \begin{alltt} | |
1747 016 | |
1748 017 /* set a 32-bit const */ | |
1749 018 int mp_set_int (mp_int * a, unsigned long b) | |
1750 019 \{ | |
1751 020 int x, res; | |
1752 021 | |
1753 022 mp_zero (a); | |
1754 023 | |
1755 024 /* set four bits at a time */ | |
1756 025 for (x = 0; x < 8; x++) \{ | |
1757 026 /* shift the number up four bits */ | |
1758 027 if ((res = mp_mul_2d (a, 4, a)) != MP_OKAY) \{ | |
1759 028 return res; | |
1760 029 \} | |
1761 030 | |
1762 031 /* OR in the top four bits of the source */ | |
1763 032 a->dp[0] |= (b >> 28) & 15; | |
1764 033 | |
1765 034 /* shift the source up to the next four bits */ | |
1766 035 b <<= 4; | |
1767 036 | |
1768 037 /* ensure that digits are not clamped off */ | |
1769 038 a->used += 1; | |
1770 039 \} | |
1771 040 mp_clamp (a); | |
1772 041 return MP_OKAY; | |
1773 042 \} | |
1774 \end{alltt} | |
1775 \end{small} | |
1776 | |
1777 This function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes. The weird | |
1778 addition on line 38 ensures that the newly added in bits are added to the number of digits. While it may not | |
1779 seem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line 27 | |
1780 as well as the call to mp\_clamp() on line 40. Both functions will clamp excess leading digits which keeps | |
1781 the number of used digits low. | |
1782 | |
1783 \section{Comparisons} | |
1784 \subsection{Unsigned Comparisions} | |
1785 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers. For example, | |
1786 to compare $1,234$ to $1,264$ the digits are extracted by their positions. That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$ | |
1787 to $1 \cdot 10^3 + 2 \cdot 10^2 + 6 \cdot 10^1 + 4 \cdot 10^0$ by comparing single digits at a time starting with the highest magnitude | |
1788 positions. If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater. | |
1789 | |
1790 The first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two | |
1791 mp\_int variables alone. It will ignore the sign of the two inputs. Such a function is useful when an absolute comparison is required or if the | |
1792 signs are known to agree in advance. | |
1793 | |
1794 To facilitate working with the results of the comparison functions three constants are required. | |
1795 | |
1796 \begin{figure}[here] | |
1797 \begin{center} | |
1798 \begin{tabular}{|r|l|} | |
1799 \hline \textbf{Constant} & \textbf{Meaning} \\ | |
1800 \hline \textbf{MP\_GT} & Greater Than \\ | |
1801 \hline \textbf{MP\_EQ} & Equal To \\ | |
1802 \hline \textbf{MP\_LT} & Less Than \\ | |
1803 \hline | |
1804 \end{tabular} | |
1805 \end{center} | |
1806 \caption{Comparison Return Codes} | |
1807 \end{figure} | |
1808 | |
1809 \begin{figure}[here] | |
1810 \begin{center} | |
1811 \begin{tabular}{l} | |
1812 \hline Algorithm \textbf{mp\_cmp\_mag}. \\ | |
1813 \textbf{Input}. Two mp\_ints $a$ and $b$. \\ | |
1814 \textbf{Output}. Unsigned comparison results ($a$ to the left of $b$). \\ | |
1815 \hline \\ | |
1816 1. If $a.used > b.used$ then return(\textit{MP\_GT}) \\ | |
1817 2. If $a.used < b.used$ then return(\textit{MP\_LT}) \\ | |
1818 3. for n from $a.used - 1$ to 0 do \\ | |
1819 \hspace{+3mm}3.1 if $a_n > b_n$ then return(\textit{MP\_GT}) \\ | |
1820 \hspace{+3mm}3.2 if $a_n < b_n$ then return(\textit{MP\_LT}) \\ | |
1821 4. Return(\textit{MP\_EQ}) \\ | |
1822 \hline | |
1823 \end{tabular} | |
1824 \end{center} | |
1825 \caption{Algorithm mp\_cmp\_mag} | |
1826 \end{figure} | |
1827 | |
1828 \textbf{Algorithm mp\_cmp\_mag.} | |
1829 By saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return | |
1830 \textbf{MP\_GT} and similar with respect to when $a = b$ and $a < b$. The first two steps compare the number of digits used in both $a$ and $b$. | |
1831 Obviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is. | |
1832 If both have the same number of digits than the actual digits themselves must be compared starting at the leading digit. | |
1833 | |
1834 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to | |
1835 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. | |
1836 | |
1837 \vspace{+3mm}\begin{small} | |
1838 \hspace{-5.1mm}{\bf File}: bn\_mp\_cmp\_mag.c | |
1839 \vspace{-3mm} | |
1840 \begin{alltt} | |
1841 016 | |
1842 017 /* compare maginitude of two ints (unsigned) */ | |
1843 018 int mp_cmp_mag (mp_int * a, mp_int * b) | |
1844 019 \{ | |
1845 020 int n; | |
1846 021 mp_digit *tmpa, *tmpb; | |
1847 022 | |
1848 023 /* compare based on # of non-zero digits */ | |
1849 024 if (a->used > b->used) \{ | |
1850 025 return MP_GT; | |
1851 026 \} | |
1852 027 | |
1853 028 if (a->used < b->used) \{ | |
1854 029 return MP_LT; | |
1855 030 \} | |
1856 031 | |
1857 032 /* alias for a */ | |
1858 033 tmpa = a->dp + (a->used - 1); | |
1859 034 | |
1860 035 /* alias for b */ | |
1861 036 tmpb = b->dp + (a->used - 1); | |
1862 037 | |
1863 038 /* compare based on digits */ | |
1864 039 for (n = 0; n < a->used; ++n, --tmpa, --tmpb) \{ | |
1865 040 if (*tmpa > *tmpb) \{ | |
1866 041 return MP_GT; | |
1867 042 \} | |
1868 043 | |
1869 044 if (*tmpa < *tmpb) \{ | |
1870 045 return MP_LT; | |
1871 046 \} | |
1872 047 \} | |
1873 048 return MP_EQ; | |
1874 049 \} | |
1875 \end{alltt} | |
1876 \end{small} | |
1877 | |
1878 The two if statements on lines 24 and 28 compare the number of digits in the two inputs. These two are performed before all of the digits | |
1879 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid | |
1880 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the | |
1881 array of digits. | |
1882 | |
1883 \subsection{Signed Comparisons} | |
1884 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude | |
1885 comparison a trivial signed comparison algorithm can be written. | |
1886 | |
1887 \begin{figure}[here] | |
1888 \begin{center} | |
1889 \begin{tabular}{l} | |
1890 \hline Algorithm \textbf{mp\_cmp}. \\ | |
1891 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ | |
1892 \textbf{Output}. Signed Comparison Results ($a$ to the left of $b$) \\ | |
1893 \hline \\ | |
1894 1. if $a.sign = MP\_NEG$ and $b.sign = MP\_ZPOS$ then return(\textit{MP\_LT}) \\ | |
1895 2. if $a.sign = MP\_ZPOS$ and $b.sign = MP\_NEG$ then return(\textit{MP\_GT}) \\ | |
1896 3. if $a.sign = MP\_NEG$ then \\ | |
1897 \hspace{+3mm}3.1 Return the unsigned comparison of $b$ and $a$ (\textit{mp\_cmp\_mag}) \\ | |
1898 4 Otherwise \\ | |
1899 \hspace{+3mm}4.1 Return the unsigned comparison of $a$ and $b$ \\ | |
1900 \hline | |
1901 \end{tabular} | |
1902 \end{center} | |
1903 \caption{Algorithm mp\_cmp} | |
1904 \end{figure} | |
1905 | |
1906 \textbf{Algorithm mp\_cmp.} | |
1907 The first two steps compare the signs of the two inputs. If the signs do not agree then it can return right away with the appropriate | |
1908 comparison code. When the signs are equal the digits of the inputs must be compared to determine the correct result. In step | |
1909 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then | |
1910 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. | |
1911 | |
1912 \vspace{+3mm}\begin{small} | |
1913 \hspace{-5.1mm}{\bf File}: bn\_mp\_cmp.c | |
1914 \vspace{-3mm} | |
1915 \begin{alltt} | |
1916 016 | |
1917 017 /* compare two ints (signed)*/ | |
1918 018 int | |
1919 019 mp_cmp (mp_int * a, mp_int * b) | |
1920 020 \{ | |
1921 021 /* compare based on sign */ | |
1922 022 if (a->sign != b->sign) \{ | |
1923 023 if (a->sign == MP_NEG) \{ | |
1924 024 return MP_LT; | |
1925 025 \} else \{ | |
1926 026 return MP_GT; | |
1927 027 \} | |
1928 028 \} | |
1929 029 | |
1930 030 /* compare digits */ | |
1931 031 if (a->sign == MP_NEG) \{ | |
1932 032 /* if negative compare opposite direction */ | |
1933 033 return mp_cmp_mag(b, a); | |
1934 034 \} else \{ | |
1935 035 return mp_cmp_mag(a, b); | |
1936 036 \} | |
1937 037 \} | |
1938 \end{alltt} | |
1939 \end{small} | |
1940 | |
1941 The two if statements on lines 22 and 23 perform the initial sign comparison. If the signs are not the equal then which ever | |
1942 has the positive sign is larger. At line 31, the inputs are compared based on magnitudes. If the signs were both negative then | |
1943 the unsigned comparison is performed in the opposite direction (\textit{line 33}). Otherwise, the signs are assumed to | |
1944 be both positive and a forward direction unsigned comparison is performed. | |
1945 | |
1946 \section*{Exercises} | |
1947 \begin{tabular}{cl} | |
1948 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ | |
1949 & \\ | |
1950 $\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits \\ | |
1951 & of two random digits (of equal magnitude) before a difference is found. \\ | |
1952 & \\ | |
1953 $\left [ 1 \right ]$ & Suggest a simple method to speed up the implementation of mp\_cmp\_mag based \\ | |
1954 & on the observations made in the previous problem. \\ | |
1955 & | |
1956 \end{tabular} | |
1957 | |
1958 \chapter{Basic Arithmetic} | |
1959 \section{Introduction} | |
1960 At this point algorithms for initialization, clearing, zeroing, copying, comparing and setting small constants have been | |
1961 established. The next logical set of algorithms to develop are addition, subtraction and digit shifting algorithms. These | |
1962 algorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms. It is very important | |
1963 that these algorithms are highly optimized. On their own they are simple $O(n)$ algorithms but they can be called from higher level algorithms | |
1964 which easily places them at $O(n^2)$ or even $O(n^3)$ work levels. | |
1965 | |
1966 All of the algorithms within this chapter make use of the logical bit shift operations denoted by $<<$ and $>>$ for left and right | |
1967 logical shifts respectively. A logical shift is analogous to sliding the decimal point of radix-10 representations. For example, the real | |
1968 number $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}). | |
1969 Algebraically a binary logical shift is equivalent to a division or multiplication by a power of two. | |
1970 For example, $a << k = a \cdot 2^k$ while $a >> k = \lfloor a/2^k \rfloor$. | |
1971 | |
1972 One significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed | |
1973 from the number. For example, consider $1101_2 >> 1$ using decimal notation this would produce $110.1_2$. However, with a logical shift the | |
1974 result is $110_2$. | |
1975 | |
1976 \section{Addition and Subtraction} | |
1977 In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus. For example, with 32-bit integers | |
1978 $a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$ since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$. | |
1979 As a result subtraction can be performed with a trivial series of logical operations and an addition. | |
1980 | |
1981 However, in multiple precision arithmetic negative numbers are not represented in the same way. Instead a sign flag is used to keep track of the | |
1982 sign of the integer. As a result signed addition and subtraction are actually implemented as conditional usage of lower level addition or | |
1983 subtraction algorithms with the sign fixed up appropriately. | |
1984 | |
1985 The lower level algorithms will add or subtract integers without regard to the sign flag. That is they will add or subtract the magnitude of | |
1986 the integers respectively. | |
1987 | |
1988 \subsection{Low Level Addition} | |
1989 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers. That is to add the | |
1990 trailing digits first and propagate the resulting carry upwards. Since this is a lower level algorithm the name will have a ``s\_'' prefix. | |
1991 Historically that convention stems from the MPI library where ``s\_'' stood for static functions that were hidden from the developer entirely. | |
1992 | |
1993 \newpage | |
1994 \begin{figure}[!here] | |
1995 \begin{center} | |
1996 \begin{small} | |
1997 \begin{tabular}{l} | |
1998 \hline Algorithm \textbf{s\_mp\_add}. \\ | |
1999 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ | |
2000 \textbf{Output}. The unsigned addition $c = \vert a \vert + \vert b \vert$. \\ | |
2001 \hline \\ | |
2002 1. if $a.used > b.used$ then \\ | |
2003 \hspace{+3mm}1.1 $min \leftarrow b.used$ \\ | |
2004 \hspace{+3mm}1.2 $max \leftarrow a.used$ \\ | |
2005 \hspace{+3mm}1.3 $x \leftarrow a$ \\ | |
2006 2. else \\ | |
2007 \hspace{+3mm}2.1 $min \leftarrow a.used$ \\ | |
2008 \hspace{+3mm}2.2 $max \leftarrow b.used$ \\ | |
2009 \hspace{+3mm}2.3 $x \leftarrow b$ \\ | |
2010 3. If $c.alloc < max + 1$ then grow $c$ to hold at least $max + 1$ digits (\textit{mp\_grow}) \\ | |
2011 4. $oldused \leftarrow c.used$ \\ | |
2012 5. $c.used \leftarrow max + 1$ \\ | |
2013 6. $u \leftarrow 0$ \\ | |
2014 7. for $n$ from $0$ to $min - 1$ do \\ | |
2015 \hspace{+3mm}7.1 $c_n \leftarrow a_n + b_n + u$ \\ | |
2016 \hspace{+3mm}7.2 $u \leftarrow c_n >> lg(\beta)$ \\ | |
2017 \hspace{+3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ | |
2018 8. if $min \ne max$ then do \\ | |
2019 \hspace{+3mm}8.1 for $n$ from $min$ to $max - 1$ do \\ | |
2020 \hspace{+6mm}8.1.1 $c_n \leftarrow x_n + u$ \\ | |
2021 \hspace{+6mm}8.1.2 $u \leftarrow c_n >> lg(\beta)$ \\ | |
2022 \hspace{+6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ | |
2023 9. $c_{max} \leftarrow u$ \\ | |
2024 10. if $olduse > max$ then \\ | |
2025 \hspace{+3mm}10.1 for $n$ from $max + 1$ to $oldused - 1$ do \\ | |
2026 \hspace{+6mm}10.1.1 $c_n \leftarrow 0$ \\ | |
2027 11. Clamp excess digits in $c$. (\textit{mp\_clamp}) \\ | |
2028 12. Return(\textit{MP\_OKAY}) \\ | |
2029 \hline | |
2030 \end{tabular} | |
2031 \end{small} | |
2032 \end{center} | |
2033 \caption{Algorithm s\_mp\_add} | |
2034 \end{figure} | |
2035 | |
2036 \textbf{Algorithm s\_mp\_add.} | |
2037 This algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes. | |
2038 Coincidentally the description of algorithm A in Knuth \cite[pp. 266]{TAOCPV2} shares the same deficiency as the algorithm from \cite{HAC}. Even the | |
2039 MIX pseudo machine code presented by Knuth \cite[pp. 266-267]{TAOCPV2} is incapable of handling inputs which are of different magnitudes. | |
2040 | |
2041 The first thing that has to be accomplished is to sort out which of the two inputs is the largest. The addition logic | |
2042 will simply add all of the smallest input to the largest input and store that first part of the result in the | |
2043 destination. Then it will apply a simpler addition loop to excess digits of the larger input. | |
2044 | |
2045 The first two steps will handle sorting the inputs such that $min$ and $max$ hold the digit counts of the two | |
2046 inputs. The variable $x$ will be an mp\_int alias for the largest input or the second input $b$ if they have the | |
2047 same number of digits. After the inputs are sorted the destination $c$ is grown as required to accomodate the sum | |
2048 of the two inputs. The original \textbf{used} count of $c$ is copied and set to the new used count. | |
2049 | |
2050 At this point the first addition loop will go through as many digit positions that both inputs have. The carry | |
2051 variable $\mu$ is set to zero outside the loop. Inside the loop an ``addition'' step requires three statements to produce | |
2052 one digit of the summand. First | |
2053 two digits from $a$ and $b$ are added together along with the carry $\mu$. The carry of this step is extracted and stored | |
2054 in $\mu$ and finally the digit of the result $c_n$ is truncated within the range $0 \le c_n < \beta$. | |
2055 | |
2056 Now all of the digit positions that both inputs have in common have been exhausted. If $min \ne max$ then $x$ is an alias | |
2057 for one of the inputs that has more digits. A simplified addition loop is then used to essentially copy the remaining digits | |
2058 and the carry to the destination. | |
2059 | |
2060 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. | |
2061 | |
2062 | |
2063 \vspace{+3mm}\begin{small} | |
2064 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_add.c | |
2065 \vspace{-3mm} | |
2066 \begin{alltt} | |
2067 016 | |
2068 017 /* low level addition, based on HAC pp.594, Algorithm 14.7 */ | |
2069 018 int | |
2070 019 s_mp_add (mp_int * a, mp_int * b, mp_int * c) | |
2071 020 \{ | |
2072 021 mp_int *x; | |
2073 022 int olduse, res, min, max; | |
2074 023 | |
2075 024 /* find sizes, we let |a| <= |b| which means we have to sort | |
2076 025 * them. "x" will point to the input with the most digits | |
2077 026 */ | |
2078 027 if (a->used > b->used) \{ | |
2079 028 min = b->used; | |
2080 029 max = a->used; | |
2081 030 x = a; | |
2082 031 \} else \{ | |
2083 032 min = a->used; | |
2084 033 max = b->used; | |
2085 034 x = b; | |
2086 035 \} | |
2087 036 | |
2088 037 /* init result */ | |
2089 038 if (c->alloc < max + 1) \{ | |
2090 039 if ((res = mp_grow (c, max + 1)) != MP_OKAY) \{ | |
2091 040 return res; | |
2092 041 \} | |
2093 042 \} | |
2094 043 | |
2095 044 /* get old used digit count and set new one */ | |
2096 045 olduse = c->used; | |
2097 046 c->used = max + 1; | |
2098 047 | |
2099 048 \{ | |
2100 049 register mp_digit u, *tmpa, *tmpb, *tmpc; | |
2101 050 register int i; | |
2102 051 | |
2103 052 /* alias for digit pointers */ | |
2104 053 | |
2105 054 /* first input */ | |
2106 055 tmpa = a->dp; | |
2107 056 | |
2108 057 /* second input */ | |
2109 058 tmpb = b->dp; | |
2110 059 | |
2111 060 /* destination */ | |
2112 061 tmpc = c->dp; | |
2113 062 | |
2114 063 /* zero the carry */ | |
2115 064 u = 0; | |
2116 065 for (i = 0; i < min; i++) \{ | |
2117 066 /* Compute the sum at one digit, T[i] = A[i] + B[i] + U */ | |
2118 067 *tmpc = *tmpa++ + *tmpb++ + u; | |
2119 068 | |
2120 069 /* U = carry bit of T[i] */ | |
2121 070 u = *tmpc >> ((mp_digit)DIGIT_BIT); | |
2122 071 | |
2123 072 /* take away carry bit from T[i] */ | |
2124 073 *tmpc++ &= MP_MASK; | |
2125 074 \} | |
2126 075 | |
2127 076 /* now copy higher words if any, that is in A+B | |
2128 077 * if A or B has more digits add those in | |
2129 078 */ | |
2130 079 if (min != max) \{ | |
2131 080 for (; i < max; i++) \{ | |
2132 081 /* T[i] = X[i] + U */ | |
2133 082 *tmpc = x->dp[i] + u; | |
2134 083 | |
2135 084 /* U = carry bit of T[i] */ | |
2136 085 u = *tmpc >> ((mp_digit)DIGIT_BIT); | |
2137 086 | |
2138 087 /* take away carry bit from T[i] */ | |
2139 088 *tmpc++ &= MP_MASK; | |
2140 089 \} | |
2141 090 \} | |
2142 091 | |
2143 092 /* add carry */ | |
2144 093 *tmpc++ = u; | |
2145 094 | |
2146 095 /* clear digits above oldused */ | |
2147 096 for (i = c->used; i < olduse; i++) \{ | |
2148 097 *tmpc++ = 0; | |
2149 098 \} | |
2150 099 \} | |
2151 100 | |
2152 101 mp_clamp (c); | |
2153 102 return MP_OKAY; | |
2154 103 \} | |
2155 \end{alltt} | |
2156 \end{small} | |
2157 | |
2158 Lines 27 to 35 perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a | |
2159 mp\_int assigned to the largest input, in effect it is a local alias. Lines 37 to 42 ensure that the destination is grown to | |
2160 accomodate the result of the addition. | |
2161 | |
2162 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on | |
2163 lines 55, 58 and 61 represent the two inputs and destination variables respectively. These aliases are used to ensure the | |
2164 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. | |
2165 | |
2166 The initial carry $u$ is cleared on line 64, note that $u$ is of type mp\_digit which ensures type compatibility within the | |
2167 implementation. The initial addition loop begins on line 65 and ends on line 74. Similarly the conditional addition loop | |
2168 begins on line 80 and ends on line 90. The addition is finished with the final carry being stored in $tmpc$ on line 93. | |
2169 Note the ``++'' operator on the same line. After line 93 $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful | |
2170 for the next loop on lines 96 to 99 which set any old upper digits to zero. | |
2171 | |
2172 \subsection{Low Level Subtraction} | |
2173 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the | |
2174 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must | |
2175 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. | |
2176 This algorithm as will be shown can be used to create functional signed addition and subtraction algorithms. | |
2177 | |
2178 | |
2179 For this algorithm a new variable is required to make the description simpler. Recall from section 1.3.1 that a mp\_digit must be able to represent | |
2180 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For | |
2181 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a | |
2182 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). | |
2183 | |
2184 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' | |
2185 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. | |
2186 | |
2187 \newpage\begin{figure}[!here] | |
2188 \begin{center} | |
2189 \begin{small} | |
2190 \begin{tabular}{l} | |
2191 \hline Algorithm \textbf{s\_mp\_sub}. \\ | |
2192 \textbf{Input}. Two mp\_ints $a$ and $b$ ($\vert a \vert \ge \vert b \vert$) \\ | |
2193 \textbf{Output}. The unsigned subtraction $c = \vert a \vert - \vert b \vert$. \\ | |
2194 \hline \\ | |
2195 1. $min \leftarrow b.used$ \\ | |
2196 2. $max \leftarrow a.used$ \\ | |
2197 3. If $c.alloc < max$ then grow $c$ to hold at least $max$ digits. (\textit{mp\_grow}) \\ | |
2198 4. $oldused \leftarrow c.used$ \\ | |
2199 5. $c.used \leftarrow max$ \\ | |
2200 6. $u \leftarrow 0$ \\ | |
2201 7. for $n$ from $0$ to $min - 1$ do \\ | |
2202 \hspace{3mm}7.1 $c_n \leftarrow a_n - b_n - u$ \\ | |
2203 \hspace{3mm}7.2 $u \leftarrow c_n >> (\gamma - 1)$ \\ | |
2204 \hspace{3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ | |
2205 8. if $min < max$ then do \\ | |
2206 \hspace{3mm}8.1 for $n$ from $min$ to $max - 1$ do \\ | |
2207 \hspace{6mm}8.1.1 $c_n \leftarrow a_n - u$ \\ | |
2208 \hspace{6mm}8.1.2 $u \leftarrow c_n >> (\gamma - 1)$ \\ | |
2209 \hspace{6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ | |
2210 9. if $oldused > max$ then do \\ | |
2211 \hspace{3mm}9.1 for $n$ from $max$ to $oldused - 1$ do \\ | |
2212 \hspace{6mm}9.1.1 $c_n \leftarrow 0$ \\ | |
2213 10. Clamp excess digits of $c$. (\textit{mp\_clamp}). \\ | |
2214 11. Return(\textit{MP\_OKAY}). \\ | |
2215 \hline | |
2216 \end{tabular} | |
2217 \end{small} | |
2218 \end{center} | |
2219 \caption{Algorithm s\_mp\_sub} | |
2220 \end{figure} | |
2221 | |
2222 \textbf{Algorithm s\_mp\_sub.} | |
2223 This algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive. That is when | |
2224 passing variables $a$ and $b$ the condition that $\vert a \vert \ge \vert b \vert$ must be met for the algorithm to function correctly. This | |
2225 algorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well. As was the case | |
2226 of the algorithm s\_mp\_add both other references lack discussion concerning various practical details such as when the inputs differ in magnitude. | |
2227 | |
2228 The initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$. Steps 1 and 2 | |
2229 set the $min$ and $max$ variables. Unlike the addition routine there is guaranteed to be no carry which means that the final result can be at | |
2230 most $max$ digits in length as opposed to $max + 1$. Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and | |
2231 set to the maximal count for the operation. | |
2232 | |
2233 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision | |
2234 subtraction is used instead. Note the use of the $\gamma$ variable to extract the carry (\textit{also known as the borrow}) within the subtraction | |
2235 loops. Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry. | |
2236 | |
2237 For example, consider subtracting $0101_2$ from $0100_2$ where $\gamma = 4$ and $\beta = 2$. The least significant bit will force a carry upwards to | |
2238 the third bit which will be set to zero after the borrow. After the very first bit has been subtracted $4 - 1 \equiv 0011_2$ will remain, When the | |
2239 third bit of $0101_2$ is subtracted from the result it will cause another carry. In this case though the carry will be forced to propagate all the | |
2240 way to the most significant bit. | |
2241 | |
2242 Recall that $\beta < 2^{\gamma}$. This means that if a carry does occur just before the $lg(\beta)$'th bit it will propagate all the way to the most | |
2243 significant bit. Thus, the high order bits of the mp\_digit that are not part of the actual digit will either be all zero, or all one. All that | |
2244 is needed is a single zero or one bit for the carry. Therefore a single logical shift right by $\gamma - 1$ positions is sufficient to extract the | |
2245 carry. This method of carry extraction may seem awkward but the reason for it becomes apparent when the implementation is discussed. | |
2246 | |
2247 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step | |
2248 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. | |
2249 | |
2250 \vspace{+3mm}\begin{small} | |
2251 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sub.c | |
2252 \vspace{-3mm} | |
2253 \begin{alltt} | |
2254 016 | |
2255 017 /* low level subtraction (assumes |a| > |b|), HAC pp.595 Algorithm 14.9 */ | |
2256 018 int | |
2257 019 s_mp_sub (mp_int * a, mp_int * b, mp_int * c) | |
2258 020 \{ | |
2259 021 int olduse, res, min, max; | |
2260 022 | |
2261 023 /* find sizes */ | |
2262 024 min = b->used; | |
2263 025 max = a->used; | |
2264 026 | |
2265 027 /* init result */ | |
2266 028 if (c->alloc < max) \{ | |
2267 029 if ((res = mp_grow (c, max)) != MP_OKAY) \{ | |
2268 030 return res; | |
2269 031 \} | |
2270 032 \} | |
2271 033 olduse = c->used; | |
2272 034 c->used = max; | |
2273 035 | |
2274 036 \{ | |
2275 037 register mp_digit u, *tmpa, *tmpb, *tmpc; | |
2276 038 register int i; | |
2277 039 | |
2278 040 /* alias for digit pointers */ | |
2279 041 tmpa = a->dp; | |
2280 042 tmpb = b->dp; | |
2281 043 tmpc = c->dp; | |
2282 044 | |
2283 045 /* set carry to zero */ | |
2284 046 u = 0; | |
2285 047 for (i = 0; i < min; i++) \{ | |
2286 048 /* T[i] = A[i] - B[i] - U */ | |
2287 049 *tmpc = *tmpa++ - *tmpb++ - u; | |
2288 050 | |
2289 051 /* U = carry bit of T[i] | |
2290 052 * Note this saves performing an AND operation since | |
2291 053 * if a carry does occur it will propagate all the way to the | |
2292 054 * MSB. As a result a single shift is enough to get the carry | |
2293 055 */ | |
2294 056 u = *tmpc >> ((mp_digit)(CHAR_BIT * sizeof (mp_digit) - 1)); | |
2295 057 | |
2296 058 /* Clear carry from T[i] */ | |
2297 059 *tmpc++ &= MP_MASK; | |
2298 060 \} | |
2299 061 | |
2300 062 /* now copy higher words if any, e.g. if A has more digits than B */ | |
2301 063 for (; i < max; i++) \{ | |
2302 064 /* T[i] = A[i] - U */ | |
2303 065 *tmpc = *tmpa++ - u; | |
2304 066 | |
2305 067 /* U = carry bit of T[i] */ | |
2306 068 u = *tmpc >> ((mp_digit)(CHAR_BIT * sizeof (mp_digit) - 1)); | |
2307 069 | |
2308 070 /* Clear carry from T[i] */ | |
2309 071 *tmpc++ &= MP_MASK; | |
2310 072 \} | |
2311 073 | |
2312 074 /* clear digits above used (since we may not have grown result above) */ | |
2313 | |
2314 075 for (i = c->used; i < olduse; i++) \{ | |
2315 076 *tmpc++ = 0; | |
2316 077 \} | |
2317 078 \} | |
2318 079 | |
2319 080 mp_clamp (c); | |
2320 081 return MP_OKAY; | |
2321 082 \} | |
2322 083 | |
2323 \end{alltt} | |
2324 \end{small} | |
2325 | |
2326 Line 24 and 25 perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only | |
2327 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines 41, 42 and 43 initialize the aliases for | |
2328 $a$, $b$ and $c$ respectively. | |
2329 | |
2330 The first subtraction loop occurs on lines 46 through 60. The theory behind the subtraction loop is exactly the same as that for | |
2331 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry | |
2332 (\textit{see line 56}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND | |
2333 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry | |
2334 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply | |
2335 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on | |
2336 twos compliment machines which is a safe assumption to make. | |
2337 | |
2338 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines 63 through 72}) is required to propagate the carry through | |
2339 $a$ and copy the result to $c$. | |
2340 | |
2341 \subsection{High Level Addition} | |
2342 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be | |
2343 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data | |
2344 types. | |
2345 | |
2346 Recall from section 5.2 that an mp\_int represents an integer with an unsigned mantissa (\textit{the array of digits}) and a \textbf{sign} | |
2347 flag. A high level addition is actually performed as a series of eight separate cases which can be optimized down to three unique cases. | |
2348 | |
2349 \begin{figure}[!here] | |
2350 \begin{center} | |
2351 \begin{tabular}{l} | |
2352 \hline Algorithm \textbf{mp\_add}. \\ | |
2353 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ | |
2354 \textbf{Output}. The signed addition $c = a + b$. \\ | |
2355 \hline \\ | |
2356 1. if $a.sign = b.sign$ then do \\ | |
2357 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\ | |
2358 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add})\\ | |
2359 2. else do \\ | |
2360 \hspace{3mm}2.1 if $\vert a \vert < \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\ | |
2361 \hspace{6mm}2.1.1 $c.sign \leftarrow b.sign$ \\ | |
2362 \hspace{6mm}2.1.2 $c \leftarrow \vert b \vert - \vert a \vert$ (\textit{s\_mp\_sub}) \\ | |
2363 \hspace{3mm}2.2 else do \\ | |
2364 \hspace{6mm}2.2.1 $c.sign \leftarrow a.sign$ \\ | |
2365 \hspace{6mm}2.2.2 $c \leftarrow \vert a \vert - \vert b \vert$ \\ | |
2366 3. Return(\textit{MP\_OKAY}). \\ | |
2367 \hline | |
2368 \end{tabular} | |
2369 \end{center} | |
2370 \caption{Algorithm mp\_add} | |
2371 \end{figure} | |
2372 | |
2373 \textbf{Algorithm mp\_add.} | |
2374 This algorithm performs the signed addition of two mp\_int variables. There is no reference algorithm to draw upon from | |
2375 either \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations. The algorithm is fairly | |
2376 straightforward but restricted since subtraction can only produce positive results. | |
2377 | |
2378 \begin{figure}[here] | |
2379 \begin{small} | |
2380 \begin{center} | |
2381 \begin{tabular}{|c|c|c|c|c|} | |
2382 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert > \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\ | |
2383 \hline $+$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\ | |
2384 \hline $+$ & $+$ & No & $c = a + b$ & $a.sign$ \\ | |
2385 \hline $-$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\ | |
2386 \hline $-$ & $-$ & No & $c = a + b$ & $a.sign$ \\ | |
2387 \hline &&&&\\ | |
2388 | |
2389 \hline $+$ & $-$ & No & $c = b - a$ & $b.sign$ \\ | |
2390 \hline $-$ & $+$ & No & $c = b - a$ & $b.sign$ \\ | |
2391 | |
2392 \hline &&&&\\ | |
2393 | |
2394 \hline $+$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\ | |
2395 \hline $-$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\ | |
2396 | |
2397 \hline | |
2398 \end{tabular} | |
2399 \end{center} | |
2400 \end{small} | |
2401 \caption{Addition Guide Chart} | |
2402 \label{fig:AddChart} | |
2403 \end{figure} | |
2404 | |
2405 Figure~\ref{fig:AddChart} lists all of the eight possible input combinations and is sorted to show that only three | |
2406 specific cases need to be handled. The return code of the unsigned operations at step 1.2, 2.1.2 and 2.2.2 are | |
2407 forwarded to step three to check for errors. This simplifies the description of the algorithm considerably and best | |
2408 follows how the implementation actually was achieved. | |
2409 | |
2410 Also note how the \textbf{sign} is set before the unsigned addition or subtraction is performed. Recall from the descriptions of algorithms | |
2411 s\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits. The mp\_clamp algorithm will set the \textbf{sign} | |
2412 to \textbf{MP\_ZPOS} when the \textbf{used} digit count reaches zero. | |
2413 | |
2414 For example, consider performing $-a + a$ with algorithm mp\_add. By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would | |
2415 produce a result of $-0$. However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp | |
2416 within algorithm s\_mp\_add will force $-0$ to become $0$. | |
2417 | |
2418 \vspace{+3mm}\begin{small} | |
2419 \hspace{-5.1mm}{\bf File}: bn\_mp\_add.c | |
2420 \vspace{-3mm} | |
2421 \begin{alltt} | |
2422 016 | |
2423 017 /* high level addition (handles signs) */ | |
2424 018 int mp_add (mp_int * a, mp_int * b, mp_int * c) | |
2425 019 \{ | |
2426 020 int sa, sb, res; | |
2427 021 | |
2428 022 /* get sign of both inputs */ | |
2429 023 sa = a->sign; | |
2430 024 sb = b->sign; | |
2431 025 | |
2432 026 /* handle two cases, not four */ | |
2433 027 if (sa == sb) \{ | |
2434 028 /* both positive or both negative */ | |
2435 029 /* add their magnitudes, copy the sign */ | |
2436 030 c->sign = sa; | |
2437 031 res = s_mp_add (a, b, c); | |
2438 032 \} else \{ | |
2439 033 /* one positive, the other negative */ | |
2440 034 /* subtract the one with the greater magnitude from */ | |
2441 035 /* the one of the lesser magnitude. The result gets */ | |
2442 036 /* the sign of the one with the greater magnitude. */ | |
2443 037 if (mp_cmp_mag (a, b) == MP_LT) \{ | |
2444 038 c->sign = sb; | |
2445 039 res = s_mp_sub (b, a, c); | |
2446 040 \} else \{ | |
2447 041 c->sign = sa; | |
2448 042 res = s_mp_sub (a, b, c); | |
2449 043 \} | |
2450 044 \} | |
2451 045 return res; | |
2452 046 \} | |
2453 047 | |
2454 \end{alltt} | |
2455 \end{small} | |
2456 | |
2457 The source code follows the algorithm fairly closely. The most notable new source code addition is the usage of the $res$ integer variable which | |
2458 is used to pass result of the unsigned operations forward. Unlike in the algorithm, the variable $res$ is merely returned as is without | |
2459 explicitly checking it and returning the constant \textbf{MP\_OKAY}. The observation is this algorithm will succeed or fail only if the lower | |
2460 level functions do so. Returning their return code is sufficient. | |
2461 | |
2462 \subsection{High Level Subtraction} | |
2463 The high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm. | |
2464 | |
2465 \newpage\begin{figure}[!here] | |
2466 \begin{center} | |
2467 \begin{tabular}{l} | |
2468 \hline Algorithm \textbf{mp\_sub}. \\ | |
2469 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ | |
2470 \textbf{Output}. The signed subtraction $c = a - b$. \\ | |
2471 \hline \\ | |
2472 1. if $a.sign \ne b.sign$ then do \\ | |
2473 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\ | |
2474 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add}) \\ | |
2475 2. else do \\ | |
2476 \hspace{3mm}2.1 if $\vert a \vert \ge \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\ | |
2477 \hspace{6mm}2.1.1 $c.sign \leftarrow a.sign$ \\ | |
2478 \hspace{6mm}2.1.2 $c \leftarrow \vert a \vert - \vert b \vert$ (\textit{s\_mp\_sub}) \\ | |
2479 \hspace{3mm}2.2 else do \\ | |
2480 \hspace{6mm}2.2.1 $c.sign \leftarrow \left \lbrace \begin{array}{ll} | |
2481 MP\_ZPOS & \mbox{if }a.sign = MP\_NEG \\ | |
2482 MP\_NEG & \mbox{otherwise} \\ | |
2483 \end{array} \right .$ \\ | |
2484 \hspace{6mm}2.2.2 $c \leftarrow \vert b \vert - \vert a \vert$ \\ | |
2485 3. Return(\textit{MP\_OKAY}). \\ | |
2486 \hline | |
2487 \end{tabular} | |
2488 \end{center} | |
2489 \caption{Algorithm mp\_sub} | |
2490 \end{figure} | |
2491 | |
2492 \textbf{Algorithm mp\_sub.} | |
2493 This algorithm performs the signed subtraction of two inputs. Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or | |
2494 \cite{HAC}. Also this algorithm is restricted by algorithm s\_mp\_sub. Chart \ref{fig:SubChart} lists the eight possible inputs and | |
2495 the operations required. | |
2496 | |
2497 \begin{figure}[!here] | |
2498 \begin{small} | |
2499 \begin{center} | |
2500 \begin{tabular}{|c|c|c|c|c|} | |
2501 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert \ge \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\ | |
2502 \hline $+$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\ | |
2503 \hline $+$ & $-$ & No & $c = a + b$ & $a.sign$ \\ | |
2504 \hline $-$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\ | |
2505 \hline $-$ & $+$ & No & $c = a + b$ & $a.sign$ \\ | |
2506 \hline &&&& \\ | |
2507 \hline $+$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\ | |
2508 \hline $-$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\ | |
2509 \hline &&&& \\ | |
2510 \hline $+$ & $+$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\ | |
2511 \hline $-$ & $-$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\ | |
2512 \hline | |
2513 \end{tabular} | |
2514 \end{center} | |
2515 \end{small} | |
2516 \caption{Subtraction Guide Chart} | |
2517 \label{fig:SubChart} | |
2518 \end{figure} | |
2519 | |
2520 Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction. That is to prevent the | |
2521 algorithm from producing $-a - -a = -0$ as a result. | |
2522 | |
2523 \vspace{+3mm}\begin{small} | |
2524 \hspace{-5.1mm}{\bf File}: bn\_mp\_sub.c | |
2525 \vspace{-3mm} | |
2526 \begin{alltt} | |
2527 016 | |
2528 017 /* high level subtraction (handles signs) */ | |
2529 018 int | |
2530 019 mp_sub (mp_int * a, mp_int * b, mp_int * c) | |
2531 020 \{ | |
2532 021 int sa, sb, res; | |
2533 022 | |
2534 023 sa = a->sign; | |
2535 024 sb = b->sign; | |
2536 025 | |
2537 026 if (sa != sb) \{ | |
2538 027 /* subtract a negative from a positive, OR */ | |
2539 028 /* subtract a positive from a negative. */ | |
2540 029 /* In either case, ADD their magnitudes, */ | |
2541 030 /* and use the sign of the first number. */ | |
2542 031 c->sign = sa; | |
2543 032 res = s_mp_add (a, b, c); | |
2544 033 \} else \{ | |
2545 034 /* subtract a positive from a positive, OR */ | |
2546 035 /* subtract a negative from a negative. */ | |
2547 036 /* First, take the difference between their */ | |
2548 037 /* magnitudes, then... */ | |
2549 038 if (mp_cmp_mag (a, b) != MP_LT) \{ | |
2550 039 /* Copy the sign from the first */ | |
2551 040 c->sign = sa; | |
2552 041 /* The first has a larger or equal magnitude */ | |
2553 042 res = s_mp_sub (a, b, c); | |
2554 043 \} else \{ | |
2555 044 /* The result has the *opposite* sign from */ | |
2556 045 /* the first number. */ | |
2557 046 c->sign = (sa == MP_ZPOS) ? MP_NEG : MP_ZPOS; | |
2558 047 /* The second has a larger magnitude */ | |
2559 048 res = s_mp_sub (b, a, c); | |
2560 049 \} | |
2561 050 \} | |
2562 051 return res; | |
2563 052 \} | |
2564 053 | |
2565 \end{alltt} | |
2566 \end{small} | |
2567 | |
2568 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations | |
2569 and forward it to the end of the function. On line 38 the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a | |
2570 ``greater than or equal to'' comparison. | |
2571 | |
2572 \section{Bit and Digit Shifting} | |
2573 It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$. | |
2574 This notation arises within discussion of Montgomery and Diminished Radix Reduction as well as Karatsuba multiplication and squaring. | |
2575 | |
2576 In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established. That is to shift | |
2577 the digits left or right as well to shift individual bits of the digits left and right. It is important to note that not all ``shift'' operations | |
2578 are on radix-$\beta$ digits. | |
2579 | |
2580 \subsection{Multiplication by Two} | |
2581 | |
2582 In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient | |
2583 operation to perform. A single precision logical shift left is sufficient to multiply a single digit by two. | |
2584 | |
2585 \newpage\begin{figure}[!here] | |
2586 \begin{small} | |
2587 \begin{center} | |
2588 \begin{tabular}{l} | |
2589 \hline Algorithm \textbf{mp\_mul\_2}. \\ | |
2590 \textbf{Input}. One mp\_int $a$ \\ | |
2591 \textbf{Output}. $b = 2a$. \\ | |
2592 \hline \\ | |
2593 1. If $b.alloc < a.used + 1$ then grow $b$ to hold $a.used + 1$ digits. (\textit{mp\_grow}) \\ | |
2594 2. $oldused \leftarrow b.used$ \\ | |
2595 3. $b.used \leftarrow a.used$ \\ | |
2596 4. $r \leftarrow 0$ \\ | |
2597 5. for $n$ from 0 to $a.used - 1$ do \\ | |
2598 \hspace{3mm}5.1 $rr \leftarrow a_n >> (lg(\beta) - 1)$ \\ | |
2599 \hspace{3mm}5.2 $b_n \leftarrow (a_n << 1) + r \mbox{ (mod }\beta\mbox{)}$ \\ | |
2600 \hspace{3mm}5.3 $r \leftarrow rr$ \\ | |
2601 6. If $r \ne 0$ then do \\ | |
2602 \hspace{3mm}6.1 $b_{n + 1} \leftarrow r$ \\ | |
2603 \hspace{3mm}6.2 $b.used \leftarrow b.used + 1$ \\ | |
2604 7. If $b.used < oldused - 1$ then do \\ | |
2605 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\ | |
2606 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\ | |
2607 8. $b.sign \leftarrow a.sign$ \\ | |
2608 9. Return(\textit{MP\_OKAY}).\\ | |
2609 \hline | |
2610 \end{tabular} | |
2611 \end{center} | |
2612 \end{small} | |
2613 \caption{Algorithm mp\_mul\_2} | |
2614 \end{figure} | |
2615 | |
2616 \textbf{Algorithm mp\_mul\_2.} | |
2617 This algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two. Neither \cite{TAOCPV2} nor \cite{HAC} describe such | |
2618 an algorithm despite the fact it arises often in other algorithms. The algorithm is setup much like the lower level algorithm s\_mp\_add since | |
2619 it is for all intents and purposes equivalent to the operation $b = \vert a \vert + \vert a \vert$. | |
2620 | |
2621 Step 1 and 2 grow the input as required to accomodate the maximum number of \textbf{used} digits in the result. The initial \textbf{used} count | |
2622 is set to $a.used$ at step 4. Only if there is a final carry will the \textbf{used} count require adjustment. | |
2623 | |
2624 Step 6 is an optimization implementation of the addition loop for this specific case. That is since the two values being added together | |
2625 are the same there is no need to perform two reads from the digits of $a$. Step 6.1 performs a single precision shift on the current digit $a_n$ to | |
2626 obtain what will be the carry for the next iteration. Step 6.2 calculates the $n$'th digit of the result as single precision shift of $a_n$ plus | |
2627 the previous carry. Recall from section 4.1 that $a_n << 1$ is equivalent to $a_n \cdot 2$. An iteration of the addition loop is finished with | |
2628 forwarding the carry to the next iteration. | |
2629 | |
2630 Step 7 takes care of any final carry by setting the $a.used$'th digit of the result to the carry and augmenting the \textbf{used} count of $b$. | |
2631 Step 8 clears any leading digits of $b$ in case it originally had a larger magnitude than $a$. | |
2632 | |
2633 \vspace{+3mm}\begin{small} | |
2634 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_2.c | |
2635 \vspace{-3mm} | |
2636 \begin{alltt} | |
2637 016 | |
2638 017 /* b = a*2 */ | |
2639 018 int mp_mul_2(mp_int * a, mp_int * b) | |
2640 019 \{ | |
2641 020 int x, res, oldused; | |
2642 021 | |
2643 022 /* grow to accomodate result */ | |
2644 023 if (b->alloc < a->used + 1) \{ | |
2645 024 if ((res = mp_grow (b, a->used + 1)) != MP_OKAY) \{ | |
2646 025 return res; | |
2647 026 \} | |
2648 027 \} | |
2649 028 | |
2650 029 oldused = b->used; | |
2651 030 b->used = a->used; | |
2652 031 | |
2653 032 \{ | |
2654 033 register mp_digit r, rr, *tmpa, *tmpb; | |
2655 034 | |
2656 035 /* alias for source */ | |
2657 036 tmpa = a->dp; | |
2658 037 | |
2659 038 /* alias for dest */ | |
2660 039 tmpb = b->dp; | |
2661 040 | |
2662 041 /* carry */ | |
2663 042 r = 0; | |
2664 043 for (x = 0; x < a->used; x++) \{ | |
2665 044 | |
2666 045 /* get what will be the *next* carry bit from the | |
2667 046 * MSB of the current digit | |
2668 047 */ | |
2669 048 rr = *tmpa >> ((mp_digit)(DIGIT_BIT - 1)); | |
2670 049 | |
2671 050 /* now shift up this digit, add in the carry [from the previous] */ | |
2672 051 *tmpb++ = ((*tmpa++ << ((mp_digit)1)) | r) & MP_MASK; | |
2673 052 | |
2674 053 /* copy the carry that would be from the source | |
2675 054 * digit into the next iteration | |
2676 055 */ | |
2677 056 r = rr; | |
2678 057 \} | |
2679 058 | |
2680 059 /* new leading digit? */ | |
2681 060 if (r != 0) \{ | |
2682 061 /* add a MSB which is always 1 at this point */ | |
2683 062 *tmpb = 1; | |
2684 063 ++(b->used); | |
2685 064 \} | |
2686 065 | |
2687 066 /* now zero any excess digits on the destination | |
2688 067 * that we didn't write to | |
2689 068 */ | |
2690 069 tmpb = b->dp + b->used; | |
2691 070 for (x = b->used; x < oldused; x++) \{ | |
2692 071 *tmpb++ = 0; | |
2693 072 \} | |
2694 073 \} | |
2695 074 b->sign = a->sign; | |
2696 075 return MP_OKAY; | |
2697 076 \} | |
2698 \end{alltt} | |
2699 \end{small} | |
2700 | |
2701 This implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input. The only noteworthy difference | |
2702 is the use of the logical shift operator on line 51 to perform a single precision doubling. | |
2703 | |
2704 \subsection{Division by Two} | |
2705 A division by two can just as easily be accomplished with a logical shift right as multiplication by two can be with a logical shift left. | |
2706 | |
2707 \newpage\begin{figure}[!here] | |
2708 \begin{small} | |
2709 \begin{center} | |
2710 \begin{tabular}{l} | |
2711 \hline Algorithm \textbf{mp\_div\_2}. \\ | |
2712 \textbf{Input}. One mp\_int $a$ \\ | |
2713 \textbf{Output}. $b = a/2$. \\ | |
2714 \hline \\ | |
2715 1. If $b.alloc < a.used$ then grow $b$ to hold $a.used$ digits. (\textit{mp\_grow}) \\ | |
2716 2. If the reallocation failed return(\textit{MP\_MEM}). \\ | |
2717 3. $oldused \leftarrow b.used$ \\ | |
2718 4. $b.used \leftarrow a.used$ \\ | |
2719 5. $r \leftarrow 0$ \\ | |
2720 6. for $n$ from $b.used - 1$ to $0$ do \\ | |
2721 \hspace{3mm}6.1 $rr \leftarrow a_n \mbox{ (mod }2\mbox{)}$\\ | |
2722 \hspace{3mm}6.2 $b_n \leftarrow (a_n >> 1) + (r << (lg(\beta) - 1)) \mbox{ (mod }\beta\mbox{)}$ \\ | |
2723 \hspace{3mm}6.3 $r \leftarrow rr$ \\ | |
2724 7. If $b.used < oldused - 1$ then do \\ | |
2725 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\ | |
2726 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\ | |
2727 8. $b.sign \leftarrow a.sign$ \\ | |
2728 9. Clamp excess digits of $b$. (\textit{mp\_clamp}) \\ | |
2729 10. Return(\textit{MP\_OKAY}).\\ | |
2730 \hline | |
2731 \end{tabular} | |
2732 \end{center} | |
2733 \end{small} | |
2734 \caption{Algorithm mp\_div\_2} | |
2735 \end{figure} | |
2736 | |
2737 \textbf{Algorithm mp\_div\_2.} | |
2738 This algorithm will divide an mp\_int by two using logical shifts to the right. Like mp\_mul\_2 it uses a modified low level addition | |
2739 core as the basis of the algorithm. Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit. The algorithm | |
2740 could be written to work from the trailing digit to the leading digit however, it would have to stop one short of $a.used - 1$ digits to prevent | |
2741 reading past the end of the array of digits. | |
2742 | |
2743 Essentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the | |
2744 least significant bit not the most significant bit. | |
2745 | |
2746 \vspace{+3mm}\begin{small} | |
2747 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_2.c | |
2748 \vspace{-3mm} | |
2749 \begin{alltt} | |
2750 016 | |
2751 017 /* b = a/2 */ | |
2752 018 int mp_div_2(mp_int * a, mp_int * b) | |
2753 019 \{ | |
2754 020 int x, res, oldused; | |
2755 021 | |
2756 022 /* copy */ | |
2757 023 if (b->alloc < a->used) \{ | |
2758 024 if ((res = mp_grow (b, a->used)) != MP_OKAY) \{ | |
2759 025 return res; | |
2760 026 \} | |
2761 027 \} | |
2762 028 | |
2763 029 oldused = b->used; | |
2764 030 b->used = a->used; | |
2765 031 \{ | |
2766 032 register mp_digit r, rr, *tmpa, *tmpb; | |
2767 033 | |
2768 034 /* source alias */ | |
2769 035 tmpa = a->dp + b->used - 1; | |
2770 036 | |
2771 037 /* dest alias */ | |
2772 038 tmpb = b->dp + b->used - 1; | |
2773 039 | |
2774 040 /* carry */ | |
2775 041 r = 0; | |
2776 042 for (x = b->used - 1; x >= 0; x--) \{ | |
2777 043 /* get the carry for the next iteration */ | |
2778 044 rr = *tmpa & 1; | |
2779 045 | |
2780 046 /* shift the current digit, add in carry and store */ | |
2781 047 *tmpb-- = (*tmpa-- >> 1) | (r << (DIGIT_BIT - 1)); | |
2782 048 | |
2783 049 /* forward carry to next iteration */ | |
2784 050 r = rr; | |
2785 051 \} | |
2786 052 | |
2787 053 /* zero excess digits */ | |
2788 054 tmpb = b->dp + b->used; | |
2789 055 for (x = b->used; x < oldused; x++) \{ | |
2790 056 *tmpb++ = 0; | |
2791 057 \} | |
2792 058 \} | |
2793 059 b->sign = a->sign; | |
2794 060 mp_clamp (b); | |
2795 061 return MP_OKAY; | |
2796 062 \} | |
2797 \end{alltt} | |
2798 \end{small} | |
2799 | |
2800 \section{Polynomial Basis Operations} | |
2801 Recall from section 4.3 that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$. Such a representation is also known as | |
2802 the polynomial basis \cite[pp. 48]{ROSE}. Given such a notation a multiplication or division by $x$ amounts to shifting whole digits a single | |
2803 place. The need for such operations arises in several other higher level algorithms such as Barrett and Montgomery reduction, integer | |
2804 division and Karatsuba multiplication. | |
2805 | |
2806 Converting from an array of digits to polynomial basis is very simple. Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that | |
2807 $y = \sum_{i=0}^{2} a_i \beta^i$. Simply replace $\beta$ with $x$ and the expression is in polynomial basis. For example, $f(x) = 8x + 9$ is the | |
2808 polynomial basis representation for $89$ using radix ten. That is, $f(10) = 8(10) + 9 = 89$. | |
2809 | |
2810 \subsection{Multiplication by $x$} | |
2811 | |
2812 Given a polynomial in $x$ such as $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_0$ multiplying by $x$ amounts to shifting the coefficients up one | |
2813 degree. In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$. From a scalar basis point of view multiplying by $x$ is equivalent to | |
2814 multiplying by the integer $\beta$. | |
2815 | |
2816 \newpage\begin{figure}[!here] | |
2817 \begin{small} | |
2818 \begin{center} | |
2819 \begin{tabular}{l} | |
2820 \hline Algorithm \textbf{mp\_lshd}. \\ | |
2821 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ | |
2822 \textbf{Output}. $a \leftarrow a \cdot \beta^b$ (equivalent to multiplication by $x^b$). \\ | |
2823 \hline \\ | |
2824 1. If $b \le 0$ then return(\textit{MP\_OKAY}). \\ | |
2825 2. If $a.alloc < a.used + b$ then grow $a$ to at least $a.used + b$ digits. (\textit{mp\_grow}). \\ | |
2826 3. If the reallocation failed return(\textit{MP\_MEM}). \\ | |
2827 4. $a.used \leftarrow a.used + b$ \\ | |
2828 5. $i \leftarrow a.used - 1$ \\ | |
2829 6. $j \leftarrow a.used - 1 - b$ \\ | |
2830 7. for $n$ from $a.used - 1$ to $b$ do \\ | |
2831 \hspace{3mm}7.1 $a_{i} \leftarrow a_{j}$ \\ | |
2832 \hspace{3mm}7.2 $i \leftarrow i - 1$ \\ | |
2833 \hspace{3mm}7.3 $j \leftarrow j - 1$ \\ | |
2834 8. for $n$ from 0 to $b - 1$ do \\ | |
2835 \hspace{3mm}8.1 $a_n \leftarrow 0$ \\ | |
2836 9. Return(\textit{MP\_OKAY}). \\ | |
2837 \hline | |
2838 \end{tabular} | |
2839 \end{center} | |
2840 \end{small} | |
2841 \caption{Algorithm mp\_lshd} | |
2842 \end{figure} | |
2843 | |
2844 \textbf{Algorithm mp\_lshd.} | |
2845 This algorithm multiplies an mp\_int by the $b$'th power of $x$. This is equivalent to multiplying by $\beta^b$. The algorithm differs | |
2846 from the other algorithms presented so far as it performs the operation in place instead storing the result in a separate location. The | |
2847 motivation behind this change is due to the way this function is typically used. Algorithms such as mp\_add store the result in an optionally | |
2848 different third mp\_int because the original inputs are often still required. Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is | |
2849 typically used on values where the original value is no longer required. The algorithm will return success immediately if | |
2850 $b \le 0$ since the rest of algorithm is only valid when $b > 0$. | |
2851 | |
2852 First the destination $a$ is grown as required to accomodate the result. The counters $i$ and $j$ are used to form a \textit{sliding window} over | |
2853 the digits of $a$ of length $b$. The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}). | |
2854 The loop on step 7 copies the digit from the tail to the head. In each iteration the window is moved down one digit. The last loop on | |
2855 step 8 sets the lower $b$ digits to zero. | |
2856 | |
2857 \newpage | |
2858 \begin{center} | |
2859 \begin{figure}[here] | |
2860 \includegraphics{pics/sliding_window.ps} | |
2861 \caption{Sliding Window Movement} | |
2862 \label{pic:sliding_window} | |
2863 \end{figure} | |
2864 \end{center} | |
2865 | |
2866 \vspace{+3mm}\begin{small} | |
2867 \hspace{-5.1mm}{\bf File}: bn\_mp\_lshd.c | |
2868 \vspace{-3mm} | |
2869 \begin{alltt} | |
2870 016 | |
2871 017 /* shift left a certain amount of digits */ | |
2872 018 int mp_lshd (mp_int * a, int b) | |
2873 019 \{ | |
2874 020 int x, res; | |
2875 021 | |
2876 022 /* if its less than zero return */ | |
2877 023 if (b <= 0) \{ | |
2878 024 return MP_OKAY; | |
2879 025 \} | |
2880 026 | |
2881 027 /* grow to fit the new digits */ | |
2882 028 if (a->alloc < a->used + b) \{ | |
2883 029 if ((res = mp_grow (a, a->used + b)) != MP_OKAY) \{ | |
2884 030 return res; | |
2885 031 \} | |
2886 032 \} | |
2887 033 | |
2888 034 \{ | |
2889 035 register mp_digit *top, *bottom; | |
2890 036 | |
2891 037 /* increment the used by the shift amount then copy upwards */ | |
2892 038 a->used += b; | |
2893 039 | |
2894 040 /* top */ | |
2895 041 top = a->dp + a->used - 1; | |
2896 042 | |
2897 043 /* base */ | |
2898 044 bottom = a->dp + a->used - 1 - b; | |
2899 045 | |
2900 046 /* much like mp_rshd this is implemented using a sliding window | |
2901 047 * except the window goes the otherway around. Copying from | |
2902 048 * the bottom to the top. see bn_mp_rshd.c for more info. | |
2903 049 */ | |
2904 050 for (x = a->used - 1; x >= b; x--) \{ | |
2905 051 *top-- = *bottom--; | |
2906 052 \} | |
2907 053 | |
2908 054 /* zero the lower digits */ | |
2909 055 top = a->dp; | |
2910 056 for (x = 0; x < b; x++) \{ | |
2911 057 *top++ = 0; | |
2912 058 \} | |
2913 059 \} | |
2914 060 return MP_OKAY; | |
2915 061 \} | |
2916 \end{alltt} | |
2917 \end{small} | |
2918 | |
2919 The if statement on line 23 ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before | |
2920 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line 41 is an alias | |
2921 for the leading digit while $bottom$ on line 44 is an alias for the trailing edge. The aliases form a window of exactly $b$ digits | |
2922 over the input. | |
2923 | |
2924 \subsection{Division by $x$} | |
2925 | |
2926 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. | |
2927 | |
2928 \newpage\begin{figure}[!here] | |
2929 \begin{small} | |
2930 \begin{center} | |
2931 \begin{tabular}{l} | |
2932 \hline Algorithm \textbf{mp\_rshd}. \\ | |
2933 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ | |
2934 \textbf{Output}. $a \leftarrow a / \beta^b$ (Divide by $x^b$). \\ | |
2935 \hline \\ | |
2936 1. If $b \le 0$ then return. \\ | |
2937 2. If $a.used \le b$ then do \\ | |
2938 \hspace{3mm}2.1 Zero $a$. (\textit{mp\_zero}). \\ | |
2939 \hspace{3mm}2.2 Return. \\ | |
2940 3. $i \leftarrow 0$ \\ | |
2941 4. $j \leftarrow b$ \\ | |
2942 5. for $n$ from 0 to $a.used - b - 1$ do \\ | |
2943 \hspace{3mm}5.1 $a_i \leftarrow a_j$ \\ | |
2944 \hspace{3mm}5.2 $i \leftarrow i + 1$ \\ | |
2945 \hspace{3mm}5.3 $j \leftarrow j + 1$ \\ | |
2946 6. for $n$ from $a.used - b$ to $a.used - 1$ do \\ | |
2947 \hspace{3mm}6.1 $a_n \leftarrow 0$ \\ | |
2948 7. $a.used \leftarrow a.used - b$ \\ | |
2949 8. Return. \\ | |
2950 \hline | |
2951 \end{tabular} | |
2952 \end{center} | |
2953 \end{small} | |
2954 \caption{Algorithm mp\_rshd} | |
2955 \end{figure} | |
2956 | |
2957 \textbf{Algorithm mp\_rshd.} | |
2958 This algorithm divides the input in place by the $b$'th power of $x$. It is analogous to dividing by a $\beta^b$ but much quicker since | |
2959 it does not require single precision division. This algorithm does not actually return an error code as it cannot fail. | |
2960 | |
2961 If the input $b$ is less than one the algorithm quickly returns without performing any work. If the \textbf{used} count is less than or equal | |
2962 to the shift count $b$ then it will simply zero the input and return. | |
2963 | |
2964 After the trivial cases of inputs have been handled the sliding window is setup. Much like the case of algorithm mp\_lshd a sliding window that | |
2965 is $b$ digits wide is used to copy the digits. Unlike mp\_lshd the window slides in the opposite direction from the trailing to the leading digit. | |
2966 Also the digits are copied from the leading to the trailing edge. | |
2967 | |
2968 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. | |
2969 | |
2970 \vspace{+3mm}\begin{small} | |
2971 \hspace{-5.1mm}{\bf File}: bn\_mp\_rshd.c | |
2972 \vspace{-3mm} | |
2973 \begin{alltt} | |
2974 016 | |
2975 017 /* shift right a certain amount of digits */ | |
2976 018 void mp_rshd (mp_int * a, int b) | |
2977 019 \{ | |
2978 020 int x; | |
2979 021 | |
2980 022 /* if b <= 0 then ignore it */ | |
2981 023 if (b <= 0) \{ | |
2982 024 return; | |
2983 025 \} | |
2984 026 | |
2985 027 /* if b > used then simply zero it and return */ | |
2986 028 if (a->used <= b) \{ | |
2987 029 mp_zero (a); | |
2988 030 return; | |
2989 031 \} | |
2990 032 | |
2991 033 \{ | |
2992 034 register mp_digit *bottom, *top; | |
2993 035 | |
2994 036 /* shift the digits down */ | |
2995 037 | |
2996 038 /* bottom */ | |
2997 039 bottom = a->dp; | |
2998 040 | |
2999 041 /* top [offset into digits] */ | |
3000 042 top = a->dp + b; | |
3001 043 | |
3002 044 /* this is implemented as a sliding window where | |
3003 045 * the window is b-digits long and digits from | |
3004 046 * the top of the window are copied to the bottom | |
3005 047 * | |
3006 048 * e.g. | |
3007 049 | |
3008 050 b-2 | b-1 | b0 | b1 | b2 | ... | bb | ----> | |
3009 051 /\symbol{92} | ----> | |
3010 052 \symbol{92}-------------------/ ----> | |
3011 053 */ | |
3012 054 for (x = 0; x < (a->used - b); x++) \{ | |
3013 055 *bottom++ = *top++; | |
3014 056 \} | |
3015 057 | |
3016 058 /* zero the top digits */ | |
3017 059 for (; x < a->used; x++) \{ | |
3018 060 *bottom++ = 0; | |
3019 061 \} | |
3020 062 \} | |
3021 063 | |
3022 064 /* remove excess digits */ | |
3023 065 a->used -= b; | |
3024 066 \} | |
3025 \end{alltt} | |
3026 \end{small} | |
3027 | |
3028 The only noteworthy element of this routine is the lack of a return type. | |
3029 | |
3030 -- Will update later to give it a return type...Tom | |
3031 | |
3032 \section{Powers of Two} | |
3033 | |
3034 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For | |
3035 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single | |
3036 shifts $k$ times to achieve a multiplication by $2^{\pm k}$ a mixture of whole digit shifting and partial digit shifting is employed. | |
3037 | |
3038 \subsection{Multiplication by Power of Two} | |
3039 | |
3040 \newpage\begin{figure}[!here] | |
3041 \begin{small} | |
3042 \begin{center} | |
3043 \begin{tabular}{l} | |
3044 \hline Algorithm \textbf{mp\_mul\_2d}. \\ | |
3045 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ | |
3046 \textbf{Output}. $c \leftarrow a \cdot 2^b$. \\ | |
3047 \hline \\ | |
3048 1. $c \leftarrow a$. (\textit{mp\_copy}) \\ | |
3049 2. If $c.alloc < c.used + \lfloor b / lg(\beta) \rfloor + 2$ then grow $c$ accordingly. \\ | |
3050 3. If the reallocation failed return(\textit{MP\_MEM}). \\ | |
3051 4. If $b \ge lg(\beta)$ then \\ | |
3052 \hspace{3mm}4.1 $c \leftarrow c \cdot \beta^{\lfloor b / lg(\beta) \rfloor}$ (\textit{mp\_lshd}). \\ | |
3053 \hspace{3mm}4.2 If step 4.1 failed return(\textit{MP\_MEM}). \\ | |
3054 5. $d \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ | |
3055 6. If $d \ne 0$ then do \\ | |
3056 \hspace{3mm}6.1 $mask \leftarrow 2^d$ \\ | |
3057 \hspace{3mm}6.2 $r \leftarrow 0$ \\ | |
3058 \hspace{3mm}6.3 for $n$ from $0$ to $c.used - 1$ do \\ | |
3059 \hspace{6mm}6.3.1 $rr \leftarrow c_n >> (lg(\beta) - d) \mbox{ (mod }mask\mbox{)}$ \\ | |
3060 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n << d) + r \mbox{ (mod }\beta\mbox{)}$ \\ | |
3061 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\ | |
3062 \hspace{3mm}6.4 If $r > 0$ then do \\ | |
3063 \hspace{6mm}6.4.1 $c_{c.used} \leftarrow r$ \\ | |
3064 \hspace{6mm}6.4.2 $c.used \leftarrow c.used + 1$ \\ | |
3065 7. Return(\textit{MP\_OKAY}). \\ | |
3066 \hline | |
3067 \end{tabular} | |
3068 \end{center} | |
3069 \end{small} | |
3070 \caption{Algorithm mp\_mul\_2d} | |
3071 \end{figure} | |
3072 | |
3073 \textbf{Algorithm mp\_mul\_2d.} | |
3074 This algorithm multiplies $a$ by $2^b$ and stores the result in $c$. The algorithm uses algorithm mp\_lshd and a derivative of algorithm mp\_mul\_2 to | |
3075 quickly compute the product. | |
3076 | |
3077 First the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than | |
3078 $\beta$. For example, if $b = 37$ and $\beta = 2^{28}$ then this step will multiply by $x$ leaving a multiplication by $2^{37 - 28} = 2^{9}$ | |
3079 left. | |
3080 | |
3081 After the digits have been shifted appropriately at most $lg(\beta) - 1$ shifts are left to perform. Step 5 calculates the number of remaining shifts | |
3082 required. If it is non-zero a modified shift loop is used to calculate the remaining product. | |
3083 Essentially the loop is a generic version of algorith mp\_mul2 designed to handle any shift count in the range $1 \le x < lg(\beta)$. The $mask$ | |
3084 variable is used to extract the upper $d$ bits to form the carry for the next iteration. | |
3085 | |
3086 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to | |
3087 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. | |
3088 | |
3089 \vspace{+3mm}\begin{small} | |
3090 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_2d.c | |
3091 \vspace{-3mm} | |
3092 \begin{alltt} | |
3093 016 | |
3094 017 /* shift left by a certain bit count */ | |
3095 018 int mp_mul_2d (mp_int * a, int b, mp_int * c) | |
3096 019 \{ | |
3097 020 mp_digit d; | |
3098 021 int res; | |
3099 022 | |
3100 023 /* copy */ | |
3101 024 if (a != c) \{ | |
3102 025 if ((res = mp_copy (a, c)) != MP_OKAY) \{ | |
3103 026 return res; | |
3104 027 \} | |
3105 028 \} | |
3106 029 | |
3107 030 if (c->alloc < (int)(c->used + b/DIGIT_BIT + 1)) \{ | |
3108 031 if ((res = mp_grow (c, c->used + b / DIGIT_BIT + 1)) != MP_OKAY) \{ | |
3109 032 return res; | |
3110 033 \} | |
3111 034 \} | |
3112 035 | |
3113 036 /* shift by as many digits in the bit count */ | |
3114 037 if (b >= (int)DIGIT_BIT) \{ | |
3115 038 if ((res = mp_lshd (c, b / DIGIT_BIT)) != MP_OKAY) \{ | |
3116 039 return res; | |
3117 040 \} | |
3118 041 \} | |
3119 042 | |
3120 043 /* shift any bit count < DIGIT_BIT */ | |
3121 044 d = (mp_digit) (b % DIGIT_BIT); | |
3122 045 if (d != 0) \{ | |
3123 046 register mp_digit *tmpc, shift, mask, r, rr; | |
3124 047 register int x; | |
3125 048 | |
3126 049 /* bitmask for carries */ | |
3127 050 mask = (((mp_digit)1) << d) - 1; | |
3128 051 | |
3129 052 /* shift for msbs */ | |
3130 053 shift = DIGIT_BIT - d; | |
3131 054 | |
3132 055 /* alias */ | |
3133 056 tmpc = c->dp; | |
3134 057 | |
3135 058 /* carry */ | |
3136 059 r = 0; | |
3137 060 for (x = 0; x < c->used; x++) \{ | |
3138 061 /* get the higher bits of the current word */ | |
3139 062 rr = (*tmpc >> shift) & mask; | |
3140 063 | |
3141 064 /* shift the current word and OR in the carry */ | |
3142 065 *tmpc = ((*tmpc << d) | r) & MP_MASK; | |
3143 066 ++tmpc; | |
3144 067 | |
3145 068 /* set the carry to the carry bits of the current word */ | |
3146 069 r = rr; | |
3147 070 \} | |
3148 071 | |
3149 072 /* set final carry */ | |
3150 073 if (r != 0) \{ | |
3151 074 c->dp[(c->used)++] = r; | |
3152 075 \} | |
3153 076 \} | |
3154 077 mp_clamp (c); | |
3155 078 return MP_OKAY; | |
3156 079 \} | |
3157 \end{alltt} | |
3158 \end{small} | |
3159 | |
3160 Notes to be revised when code is updated. -- Tom | |
3161 | |
3162 \subsection{Division by Power of Two} | |
3163 | |
3164 \newpage\begin{figure}[!here] | |
3165 \begin{small} | |
3166 \begin{center} | |
3167 \begin{tabular}{l} | |
3168 \hline Algorithm \textbf{mp\_div\_2d}. \\ | |
3169 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ | |
3170 \textbf{Output}. $c \leftarrow \lfloor a / 2^b \rfloor, d \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\ | |
3171 \hline \\ | |
3172 1. If $b \le 0$ then do \\ | |
3173 \hspace{3mm}1.1 $c \leftarrow a$ (\textit{mp\_copy}) \\ | |
3174 \hspace{3mm}1.2 $d \leftarrow 0$ (\textit{mp\_zero}) \\ | |
3175 \hspace{3mm}1.3 Return(\textit{MP\_OKAY}). \\ | |
3176 2. $c \leftarrow a$ \\ | |
3177 3. $d \leftarrow a \mbox{ (mod }2^b\mbox{)}$ (\textit{mp\_mod\_2d}) \\ | |
3178 4. If $b \ge lg(\beta)$ then do \\ | |
3179 \hspace{3mm}4.1 $c \leftarrow \lfloor c/\beta^{\lfloor b/lg(\beta) \rfloor} \rfloor$ (\textit{mp\_rshd}). \\ | |
3180 5. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ | |
3181 6. If $k \ne 0$ then do \\ | |
3182 \hspace{3mm}6.1 $mask \leftarrow 2^k$ \\ | |
3183 \hspace{3mm}6.2 $r \leftarrow 0$ \\ | |
3184 \hspace{3mm}6.3 for $n$ from $c.used - 1$ to $0$ do \\ | |
3185 \hspace{6mm}6.3.1 $rr \leftarrow c_n \mbox{ (mod }mask\mbox{)}$ \\ | |
3186 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n >> k) + (r << (lg(\beta) - k))$ \\ | |
3187 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\ | |
3188 7. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\ | |
3189 8. Return(\textit{MP\_OKAY}). \\ | |
3190 \hline | |
3191 \end{tabular} | |
3192 \end{center} | |
3193 \end{small} | |
3194 \caption{Algorithm mp\_div\_2d} | |
3195 \end{figure} | |
3196 | |
3197 \textbf{Algorithm mp\_div\_2d.} | |
3198 This algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder. The algorithm is designed much like algorithm | |
3199 mp\_mul\_2d by first using whole digit shifts then single precision shifts. This algorithm will also produce the remainder of the division | |
3200 by using algorithm mp\_mod\_2d. | |
3201 | |
3202 \vspace{+3mm}\begin{small} | |
3203 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_2d.c | |
3204 \vspace{-3mm} | |
3205 \begin{alltt} | |
3206 016 | |
3207 017 /* shift right by a certain bit count (store quotient in c, optional remaind | |
3208 er in d) */ | |
3209 018 int mp_div_2d (mp_int * a, int b, mp_int * c, mp_int * d) | |
3210 019 \{ | |
3211 020 mp_digit D, r, rr; | |
3212 021 int x, res; | |
3213 022 mp_int t; | |
3214 023 | |
3215 024 | |
3216 025 /* if the shift count is <= 0 then we do no work */ | |
3217 026 if (b <= 0) \{ | |
3218 027 res = mp_copy (a, c); | |
3219 028 if (d != NULL) \{ | |
3220 029 mp_zero (d); | |
3221 030 \} | |
3222 031 return res; | |
3223 032 \} | |
3224 033 | |
3225 034 if ((res = mp_init (&t)) != MP_OKAY) \{ | |
3226 035 return res; | |
3227 036 \} | |
3228 037 | |
3229 038 /* get the remainder */ | |
3230 039 if (d != NULL) \{ | |
3231 040 if ((res = mp_mod_2d (a, b, &t)) != MP_OKAY) \{ | |
3232 041 mp_clear (&t); | |
3233 042 return res; | |
3234 043 \} | |
3235 044 \} | |
3236 045 | |
3237 046 /* copy */ | |
3238 047 if ((res = mp_copy (a, c)) != MP_OKAY) \{ | |
3239 048 mp_clear (&t); | |
3240 049 return res; | |
3241 050 \} | |
3242 051 | |
3243 052 /* shift by as many digits in the bit count */ | |
3244 053 if (b >= (int)DIGIT_BIT) \{ | |
3245 054 mp_rshd (c, b / DIGIT_BIT); | |
3246 055 \} | |
3247 056 | |
3248 057 /* shift any bit count < DIGIT_BIT */ | |
3249 058 D = (mp_digit) (b % DIGIT_BIT); | |
3250 059 if (D != 0) \{ | |
3251 060 register mp_digit *tmpc, mask, shift; | |
3252 061 | |
3253 062 /* mask */ | |
3254 063 mask = (((mp_digit)1) << D) - 1; | |
3255 064 | |
3256 065 /* shift for lsb */ | |
3257 066 shift = DIGIT_BIT - D; | |
3258 067 | |
3259 068 /* alias */ | |
3260 069 tmpc = c->dp + (c->used - 1); | |
3261 070 | |
3262 071 /* carry */ | |
3263 072 r = 0; | |
3264 073 for (x = c->used - 1; x >= 0; x--) \{ | |
3265 074 /* get the lower bits of this word in a temp */ | |
3266 075 rr = *tmpc & mask; | |
3267 076 | |
3268 077 /* shift the current word and mix in the carry bits from the previous | |
3269 word */ | |
3270 078 *tmpc = (*tmpc >> D) | (r << shift); | |
3271 079 --tmpc; | |
3272 080 | |
3273 081 /* set the carry to the carry bits of the current word found above */ | |
3274 082 r = rr; | |
3275 083 \} | |
3276 084 \} | |
3277 085 mp_clamp (c); | |
3278 086 if (d != NULL) \{ | |
3279 087 mp_exch (&t, d); | |
3280 088 \} | |
3281 089 mp_clear (&t); | |
3282 090 return MP_OKAY; | |
3283 091 \} | |
3284 \end{alltt} | |
3285 \end{small} | |
3286 | |
3287 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally | |
3288 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the | |
3289 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before | |
3290 the quotient is obtained. | |
3291 | |
3292 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). | |
3293 | |
3294 \subsection{Remainder of Division by Power of Two} | |
3295 | |
3296 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This | |
3297 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. | |
3298 | |
3299 \begin{figure}[!here] | |
3300 \begin{small} | |
3301 \begin{center} | |
3302 \begin{tabular}{l} | |
3303 \hline Algorithm \textbf{mp\_mod\_2d}. \\ | |
3304 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ | |
3305 \textbf{Output}. $c \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\ | |
3306 \hline \\ | |
3307 1. If $b \le 0$ then do \\ | |
3308 \hspace{3mm}1.1 $c \leftarrow 0$ (\textit{mp\_zero}) \\ | |
3309 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ | |
3310 2. If $b > a.used \cdot lg(\beta)$ then do \\ | |
3311 \hspace{3mm}2.1 $c \leftarrow a$ (\textit{mp\_copy}) \\ | |
3312 \hspace{3mm}2.2 Return the result of step 2.1. \\ | |
3313 3. $c \leftarrow a$ \\ | |
3314 4. If step 3 failed return(\textit{MP\_MEM}). \\ | |
3315 5. for $n$ from $\lceil b / lg(\beta) \rceil$ to $c.used$ do \\ | |
3316 \hspace{3mm}5.1 $c_n \leftarrow 0$ \\ | |
3317 6. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ | |
3318 7. $c_{\lfloor b / lg(\beta) \rfloor} \leftarrow c_{\lfloor b / lg(\beta) \rfloor} \mbox{ (mod }2^{k}\mbox{)}$. \\ | |
3319 8. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\ | |
3320 9. Return(\textit{MP\_OKAY}). \\ | |
3321 \hline | |
3322 \end{tabular} | |
3323 \end{center} | |
3324 \end{small} | |
3325 \caption{Algorithm mp\_mod\_2d} | |
3326 \end{figure} | |
3327 | |
3328 \textbf{Algorithm mp\_mod\_2d.} | |
3329 This algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$. First if $b$ is less than or equal to zero the | |
3330 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ | |
3331 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. | |
3332 | |
3333 \vspace{+3mm}\begin{small} | |
3334 \hspace{-5.1mm}{\bf File}: bn\_mp\_mod\_2d.c | |
3335 \vspace{-3mm} | |
3336 \begin{alltt} | |
3337 016 | |
3338 017 /* calc a value mod 2**b */ | |
3339 018 int | |
3340 019 mp_mod_2d (mp_int * a, int b, mp_int * c) | |
3341 020 \{ | |
3342 021 int x, res; | |
3343 022 | |
3344 023 /* if b is <= 0 then zero the int */ | |
3345 024 if (b <= 0) \{ | |
3346 025 mp_zero (c); | |
3347 026 return MP_OKAY; | |
3348 027 \} | |
3349 028 | |
3350 029 /* if the modulus is larger than the value than return */ | |
3351 030 if (b > (int) (a->used * DIGIT_BIT)) \{ | |
3352 031 res = mp_copy (a, c); | |
3353 032 return res; | |
3354 033 \} | |
3355 034 | |
3356 035 /* copy */ | |
3357 036 if ((res = mp_copy (a, c)) != MP_OKAY) \{ | |
3358 037 return res; | |
3359 038 \} | |
3360 039 | |
3361 040 /* zero digits above the last digit of the modulus */ | |
3362 041 for (x = (b / DIGIT_BIT) + ((b % DIGIT_BIT) == 0 ? 0 : 1); x < c->used; x+ | |
3363 +) \{ | |
3364 042 c->dp[x] = 0; | |
3365 043 \} | |
3366 044 /* clear the digit that is not completely outside/inside the modulus */ | |
3367 045 c->dp[b / DIGIT_BIT] &= | |
3368 046 (mp_digit) ((((mp_digit) 1) << (((mp_digit) b) % DIGIT_BIT)) - ((mp_digi | |
3369 t) 1)); | |
3370 047 mp_clamp (c); | |
3371 048 return MP_OKAY; | |
3372 049 \} | |
3373 \end{alltt} | |
3374 \end{small} | |
3375 | |
3376 -- Add comments later, Tom. | |
3377 | |
3378 \section*{Exercises} | |
3379 \begin{tabular}{cl} | |
3380 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ | |
3381 & in $O(n)$ time. \\ | |
3382 &\\ | |
3383 $\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming \\ | |
3384 & weight values such as $3$, $5$ and $9$. Extend it to handle all values \\ | |
3385 & upto $64$ with a hamming weight less than three. \\ | |
3386 &\\ | |
3387 $\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\ | |
3388 & $2^k - 1$ as well. \\ | |
3389 &\\ | |
3390 $\left [ 3 \right ] $ & Using only algorithms mp\_mul\_2, mp\_div\_2 and mp\_add create an \\ | |
3391 & algorithm to multiply two integers in roughly $O(2n^2)$ time for \\ | |
3392 & any $n$-bit input. Note that the time of addition is ignored in the \\ | |
3393 & calculation. \\ | |
3394 & \\ | |
3395 $\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\ | |
3396 & $O \left (2^{(k-1)}n + \left ({2n^2 \over k} \right ) \right )$ for an appropriate choice of $k$. Again ignore \\ | |
3397 & the cost of addition. \\ | |
3398 & \\ | |
3399 $\left [ 2 \right ] $ & Devise a chart to find optimal values of $k$ for the previous problem \\ | |
3400 & for $n = 64 \ldots 1024$ in steps of $64$. \\ | |
3401 & \\ | |
3402 $\left [ 2 \right ] $ & Using only algorithms mp\_abs and mp\_sub devise another method for \\ | |
3403 & calculating the result of a signed comparison. \\ | |
3404 & | |
3405 \end{tabular} | |
3406 | |
3407 \chapter{Multiplication and Squaring} | |
3408 \section{The Multipliers} | |
3409 For most number theoretic problems including certain public key cryptographic algorithms, the ``multipliers'' form the most important subset of | |
3410 algorithms of any multiple precision integer package. The set of multiplier algorithms include integer multiplication, squaring and modular reduction | |
3411 where in each of the algorithms single precision multiplication is the dominant operation performed. This chapter will discuss integer multiplication | |
3412 and squaring, leaving modular reductions for the subsequent chapter. | |
3413 | |
3414 The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular | |
3415 exponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$. During a modular | |
3416 exponentiation the majority\footnote{Roughly speaking a modular exponentiation will spend about 40\% of the time performing modular reductions, | |
3417 35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision | |
3418 multiplications. | |
3419 | |
3420 For centuries general purpose multiplication has required a lengthly $O(n^2)$ process, whereby each digit of one multiplicand has to be multiplied | |
3421 against every digit of the other multiplicand. Traditional long-hand multiplication is based on this process; while the techniques can differ the | |
3422 overall algorithm used is essentially the same. Only ``recently'' have faster algorithms been studied. First Karatsuba multiplication was discovered in | |
3423 1962. This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach. | |
3424 This technique led to the discovery of polynomial basis algorithms (\textit{good reference?}) and subquently Fourier Transform based solutions. | |
3425 | |
3426 \section{Multiplication} | |
3427 \subsection{The Baseline Multiplication} | |
3428 \label{sec:basemult} | |
3429 \index{baseline multiplication} | |
3430 Computing the product of two integers in software can be achieved using a trivial adaptation of the standard $O(n^2)$ long-hand multiplication | |
3431 algorithm that school children are taught. The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision | |
3432 multiplications are required. More specifically for a $m$ and $n$ digit input $m \cdot n$ single precision multiplications are required. To | |
3433 simplify most discussions, it will be assumed that the inputs have comparable number of digits. | |
3434 | |
3435 The ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be | |
3436 used. This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible. One important | |
3437 facet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution. The importance of this | |
3438 modification will become evident during the discussion of Barrett modular reduction. Recall that for a $n$ and $m$ digit input the product | |
3439 will be at most $n + m$ digits. Therefore, this algorithm can be reduced to a full multiplier by having it produce $n + m$ digits of the product. | |
3440 | |
3441 Recall from sub-section 4.2.2 the definition of $\gamma$ as the number of bits in the type \textbf{mp\_digit}. We shall now extend the variable set to | |
3442 include $\alpha$ which shall represent the number of bits in the type \textbf{mp\_word}. This implies that $2^{\alpha} > 2 \cdot \beta^2$. The | |
3443 constant $\delta = 2^{\alpha - 2lg(\beta)}$ will represent the maximal weight of any column in a product (\textit{see sub-section 5.2.2 for more information}). | |
3444 | |
3445 \newpage\begin{figure}[!here] | |
3446 \begin{small} | |
3447 \begin{center} | |
3448 \begin{tabular}{l} | |
3449 \hline Algorithm \textbf{s\_mp\_mul\_digs}. \\ | |
3450 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ | |
3451 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ | |
3452 \hline \\ | |
3453 1. If min$(a.used, b.used) < \delta$ then do \\ | |
3454 \hspace{3mm}1.1 Calculate $c = \vert a \vert \cdot \vert b \vert$ by the Comba method (\textit{see algorithm~\ref{fig:COMBAMULT}}). \\ | |
3455 \hspace{3mm}1.2 Return the result of step 1.1 \\ | |
3456 \\ | |
3457 Allocate and initialize a temporary mp\_int. \\ | |
3458 2. Init $t$ to be of size $digs$ \\ | |
3459 3. If step 2 failed return(\textit{MP\_MEM}). \\ | |
3460 4. $t.used \leftarrow digs$ \\ | |
3461 \\ | |
3462 Compute the product. \\ | |
3463 5. for $ix$ from $0$ to $a.used - 1$ do \\ | |
3464 \hspace{3mm}5.1 $u \leftarrow 0$ \\ | |
3465 \hspace{3mm}5.2 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ | |
3466 \hspace{3mm}5.3 If $pb < 1$ then goto step 6. \\ | |
3467 \hspace{3mm}5.4 for $iy$ from $0$ to $pb - 1$ do \\ | |
3468 \hspace{6mm}5.4.1 $\hat r \leftarrow t_{iy + ix} + a_{ix} \cdot b_{iy} + u$ \\ | |
3469 \hspace{6mm}5.4.2 $t_{iy + ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
3470 \hspace{6mm}5.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
3471 \hspace{3mm}5.5 if $ix + pb < digs$ then do \\ | |
3472 \hspace{6mm}5.5.1 $t_{ix + pb} \leftarrow u$ \\ | |
3473 6. Clamp excess digits of $t$. \\ | |
3474 7. Swap $c$ with $t$ \\ | |
3475 8. Clear $t$ \\ | |
3476 9. Return(\textit{MP\_OKAY}). \\ | |
3477 \hline | |
3478 \end{tabular} | |
3479 \end{center} | |
3480 \end{small} | |
3481 \caption{Algorithm s\_mp\_mul\_digs} | |
3482 \end{figure} | |
3483 | |
3484 \textbf{Algorithm s\_mp\_mul\_digs.} | |
3485 This algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits. While it may seem | |
3486 a bit awkward to modify the function from its simple $O(n^2)$ description, the usefulness of partial multipliers will arise in a subsequent | |
3487 algorithm. The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}. | |
3488 Algorithm s\_mp\_mul\_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the | |
3489 inputs. | |
3490 | |
3491 The first thing this algorithm checks for is whether a Comba multiplier can be used instead. If the minimum digit count of either | |
3492 input is less than $\delta$, then the Comba method may be used instead. After the Comba method is ruled out, the baseline algorithm begins. A | |
3493 temporary mp\_int variable $t$ is used to hold the intermediate result of the product. This allows the algorithm to be used to | |
3494 compute products when either $a = c$ or $b = c$ without overwriting the inputs. | |
3495 | |
3496 All of step 5 is the infamous $O(n^2)$ multiplication loop slightly modified to only produce upto $digs$ digits of output. The $pb$ variable | |
3497 is given the count of digits to read from $b$ inside the nested loop. If $pb \le 1$ then no more output digits can be produced and the algorithm | |
3498 will exit the loop. The best way to think of the loops are as a series of $pb \times 1$ multiplications. That is, in each pass of the | |
3499 innermost loop $a_{ix}$ is multiplied against $b$ and the result is added (\textit{with an appropriate shift}) to $t$. | |
3500 | |
3501 For example, consider multiplying $576$ by $241$. That is equivalent to computing $10^0(1)(576) + 10^1(4)(576) + 10^2(2)(576)$ which is best | |
3502 visualized in the following table. | |
3503 | |
3504 \begin{figure}[here] | |
3505 \begin{center} | |
3506 \begin{tabular}{|c|c|c|c|c|c|l|} | |
3507 \hline && & 5 & 7 & 6 & \\ | |
3508 \hline $\times$&& & 2 & 4 & 1 & \\ | |
3509 \hline &&&&&&\\ | |
3510 && & 5 & 7 & 6 & $10^0(1)(576)$ \\ | |
3511 &2 & 3 & 6 & 1 & 6 & $10^1(4)(576) + 10^0(1)(576)$ \\ | |
3512 1 & 3 & 8 & 8 & 1 & 6 & $10^2(2)(576) + 10^1(4)(576) + 10^0(1)(576)$ \\ | |
3513 \hline | |
3514 \end{tabular} | |
3515 \end{center} | |
3516 \caption{Long-Hand Multiplication Diagram} | |
3517 \end{figure} | |
3518 | |
3519 Each row of the product is added to the result after being shifted to the left (\textit{multiplied by a power of the radix}) by the appropriate | |
3520 count. That is in pass $ix$ of the inner loop the product is added starting at the $ix$'th digit of the reult. | |
3521 | |
3522 Step 5.4.1 introduces the hat symbol (\textit{e.g. $\hat r$}) which represents a double precision variable. The multiplication on that step | |
3523 is assumed to be a double wide output single precision multiplication. That is, two single precision variables are multiplied to produce a | |
3524 double precision result. The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step | |
3525 5.4.1 is propagated through the nested loop. If the carry was not propagated immediately it would overflow the single precision digit | |
3526 $t_{ix+iy}$ and the result would be lost. | |
3527 | |
3528 At step 5.5 the nested loop is finished and any carry that was left over should be forwarded. The carry does not have to be added to the $ix+pb$'th | |
3529 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result | |
3530 exceed the precision requested. | |
3531 | |
3532 \vspace{+3mm}\begin{small} | |
3533 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_mul\_digs.c | |
3534 \vspace{-3mm} | |
3535 \begin{alltt} | |
3536 016 | |
3537 017 /* multiplies |a| * |b| and only computes upto digs digits of result | |
3538 018 * HAC pp. 595, Algorithm 14.12 Modified so you can control how | |
3539 019 * many digits of output are created. | |
3540 020 */ | |
3541 021 int | |
3542 022 s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs) | |
3543 023 \{ | |
3544 024 mp_int t; | |
3545 025 int res, pa, pb, ix, iy; | |
3546 026 mp_digit u; | |
3547 027 mp_word r; | |
3548 028 mp_digit tmpx, *tmpt, *tmpy; | |
3549 029 | |
3550 030 /* can we use the fast multiplier? */ | |
3551 031 if (((digs) < MP_WARRAY) && | |
3552 032 MIN (a->used, b->used) < | |
3553 033 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{ | |
3554 034 return fast_s_mp_mul_digs (a, b, c, digs); | |
3555 035 \} | |
3556 036 | |
3557 037 if ((res = mp_init_size (&t, digs)) != MP_OKAY) \{ | |
3558 038 return res; | |
3559 039 \} | |
3560 040 t.used = digs; | |
3561 041 | |
3562 042 /* compute the digits of the product directly */ | |
3563 043 pa = a->used; | |
3564 044 for (ix = 0; ix < pa; ix++) \{ | |
3565 045 /* set the carry to zero */ | |
3566 046 u = 0; | |
3567 047 | |
3568 048 /* limit ourselves to making digs digits of output */ | |
3569 049 pb = MIN (b->used, digs - ix); | |
3570 050 | |
3571 051 /* setup some aliases */ | |
3572 052 /* copy of the digit from a used within the nested loop */ | |
3573 053 tmpx = a->dp[ix]; | |
3574 054 | |
3575 055 /* an alias for the destination shifted ix places */ | |
3576 056 tmpt = t.dp + ix; | |
3577 057 | |
3578 058 /* an alias for the digits of b */ | |
3579 059 tmpy = b->dp; | |
3580 060 | |
3581 061 /* compute the columns of the output and propagate the carry */ | |
3582 062 for (iy = 0; iy < pb; iy++) \{ | |
3583 063 /* compute the column as a mp_word */ | |
3584 064 r = ((mp_word)*tmpt) + | |
3585 065 ((mp_word)tmpx) * ((mp_word)*tmpy++) + | |
3586 066 ((mp_word) u); | |
3587 067 | |
3588 068 /* the new column is the lower part of the result */ | |
3589 069 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); | |
3590 070 | |
3591 071 /* get the carry word from the result */ | |
3592 072 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT)); | |
3593 073 \} | |
3594 074 /* set carry if it is placed below digs */ | |
3595 075 if (ix + iy < digs) \{ | |
3596 076 *tmpt = u; | |
3597 077 \} | |
3598 078 \} | |
3599 079 | |
3600 080 mp_clamp (&t); | |
3601 081 mp_exch (&t, c); | |
3602 082 | |
3603 083 mp_clear (&t); | |
3604 084 return MP_OKAY; | |
3605 085 \} | |
3606 \end{alltt} | |
3607 \end{small} | |
3608 | |
3609 Lines 31 to 35 determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and | |
3610 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control | |
3611 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. | |
3612 | |
3613 Of particular importance is the calculation of the $ix+iy$'th column on lines 64, 65 and 66. Note how all of the | |
3614 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations | |
3615 are used instead of single precision. The multiplication on line 65 makes use of a specific GCC optimizer behaviour. On the outset it looks like | |
3616 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most | |
3617 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For | |
3618 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. | |
3619 | |
3620 \subsection{Faster Multiplication by the ``Comba'' Method} | |
3621 | |
3622 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This | |
3623 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known | |
3624 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested | |
3625 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in | |
3626 his 1986 paper \cite{BARRETT} written five years before. | |
3627 | |
3628 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how | |
3629 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the | |
3630 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. | |
3631 | |
3632 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a | |
3633 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount | |
3634 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. | |
3635 | |
3636 \begin{equation} | |
3637 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace | |
3638 \end{equation} | |
3639 | |
3640 Where $\vec x_n$ is the $n'th$ column of the output vector. Consider the following example which computes the vector $\vec x$ for the multiplication | |
3641 of $576$ and $241$. | |
3642 | |
3643 \newpage\begin{figure}[here] | |
3644 \begin{small} | |
3645 \begin{center} | |
3646 \begin{tabular}{|c|c|c|c|c|c|} | |
3647 \hline & & 5 & 7 & 6 & First Input\\ | |
3648 \hline $\times$ & & 2 & 4 & 1 & Second Input\\ | |
3649 \hline & & $1 \cdot 5 = 5$ & $1 \cdot 7 = 7$ & $1 \cdot 6 = 6$ & First pass \\ | |
3650 & $4 \cdot 5 = 20$ & $4 \cdot 7+5=33$ & $4 \cdot 6+7=31$ & 6 & Second pass \\ | |
3651 $2 \cdot 5 = 10$ & $2 \cdot 7 + 20 = 34$ & $2 \cdot 6+33=45$ & 31 & 6 & Third pass \\ | |
3652 \hline 10 & 34 & 45 & 31 & 6 & Final Result \\ | |
3653 \hline | |
3654 \end{tabular} | |
3655 \end{center} | |
3656 \end{small} | |
3657 \caption{Comba Multiplication Diagram} | |
3658 \end{figure} | |
3659 | |
3660 At this point the vector $x = \left < 10, 34, 45, 31, 6 \right >$ is the result of the first step of the Comba multipler. | |
3661 Now the columns must be fixed by propagating the carry upwards. The resultant vector will have one extra dimension over the input vector which is | |
3662 congruent to adding a leading zero digit. | |
3663 | |
3664 \begin{figure}[!here] | |
3665 \begin{small} | |
3666 \begin{center} | |
3667 \begin{tabular}{l} | |
3668 \hline Algorithm \textbf{Comba Fixup}. \\ | |
3669 \textbf{Input}. Vector $\vec x$ of dimension $k$ \\ | |
3670 \textbf{Output}. Vector $\vec x$ such that the carries have been propagated. \\ | |
3671 \hline \\ | |
3672 1. for $n$ from $0$ to $k - 1$ do \\ | |
3673 \hspace{3mm}1.1 $\vec x_{n+1} \leftarrow \vec x_{n+1} + \lfloor \vec x_{n}/\beta \rfloor$ \\ | |
3674 \hspace{3mm}1.2 $\vec x_{n} \leftarrow \vec x_{n} \mbox{ (mod }\beta\mbox{)}$ \\ | |
3675 2. Return($\vec x$). \\ | |
3676 \hline | |
3677 \end{tabular} | |
3678 \end{center} | |
3679 \end{small} | |
3680 \caption{Algorithm Comba Fixup} | |
3681 \end{figure} | |
3682 | |
3683 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$. In this case | |
3684 $241 \cdot 576$ is in fact $138816$ and the procedure succeeded. If the algorithm is correct and as will be demonstrated shortly more | |
3685 efficient than the baseline algorithm why not simply always use this algorithm? | |
3686 | |
3687 \subsubsection{Column Weight.} | |
3688 At the nested $O(n^2)$ level the Comba method adds the product of two single precision variables to each column of the output | |
3689 independently. A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix | |
3690 the carries. For example, in the multiplication of two three-digit numbers the third column of output will be the sum of | |
3691 three single precision multiplications. If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then | |
3692 an overflow can occur and the carry information will be lost. For any $m$ and $n$ digit inputs the maximum weight of any column is | |
3693 min$(m, n)$ which is fairly obvious. | |
3694 | |
3695 The maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used. Recall | |
3696 from earlier that a double precision type has $\alpha$ bits of resolution and a single precision digit has $lg(\beta)$ bits of precision. Given these | |
3697 two quantities we must not violate the following | |
3698 | |
3699 \begin{equation} | |
3700 k \cdot \left (\beta - 1 \right )^2 < 2^{\alpha} | |
3701 \end{equation} | |
3702 | |
3703 Which reduces to | |
3704 | |
3705 \begin{equation} | |
3706 k \cdot \left ( \beta^2 - 2\beta + 1 \right ) < 2^{\alpha} | |
3707 \end{equation} | |
3708 | |
3709 Let $\rho = lg(\beta)$ represent the number of bits in a single precision digit. By further re-arrangement of the equation the final solution is | |
3710 found. | |
3711 | |
3712 \begin{equation} | |
3713 k < {{2^{\alpha}} \over {\left (2^{2\rho} - 2^{\rho + 1} + 1 \right )}} | |
3714 \end{equation} | |
3715 | |
3716 The defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$. In this configuration | |
3717 the smaller input may not have more than $256$ digits if the Comba method is to be used. This is quite satisfactory for most applications since | |
3718 $256$ digits would allow for numbers in the range of $0 \le x < 2^{7168}$ which, is much larger than most public key cryptographic algorithms require. | |
3719 | |
3720 \newpage\begin{figure}[!here] | |
3721 \begin{small} | |
3722 \begin{center} | |
3723 \begin{tabular}{l} | |
3724 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ | |
3725 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ | |
3726 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ | |
3727 \hline \\ | |
3728 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ | |
3729 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ | |
3730 2. If step 1 failed return(\textit{MP\_MEM}).\\ | |
3731 \\ | |
3732 Zero the temporary array $\hat W$. \\ | |
3733 3. for $n$ from $0$ to $digs - 1$ do \\ | |
3734 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\ | |
3735 \\ | |
3736 Compute the columns. \\ | |
3737 4. for $ix$ from $0$ to $a.used - 1$ do \\ | |
3738 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ | |
3739 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ | |
3740 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ | |
3741 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ | |
3742 \\ | |
3743 Propagate the carries upwards. \\ | |
3744 5. $oldused \leftarrow c.used$ \\ | |
3745 6. $c.used \leftarrow digs$ \\ | |
3746 7. If $digs > 1$ then do \\ | |
3747 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ | |
3748 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ | |
3749 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\ | |
3750 8. else do \\ | |
3751 \hspace{3mm}8.1 $ix \leftarrow 0$ \\ | |
3752 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\ | |
3753 \\ | |
3754 Zero excess digits. \\ | |
3755 10. If $digs < oldused$ then do \\ | |
3756 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\ | |
3757 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\ | |
3758 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\ | |
3759 12. Return(\textit{MP\_OKAY}). \\ | |
3760 \hline | |
3761 \end{tabular} | |
3762 \end{center} | |
3763 \end{small} | |
3764 \caption{Algorithm fast\_s\_mp\_mul\_digs} | |
3765 \label{fig:COMBAMULT} | |
3766 \end{figure} | |
3767 | |
3768 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} | |
3769 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm | |
3770 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. | |
3771 | |
3772 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that | |
3773 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. | |
3774 | |
3775 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of | |
3776 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each | |
3777 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. | |
3778 | |
3779 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the | |
3780 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require | |
3781 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, | |
3782 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply | |
3783 and addition operations in the nested loop in parallel. | |
3784 | |
3785 \vspace{+3mm}\begin{small} | |
3786 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_mul\_digs.c | |
3787 \vspace{-3mm} | |
3788 \begin{alltt} | |
3789 016 | |
3790 017 /* Fast (comba) multiplier | |
3791 018 * | |
3792 019 * This is the fast column-array [comba] multiplier. It is | |
3793 020 * designed to compute the columns of the product first | |
3794 021 * then handle the carries afterwards. This has the effect | |
3795 022 * of making the nested loops that compute the columns very | |
3796 023 * simple and schedulable on super-scalar processors. | |
3797 024 * | |
3798 025 * This has been modified to produce a variable number of | |
3799 026 * digits of output so if say only a half-product is required | |
3800 027 * you don't have to compute the upper half (a feature | |
3801 028 * required for fast Barrett reduction). | |
3802 029 * | |
3803 030 * Based on Algorithm 14.12 on pp.595 of HAC. | |
3804 031 * | |
3805 032 */ | |
3806 033 int | |
3807 034 fast_s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs) | |
3808 035 \{ | |
3809 036 int olduse, res, pa, ix; | |
3810 037 mp_word W[MP_WARRAY]; | |
3811 038 | |
3812 039 /* grow the destination as required */ | |
3813 040 if (c->alloc < digs) \{ | |
3814 041 if ((res = mp_grow (c, digs)) != MP_OKAY) \{ | |
3815 042 return res; | |
3816 043 \} | |
3817 044 \} | |
3818 045 | |
3819 046 /* clear temp buf (the columns) */ | |
3820 047 memset (W, 0, sizeof (mp_word) * digs); | |
3821 048 | |
3822 049 /* calculate the columns */ | |
3823 050 pa = a->used; | |
3824 051 for (ix = 0; ix < pa; ix++) \{ | |
3825 052 /* this multiplier has been modified to allow you to | |
3826 053 * control how many digits of output are produced. | |
3827 054 * So at most we want to make upto "digs" digits of output. | |
3828 055 * | |
3829 056 * this adds products to distinct columns (at ix+iy) of W | |
3830 057 * note that each step through the loop is not dependent on | |
3831 058 * the previous which means the compiler can easily unroll | |
3832 059 * the loop without scheduling problems | |
3833 060 */ | |
3834 061 \{ | |
3835 062 register mp_digit tmpx, *tmpy; | |
3836 063 register mp_word *_W; | |
3837 064 register int iy, pb; | |
3838 065 | |
3839 066 /* alias for the the word on the left e.g. A[ix] * A[iy] */ | |
3840 067 tmpx = a->dp[ix]; | |
3841 068 | |
3842 069 /* alias for the right side */ | |
3843 070 tmpy = b->dp; | |
3844 071 | |
3845 072 /* alias for the columns, each step through the loop adds a new | |
3846 073 term to each column | |
3847 074 */ | |
3848 075 _W = W + ix; | |
3849 076 | |
3850 077 /* the number of digits is limited by their placement. E.g. | |
3851 078 we avoid multiplying digits that will end up above the # of | |
3852 079 digits of precision requested | |
3853 080 */ | |
3854 081 pb = MIN (b->used, digs - ix); | |
3855 082 | |
3856 083 for (iy = 0; iy < pb; iy++) \{ | |
3857 084 *_W++ += ((mp_word)tmpx) * ((mp_word)*tmpy++); | |
3858 085 \} | |
3859 086 \} | |
3860 087 | |
3861 088 \} | |
3862 089 | |
3863 090 /* setup dest */ | |
3864 091 olduse = c->used; | |
3865 092 c->used = digs; | |
3866 093 | |
3867 094 \{ | |
3868 095 register mp_digit *tmpc; | |
3869 096 | |
3870 097 /* At this point W[] contains the sums of each column. To get the | |
3871 098 * correct result we must take the extra bits from each column and | |
3872 099 * carry them down | |
3873 100 * | |
3874 101 * Note that while this adds extra code to the multiplier it | |
3875 102 * saves time since the carry propagation is removed from the | |
3876 103 * above nested loop.This has the effect of reducing the work | |
3877 104 * from N*(N+N*c)==N**2 + c*N**2 to N**2 + N*c where c is the | |
3878 105 * cost of the shifting. On very small numbers this is slower | |
3879 106 * but on most cryptographic size numbers it is faster. | |
3880 107 * | |
3881 108 * In this particular implementation we feed the carries from | |
3882 109 * behind which means when the loop terminates we still have one | |
3883 110 * last digit to copy | |
3884 111 */ | |
3885 112 tmpc = c->dp; | |
3886 113 for (ix = 1; ix < digs; ix++) \{ | |
3887 114 /* forward the carry from the previous temp */ | |
3888 115 W[ix] += (W[ix - 1] >> ((mp_word) DIGIT_BIT)); | |
3889 116 | |
3890 117 /* now extract the previous digit [below the carry] */ | |
3891 118 *tmpc++ = (mp_digit) (W[ix - 1] & ((mp_word) MP_MASK)); | |
3892 119 \} | |
3893 120 /* fetch the last digit */ | |
3894 121 *tmpc++ = (mp_digit) (W[digs - 1] & ((mp_word) MP_MASK)); | |
3895 122 | |
3896 123 /* clear unused digits [that existed in the old copy of c] */ | |
3897 124 for (; ix < olduse; ix++) \{ | |
3898 125 *tmpc++ = 0; | |
3899 126 \} | |
3900 127 \} | |
3901 128 mp_clamp (c); | |
3902 129 return MP_OKAY; | |
3903 130 \} | |
3904 \end{alltt} | |
3905 \end{small} | |
3906 | |
3907 The memset on line 47 clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication | |
3908 implementation a series of aliases (\textit{lines 67, 70 and 75}) are used to simplify the inner $O(n^2)$ loop. | |
3909 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. | |
3910 | |
3911 The inner loop on lines 83, 84 and 85 is where the algorithm will spend the majority of the time, which is why it has been | |
3912 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the | |
3913 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three | |
3914 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop | |
3915 and scheduling the instructions so there are very few dependency stalls. | |
3916 | |
3917 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the | |
3918 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next | |
3919 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can | |
3920 be simultaneously used. | |
3921 | |
3922 \subsection{Polynomial Basis Multiplication} | |
3923 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms | |
3924 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and | |
3925 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. | |
3926 | |
3927 The product $a \cdot b \equiv f(x)g(x)$ is the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$. The coefficients $w_i$ will | |
3928 directly yield the desired product when $\beta$ is substituted for $x$. The direct solution to solve for the $2n + 1$ coefficients | |
3929 requires $O(n^2)$ time and would in practice be slower than the Comba technique. | |
3930 | |
3931 However, numerical analysis theory indicates that only $2n + 1$ distinct points in $W(x)$ are required to determine the values of the $2n + 1$ unknown | |
3932 coefficients. This means by finding $\zeta_y = W(y)$ for $2n + 1$ small values of $y$ the coefficients of $W(x)$ can be found with | |
3933 Gaussian elimination. This technique is also occasionally refered to as the \textit{interpolation technique} (\textit{references please...}) since in | |
3934 effect an interpolation based on $2n + 1$ points will yield a polynomial equivalent to $W(x)$. | |
3935 | |
3936 The coefficients of the polynomial $W(x)$ are unknown which makes finding $W(y)$ for any value of $y$ impossible. However, since | |
3937 $W(x) = f(x)g(x)$ the equivalent $\zeta_y = f(y) g(y)$ can be used in its place. The benefit of this technique stems from the | |
3938 fact that $f(y)$ and $g(y)$ are much smaller than either $a$ or $b$ respectively. As a result finding the $2n + 1$ relations required | |
3939 by multiplying $f(y)g(y)$ involves multiplying integers that are much smaller than either of the inputs. | |
3940 | |
3941 When picking points to gather relations there are always three obvious points to choose, $y = 0, 1$ and $ \infty$. The $\zeta_0$ term | |
3942 is simply the product $W(0) = w_0 = a_0 \cdot b_0$. The $\zeta_1$ term is the product | |
3943 $W(1) = \left (\sum_{i = 0}^{n} a_i \right ) \left (\sum_{i = 0}^{n} b_i \right )$. The third point $\zeta_{\infty}$ is less obvious but rather | |
3944 simple to explain. The $2n + 1$'th coefficient of $W(x)$ is numerically equivalent to the most significant column in an integer multiplication. | |
3945 The point at $\infty$ is used symbolically to represent the most significant column, that is $W(\infty) = w_{2n} = a_nb_n$. Note that the | |
3946 points at $y = 0$ and $\infty$ yield the coefficients $w_0$ and $w_{2n}$ directly. | |
3947 | |
3948 If more points are required they should be of small values and powers of two such as $2^q$ and the related \textit{mirror points} | |
3949 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ for small values of $q$. The term ``mirror point'' stems from the fact that | |
3950 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ can be calculated in the exact opposite fashion as $\zeta_{2^q}$. For | |
3951 example, when $n = 2$ and $q = 1$ then following two equations are equivalent to the point $\zeta_{2}$ and its mirror. | |
3952 | |
3953 \begin{eqnarray} | |
3954 \zeta_{2} = f(2)g(2) = (4a_2 + 2a_1 + a_0)(4b_2 + 2b_1 + b_0) \nonumber \\ | |
3955 16 \cdot \zeta_{1 \over 2} = 4f({1\over 2}) \cdot 4g({1 \over 2}) = (a_2 + 2a_1 + 4a_0)(b_2 + 2b_1 + 4b_0) | |
3956 \end{eqnarray} | |
3957 | |
3958 Using such points will allow the values of $f(y)$ and $g(y)$ to be independently calculated using only left shifts. For example, when $n = 2$ the | |
3959 polynomial $f(2^q)$ is equal to $2^q((2^qa_2) + a_1) + a_0$. This technique of polynomial representation is known as Horner's method. | |
3960 | |
3961 As a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications. Each multiplication is of | |
3962 multiplicands that have $n$ times fewer digits than the inputs. The asymptotic running time of this algorithm is | |
3963 $O \left ( k^{lg_n(2n - 1)} \right )$ for $k$ digit inputs (\textit{assuming they have the same number of digits}). Figure~\ref{fig:exponent} | |
3964 summarizes the exponents for various values of $n$. | |
3965 | |
3966 \begin{figure} | |
3967 \begin{center} | |
3968 \begin{tabular}{|c|c|c|} | |
3969 \hline \textbf{Split into $n$ Parts} & \textbf{Exponent} & \textbf{Notes}\\ | |
3970 \hline $2$ & $1.584962501$ & This is Karatsuba Multiplication. \\ | |
3971 \hline $3$ & $1.464973520$ & This is Toom-Cook Multiplication. \\ | |
3972 \hline $4$ & $1.403677461$ &\\ | |
3973 \hline $5$ & $1.365212389$ &\\ | |
3974 \hline $10$ & $1.278753601$ &\\ | |
3975 \hline $100$ & $1.149426538$ &\\ | |
3976 \hline $1000$ & $1.100270931$ &\\ | |
3977 \hline $10000$ & $1.075252070$ &\\ | |
3978 \hline | |
3979 \end{tabular} | |
3980 \end{center} | |
3981 \caption{Asymptotic Running Time of Polynomial Basis Multiplication} | |
3982 \label{fig:exponent} | |
3983 \end{figure} | |
3984 | |
3985 At first it may seem like a good idea to choose $n = 1000$ since the exponent is approximately $1.1$. However, the overhead | |
3986 of solving for the 2001 terms of $W(x)$ will certainly consume any savings the algorithm could offer for all but exceedingly large | |
3987 numbers. | |
3988 | |
3989 \subsubsection{Cutoff Point} | |
3990 The polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach. However, | |
3991 the algorithms incur an overhead (\textit{at the $O(n)$ work level}) since they require a system of equations to be solved. This makes the | |
3992 polynomial basis approach more costly to use with small inputs. | |
3993 | |
3994 Let $m$ represent the number of digits in the multiplicands (\textit{assume both multiplicands have the same number of digits}). There exists a | |
3995 point $y$ such that when $m < y$ the polynomial basis algorithms are more costly than Comba, when $m = y$ they are roughly the same cost and | |
3996 when $m > y$ the Comba methods are slower than the polynomial basis algorithms. | |
3997 | |
3998 The exact location of $y$ depends on several key architectural elements of the computer platform in question. | |
3999 | |
4000 \begin{enumerate} | |
4001 \item The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc. For example | |
4002 on the AMD Athlon the ratio is roughly $17 : 1$ while on the Intel P4 it is $29 : 1$. The higher the ratio in favour of multiplication the lower | |
4003 the cutoff point $y$ will be. | |
4004 | |
4005 \item The complexity of the linear system of equations (\textit{for the coefficients of $W(x)$}) is. Generally speaking as the number of splits | |
4006 grows the complexity grows substantially. Ideally solving the system will only involve addition, subtraction and shifting of integers. This | |
4007 directly reflects on the ratio previous mentioned. | |
4008 | |
4009 \item To a lesser extent memory bandwidth and function call overheads. Provided the values are in the processor cache this is less of an | |
4010 influence over the cutoff point. | |
4011 | |
4012 \end{enumerate} | |
4013 | |
4014 A clean cutoff point separation occurs when a point $y$ is found such that all of the cutoff point conditions are met. For example, if the point | |
4015 is too low then there will be values of $m$ such that $m > y$ and the Comba method is still faster. Finding the cutoff points is fairly simple when | |
4016 a high resolution timer is available. | |
4017 | |
4018 \subsection{Karatsuba Multiplication} | |
4019 Karatsuba \cite{KARA} multiplication when originally proposed in 1962 was among the first set of algorithms to break the $O(n^2)$ barrier for | |
4020 general purpose multiplication. Given two polynomial basis representations $f(x) = ax + b$ and $g(x) = cx + d$, Karatsuba proved with | |
4021 light algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent. | |
4022 | |
4023 \begin{equation} | |
4024 f(x) \cdot g(x) = acx^2 + ((a - b)(c - d) - (ac + bd))x + bd | |
4025 \end{equation} | |
4026 | |
4027 Using the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product. Applying | |
4028 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique. It turns | |
4029 out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points | |
4030 $\zeta_0$, $\zeta_{\infty}$ and $-\zeta_{-1}$. Consider the resultant system of equations. | |
4031 | |
4032 \begin{center} | |
4033 \begin{tabular}{rcrcrcrc} | |
4034 $\zeta_{0}$ & $=$ & & & & & $w_0$ \\ | |
4035 $-\zeta_{-1}$ & $=$ & $-w_2$ & $+$ & $w_1$ & $-$ & $w_0$ \\ | |
4036 $\zeta_{\infty}$ & $=$ & $w_2$ & & & & \\ | |
4037 \end{tabular} | |
4038 \end{center} | |
4039 | |
4040 By adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for. The simplicity | |
4041 of this system of equations has made Karatsuba fairly popular. In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.} | |
4042 making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman. It is worth noting that the point | |
4043 $\zeta_1$ could be substituted for $-\zeta_{-1}$. In this case the first and third row are subtracted instead of added to the second row. | |
4044 | |
4045 \newpage\begin{figure}[!here] | |
4046 \begin{small} | |
4047 \begin{center} | |
4048 \begin{tabular}{l} | |
4049 \hline Algorithm \textbf{mp\_karatsuba\_mul}. \\ | |
4050 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | |
4051 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert$ \\ | |
4052 \hline \\ | |
4053 1. Init the following mp\_int variables: $x0$, $x1$, $y0$, $y1$, $t1$, $x0y0$, $x1y1$.\\ | |
4054 2. If step 2 failed then return(\textit{MP\_MEM}). \\ | |
4055 \\ | |
4056 Split the input. e.g. $a = x1 \cdot \beta^B + x0$ \\ | |
4057 3. $B \leftarrow \mbox{min}(a.used, b.used)/2$ \\ | |
4058 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\ | |
4059 5. $y0 \leftarrow b \mbox{ (mod }\beta^B\mbox{)}$ \\ | |
4060 6. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_rshd}) \\ | |
4061 7. $y1 \leftarrow \lfloor b / \beta^B \rfloor$ \\ | |
4062 \\ | |
4063 Calculate the three products. \\ | |
4064 8. $x0y0 \leftarrow x0 \cdot y0$ (\textit{mp\_mul}) \\ | |
4065 9. $x1y1 \leftarrow x1 \cdot y1$ \\ | |
4066 10. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\ | |
4067 11. $x0 \leftarrow y1 - y0$ \\ | |
4068 12. $t1 \leftarrow t1 \cdot x0$ \\ | |
4069 \\ | |
4070 Calculate the middle term. \\ | |
4071 13. $x0 \leftarrow x0y0 + x1y1$ \\ | |
4072 14. $t1 \leftarrow x0 - t1$ \\ | |
4073 \\ | |
4074 Calculate the final product. \\ | |
4075 15. $t1 \leftarrow t1 \cdot \beta^B$ (\textit{mp\_lshd}) \\ | |
4076 16. $x1y1 \leftarrow x1y1 \cdot \beta^{2B}$ \\ | |
4077 17. $t1 \leftarrow x0y0 + t1$ \\ | |
4078 18. $c \leftarrow t1 + x1y1$ \\ | |
4079 19. Clear all of the temporary variables. \\ | |
4080 20. Return(\textit{MP\_OKAY}).\\ | |
4081 \hline | |
4082 \end{tabular} | |
4083 \end{center} | |
4084 \end{small} | |
4085 \caption{Algorithm mp\_karatsuba\_mul} | |
4086 \end{figure} | |
4087 | |
4088 \textbf{Algorithm mp\_karatsuba\_mul.} | |
4089 This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm. It is loosely based on the description | |
4090 from Knuth \cite[pp. 294-295]{TAOCPV2}. | |
4091 | |
4092 \index{radix point} | |
4093 In order to split the two inputs into their respective halves, a suitable \textit{radix point} must be chosen. The radix point chosen must | |
4094 be used for both of the inputs meaning that it must be smaller than the smallest input. Step 3 chooses the radix point $B$ as half of the | |
4095 smallest input \textbf{used} count. After the radix point is chosen the inputs are split into lower and upper halves. Step 4 and 5 | |
4096 compute the lower halves. Step 6 and 7 computer the upper halves. | |
4097 | |
4098 After the halves have been computed the three intermediate half-size products must be computed. Step 8 and 9 compute the trivial products | |
4099 $x0 \cdot y0$ and $x1 \cdot y1$. The mp\_int $x0$ is used as a temporary variable after $x1 - x0$ has been computed. By using $x0$ instead | |
4100 of an additional temporary variable, the algorithm can avoid an addition memory allocation operation. | |
4101 | |
4102 The remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations. | |
4103 | |
4104 \vspace{+3mm}\begin{small} | |
4105 \hspace{-5.1mm}{\bf File}: bn\_mp\_karatsuba\_mul.c | |
4106 \vspace{-3mm} | |
4107 \begin{alltt} | |
4108 016 | |
4109 017 /* c = |a| * |b| using Karatsuba Multiplication using | |
4110 018 * three half size multiplications | |
4111 019 * | |
4112 020 * Let B represent the radix [e.g. 2**DIGIT_BIT] and | |
4113 021 * let n represent half of the number of digits in | |
4114 022 * the min(a,b) | |
4115 023 * | |
4116 024 * a = a1 * B**n + a0 | |
4117 025 * b = b1 * B**n + b0 | |
4118 026 * | |
4119 027 * Then, a * b => | |
4120 028 a1b1 * B**2n + ((a1 - a0)(b1 - b0) + a0b0 + a1b1) * B + a0b0 | |
4121 029 * | |
4122 030 * Note that a1b1 and a0b0 are used twice and only need to be | |
4123 031 * computed once. So in total three half size (half # of | |
4124 032 * digit) multiplications are performed, a0b0, a1b1 and | |
4125 033 * (a1-b1)(a0-b0) | |
4126 034 * | |
4127 035 * Note that a multiplication of half the digits requires | |
4128 036 * 1/4th the number of single precision multiplications so in | |
4129 037 * total after one call 25% of the single precision multiplications | |
4130 038 * are saved. Note also that the call to mp_mul can end up back | |
4131 039 * in this function if the a0, a1, b0, or b1 are above the threshold. | |
4132 040 * This is known as divide-and-conquer and leads to the famous | |
4133 041 * O(N**lg(3)) or O(N**1.584) work which is asymptopically lower than | |
4134 042 * the standard O(N**2) that the baseline/comba methods use. | |
4135 043 * Generally though the overhead of this method doesn't pay off | |
4136 044 * until a certain size (N ~ 80) is reached. | |
4137 045 */ | |
4138 046 int mp_karatsuba_mul (mp_int * a, mp_int * b, mp_int * c) | |
4139 047 \{ | |
4140 048 mp_int x0, x1, y0, y1, t1, x0y0, x1y1; | |
4141 049 int B, err; | |
4142 050 | |
4143 051 /* default the return code to an error */ | |
4144 052 err = MP_MEM; | |
4145 053 | |
4146 054 /* min # of digits */ | |
4147 055 B = MIN (a->used, b->used); | |
4148 056 | |
4149 057 /* now divide in two */ | |
4150 058 B = B >> 1; | |
4151 059 | |
4152 060 /* init copy all the temps */ | |
4153 061 if (mp_init_size (&x0, B) != MP_OKAY) | |
4154 062 goto ERR; | |
4155 063 if (mp_init_size (&x1, a->used - B) != MP_OKAY) | |
4156 064 goto X0; | |
4157 065 if (mp_init_size (&y0, B) != MP_OKAY) | |
4158 066 goto X1; | |
4159 067 if (mp_init_size (&y1, b->used - B) != MP_OKAY) | |
4160 068 goto Y0; | |
4161 069 | |
4162 070 /* init temps */ | |
4163 071 if (mp_init_size (&t1, B * 2) != MP_OKAY) | |
4164 072 goto Y1; | |
4165 073 if (mp_init_size (&x0y0, B * 2) != MP_OKAY) | |
4166 074 goto T1; | |
4167 075 if (mp_init_size (&x1y1, B * 2) != MP_OKAY) | |
4168 076 goto X0Y0; | |
4169 077 | |
4170 078 /* now shift the digits */ | |
4171 079 x0.sign = x1.sign = a->sign; | |
4172 080 y0.sign = y1.sign = b->sign; | |
4173 081 | |
4174 082 x0.used = y0.used = B; | |
4175 083 x1.used = a->used - B; | |
4176 084 y1.used = b->used - B; | |
4177 085 | |
4178 086 \{ | |
4179 087 register int x; | |
4180 088 register mp_digit *tmpa, *tmpb, *tmpx, *tmpy; | |
4181 089 | |
4182 090 /* we copy the digits directly instead of using higher level functions | |
4183 091 * since we also need to shift the digits | |
4184 092 */ | |
4185 093 tmpa = a->dp; | |
4186 094 tmpb = b->dp; | |
4187 095 | |
4188 096 tmpx = x0.dp; | |
4189 097 tmpy = y0.dp; | |
4190 098 for (x = 0; x < B; x++) \{ | |
4191 099 *tmpx++ = *tmpa++; | |
4192 100 *tmpy++ = *tmpb++; | |
4193 101 \} | |
4194 102 | |
4195 103 tmpx = x1.dp; | |
4196 104 for (x = B; x < a->used; x++) \{ | |
4197 105 *tmpx++ = *tmpa++; | |
4198 106 \} | |
4199 107 | |
4200 108 tmpy = y1.dp; | |
4201 109 for (x = B; x < b->used; x++) \{ | |
4202 110 *tmpy++ = *tmpb++; | |
4203 111 \} | |
4204 112 \} | |
4205 113 | |
4206 114 /* only need to clamp the lower words since by definition the | |
4207 115 * upper words x1/y1 must have a known number of digits | |
4208 116 */ | |
4209 117 mp_clamp (&x0); | |
4210 118 mp_clamp (&y0); | |
4211 119 | |
4212 120 /* now calc the products x0y0 and x1y1 */ | |
4213 121 /* after this x0 is no longer required, free temp [x0==t2]! */ | |
4214 122 if (mp_mul (&x0, &y0, &x0y0) != MP_OKAY) | |
4215 123 goto X1Y1; /* x0y0 = x0*y0 */ | |
4216 124 if (mp_mul (&x1, &y1, &x1y1) != MP_OKAY) | |
4217 125 goto X1Y1; /* x1y1 = x1*y1 */ | |
4218 126 | |
4219 127 /* now calc x1-x0 and y1-y0 */ | |
4220 128 if (mp_sub (&x1, &x0, &t1) != MP_OKAY) | |
4221 129 goto X1Y1; /* t1 = x1 - x0 */ | |
4222 130 if (mp_sub (&y1, &y0, &x0) != MP_OKAY) | |
4223 131 goto X1Y1; /* t2 = y1 - y0 */ | |
4224 132 if (mp_mul (&t1, &x0, &t1) != MP_OKAY) | |
4225 133 goto X1Y1; /* t1 = (x1 - x0) * (y1 - y0) */ | |
4226 134 | |
4227 135 /* add x0y0 */ | |
4228 136 if (mp_add (&x0y0, &x1y1, &x0) != MP_OKAY) | |
4229 137 goto X1Y1; /* t2 = x0y0 + x1y1 */ | |
4230 138 if (mp_sub (&x0, &t1, &t1) != MP_OKAY) | |
4231 139 goto X1Y1; /* t1 = x0y0 + x1y1 - (x1-x0)*(y1-y0) */ | |
4232 140 | |
4233 141 /* shift by B */ | |
4234 142 if (mp_lshd (&t1, B) != MP_OKAY) | |
4235 143 goto X1Y1; /* t1 = (x0y0 + x1y1 - (x1-x0)*(y1-y0))<<B */ | |
4236 144 if (mp_lshd (&x1y1, B * 2) != MP_OKAY) | |
4237 145 goto X1Y1; /* x1y1 = x1y1 << 2*B */ | |
4238 146 | |
4239 147 if (mp_add (&x0y0, &t1, &t1) != MP_OKAY) | |
4240 148 goto X1Y1; /* t1 = x0y0 + t1 */ | |
4241 149 if (mp_add (&t1, &x1y1, c) != MP_OKAY) | |
4242 150 goto X1Y1; /* t1 = x0y0 + t1 + x1y1 */ | |
4243 151 | |
4244 152 /* Algorithm succeeded set the return code to MP_OKAY */ | |
4245 153 err = MP_OKAY; | |
4246 154 | |
4247 155 X1Y1:mp_clear (&x1y1); | |
4248 156 X0Y0:mp_clear (&x0y0); | |
4249 157 T1:mp_clear (&t1); | |
4250 158 Y1:mp_clear (&y1); | |
4251 159 Y0:mp_clear (&y0); | |
4252 160 X1:mp_clear (&x1); | |
4253 161 X0:mp_clear (&x0); | |
4254 162 ERR: | |
4255 163 return err; | |
4256 164 \} | |
4257 \end{alltt} | |
4258 \end{small} | |
4259 | |
4260 The new coding element in this routine, not seen in previous routines, is the usage of goto statements. The conventional | |
4261 wisdom is that goto statements should be avoided. This is generally true, however when every single function call can fail, it makes sense | |
4262 to handle error recovery with a single piece of code. Lines 61 to 75 handle initializing all of the temporary variables | |
4263 required. Note how each of the if statements goes to a different label in case of failure. This allows the routine to correctly free only | |
4264 the temporaries that have been successfully allocated so far. | |
4265 | |
4266 The temporary variables are all initialized using the mp\_init\_size routine since they are expected to be large. This saves the | |
4267 additional reallocation that would have been necessary. Also $x0$, $x1$, $y0$ and $y1$ have to be able to hold at least their respective | |
4268 number of digits for the next section of code. | |
4269 | |
4270 The first algebraic portion of the algorithm is to split the two inputs into their halves. However, instead of using mp\_mod\_2d and mp\_rshd | |
4271 to extract the halves, the respective code has been placed inline within the body of the function. To initialize the halves, the \textbf{used} and | |
4272 \textbf{sign} members are copied first. The first for loop on line 98 copies the lower halves. Since they are both the same magnitude it | |
4273 is simpler to calculate both lower halves in a single loop. The for loop on lines 104 and 109 calculate the upper halves $x1$ and | |
4274 $y1$ respectively. | |
4275 | |
4276 By inlining the calculation of the halves, the Karatsuba multiplier has a slightly lower overhead and can be used for smaller magnitude inputs. | |
4277 | |
4278 When line 153 is reached, the algorithm has completed succesfully. The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that | |
4279 the same code that handles errors can be used to clear the temporary variables and return. | |
4280 | |
4281 \subsection{Toom-Cook $3$-Way Multiplication} | |
4282 Toom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points are | |
4283 chosen such that $\zeta$ is easy to compute and the resulting system of equations easy to reduce. Here, the points $\zeta_{0}$, | |
4284 $16 \cdot \zeta_{1 \over 2}$, $\zeta_1$, $\zeta_2$ and $\zeta_{\infty}$ make up the five required points to solve for the coefficients | |
4285 of the $W(x)$. | |
4286 | |
4287 With the five relations that Toom-Cook specifies, the following system of equations is formed. | |
4288 | |
4289 \begin{center} | |
4290 \begin{tabular}{rcrcrcrcrcr} | |
4291 $\zeta_0$ & $=$ & $0w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $1w_0$ \\ | |
4292 $16 \cdot \zeta_{1 \over 2}$ & $=$ & $1w_4$ & $+$ & $2w_3$ & $+$ & $4w_2$ & $+$ & $8w_1$ & $+$ & $16w_0$ \\ | |
4293 $\zeta_1$ & $=$ & $1w_4$ & $+$ & $1w_3$ & $+$ & $1w_2$ & $+$ & $1w_1$ & $+$ & $1w_0$ \\ | |
4294 $\zeta_2$ & $=$ & $16w_4$ & $+$ & $8w_3$ & $+$ & $4w_2$ & $+$ & $2w_1$ & $+$ & $1w_0$ \\ | |
4295 $\zeta_{\infty}$ & $=$ & $1w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $0w_0$ \\ | |
4296 \end{tabular} | |
4297 \end{center} | |
4298 | |
4299 A trivial solution to this matrix requires $12$ subtractions, two multiplications by a small power of two, two divisions by a small power | |
4300 of two, two divisions by three and one multiplication by three. All of these $19$ sub-operations require less than quadratic time, meaning that | |
4301 the algorithm can be faster than a baseline multiplication. However, the greater complexity of this algorithm places the cutoff point | |
4302 (\textbf{TOOM\_MUL\_CUTOFF}) where Toom-Cook becomes more efficient much higher than the Karatsuba cutoff point. | |
4303 | |
4304 \begin{figure}[!here] | |
4305 \begin{small} | |
4306 \begin{center} | |
4307 \begin{tabular}{l} | |
4308 \hline Algorithm \textbf{mp\_toom\_mul}. \\ | |
4309 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | |
4310 \textbf{Output}. $c \leftarrow a \cdot b $ \\ | |
4311 \hline \\ | |
4312 Split $a$ and $b$ into three pieces. E.g. $a = a_2 \beta^{2k} + a_1 \beta^{k} + a_0$ \\ | |
4313 1. $k \leftarrow \lfloor \mbox{min}(a.used, b.used) / 3 \rfloor$ \\ | |
4314 2. $a_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4315 3. $a_1 \leftarrow \lfloor a / \beta^k \rfloor$, $a_1 \leftarrow a_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4316 4. $a_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $a_2 \leftarrow a_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4317 5. $b_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4318 6. $b_1 \leftarrow \lfloor a / \beta^k \rfloor$, $b_1 \leftarrow b_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4319 7. $b_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $b_2 \leftarrow b_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ | |
4320 \\ | |
4321 Find the five equations for $w_0, w_1, ..., w_4$. \\ | |
4322 8. $w_0 \leftarrow a_0 \cdot b_0$ \\ | |
4323 9. $w_4 \leftarrow a_2 \cdot b_2$ \\ | |
4324 10. $tmp_1 \leftarrow 2 \cdot a_0$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_2$ \\ | |
4325 11. $tmp_2 \leftarrow 2 \cdot b_0$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_2$ \\ | |
4326 12. $w_1 \leftarrow tmp_1 \cdot tmp_2$ \\ | |
4327 13. $tmp_1 \leftarrow 2 \cdot a_2$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_0$ \\ | |
4328 14. $tmp_2 \leftarrow 2 \cdot b_2$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_0$ \\ | |
4329 15. $w_3 \leftarrow tmp_1 \cdot tmp_2$ \\ | |
4330 16. $tmp_1 \leftarrow a_0 + a_1$, $tmp_1 \leftarrow tmp_1 + a_2$, $tmp_2 \leftarrow b_0 + b_1$, $tmp_2 \leftarrow tmp_2 + b_2$ \\ | |
4331 17. $w_2 \leftarrow tmp_1 \cdot tmp_2$ \\ | |
4332 \\ | |
4333 Continued on the next page.\\ | |
4334 \hline | |
4335 \end{tabular} | |
4336 \end{center} | |
4337 \end{small} | |
4338 \caption{Algorithm mp\_toom\_mul} | |
4339 \end{figure} | |
4340 | |
4341 \newpage\begin{figure}[!here] | |
4342 \begin{small} | |
4343 \begin{center} | |
4344 \begin{tabular}{l} | |
4345 \hline Algorithm \textbf{mp\_toom\_mul} (continued). \\ | |
4346 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | |
4347 \textbf{Output}. $c \leftarrow a \cdot b $ \\ | |
4348 \hline \\ | |
4349 Now solve the system of equations. \\ | |
4350 18. $w_1 \leftarrow w_4 - w_1$, $w_3 \leftarrow w_3 - w_0$ \\ | |
4351 19. $w_1 \leftarrow \lfloor w_1 / 2 \rfloor$, $w_3 \leftarrow \lfloor w_3 / 2 \rfloor$ \\ | |
4352 20. $w_2 \leftarrow w_2 - w_0$, $w_2 \leftarrow w_2 - w_4$ \\ | |
4353 21. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\ | |
4354 22. $tmp_1 \leftarrow 8 \cdot w_0$, $w_1 \leftarrow w_1 - tmp_1$, $tmp_1 \leftarrow 8 \cdot w_4$, $w_3 \leftarrow w_3 - tmp_1$ \\ | |
4355 23. $w_2 \leftarrow 3 \cdot w_2$, $w_2 \leftarrow w_2 - w_1$, $w_2 \leftarrow w_2 - w_3$ \\ | |
4356 24. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\ | |
4357 25. $w_1 \leftarrow \lfloor w_1 / 3 \rfloor, w_3 \leftarrow \lfloor w_3 / 3 \rfloor$ \\ | |
4358 \\ | |
4359 Now substitute $\beta^k$ for $x$ by shifting $w_0, w_1, ..., w_4$. \\ | |
4360 26. for $n$ from $1$ to $4$ do \\ | |
4361 \hspace{3mm}26.1 $w_n \leftarrow w_n \cdot \beta^{nk}$ \\ | |
4362 27. $c \leftarrow w_0 + w_1$, $c \leftarrow c + w_2$, $c \leftarrow c + w_3$, $c \leftarrow c + w_4$ \\ | |
4363 28. Return(\textit{MP\_OKAY}) \\ | |
4364 \hline | |
4365 \end{tabular} | |
4366 \end{center} | |
4367 \end{small} | |
4368 \caption{Algorithm mp\_toom\_mul (continued)} | |
4369 \end{figure} | |
4370 | |
4371 \textbf{Algorithm mp\_toom\_mul.} | |
4372 This algorithm computes the product of two mp\_int variables $a$ and $b$ using the Toom-Cook approach. Compared to the Karatsuba multiplication, this | |
4373 algorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead. In this | |
4374 description, several statements have been compounded to save space. The intention is that the statements are executed from left to right across | |
4375 any given step. | |
4376 | |
4377 The two inputs $a$ and $b$ are first split into three $k$-digit integers $a_0, a_1, a_2$ and $b_0, b_1, b_2$ respectively. From these smaller | |
4378 integers the coefficients of the polynomial basis representations $f(x)$ and $g(x)$ are known and can be used to find the relations required. | |
4379 | |
4380 The first two relations $w_0$ and $w_4$ are the points $\zeta_{0}$ and $\zeta_{\infty}$ respectively. The relation $w_1, w_2$ and $w_3$ correspond | |
4381 to the points $16 \cdot \zeta_{1 \over 2}, \zeta_{2}$ and $\zeta_{1}$ respectively. These are found using logical shifts to independently find | |
4382 $f(y)$ and $g(y)$ which significantly speeds up the algorithm. | |
4383 | |
4384 After the five relations $w_0, w_1, \ldots, w_4$ have been computed, the system they represent must be solved in order for the unknown coefficients | |
4385 $w_1, w_2$ and $w_3$ to be isolated. The steps 18 through 25 perform the system reduction required as previously described. Each step of | |
4386 the reduction represents the comparable matrix operation that would be performed had this been performed by pencil. For example, step 18 indicates | |
4387 that row $1$ must be subtracted from row $4$ and simultaneously row $0$ subtracted from row $3$. | |
4388 | |
4389 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer | |
4390 result $a \cdot b$ is produced. | |
4391 | |
4392 \vspace{+3mm}\begin{small} | |
4393 \hspace{-5.1mm}{\bf File}: bn\_mp\_toom\_mul.c | |
4394 \vspace{-3mm} | |
4395 \begin{alltt} | |
4396 016 | |
4397 017 /* multiplication using the Toom-Cook 3-way algorithm */ | |
4398 018 int mp_toom_mul(mp_int *a, mp_int *b, mp_int *c) | |
4399 019 \{ | |
4400 020 mp_int w0, w1, w2, w3, w4, tmp1, tmp2, a0, a1, a2, b0, b1, b2; | |
4401 021 int res, B; | |
4402 022 | |
4403 023 /* init temps */ | |
4404 024 if ((res = mp_init_multi(&w0, &w1, &w2, &w3, &w4, | |
4405 025 &a0, &a1, &a2, &b0, &b1, | |
4406 026 &b2, &tmp1, &tmp2, NULL)) != MP_OKAY) \{ | |
4407 027 return res; | |
4408 028 \} | |
4409 029 | |
4410 030 /* B */ | |
4411 031 B = MIN(a->used, b->used) / 3; | |
4412 032 | |
4413 033 /* a = a2 * B**2 + a1 * B + a0 */ | |
4414 034 if ((res = mp_mod_2d(a, DIGIT_BIT * B, &a0)) != MP_OKAY) \{ | |
4415 035 goto ERR; | |
4416 036 \} | |
4417 037 | |
4418 038 if ((res = mp_copy(a, &a1)) != MP_OKAY) \{ | |
4419 039 goto ERR; | |
4420 040 \} | |
4421 041 mp_rshd(&a1, B); | |
4422 042 mp_mod_2d(&a1, DIGIT_BIT * B, &a1); | |
4423 043 | |
4424 044 if ((res = mp_copy(a, &a2)) != MP_OKAY) \{ | |
4425 045 goto ERR; | |
4426 046 \} | |
4427 047 mp_rshd(&a2, B*2); | |
4428 048 | |
4429 049 /* b = b2 * B**2 + b1 * B + b0 */ | |
4430 050 if ((res = mp_mod_2d(b, DIGIT_BIT * B, &b0)) != MP_OKAY) \{ | |
4431 051 goto ERR; | |
4432 052 \} | |
4433 053 | |
4434 054 if ((res = mp_copy(b, &b1)) != MP_OKAY) \{ | |
4435 055 goto ERR; | |
4436 056 \} | |
4437 057 mp_rshd(&b1, B); | |
4438 058 mp_mod_2d(&b1, DIGIT_BIT * B, &b1); | |
4439 059 | |
4440 060 if ((res = mp_copy(b, &b2)) != MP_OKAY) \{ | |
4441 061 goto ERR; | |
4442 062 \} | |
4443 063 mp_rshd(&b2, B*2); | |
4444 064 | |
4445 065 /* w0 = a0*b0 */ | |
4446 066 if ((res = mp_mul(&a0, &b0, &w0)) != MP_OKAY) \{ | |
4447 067 goto ERR; | |
4448 068 \} | |
4449 069 | |
4450 070 /* w4 = a2 * b2 */ | |
4451 071 if ((res = mp_mul(&a2, &b2, &w4)) != MP_OKAY) \{ | |
4452 072 goto ERR; | |
4453 073 \} | |
4454 074 | |
4455 075 /* w1 = (a2 + 2(a1 + 2a0))(b2 + 2(b1 + 2b0)) */ | |
4456 076 if ((res = mp_mul_2(&a0, &tmp1)) != MP_OKAY) \{ | |
4457 077 goto ERR; | |
4458 078 \} | |
4459 079 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{ | |
4460 080 goto ERR; | |
4461 081 \} | |
4462 082 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{ | |
4463 083 goto ERR; | |
4464 084 \} | |
4465 085 if ((res = mp_add(&tmp1, &a2, &tmp1)) != MP_OKAY) \{ | |
4466 086 goto ERR; | |
4467 087 \} | |
4468 088 | |
4469 089 if ((res = mp_mul_2(&b0, &tmp2)) != MP_OKAY) \{ | |
4470 090 goto ERR; | |
4471 091 \} | |
4472 092 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{ | |
4473 093 goto ERR; | |
4474 094 \} | |
4475 095 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{ | |
4476 096 goto ERR; | |
4477 097 \} | |
4478 098 if ((res = mp_add(&tmp2, &b2, &tmp2)) != MP_OKAY) \{ | |
4479 099 goto ERR; | |
4480 100 \} | |
4481 101 | |
4482 102 if ((res = mp_mul(&tmp1, &tmp2, &w1)) != MP_OKAY) \{ | |
4483 103 goto ERR; | |
4484 104 \} | |
4485 105 | |
4486 106 /* w3 = (a0 + 2(a1 + 2a2))(b0 + 2(b1 + 2b2)) */ | |
4487 107 if ((res = mp_mul_2(&a2, &tmp1)) != MP_OKAY) \{ | |
4488 108 goto ERR; | |
4489 109 \} | |
4490 110 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{ | |
4491 111 goto ERR; | |
4492 112 \} | |
4493 113 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{ | |
4494 114 goto ERR; | |
4495 115 \} | |
4496 116 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{ | |
4497 117 goto ERR; | |
4498 118 \} | |
4499 119 | |
4500 120 if ((res = mp_mul_2(&b2, &tmp2)) != MP_OKAY) \{ | |
4501 121 goto ERR; | |
4502 122 \} | |
4503 123 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{ | |
4504 124 goto ERR; | |
4505 125 \} | |
4506 126 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{ | |
4507 127 goto ERR; | |
4508 128 \} | |
4509 129 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{ | |
4510 130 goto ERR; | |
4511 131 \} | |
4512 132 | |
4513 133 if ((res = mp_mul(&tmp1, &tmp2, &w3)) != MP_OKAY) \{ | |
4514 134 goto ERR; | |
4515 135 \} | |
4516 136 | |
4517 137 | |
4518 138 /* w2 = (a2 + a1 + a0)(b2 + b1 + b0) */ | |
4519 139 if ((res = mp_add(&a2, &a1, &tmp1)) != MP_OKAY) \{ | |
4520 140 goto ERR; | |
4521 141 \} | |
4522 142 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{ | |
4523 143 goto ERR; | |
4524 144 \} | |
4525 145 if ((res = mp_add(&b2, &b1, &tmp2)) != MP_OKAY) \{ | |
4526 146 goto ERR; | |
4527 147 \} | |
4528 148 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{ | |
4529 149 goto ERR; | |
4530 150 \} | |
4531 151 if ((res = mp_mul(&tmp1, &tmp2, &w2)) != MP_OKAY) \{ | |
4532 152 goto ERR; | |
4533 153 \} | |
4534 154 | |
4535 155 /* now solve the matrix | |
4536 156 | |
4537 157 0 0 0 0 1 | |
4538 158 1 2 4 8 16 | |
4539 159 1 1 1 1 1 | |
4540 160 16 8 4 2 1 | |
4541 161 1 0 0 0 0 | |
4542 162 | |
4543 163 using 12 subtractions, 4 shifts, | |
4544 164 2 small divisions and 1 small multiplication | |
4545 165 */ | |
4546 166 | |
4547 167 /* r1 - r4 */ | |
4548 168 if ((res = mp_sub(&w1, &w4, &w1)) != MP_OKAY) \{ | |
4549 169 goto ERR; | |
4550 170 \} | |
4551 171 /* r3 - r0 */ | |
4552 172 if ((res = mp_sub(&w3, &w0, &w3)) != MP_OKAY) \{ | |
4553 173 goto ERR; | |
4554 174 \} | |
4555 175 /* r1/2 */ | |
4556 176 if ((res = mp_div_2(&w1, &w1)) != MP_OKAY) \{ | |
4557 177 goto ERR; | |
4558 178 \} | |
4559 179 /* r3/2 */ | |
4560 180 if ((res = mp_div_2(&w3, &w3)) != MP_OKAY) \{ | |
4561 181 goto ERR; | |
4562 182 \} | |
4563 183 /* r2 - r0 - r4 */ | |
4564 184 if ((res = mp_sub(&w2, &w0, &w2)) != MP_OKAY) \{ | |
4565 185 goto ERR; | |
4566 186 \} | |
4567 187 if ((res = mp_sub(&w2, &w4, &w2)) != MP_OKAY) \{ | |
4568 188 goto ERR; | |
4569 189 \} | |
4570 190 /* r1 - r2 */ | |
4571 191 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{ | |
4572 192 goto ERR; | |
4573 193 \} | |
4574 194 /* r3 - r2 */ | |
4575 195 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{ | |
4576 196 goto ERR; | |
4577 197 \} | |
4578 198 /* r1 - 8r0 */ | |
4579 199 if ((res = mp_mul_2d(&w0, 3, &tmp1)) != MP_OKAY) \{ | |
4580 200 goto ERR; | |
4581 201 \} | |
4582 202 if ((res = mp_sub(&w1, &tmp1, &w1)) != MP_OKAY) \{ | |
4583 203 goto ERR; | |
4584 204 \} | |
4585 205 /* r3 - 8r4 */ | |
4586 206 if ((res = mp_mul_2d(&w4, 3, &tmp1)) != MP_OKAY) \{ | |
4587 207 goto ERR; | |
4588 208 \} | |
4589 209 if ((res = mp_sub(&w3, &tmp1, &w3)) != MP_OKAY) \{ | |
4590 210 goto ERR; | |
4591 211 \} | |
4592 212 /* 3r2 - r1 - r3 */ | |
4593 213 if ((res = mp_mul_d(&w2, 3, &w2)) != MP_OKAY) \{ | |
4594 214 goto ERR; | |
4595 215 \} | |
4596 216 if ((res = mp_sub(&w2, &w1, &w2)) != MP_OKAY) \{ | |
4597 217 goto ERR; | |
4598 218 \} | |
4599 219 if ((res = mp_sub(&w2, &w3, &w2)) != MP_OKAY) \{ | |
4600 220 goto ERR; | |
4601 221 \} | |
4602 222 /* r1 - r2 */ | |
4603 223 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{ | |
4604 224 goto ERR; | |
4605 225 \} | |
4606 226 /* r3 - r2 */ | |
4607 227 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{ | |
4608 228 goto ERR; | |
4609 229 \} | |
4610 230 /* r1/3 */ | |
4611 231 if ((res = mp_div_3(&w1, &w1, NULL)) != MP_OKAY) \{ | |
4612 232 goto ERR; | |
4613 233 \} | |
4614 234 /* r3/3 */ | |
4615 235 if ((res = mp_div_3(&w3, &w3, NULL)) != MP_OKAY) \{ | |
4616 236 goto ERR; | |
4617 237 \} | |
4618 238 | |
4619 239 /* at this point shift W[n] by B*n */ | |
4620 240 if ((res = mp_lshd(&w1, 1*B)) != MP_OKAY) \{ | |
4621 241 goto ERR; | |
4622 242 \} | |
4623 243 if ((res = mp_lshd(&w2, 2*B)) != MP_OKAY) \{ | |
4624 244 goto ERR; | |
4625 245 \} | |
4626 246 if ((res = mp_lshd(&w3, 3*B)) != MP_OKAY) \{ | |
4627 247 goto ERR; | |
4628 248 \} | |
4629 249 if ((res = mp_lshd(&w4, 4*B)) != MP_OKAY) \{ | |
4630 250 goto ERR; | |
4631 251 \} | |
4632 252 | |
4633 253 if ((res = mp_add(&w0, &w1, c)) != MP_OKAY) \{ | |
4634 254 goto ERR; | |
4635 255 \} | |
4636 256 if ((res = mp_add(&w2, &w3, &tmp1)) != MP_OKAY) \{ | |
4637 257 goto ERR; | |
4638 258 \} | |
4639 259 if ((res = mp_add(&w4, &tmp1, &tmp1)) != MP_OKAY) \{ | |
4640 260 goto ERR; | |
4641 261 \} | |
4642 262 if ((res = mp_add(&tmp1, c, c)) != MP_OKAY) \{ | |
4643 263 goto ERR; | |
4644 264 \} | |
4645 265 | |
4646 266 ERR: | |
4647 267 mp_clear_multi(&w0, &w1, &w2, &w3, &w4, | |
4648 268 &a0, &a1, &a2, &b0, &b1, | |
4649 269 &b2, &tmp1, &tmp2, NULL); | |
4650 270 return res; | |
4651 271 \} | |
4652 272 | |
4653 \end{alltt} | |
4654 \end{small} | |
4655 | |
4656 -- Comments to be added during editing phase. | |
4657 | |
4658 \subsection{Signed Multiplication} | |
4659 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all | |
4660 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. | |
4661 | |
4662 \newpage\begin{figure}[!here] | |
4663 \begin{small} | |
4664 \begin{center} | |
4665 \begin{tabular}{l} | |
4666 \hline Algorithm \textbf{mp\_mul}. \\ | |
4667 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | |
4668 \textbf{Output}. $c \leftarrow a \cdot b$ \\ | |
4669 \hline \\ | |
4670 1. If $a.sign = b.sign$ then \\ | |
4671 \hspace{3mm}1.1 $sign = MP\_ZPOS$ \\ | |
4672 2. else \\ | |
4673 \hspace{3mm}2.1 $sign = MP\_ZNEG$ \\ | |
4674 3. If min$(a.used, b.used) \ge TOOM\_MUL\_CUTOFF$ then \\ | |
4675 \hspace{3mm}3.1 $c \leftarrow a \cdot b$ using algorithm mp\_toom\_mul \\ | |
4676 4. else if min$(a.used, b.used) \ge KARATSUBA\_MUL\_CUTOFF$ then \\ | |
4677 \hspace{3mm}4.1 $c \leftarrow a \cdot b$ using algorithm mp\_karatsuba\_mul \\ | |
4678 5. else \\ | |
4679 \hspace{3mm}5.1 $digs \leftarrow a.used + b.used + 1$ \\ | |
4680 \hspace{3mm}5.2 If $digs < MP\_ARRAY$ and min$(a.used, b.used) \le \delta$ then \\ | |
4681 \hspace{6mm}5.2.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm fast\_s\_mp\_mul\_digs. \\ | |
4682 \hspace{3mm}5.3 else \\ | |
4683 \hspace{6mm}5.3.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm s\_mp\_mul\_digs. \\ | |
4684 6. $c.sign \leftarrow sign$ \\ | |
4685 7. Return the result of the unsigned multiplication performed. \\ | |
4686 \hline | |
4687 \end{tabular} | |
4688 \end{center} | |
4689 \end{small} | |
4690 \caption{Algorithm mp\_mul} | |
4691 \end{figure} | |
4692 | |
4693 \textbf{Algorithm mp\_mul.} | |
4694 This algorithm performs the signed multiplication of two inputs. It will make use of any of the three unsigned multiplication algorithms | |
4695 available when the input is of appropriate size. The \textbf{sign} of the result is not set until the end of the algorithm since algorithm | |
4696 s\_mp\_mul\_digs will clear it. | |
4697 | |
4698 \vspace{+3mm}\begin{small} | |
4699 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul.c | |
4700 \vspace{-3mm} | |
4701 \begin{alltt} | |
4702 016 | |
4703 017 /* high level multiplication (handles sign) */ | |
4704 018 int mp_mul (mp_int * a, mp_int * b, mp_int * c) | |
4705 019 \{ | |
4706 020 int res, neg; | |
4707 021 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG; | |
4708 022 | |
4709 023 /* use Toom-Cook? */ | |
4710 024 if (MIN (a->used, b->used) >= TOOM_MUL_CUTOFF) \{ | |
4711 025 res = mp_toom_mul(a, b, c); | |
4712 026 /* use Karatsuba? */ | |
4713 027 \} else if (MIN (a->used, b->used) >= KARATSUBA_MUL_CUTOFF) \{ | |
4714 028 res = mp_karatsuba_mul (a, b, c); | |
4715 029 \} else \{ | |
4716 030 /* can we use the fast multiplier? | |
4717 031 * | |
4718 032 * The fast multiplier can be used if the output will | |
4719 033 * have less than MP_WARRAY digits and the number of | |
4720 034 * digits won't affect carry propagation | |
4721 035 */ | |
4722 036 int digs = a->used + b->used + 1; | |
4723 037 | |
4724 038 if ((digs < MP_WARRAY) && | |
4725 039 MIN(a->used, b->used) <= | |
4726 040 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{ | |
4727 041 res = fast_s_mp_mul_digs (a, b, c, digs); | |
4728 042 \} else \{ | |
4729 043 res = s_mp_mul (a, b, c); | |
4730 044 \} | |
4731 045 \} | |
4732 046 c->sign = neg; | |
4733 047 return res; | |
4734 048 \} | |
4735 \end{alltt} | |
4736 \end{small} | |
4737 | |
4738 The implementation is rather simplistic and is not particularly noteworthy. Line 23 computes the sign of the result using the ``?'' | |
4739 operator from the C programming language. Line 40 computes $\delta$ using the fact that $1 << k$ is equal to $2^k$. | |
4740 | |
4741 \section{Squaring} | |
4742 \label{sec:basesquare} | |
4743 | |
4744 Squaring is a special case of multiplication where both multiplicands are equal. At first it may seem like there is no significant optimization | |
4745 available but in fact there is. Consider the multiplication of $576$ against $241$. In total there will be nine single precision multiplications | |
4746 performed which are $1\cdot 6$, $1 \cdot 7$, $1 \cdot 5$, $4 \cdot 6$, $4 \cdot 7$, $4 \cdot 5$, $2 \cdot 6$, $2 \cdot 7$ and $2 \cdot 5$. Now consider | |
4747 the multiplication of $123$ against $123$. The nine products are $3 \cdot 3$, $3 \cdot 2$, $3 \cdot 1$, $2 \cdot 3$, $2 \cdot 2$, $2 \cdot 1$, | |
4748 $1 \cdot 3$, $1 \cdot 2$ and $1 \cdot 1$. On closer inspection some of the products are equivalent. For example, $3 \cdot 2 = 2 \cdot 3$ | |
4749 and $3 \cdot 1 = 1 \cdot 3$. | |
4750 | |
4751 For any $n$-digit input, there are ${{\left (n^2 + n \right)}\over 2}$ possible unique single precision multiplications required compared to the $n^2$ | |
4752 required for multiplication. The following diagram gives an example of the operations required. | |
4753 | |
4754 \begin{figure}[here] | |
4755 \begin{center} | |
4756 \begin{tabular}{ccccc|c} | |
4757 &&1&2&3&\\ | |
4758 $\times$ &&1&2&3&\\ | |
4759 \hline && $3 \cdot 1$ & $3 \cdot 2$ & $3 \cdot 3$ & Row 0\\ | |
4760 & $2 \cdot 1$ & $2 \cdot 2$ & $2 \cdot 3$ && Row 1 \\ | |
4761 $1 \cdot 1$ & $1 \cdot 2$ & $1 \cdot 3$ &&& Row 2 \\ | |
4762 \end{tabular} | |
4763 \end{center} | |
4764 \caption{Squaring Optimization Diagram} | |
4765 \end{figure} | |
4766 | |
4767 Starting from zero and numbering the columns from right to left a very simple pattern becomes obvious. For the purposes of this discussion let $x$ | |
4768 represent the number being squared. The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it. | |
4769 | |
4770 The second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product. Every non-square term of a column will | |
4771 appear twice hence the name ``double product''. Every odd column is made up entirely of double products. In fact every column is made up of double | |
4772 products and at most one square (\textit{see the exercise section}). | |
4773 | |
4774 The third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row, | |
4775 occurs at column $2k + 1$. For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero. | |
4776 Column two of row one is a square and column three is the first unique column. | |
4777 | |
4778 \subsection{The Baseline Squaring Algorithm} | |
4779 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines | |
4780 will not handle. | |
4781 | |
4782 \newpage\begin{figure}[!here] | |
4783 \begin{small} | |
4784 \begin{center} | |
4785 \begin{tabular}{l} | |
4786 \hline Algorithm \textbf{s\_mp\_sqr}. \\ | |
4787 \textbf{Input}. mp\_int $a$ \\ | |
4788 \textbf{Output}. $b \leftarrow a^2$ \\ | |
4789 \hline \\ | |
4790 1. Init a temporary mp\_int of at least $2 \cdot a.used +1$ digits. (\textit{mp\_init\_size}) \\ | |
4791 2. If step 1 failed return(\textit{MP\_MEM}) \\ | |
4792 3. $t.used \leftarrow 2 \cdot a.used + 1$ \\ | |
4793 4. For $ix$ from 0 to $a.used - 1$ do \\ | |
4794 \hspace{3mm}Calculate the square. \\ | |
4795 \hspace{3mm}4.1 $\hat r \leftarrow t_{2ix} + \left (a_{ix} \right )^2$ \\ | |
4796 \hspace{3mm}4.2 $t_{2ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
4797 \hspace{3mm}Calculate the double products after the square. \\ | |
4798 \hspace{3mm}4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
4799 \hspace{3mm}4.4 For $iy$ from $ix + 1$ to $a.used - 1$ do \\ | |
4800 \hspace{6mm}4.4.1 $\hat r \leftarrow 2 \cdot a_{ix}a_{iy} + t_{ix + iy} + u$ \\ | |
4801 \hspace{6mm}4.4.2 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
4802 \hspace{6mm}4.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
4803 \hspace{3mm}Set the last carry. \\ | |
4804 \hspace{3mm}4.5 While $u > 0$ do \\ | |
4805 \hspace{6mm}4.5.1 $iy \leftarrow iy + 1$ \\ | |
4806 \hspace{6mm}4.5.2 $\hat r \leftarrow t_{ix + iy} + u$ \\ | |
4807 \hspace{6mm}4.5.3 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
4808 \hspace{6mm}4.5.4 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
4809 5. Clamp excess digits of $t$. (\textit{mp\_clamp}) \\ | |
4810 6. Exchange $b$ and $t$. \\ | |
4811 7. Clear $t$ (\textit{mp\_clear}) \\ | |
4812 8. Return(\textit{MP\_OKAY}) \\ | |
4813 \hline | |
4814 \end{tabular} | |
4815 \end{center} | |
4816 \end{small} | |
4817 \caption{Algorithm s\_mp\_sqr} | |
4818 \end{figure} | |
4819 | |
4820 \textbf{Algorithm s\_mp\_sqr.} | |
4821 This algorithm computes the square of an input using the three observations on squaring. It is based fairly faithfully on algorithm 14.16 of HAC | |
4822 \cite[pp.596-597]{HAC}. Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring. This allows the | |
4823 destination mp\_int to be the same as the source mp\_int. | |
4824 | |
4825 The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while | |
4826 the inner loop computes the columns of the partial result. Step 4.1 and 4.2 compute the square term for each row, and step 4.3 and 4.4 propagate | |
4827 the carry and compute the double products. | |
4828 | |
4829 The requirement that a mp\_word be able to represent the range $0 \le x < 2 \beta^2$ arises from this | |
4830 very algorithm. The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that | |
4831 when it is multiplied by two, it can be properly represented by a mp\_word. | |
4832 | |
4833 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial | |
4834 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. | |
4835 | |
4836 \vspace{+3mm}\begin{small} | |
4837 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sqr.c | |
4838 \vspace{-3mm} | |
4839 \begin{alltt} | |
4840 016 | |
4841 017 /* low level squaring, b = a*a, HAC pp.596-597, Algorithm 14.16 */ | |
4842 018 int | |
4843 019 s_mp_sqr (mp_int * a, mp_int * b) | |
4844 020 \{ | |
4845 021 mp_int t; | |
4846 022 int res, ix, iy, pa; | |
4847 023 mp_word r; | |
4848 024 mp_digit u, tmpx, *tmpt; | |
4849 025 | |
4850 026 pa = a->used; | |
4851 027 if ((res = mp_init_size (&t, 2*pa + 1)) != MP_OKAY) \{ | |
4852 028 return res; | |
4853 029 \} | |
4854 030 | |
4855 031 /* default used is maximum possible size */ | |
4856 032 t.used = 2*pa + 1; | |
4857 033 | |
4858 034 for (ix = 0; ix < pa; ix++) \{ | |
4859 035 /* first calculate the digit at 2*ix */ | |
4860 036 /* calculate double precision result */ | |
4861 037 r = ((mp_word) t.dp[2*ix]) + | |
4862 038 ((mp_word)a->dp[ix])*((mp_word)a->dp[ix]); | |
4863 039 | |
4864 040 /* store lower part in result */ | |
4865 041 t.dp[ix+ix] = (mp_digit) (r & ((mp_word) MP_MASK)); | |
4866 042 | |
4867 043 /* get the carry */ | |
4868 044 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); | |
4869 045 | |
4870 046 /* left hand side of A[ix] * A[iy] */ | |
4871 047 tmpx = a->dp[ix]; | |
4872 048 | |
4873 049 /* alias for where to store the results */ | |
4874 050 tmpt = t.dp + (2*ix + 1); | |
4875 051 | |
4876 052 for (iy = ix + 1; iy < pa; iy++) \{ | |
4877 053 /* first calculate the product */ | |
4878 054 r = ((mp_word)tmpx) * ((mp_word)a->dp[iy]); | |
4879 055 | |
4880 056 /* now calculate the double precision result, note we use | |
4881 057 * addition instead of *2 since it's easier to optimize | |
4882 058 */ | |
4883 059 r = ((mp_word) *tmpt) + r + r + ((mp_word) u); | |
4884 060 | |
4885 061 /* store lower part */ | |
4886 062 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); | |
4887 063 | |
4888 064 /* get carry */ | |
4889 065 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); | |
4890 066 \} | |
4891 067 /* propagate upwards */ | |
4892 068 while (u != ((mp_digit) 0)) \{ | |
4893 069 r = ((mp_word) *tmpt) + ((mp_word) u); | |
4894 070 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); | |
4895 071 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); | |
4896 072 \} | |
4897 073 \} | |
4898 074 | |
4899 075 mp_clamp (&t); | |
4900 076 mp_exch (&t, b); | |
4901 077 mp_clear (&t); | |
4902 078 return MP_OKAY; | |
4903 079 \} | |
4904 \end{alltt} | |
4905 \end{small} | |
4906 | |
4907 Inside the outer loop (\textit{see line 34}) the square term is calculated on line 37. Line 44 extracts the carry from the square | |
4908 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines 47 and 50 respectively. The doubling is performed using two | |
4909 additions (\textit{see line 59}) since it is usually faster than shifting,if not at least as fast. | |
4910 | |
4911 \subsection{Faster Squaring by the ``Comba'' Method} | |
4912 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional | |
4913 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these | |
4914 performance hazards. | |
4915 | |
4916 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry | |
4917 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact | |
4918 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, | |
4919 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. | |
4920 | |
4921 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word | |
4922 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be | |
4923 moved to a $O(n)$ work level outside the $O(n^2)$ level. | |
4924 | |
4925 \newpage\begin{figure}[!here] | |
4926 \begin{small} | |
4927 \begin{center} | |
4928 \begin{tabular}{l} | |
4929 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ | |
4930 \textbf{Input}. mp\_int $a$ \\ | |
4931 \textbf{Output}. $b \leftarrow a^2$ \\ | |
4932 \hline \\ | |
4933 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ | |
4934 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ | |
4935 2. If step 1 failed return(\textit{MP\_MEM}). \\ | |
4936 3. for $ix$ from $0$ to $2a.used + 1$ do \\ | |
4937 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ | |
4938 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\ | |
4939 4. for $ix$ from $0$ to $a.used - 1$ do \\ | |
4940 \hspace{3mm}Compute the square.\\ | |
4941 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\ | |
4942 \\ | |
4943 \hspace{3mm}Compute the double products.\\ | |
4944 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ | |
4945 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ | |
4946 5. $oldused \leftarrow b.used$ \\ | |
4947 6. $b.used \leftarrow 2a.used + 1$ \\ | |
4948 \\ | |
4949 Double the products and propagate the carries simultaneously. \\ | |
4950 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ | |
4951 8. for $ix$ from $1$ to $2a.used$ do \\ | |
4952 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ | |
4953 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ | |
4954 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ | |
4955 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ | |
4956 10. if $2a.used + 1 < oldused$ then do \\ | |
4957 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\ | |
4958 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\ | |
4959 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ | |
4960 12. Return(\textit{MP\_OKAY}). \\ | |
4961 \hline | |
4962 \end{tabular} | |
4963 \end{center} | |
4964 \end{small} | |
4965 \caption{Algorithm fast\_s\_mp\_sqr} | |
4966 \end{figure} | |
4967 | |
4968 \textbf{Algorithm fast\_s\_mp\_sqr.} | |
4969 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when | |
4970 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. | |
4971 | |
4972 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second | |
4973 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most | |
4974 processors to simply make it a full size array. | |
4975 | |
4976 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how | |
4977 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop | |
4978 computes the sum of the products for each column. They are not doubled until later. | |
4979 | |
4980 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both | |
4981 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the | |
4982 squares in place. | |
4983 | |
4984 \vspace{+3mm}\begin{small} | |
4985 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_sqr.c | |
4986 \vspace{-3mm} | |
4987 \begin{alltt} | |
4988 016 | |
4989 017 /* fast squaring | |
4990 018 * | |
4991 019 * This is the comba method where the columns of the product | |
4992 020 * are computed first then the carries are computed. This | |
4993 021 * has the effect of making a very simple inner loop that | |
4994 022 * is executed the most | |
4995 023 * | |
4996 024 * W2 represents the outer products and W the inner. | |
4997 025 * | |
4998 026 * A further optimizations is made because the inner | |
4999 027 * products are of the form "A * B * 2". The *2 part does | |
5000 028 * not need to be computed until the end which is good | |
5001 029 * because 64-bit shifts are slow! | |
5002 030 * | |
5003 031 * Based on Algorithm 14.16 on pp.597 of HAC. | |
5004 032 * | |
5005 033 */ | |
5006 034 int fast_s_mp_sqr (mp_int * a, mp_int * b) | |
5007 035 \{ | |
5008 036 int olduse, newused, res, ix, pa; | |
5009 037 mp_word W2[MP_WARRAY], W[MP_WARRAY]; | |
5010 038 | |
5011 039 /* calculate size of product and allocate as required */ | |
5012 040 pa = a->used; | |
5013 041 newused = pa + pa + 1; | |
5014 042 if (b->alloc < newused) \{ | |
5015 043 if ((res = mp_grow (b, newused)) != MP_OKAY) \{ | |
5016 044 return res; | |
5017 045 \} | |
5018 046 \} | |
5019 047 | |
5020 048 /* zero temp buffer (columns) | |
5021 049 * Note that there are two buffers. Since squaring requires | |
5022 050 * a outer and inner product and the inner product requires | |
5023 051 * computing a product and doubling it (a relatively expensive | |
5024 052 * op to perform n**2 times if you don't have to) the inner and | |
5025 053 * outer products are computed in different buffers. This way | |
5026 054 * the inner product can be doubled using n doublings instead of | |
5027 055 * n**2 | |
5028 056 */ | |
5029 057 memset (W, 0, newused * sizeof (mp_word)); | |
5030 058 memset (W2, 0, newused * sizeof (mp_word)); | |
5031 059 | |
5032 060 /* This computes the inner product. To simplify the inner N**2 loop | |
5033 061 * the multiplication by two is done afterwards in the N loop. | |
5034 062 */ | |
5035 063 for (ix = 0; ix < pa; ix++) \{ | |
5036 064 /* compute the outer product | |
5037 065 * | |
5038 066 * Note that every outer product is computed | |
5039 067 * for a particular column only once which means that | |
5040 068 * there is no need todo a double precision addition | |
5041 069 * into the W2[] array. | |
5042 070 */ | |
5043 071 W2[ix + ix] = ((mp_word)a->dp[ix]) * ((mp_word)a->dp[ix]); | |
5044 072 | |
5045 073 \{ | |
5046 074 register mp_digit tmpx, *tmpy; | |
5047 075 register mp_word *_W; | |
5048 076 register int iy; | |
5049 077 | |
5050 078 /* copy of left side */ | |
5051 079 tmpx = a->dp[ix]; | |
5052 080 | |
5053 081 /* alias for right side */ | |
5054 082 tmpy = a->dp + (ix + 1); | |
5055 083 | |
5056 084 /* the column to store the result in */ | |
5057 085 _W = W + (ix + ix + 1); | |
5058 086 | |
5059 087 /* inner products */ | |
5060 088 for (iy = ix + 1; iy < pa; iy++) \{ | |
5061 089 *_W++ += ((mp_word)tmpx) * ((mp_word)*tmpy++); | |
5062 090 \} | |
5063 091 \} | |
5064 092 \} | |
5065 093 | |
5066 094 /* setup dest */ | |
5067 095 olduse = b->used; | |
5068 096 b->used = newused; | |
5069 097 | |
5070 098 /* now compute digits | |
5071 099 * | |
5072 100 * We have to double the inner product sums, add in the | |
5073 101 * outer product sums, propagate carries and convert | |
5074 102 * to single precision. | |
5075 103 */ | |
5076 104 \{ | |
5077 105 register mp_digit *tmpb; | |
5078 106 | |
5079 107 /* double first value, since the inner products are | |
5080 108 * half of what they should be | |
5081 109 */ | |
5082 110 W[0] += W[0] + W2[0]; | |
5083 111 | |
5084 112 tmpb = b->dp; | |
5085 113 for (ix = 1; ix < newused; ix++) \{ | |
5086 114 /* double/add next digit */ | |
5087 115 W[ix] += W[ix] + W2[ix]; | |
5088 116 | |
5089 117 /* propagate carry forwards [from the previous digit] */ | |
5090 118 W[ix] = W[ix] + (W[ix - 1] >> ((mp_word) DIGIT_BIT)); | |
5091 119 | |
5092 120 /* store the current digit now that the carry isn't | |
5093 121 * needed | |
5094 122 */ | |
5095 123 *tmpb++ = (mp_digit) (W[ix - 1] & ((mp_word) MP_MASK)); | |
5096 124 \} | |
5097 125 /* set the last value. Note even if the carry is zero | |
5098 126 * this is required since the next step will not zero | |
5099 127 * it if b originally had a value at b->dp[2*a.used] | |
5100 128 */ | |
5101 129 *tmpb++ = (mp_digit) (W[(newused) - 1] & ((mp_word) MP_MASK)); | |
5102 130 | |
5103 131 /* clear high digits of b if there were any originally */ | |
5104 132 for (; ix < olduse; ix++) \{ | |
5105 133 *tmpb++ = 0; | |
5106 134 \} | |
5107 135 \} | |
5108 136 | |
5109 137 mp_clamp (b); | |
5110 138 return MP_OKAY; | |
5111 139 \} | |
5112 \end{alltt} | |
5113 \end{small} | |
5114 | |
5115 -- Write something deep and insightful later, Tom. | |
5116 | |
5117 \subsection{Polynomial Basis Squaring} | |
5118 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception | |
5119 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ | |
5120 multiplications to find the $\zeta$ relations, squaring operations are performed instead. | |
5121 | |
5122 \subsection{Karatsuba Squaring} | |
5123 Let $f(x) = ax + b$ represent the polynomial basis representation of a number to square. | |
5124 Let $h(x) = \left ( f(x) \right )^2$ represent the square of the polynomial. The Karatsuba equation can be modified to square a | |
5125 number with the following equation. | |
5126 | |
5127 \begin{equation} | |
5128 h(x) = a^2x^2 + \left (a^2 + b^2 - (a - b)^2 \right )x + b^2 | |
5129 \end{equation} | |
5130 | |
5131 Upon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a - b)^2$. As in | |
5132 Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of | |
5133 $O \left ( n^{lg(3)} \right )$. | |
5134 | |
5135 If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm | |
5136 instead? The answer to this arises from the cutoff point for squaring. As in multiplication there exists a cutoff point, at which the | |
5137 time required for a Comba based squaring and a Karatsuba based squaring meet. Due to the overhead inherent in the Karatsuba method, the cutoff | |
5138 point is fairly high. For example, on an AMD Athlon XP processor with $\beta = 2^{28}$, the cutoff point is around 127 digits. | |
5139 | |
5140 Consider squaring a 200 digit number with this technique. It will be split into two 100 digit halves which are subsequently squared. | |
5141 The 100 digit halves will not be squared using Karatsuba, but instead using the faster Comba based squaring algorithm. If Karatsuba multiplication | |
5142 were used instead, the 100 digit numbers would be squared with a slower Comba based multiplication. | |
5143 | |
5144 \newpage\begin{figure}[!here] | |
5145 \begin{small} | |
5146 \begin{center} | |
5147 \begin{tabular}{l} | |
5148 \hline Algorithm \textbf{mp\_karatsuba\_sqr}. \\ | |
5149 \textbf{Input}. mp\_int $a$ \\ | |
5150 \textbf{Output}. $b \leftarrow a^2$ \\ | |
5151 \hline \\ | |
5152 1. Initialize the following temporary mp\_ints: $x0$, $x1$, $t1$, $t2$, $x0x0$ and $x1x1$. \\ | |
5153 2. If any of the initializations on step 1 failed return(\textit{MP\_MEM}). \\ | |
5154 \\ | |
5155 Split the input. e.g. $a = x1\beta^B + x0$ \\ | |
5156 3. $B \leftarrow \lfloor a.used / 2 \rfloor$ \\ | |
5157 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\ | |
5158 5. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_lshd}) \\ | |
5159 \\ | |
5160 Calculate the three squares. \\ | |
5161 6. $x0x0 \leftarrow x0^2$ (\textit{mp\_sqr}) \\ | |
5162 7. $x1x1 \leftarrow x1^2$ \\ | |
5163 8. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\ | |
5164 9. $t1 \leftarrow t1^2$ \\ | |
5165 \\ | |
5166 Compute the middle term. \\ | |
5167 10. $t2 \leftarrow x0x0 + x1x1$ (\textit{s\_mp\_add}) \\ | |
5168 11. $t1 \leftarrow t2 - t1$ \\ | |
5169 \\ | |
5170 Compute final product. \\ | |
5171 12. $t1 \leftarrow t1\beta^B$ (\textit{mp\_lshd}) \\ | |
5172 13. $x1x1 \leftarrow x1x1\beta^{2B}$ \\ | |
5173 14. $t1 \leftarrow t1 + x0x0$ \\ | |
5174 15. $b \leftarrow t1 + x1x1$ \\ | |
5175 16. Return(\textit{MP\_OKAY}). \\ | |
5176 \hline | |
5177 \end{tabular} | |
5178 \end{center} | |
5179 \end{small} | |
5180 \caption{Algorithm mp\_karatsuba\_sqr} | |
5181 \end{figure} | |
5182 | |
5183 \textbf{Algorithm mp\_karatsuba\_sqr.} | |
5184 This algorithm computes the square of an input $a$ using the Karatsuba technique. This algorithm is very similar to the Karatsuba based | |
5185 multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings. | |
5186 | |
5187 The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is | |
5188 placed just below the middle. Step 3, 4 and 5 compute the two halves required using $B$ | |
5189 as the radix point. The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form. | |
5190 | |
5191 By expanding $\left (x1 - x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $x1^2 + x0^2 - (x1 - x0)^2 = 2 \cdot x0 \cdot x1$. | |
5192 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then | |
5193 this method is faster. Assuming no further recursions occur, the difference can be estimated with the following inequality. | |
5194 | |
5195 Let $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or | |
5196 machine clock cycles.}. | |
5197 | |
5198 \begin{equation} | |
5199 5pn +{{q(n^2 + n)} \over 2} \le pn + qn^2 | |
5200 \end{equation} | |
5201 | |
5202 For example, on an AMD Athlon XP processor $p = {1 \over 3}$ and $q = 6$. This implies that the following inequality should hold. | |
5203 \begin{center} | |
5204 \begin{tabular}{rcl} | |
5205 ${5n \over 3} + 3n^2 + 3n$ & $<$ & ${n \over 3} + 6n^2$ \\ | |
5206 ${5 \over 3} + 3n + 3$ & $<$ & ${1 \over 3} + 6n$ \\ | |
5207 ${13 \over 9}$ & $<$ & $n$ \\ | |
5208 \end{tabular} | |
5209 \end{center} | |
5210 | |
5211 This results in a cutoff point around $n = 2$. As a consequence it is actually faster to compute the middle term the ``long way'' on processors | |
5212 where multiplication is substantially slower\footnote{On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication. On | |
5213 the Intel P4 processor this ratio is 1:29 making this method even more beneficial. The only common exception is the ARMv4 processor which has a | |
5214 ratio of 1:7. } than simpler operations such as addition. | |
5215 | |
5216 \vspace{+3mm}\begin{small} | |
5217 \hspace{-5.1mm}{\bf File}: bn\_mp\_karatsuba\_sqr.c | |
5218 \vspace{-3mm} | |
5219 \begin{alltt} | |
5220 016 | |
5221 017 /* Karatsuba squaring, computes b = a*a using three | |
5222 018 * half size squarings | |
5223 019 * | |
5224 020 * See comments of mp_karatsuba_mul for details. It | |
5225 021 * is essentially the same algorithm but merely | |
5226 022 * tuned to perform recursive squarings. | |
5227 023 */ | |
5228 024 int mp_karatsuba_sqr (mp_int * a, mp_int * b) | |
5229 025 \{ | |
5230 026 mp_int x0, x1, t1, t2, x0x0, x1x1; | |
5231 027 int B, err; | |
5232 028 | |
5233 029 err = MP_MEM; | |
5234 030 | |
5235 031 /* min # of digits */ | |
5236 032 B = a->used; | |
5237 033 | |
5238 034 /* now divide in two */ | |
5239 035 B = B >> 1; | |
5240 036 | |
5241 037 /* init copy all the temps */ | |
5242 038 if (mp_init_size (&x0, B) != MP_OKAY) | |
5243 039 goto ERR; | |
5244 040 if (mp_init_size (&x1, a->used - B) != MP_OKAY) | |
5245 041 goto X0; | |
5246 042 | |
5247 043 /* init temps */ | |
5248 044 if (mp_init_size (&t1, a->used * 2) != MP_OKAY) | |
5249 045 goto X1; | |
5250 046 if (mp_init_size (&t2, a->used * 2) != MP_OKAY) | |
5251 047 goto T1; | |
5252 048 if (mp_init_size (&x0x0, B * 2) != MP_OKAY) | |
5253 049 goto T2; | |
5254 050 if (mp_init_size (&x1x1, (a->used - B) * 2) != MP_OKAY) | |
5255 051 goto X0X0; | |
5256 052 | |
5257 053 \{ | |
5258 054 register int x; | |
5259 055 register mp_digit *dst, *src; | |
5260 056 | |
5261 057 src = a->dp; | |
5262 058 | |
5263 059 /* now shift the digits */ | |
5264 060 dst = x0.dp; | |
5265 061 for (x = 0; x < B; x++) \{ | |
5266 062 *dst++ = *src++; | |
5267 063 \} | |
5268 064 | |
5269 065 dst = x1.dp; | |
5270 066 for (x = B; x < a->used; x++) \{ | |
5271 067 *dst++ = *src++; | |
5272 068 \} | |
5273 069 \} | |
5274 070 | |
5275 071 x0.used = B; | |
5276 072 x1.used = a->used - B; | |
5277 073 | |
5278 074 mp_clamp (&x0); | |
5279 075 | |
5280 076 /* now calc the products x0*x0 and x1*x1 */ | |
5281 077 if (mp_sqr (&x0, &x0x0) != MP_OKAY) | |
5282 078 goto X1X1; /* x0x0 = x0*x0 */ | |
5283 079 if (mp_sqr (&x1, &x1x1) != MP_OKAY) | |
5284 080 goto X1X1; /* x1x1 = x1*x1 */ | |
5285 081 | |
5286 082 /* now calc (x1-x0)**2 */ | |
5287 083 if (mp_sub (&x1, &x0, &t1) != MP_OKAY) | |
5288 084 goto X1X1; /* t1 = x1 - x0 */ | |
5289 085 if (mp_sqr (&t1, &t1) != MP_OKAY) | |
5290 086 goto X1X1; /* t1 = (x1 - x0) * (x1 - x0) */ | |
5291 087 | |
5292 088 /* add x0y0 */ | |
5293 089 if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY) | |
5294 090 goto X1X1; /* t2 = x0x0 + x1x1 */ | |
5295 091 if (mp_sub (&t2, &t1, &t1) != MP_OKAY) | |
5296 092 goto X1X1; /* t1 = x0x0 + x1x1 - (x1-x0)*(x1-x0) */ | |
5297 093 | |
5298 094 /* shift by B */ | |
5299 095 if (mp_lshd (&t1, B) != MP_OKAY) | |
5300 096 goto X1X1; /* t1 = (x0x0 + x1x1 - (x1-x0)*(x1-x0))<<B */ | |
5301 097 if (mp_lshd (&x1x1, B * 2) != MP_OKAY) | |
5302 098 goto X1X1; /* x1x1 = x1x1 << 2*B */ | |
5303 099 | |
5304 100 if (mp_add (&x0x0, &t1, &t1) != MP_OKAY) | |
5305 101 goto X1X1; /* t1 = x0x0 + t1 */ | |
5306 102 if (mp_add (&t1, &x1x1, b) != MP_OKAY) | |
5307 103 goto X1X1; /* t1 = x0x0 + t1 + x1x1 */ | |
5308 104 | |
5309 105 err = MP_OKAY; | |
5310 106 | |
5311 107 X1X1:mp_clear (&x1x1); | |
5312 108 X0X0:mp_clear (&x0x0); | |
5313 109 T2:mp_clear (&t2); | |
5314 110 T1:mp_clear (&t1); | |
5315 111 X1:mp_clear (&x1); | |
5316 112 X0:mp_clear (&x0); | |
5317 113 ERR: | |
5318 114 return err; | |
5319 115 \} | |
5320 \end{alltt} | |
5321 \end{small} | |
5322 | |
5323 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul. It uses the same inline style to copy and | |
5324 shift the input into the two halves. The loop from line 53 to line 69 has been modified since only one input exists. The \textbf{used} | |
5325 count of both $x0$ and $x1$ is fixed up and $x0$ is clamped before the calculations begin. At this point $x1$ and $x0$ are valid equivalents | |
5326 to the respective halves as if mp\_rshd and mp\_mod\_2d had been used. | |
5327 | |
5328 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point | |
5329 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 | |
5330 it is actually below the Comba limit (\textit{at 110 digits}). | |
5331 | |
5332 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to | |
5333 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. | |
5334 | |
5335 \textit{Last paragraph sucks. re-write! -- Tom} | |
5336 | |
5337 \subsection{Toom-Cook Squaring} | |
5338 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used | |
5339 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to | |
5340 derive their own Toom-Cook squaring algorithm. | |
5341 | |
5342 \subsection{High Level Squaring} | |
5343 \newpage\begin{figure}[!here] | |
5344 \begin{small} | |
5345 \begin{center} | |
5346 \begin{tabular}{l} | |
5347 \hline Algorithm \textbf{mp\_sqr}. \\ | |
5348 \textbf{Input}. mp\_int $a$ \\ | |
5349 \textbf{Output}. $b \leftarrow a^2$ \\ | |
5350 \hline \\ | |
5351 1. If $a.used \ge TOOM\_SQR\_CUTOFF$ then \\ | |
5352 \hspace{3mm}1.1 $b \leftarrow a^2$ using algorithm mp\_toom\_sqr \\ | |
5353 2. else if $a.used \ge KARATSUBA\_SQR\_CUTOFF$ then \\ | |
5354 \hspace{3mm}2.1 $b \leftarrow a^2$ using algorithm mp\_karatsuba\_sqr \\ | |
5355 3. else \\ | |
5356 \hspace{3mm}3.1 $digs \leftarrow a.used + b.used + 1$ \\ | |
5357 \hspace{3mm}3.2 If $digs < MP\_ARRAY$ and $a.used \le \delta$ then \\ | |
5358 \hspace{6mm}3.2.1 $b \leftarrow a^2$ using algorithm fast\_s\_mp\_sqr. \\ | |
5359 \hspace{3mm}3.3 else \\ | |
5360 \hspace{6mm}3.3.1 $b \leftarrow a^2$ using algorithm s\_mp\_sqr. \\ | |
5361 4. $b.sign \leftarrow MP\_ZPOS$ \\ | |
5362 5. Return the result of the unsigned squaring performed. \\ | |
5363 \hline | |
5364 \end{tabular} | |
5365 \end{center} | |
5366 \end{small} | |
5367 \caption{Algorithm mp\_sqr} | |
5368 \end{figure} | |
5369 | |
5370 \textbf{Algorithm mp\_sqr.} | |
5371 This algorithm computes the square of the input using one of four different algorithms. If the input is very large and has at least | |
5372 \textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used. If | |
5373 neither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used. | |
5374 | |
5375 \vspace{+3mm}\begin{small} | |
5376 \hspace{-5.1mm}{\bf File}: bn\_mp\_sqr.c | |
5377 \vspace{-3mm} | |
5378 \begin{alltt} | |
5379 016 | |
5380 017 /* computes b = a*a */ | |
5381 018 int | |
5382 019 mp_sqr (mp_int * a, mp_int * b) | |
5383 020 \{ | |
5384 021 int res; | |
5385 022 | |
5386 023 /* use Toom-Cook? */ | |
5387 024 if (a->used >= TOOM_SQR_CUTOFF) \{ | |
5388 025 res = mp_toom_sqr(a, b); | |
5389 026 /* Karatsuba? */ | |
5390 027 \} else if (a->used >= KARATSUBA_SQR_CUTOFF) \{ | |
5391 028 res = mp_karatsuba_sqr (a, b); | |
5392 029 \} else \{ | |
5393 030 /* can we use the fast comba multiplier? */ | |
5394 031 if ((a->used * 2 + 1) < MP_WARRAY && | |
5395 032 a->used < | |
5396 033 (1 << (sizeof(mp_word) * CHAR_BIT - 2*DIGIT_BIT - 1))) \{ | |
5397 034 res = fast_s_mp_sqr (a, b); | |
5398 035 \} else \{ | |
5399 036 res = s_mp_sqr (a, b); | |
5400 037 \} | |
5401 038 \} | |
5402 039 b->sign = MP_ZPOS; | |
5403 040 return res; | |
5404 041 \} | |
5405 \end{alltt} | |
5406 \end{small} | |
5407 | |
5408 \section*{Exercises} | |
5409 \begin{tabular}{cl} | |
5410 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ | |
5411 & that have different number of digits in Karatsuba multiplication. \\ | |
5412 & \\ | |
5413 $\left [ 3 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\ | |
5414 & of double products and at most one square is stated. Prove this statement. \\ | |
5415 & \\ | |
5416 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\ | |
5417 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\ | |
5418 & \\ | |
5419 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ | |
5420 & \\ | |
5421 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ | |
5422 & \\ | |
5423 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ | |
5424 & required for equation $6.7$ to be true. \\ | |
5425 & \\ | |
5426 \end{tabular} | |
5427 | |
5428 \chapter{Modular Reduction} | |
5429 \section{Basics of Modular Reduction} | |
5430 \index{modular residue} | |
5431 Modular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms, | |
5432 such as factoring. Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set. A number $a$ is said to be \textit{reduced} | |
5433 modulo another number $b$ by finding the remainder of the division $a/b$. Full integer division with remainder is a topic to be covered | |
5434 in~\ref{sec:division}. | |
5435 | |
5436 Modular reduction is equivalent to solving for $r$ in the following equation. $a = bq + r$ where $q = \lfloor a/b \rfloor$. The result | |
5437 $r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$. In other vernacular $r$ is known as the | |
5438 ``modular residue'' which leads to ``quadratic residue''\footnote{That's fancy talk for $b \equiv a^2 \mbox{ (mod }p\mbox{)}$.} and | |
5439 other forms of residues. | |
5440 | |
5441 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions | |
5442 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the | |
5443 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in | |
5444 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular | |
5445 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the | |
5446 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check | |
5447 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. | |
5448 | |
5449 \section{The Barrett Reduction} | |
5450 The Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate | |
5451 division. Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to | |
5452 | |
5453 \begin{equation} | |
5454 c = a - b \cdot \lfloor a/b \rfloor | |
5455 \end{equation} | |
5456 | |
5457 Since algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper | |
5458 targeted the DSP56K processor.} intuition would indicate the next step would be to replace $a/b$ by a multiplication by the reciprocal. However, | |
5459 DSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types. | |
5460 It would take another common optimization to optimize the algorithm. | |
5461 | |
5462 \subsection{Fixed Point Arithmetic} | |
5463 The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers. Fixed | |
5464 point arithmetic would become very popular as it greatly optimize the ``3d-shooter'' genre of games in the mid 1990s when floating point units were | |
5465 fairly slow if not unavailable. The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit | |
5466 integer and a $q$-bit fraction part (\textit{where $p+q = k$}). | |
5467 | |
5468 In this system a $k$-bit integer $n$ would actually represent $n/2^q$. For example, with $q = 4$ the integer $n = 37$ would actually represent the | |
5469 value $2.3125$. To multiply two fixed point numbers the integers are multiplied using traditional arithmetic and subsequently normalized by | |
5470 moving the implied decimal point back to where it should be. For example, with $q = 4$ to multiply the integers $9$ and $5$ they must be converted | |
5471 to fixed point first by multiplying by $2^q$. Let $a = 9(2^q)$ represent the fixed point representation of $9$ and $b = 5(2^q)$ represent the | |
5472 fixed point representation of $5$. The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$. | |
5473 | |
5474 This technique became popular since a normal integer multiplication and logical shift right are the only required operations to perform a multiplication | |
5475 of two fixed point numbers. Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal. If $2^q$ is | |
5476 equivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic. Using this fact dividing an integer | |
5477 $a$ by another integer $b$ can be achieved with the following expression. | |
5478 | |
5479 \begin{equation} | |
5480 \lfloor a / b \rfloor \mbox{ }\approx\mbox{ } \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor | |
5481 \end{equation} | |
5482 | |
5483 The precision of the division is proportional to the value of $q$. If the divisor $b$ is used frequently as is the case with | |
5484 modular exponentiation pre-computing $2^q/b$ will allow a division to be performed with a multiplication and a right shift. Both operations | |
5485 are considerably faster than division on most processors. | |
5486 | |
5487 Consider dividing $19$ by $5$. The correct result is $\lfloor 19/5 \rfloor = 3$. With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which | |
5488 leads to a product of $19$ which when divided by $2^q$ produces $2$. However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and | |
5489 the result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct. The value of $2^q$ must be close to or ideally | |
5490 larger than the dividend. In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach | |
5491 to work correctly. Plugging this form of divison into the original equation the following modular residue equation arises. | |
5492 | |
5493 \begin{equation} | |
5494 c = a - b \cdot \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor | |
5495 \end{equation} | |
5496 | |
5497 Using the notation from \cite{BARRETT} the value of $\lfloor 2^q / b \rfloor$ will be represented by the $\mu$ symbol. Using the $\mu$ | |
5498 variable also helps re-inforce the idea that it is meant to be computed once and re-used. | |
5499 | |
5500 \begin{equation} | |
5501 c = a - b \cdot \lfloor (a \cdot \mu)/2^q \rfloor | |
5502 \end{equation} | |
5503 | |
5504 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one. In the context of Barrett | |
5505 reduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough | |
5506 precision. | |
5507 | |
5508 Let $n$ represent the number of digits in $b$. This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and | |
5509 another $n^2$ single precision multiplications to find the residue. In total $3n^2$ single precision multiplications are required to | |
5510 reduce the number. | |
5511 | |
5512 For example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$. Consider reducing | |
5513 $a = 180388626447$ modulo $b$ using the above reduction equation. The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$. | |
5514 By subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found. | |
5515 | |
5516 \subsection{Choosing a Radix Point} | |
5517 Using the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications. If that were the best | |
5518 that could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$. | |
5519 See~\ref{sec:division} for further details.} might as well be used in its place. The key to optimizing the reduction is to reduce the precision of | |
5520 the initial multiplication that finds the quotient. | |
5521 | |
5522 Let $a$ represent the number of which the residue is sought. Let $b$ represent the modulus used to find the residue. Let $m$ represent | |
5523 the number of digits in $b$. For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if | |
5524 two $m$-digit numbers have been multiplied. Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer. Digits below the | |
5525 $m - 1$'th digit of $a$ will contribute at most a value of $1$ to the quotient because $\beta^k < b$ for any $0 \le k \le m - 1$. Another way to | |
5526 express this is by re-writing $a$ as two parts. If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then | |
5527 ${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$. Since $a'$ is bound to be less than $b$ the quotient | |
5528 is bound by $0 \le {a' \over b} < 1$. | |
5529 | |
5530 Since the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero. However, if the digits | |
5531 ``might as well be zero'' they might as well not be there in the first place. Let $q_0 = \lfloor a/\beta^{m-1} \rfloor$ represent the input | |
5532 with the irrelevant digits trimmed. Now the modular reduction is trimmed to the almost equivalent equation | |
5533 | |
5534 \begin{equation} | |
5535 c = a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor | |
5536 \end{equation} | |
5537 | |
5538 Note that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the | |
5539 exponent on the divisor when added to the amount $q_0$ was shifted by equals $2m$. If the optimization had not been performed the divisor | |
5540 would have the exponent $2m$ so in the end the exponents do ``add up''. Using the above equation the quotient | |
5541 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ can be off from the true quotient by at most two. The original fixed point quotient can be off | |
5542 by as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient | |
5543 can be off by an additional value of one for a total of at most two. This implies that | |
5544 $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. By first subtracting $b$ times the quotient and then conditionally subtracting | |
5545 $b$ once or twice the residue is found. | |
5546 | |
5547 The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single | |
5548 precision multiplications, ignoring the subtractions required. In total $2m^2 + m$ single precision multiplications are required to find the residue. | |
5549 This is considerably faster than the original attempt. | |
5550 | |
5551 For example, let $\beta = 10$ represent the radix of the digits. Let $b = 9999$ represent the modulus which implies $m = 4$. Let $a = 99929878$ | |
5552 represent the value of which the residue is desired. In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$. | |
5553 With the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$. The quotient is then | |
5554 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor = 9993$. Subtracting $9993b$ from $a$ and the correct residue $a \equiv 9871 \mbox{ (mod }b\mbox{)}$ | |
5555 is found. | |
5556 | |
5557 \subsection{Trimming the Quotient} | |
5558 So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications. As | |
5559 it stands now the algorithm is already fairly fast compared to a full integer division algorithm. However, there is still room for | |
5560 optimization. | |
5561 | |
5562 After the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower | |
5563 half of the product. It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision | |
5564 multiplications. If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly. | |
5565 In fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed. | |
5566 | |
5567 The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number. Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision | |
5568 multiplications would be required. Using a multiplier that will only produce digits at and above the $m - 1$'th digit reduces the number | |
5569 of single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications. | |
5570 | |
5571 \subsection{Trimming the Residue} | |
5572 After the quotient has been calculated it is used to reduce the input. As previously noted the algorithm is not exact and it can be off by a small | |
5573 multiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. If $b$ is $m$ digits than the | |
5574 result of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are | |
5575 implicitly zero. | |
5576 | |
5577 The next optimization arises from this very fact. Instead of computing $b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ using a full | |
5578 $O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed. Similarly the value of $a$ can | |
5579 be reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well. A multiplication that produces | |
5580 only the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications. | |
5581 | |
5582 With both optimizations in place the algorithm is the algorithm Barrett proposed. It requires $m^2 + 2m - 1$ single precision multiplications which | |
5583 is considerably faster than the straightforward $3m^2$ method. | |
5584 | |
5585 \subsection{The Barrett Algorithm} | |
5586 \newpage\begin{figure}[!here] | |
5587 \begin{small} | |
5588 \begin{center} | |
5589 \begin{tabular}{l} | |
5590 \hline Algorithm \textbf{mp\_reduce}. \\ | |
5591 \textbf{Input}. mp\_int $a$, mp\_int $b$ and $\mu = \lfloor \beta^{2m}/b \rfloor, m = \lceil lg_{\beta}(b) \rceil, (0 \le a < b^2, b > 1)$ \\ | |
5592 \textbf{Output}. $a \mbox{ (mod }b\mbox{)}$ \\ | |
5593 \hline \\ | |
5594 Let $m$ represent the number of digits in $b$. \\ | |
5595 1. Make a copy of $a$ and store it in $q$. (\textit{mp\_init\_copy}) \\ | |
5596 2. $q \leftarrow \lfloor q / \beta^{m - 1} \rfloor$ (\textit{mp\_rshd}) \\ | |
5597 \\ | |
5598 Produce the quotient. \\ | |
5599 3. $q \leftarrow q \cdot \mu$ (\textit{note: only produce digits at or above $m-1$}) \\ | |
5600 4. $q \leftarrow \lfloor q / \beta^{m + 1} \rfloor$ \\ | |
5601 \\ | |
5602 Subtract the multiple of modulus from the input. \\ | |
5603 5. $a \leftarrow a \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{mp\_mod\_2d}) \\ | |
5604 6. $q \leftarrow q \cdot b \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{s\_mp\_mul\_digs}) \\ | |
5605 7. $a \leftarrow a - q$ (\textit{mp\_sub}) \\ | |
5606 \\ | |
5607 Add $\beta^{m+1}$ if a carry occured. \\ | |
5608 8. If $a < 0$ then (\textit{mp\_cmp\_d}) \\ | |
5609 \hspace{3mm}8.1 $q \leftarrow 1$ (\textit{mp\_set}) \\ | |
5610 \hspace{3mm}8.2 $q \leftarrow q \cdot \beta^{m+1}$ (\textit{mp\_lshd}) \\ | |
5611 \hspace{3mm}8.3 $a \leftarrow a + q$ \\ | |
5612 \\ | |
5613 Now subtract the modulus if the residue is too large (e.g. quotient too small). \\ | |
5614 9. While $a \ge b$ do (\textit{mp\_cmp}) \\ | |
5615 \hspace{3mm}9.1 $c \leftarrow a - b$ \\ | |
5616 10. Clear $q$. \\ | |
5617 11. Return(\textit{MP\_OKAY}) \\ | |
5618 \hline | |
5619 \end{tabular} | |
5620 \end{center} | |
5621 \end{small} | |
5622 \caption{Algorithm mp\_reduce} | |
5623 \end{figure} | |
5624 | |
5625 \textbf{Algorithm mp\_reduce.} | |
5626 This algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm. It is loosely based on algorithm 14.42 of HAC | |
5627 \cite[pp. 602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}. The algorithm has several restrictions and assumptions which must | |
5628 be adhered to for the algorithm to work. | |
5629 | |
5630 First the modulus $b$ is assumed to be positive and greater than one. If the modulus were less than or equal to one than subtracting | |
5631 a multiple of it would either accomplish nothing or actually enlarge the input. The input $a$ must be in the range $0 \le a < b^2$ in order | |
5632 for the quotient to have enough precision. If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem. | |
5633 Technically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish. The value of $\mu$ is passed as an argument to this | |
5634 algorithm and is assumed to be calculated and stored before the algorithm is used. | |
5635 | |
5636 Recall that the multiplication for the quotient on step 3 must only produce digits at or above the $m-1$'th position. An algorithm called | |
5637 $s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task. The algorithm is based on $s\_mp\_mul\_digs$ except that | |
5638 instead of stopping at a given level of precision it starts at a given level of precision. This optimal algorithm can only be used if the number | |
5639 of digits in $b$ is very much smaller than $\beta$. | |
5640 | |
5641 While it is known that | |
5642 $a \ge b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ only the lower $m+1$ digits are being used to compute the residue, so an implied | |
5643 ``borrow'' from the higher digits might leave a negative result. After the multiple of the modulus has been subtracted from $a$ the residue must be | |
5644 fixed up in case it is negative. The invariant $\beta^{m+1}$ must be added to the residue to make it positive again. | |
5645 | |
5646 The while loop at step 9 will subtract $b$ until the residue is less than $b$. If the algorithm is performed correctly this step is | |
5647 performed at most twice, and on average once. However, if $a \ge b^2$ than it will iterate substantially more times than it should. | |
5648 | |
5649 \vspace{+3mm}\begin{small} | |
5650 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce.c | |
5651 \vspace{-3mm} | |
5652 \begin{alltt} | |
5653 016 | |
5654 017 /* reduces x mod m, assumes 0 < x < m**2, mu is | |
5655 018 * precomputed via mp_reduce_setup. | |
5656 019 * From HAC pp.604 Algorithm 14.42 | |
5657 020 */ | |
5658 021 int | |
5659 022 mp_reduce (mp_int * x, mp_int * m, mp_int * mu) | |
5660 023 \{ | |
5661 024 mp_int q; | |
5662 025 int res, um = m->used; | |
5663 026 | |
5664 027 /* q = x */ | |
5665 028 if ((res = mp_init_copy (&q, x)) != MP_OKAY) \{ | |
5666 029 return res; | |
5667 030 \} | |
5668 031 | |
5669 032 /* q1 = x / b**(k-1) */ | |
5670 033 mp_rshd (&q, um - 1); | |
5671 034 | |
5672 035 /* according to HAC this optimization is ok */ | |
5673 036 if (((unsigned long) um) > (((mp_digit)1) << (DIGIT_BIT - 1))) \{ | |
5674 037 if ((res = mp_mul (&q, mu, &q)) != MP_OKAY) \{ | |
5675 038 goto CLEANUP; | |
5676 039 \} | |
5677 040 \} else \{ | |
5678 041 if ((res = s_mp_mul_high_digs (&q, mu, &q, um - 1)) != MP_OKAY) \{ | |
5679 042 goto CLEANUP; | |
5680 043 \} | |
5681 044 \} | |
5682 045 | |
5683 046 /* q3 = q2 / b**(k+1) */ | |
5684 047 mp_rshd (&q, um + 1); | |
5685 048 | |
5686 049 /* x = x mod b**(k+1), quick (no division) */ | |
5687 050 if ((res = mp_mod_2d (x, DIGIT_BIT * (um + 1), x)) != MP_OKAY) \{ | |
5688 051 goto CLEANUP; | |
5689 052 \} | |
5690 053 | |
5691 054 /* q = q * m mod b**(k+1), quick (no division) */ | |
5692 055 if ((res = s_mp_mul_digs (&q, m, &q, um + 1)) != MP_OKAY) \{ | |
5693 056 goto CLEANUP; | |
5694 057 \} | |
5695 058 | |
5696 059 /* x = x - q */ | |
5697 060 if ((res = mp_sub (x, &q, x)) != MP_OKAY) \{ | |
5698 061 goto CLEANUP; | |
5699 062 \} | |
5700 063 | |
5701 064 /* If x < 0, add b**(k+1) to it */ | |
5702 065 if (mp_cmp_d (x, 0) == MP_LT) \{ | |
5703 066 mp_set (&q, 1); | |
5704 067 if ((res = mp_lshd (&q, um + 1)) != MP_OKAY) | |
5705 068 goto CLEANUP; | |
5706 069 if ((res = mp_add (x, &q, x)) != MP_OKAY) | |
5707 070 goto CLEANUP; | |
5708 071 \} | |
5709 072 | |
5710 073 /* Back off if it's too big */ | |
5711 074 while (mp_cmp (x, m) != MP_LT) \{ | |
5712 075 if ((res = s_mp_sub (x, m, x)) != MP_OKAY) \{ | |
5713 076 goto CLEANUP; | |
5714 077 \} | |
5715 078 \} | |
5716 079 | |
5717 080 CLEANUP: | |
5718 081 mp_clear (&q); | |
5719 082 | |
5720 083 return res; | |
5721 084 \} | |
5722 \end{alltt} | |
5723 \end{small} | |
5724 | |
5725 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up. This essentially halves | |
5726 the number of single precision multiplications required. However, the optimization is only safe if $\beta$ is much larger than the number of digits | |
5727 in the modulus. In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is | |
5728 safe to do so. | |
5729 | |
5730 \subsection{The Barrett Setup Algorithm} | |
5731 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for | |
5732 future use so that the Barrett algorithm can be used without delay. | |
5733 | |
5734 \begin{figure}[!here] | |
5735 \begin{small} | |
5736 \begin{center} | |
5737 \begin{tabular}{l} | |
5738 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ | |
5739 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ | |
5740 \textbf{Output}. $\mu \leftarrow \lfloor \beta^{2m}/a \rfloor$ \\ | |
5741 \hline \\ | |
5742 1. $\mu \leftarrow 2^{2 \cdot lg(\beta) \cdot m}$ (\textit{mp\_2expt}) \\ | |
5743 2. $\mu \leftarrow \lfloor \mu / b \rfloor$ (\textit{mp\_div}) \\ | |
5744 3. Return(\textit{MP\_OKAY}) \\ | |
5745 \hline | |
5746 \end{tabular} | |
5747 \end{center} | |
5748 \end{small} | |
5749 \caption{Algorithm mp\_reduce\_setup} | |
5750 \end{figure} | |
5751 | |
5752 \textbf{Algorithm mp\_reduce\_setup.} | |
5753 This algorithm computes the reciprocal $\mu$ required for Barrett reduction. First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot m}$ which | |
5754 is equivalent and much faster. The final value is computed by taking the integer quotient of $\lfloor \mu / b \rfloor$. | |
5755 | |
5756 \vspace{+3mm}\begin{small} | |
5757 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_setup.c | |
5758 \vspace{-3mm} | |
5759 \begin{alltt} | |
5760 016 | |
5761 017 /* pre-calculate the value required for Barrett reduction | |
5762 018 * For a given modulus "b" it calulates the value required in "a" | |
5763 019 */ | |
5764 020 int | |
5765 021 mp_reduce_setup (mp_int * a, mp_int * b) | |
5766 022 \{ | |
5767 023 int res; | |
5768 024 | |
5769 025 if ((res = mp_2expt (a, b->used * 2 * DIGIT_BIT)) != MP_OKAY) \{ | |
5770 026 return res; | |
5771 027 \} | |
5772 028 return mp_div (a, b, a, NULL); | |
5773 029 \} | |
5774 \end{alltt} | |
5775 \end{small} | |
5776 | |
5777 This simple routine calculates the reciprocal $\mu$ required by Barrett reduction. Note the extended usage of algorithm mp\_div where the variable | |
5778 which would received the remainder is passed as NULL. As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the | |
5779 remainder to be passed as NULL meaning to ignore the value. | |
5780 | |
5781 \section{The Montgomery Reduction} | |
5782 Montgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting | |
5783 form of reduction in common use. It computes a modular residue which is not actually equal to the residue of the input yet instead equal to a | |
5784 residue times a constant. However, as perplexing as this may sound the algorithm is relatively simple and very efficient. | |
5785 | |
5786 Throughout this entire section the variable $n$ will represent the modulus used to form the residue. As will be discussed shortly the value of | |
5787 $n$ must be odd. The variable $x$ will represent the quantity of which the residue is sought. Similar to the Barrett algorithm the input | |
5788 is restricted to $0 \le x < n^2$. To begin the description some simple number theory facts must be established. | |
5789 | |
5790 \textbf{Fact 1.} Adding $n$ to $x$ does not change the residue since in effect it adds one to the quotient $\lfloor x / n \rfloor$. Another way | |
5791 to explain this is that $n$ is (\textit{or multiples of $n$ are}) congruent to zero modulo $n$. Adding zero will not change the value of the residue. | |
5792 | |
5793 \textbf{Fact 2.} If $x$ is even then performing a division by two in $\Z$ is congruent to $x \cdot 2^{-1} \mbox{ (mod }n\mbox{)}$. Actually | |
5794 this is an application of the fact that if $x$ is evenly divisible by any $k \in \Z$ then division in $\Z$ will be congruent to | |
5795 multiplication by $k^{-1}$ modulo $n$. | |
5796 | |
5797 From these two simple facts the following simple algorithm can be derived. | |
5798 | |
5799 \newpage\begin{figure}[!here] | |
5800 \begin{small} | |
5801 \begin{center} | |
5802 \begin{tabular}{l} | |
5803 \hline Algorithm \textbf{Montgomery Reduction}. \\ | |
5804 \textbf{Input}. Integer $x$, $n$ and $k$ \\ | |
5805 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\ | |
5806 \hline \\ | |
5807 1. for $t$ from $1$ to $k$ do \\ | |
5808 \hspace{3mm}1.1 If $x$ is odd then \\ | |
5809 \hspace{6mm}1.1.1 $x \leftarrow x + n$ \\ | |
5810 \hspace{3mm}1.2 $x \leftarrow x/2$ \\ | |
5811 2. Return $x$. \\ | |
5812 \hline | |
5813 \end{tabular} | |
5814 \end{center} | |
5815 \end{small} | |
5816 \caption{Algorithm Montgomery Reduction} | |
5817 \end{figure} | |
5818 | |
5819 The algorithm reduces the input one bit at a time using the two congruencies stated previously. Inside the loop $n$, which is odd, is | |
5820 added to $x$ if $x$ is odd. This forces $x$ to be even which allows the division by two in $\Z$ to be congruent to a modular division by two. Since | |
5821 $x$ is assumed to be initially much larger than $n$ the addition of $n$ will contribute an insignificant magnitude to $x$. Let $r$ represent the | |
5822 final result of the Montgomery algorithm. If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to | |
5823 $0 \le r < \lfloor x/2^k \rfloor + n$. As a result at most a single subtraction is required to get the residue desired. | |
5824 | |
5825 \begin{figure}[here] | |
5826 \begin{small} | |
5827 \begin{center} | |
5828 \begin{tabular}{|c|l|} | |
5829 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} \\ | |
5830 \hline $1$ & $x + n = 5812$, $x/2 = 2906$ \\ | |
5831 \hline $2$ & $x/2 = 1453$ \\ | |
5832 \hline $3$ & $x + n = 1710$, $x/2 = 855$ \\ | |
5833 \hline $4$ & $x + n = 1112$, $x/2 = 556$ \\ | |
5834 \hline $5$ & $x/2 = 278$ \\ | |
5835 \hline $6$ & $x/2 = 139$ \\ | |
5836 \hline $7$ & $x + n = 396$, $x/2 = 198$ \\ | |
5837 \hline $8$ & $x/2 = 99$ \\ | |
5838 \hline | |
5839 \end{tabular} | |
5840 \end{center} | |
5841 \end{small} | |
5842 \caption{Example of Montgomery Reduction (I)} | |
5843 \label{fig:MONT1} | |
5844 \end{figure} | |
5845 | |
5846 Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 8$. The result of the algorithm $r = 99$ is | |
5847 congruent to the value of $2^{-8} \cdot 5555 \mbox{ (mod }257\mbox{)}$. When $r$ is multiplied by $2^8$ modulo $257$ the correct residue | |
5848 $r \equiv 158$ is produced. | |
5849 | |
5850 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$. The current algorithm requires $2k^2$ single precision shifts | |
5851 and $k^2$ single precision additions. At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful. | |
5852 Fortunately there exists an alternative representation of the algorithm. | |
5853 | |
5854 \begin{figure}[!here] | |
5855 \begin{small} | |
5856 \begin{center} | |
5857 \begin{tabular}{l} | |
5858 \hline Algorithm \textbf{Montgomery Reduction} (modified I). \\ | |
5859 \textbf{Input}. Integer $x$, $n$ and $k$ \\ | |
5860 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\ | |
5861 \hline \\ | |
5862 1. for $t$ from $0$ to $k - 1$ do \\ | |
5863 \hspace{3mm}1.1 If the $t$'th bit of $x$ is one then \\ | |
5864 \hspace{6mm}1.1.1 $x \leftarrow x + 2^tn$ \\ | |
5865 2. Return $x/2^k$. \\ | |
5866 \hline | |
5867 \end{tabular} | |
5868 \end{center} | |
5869 \end{small} | |
5870 \caption{Algorithm Montgomery Reduction (modified I)} | |
5871 \end{figure} | |
5872 | |
5873 This algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2. The number of single | |
5874 precision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement. | |
5875 | |
5876 \begin{figure}[here] | |
5877 \begin{small} | |
5878 \begin{center} | |
5879 \begin{tabular}{|c|l|r|} | |
5880 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} & \textbf{Result ($x$) in Binary} \\ | |
5881 \hline -- & $5555$ & $1010110110011$ \\ | |
5882 \hline $1$ & $x + 2^{0}n = 5812$ & $1011010110100$ \\ | |
5883 \hline $2$ & $5812$ & $1011010110100$ \\ | |
5884 \hline $3$ & $x + 2^{2}n = 6840$ & $1101010111000$ \\ | |
5885 \hline $4$ & $x + 2^{3}n = 8896$ & $10001011000000$ \\ | |
5886 \hline $5$ & $8896$ & $10001011000000$ \\ | |
5887 \hline $6$ & $8896$ & $10001011000000$ \\ | |
5888 \hline $7$ & $x + 2^{6}n = 25344$ & $110001100000000$ \\ | |
5889 \hline $8$ & $25344$ & $110001100000000$ \\ | |
5890 \hline -- & $x/2^k = 99$ & \\ | |
5891 \hline | |
5892 \end{tabular} | |
5893 \end{center} | |
5894 \end{small} | |
5895 \caption{Example of Montgomery Reduction (II)} | |
5896 \label{fig:MONT2} | |
5897 \end{figure} | |
5898 | |
5899 Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 8$. | |
5900 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the | |
5901 loop. Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed. In those iterations the $t$'th bit of $x$ is | |
5902 zero and the appropriate multiple of $n$ does not need to be added to force the $t$'th bit of the result to zero. | |
5903 | |
5904 \subsection{Digit Based Montgomery Reduction} | |
5905 Instead of computing the reduction on a bit-by-bit basis it is actually much faster to compute it on digit-by-digit basis. Consider the | |
5906 previous algorithm re-written to compute the Montgomery reduction in this new fashion. | |
5907 | |
5908 \begin{figure}[!here] | |
5909 \begin{small} | |
5910 \begin{center} | |
5911 \begin{tabular}{l} | |
5912 \hline Algorithm \textbf{Montgomery Reduction} (modified II). \\ | |
5913 \textbf{Input}. Integer $x$, $n$ and $k$ \\ | |
5914 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ | |
5915 \hline \\ | |
5916 1. for $t$ from $0$ to $k - 1$ do \\ | |
5917 \hspace{3mm}1.1 $x \leftarrow x + \mu n \beta^t$ \\ | |
5918 2. Return $x/\beta^k$. \\ | |
5919 \hline | |
5920 \end{tabular} | |
5921 \end{center} | |
5922 \end{small} | |
5923 \caption{Algorithm Montgomery Reduction (modified II)} | |
5924 \end{figure} | |
5925 | |
5926 The value $\mu n \beta^t$ is a multiple of the modulus $n$ meaning that it will not change the residue. If the first digit of | |
5927 the value $\mu n \beta^t$ equals the negative (modulo $\beta$) of the $t$'th digit of $x$ then the addition will result in a zero digit. This | |
5928 problem breaks down to solving the following congruency. | |
5929 | |
5930 \begin{center} | |
5931 \begin{tabular}{rcl} | |
5932 $x_t + \mu n_0$ & $\equiv$ & $0 \mbox{ (mod }\beta\mbox{)}$ \\ | |
5933 $\mu n_0$ & $\equiv$ & $-x_t \mbox{ (mod }\beta\mbox{)}$ \\ | |
5934 $\mu$ & $\equiv$ & $-x_t/n_0 \mbox{ (mod }\beta\mbox{)}$ \\ | |
5935 \end{tabular} | |
5936 \end{center} | |
5937 | |
5938 In each iteration of the loop on step 1 a new value of $\mu$ must be calculated. The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used | |
5939 extensively in this algorithm and should be precomputed. Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$. | |
5940 | |
5941 For example, let $\beta = 10$ represent the radix. Let $n = 17$ represent the modulus which implies $k = 2$ and $\rho \equiv 7$. Let $x = 33$ | |
5942 represent the value to reduce. | |
5943 | |
5944 \newpage\begin{figure} | |
5945 \begin{center} | |
5946 \begin{tabular}{|c|c|c|} | |
5947 \hline \textbf{Step ($t$)} & \textbf{Value of $x$} & \textbf{Value of $\mu$} \\ | |
5948 \hline -- & $33$ & --\\ | |
5949 \hline $0$ & $33 + \mu n = 50$ & $1$ \\ | |
5950 \hline $1$ & $50 + \mu n \beta = 900$ & $5$ \\ | |
5951 \hline | |
5952 \end{tabular} | |
5953 \end{center} | |
5954 \caption{Example of Montgomery Reduction} | |
5955 \end{figure} | |
5956 | |
5957 The final result $900$ is then divided by $\beta^k$ to produce the final result $9$. The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$ | |
5958 which implies the result is not the modular residue of $x$ modulo $n$. However, recall that the residue is actually multiplied by $\beta^{-k}$ in | |
5959 the algorithm. To get the true residue the value must be multiplied by $\beta^k$. In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and | |
5960 the correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$. | |
5961 | |
5962 \subsection{Baseline Montgomery Reduction} | |
5963 The baseline Montgomery reduction algorithm will produce the residue for any size input. It is designed to be a catch-all algororithm for | |
5964 Montgomery reductions. | |
5965 | |
5966 \newpage\begin{figure}[!here] | |
5967 \begin{small} | |
5968 \begin{center} | |
5969 \begin{tabular}{l} | |
5970 \hline Algorithm \textbf{mp\_montgomery\_reduce}. \\ | |
5971 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\ | |
5972 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\ | |
5973 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ | |
5974 \hline \\ | |
5975 1. $digs \leftarrow 2n.used + 1$ \\ | |
5976 2. If $digs < MP\_ARRAY$ and $m.used < \delta$ then \\ | |
5977 \hspace{3mm}2.1 Use algorithm fast\_mp\_montgomery\_reduce instead. \\ | |
5978 \\ | |
5979 Setup $x$ for the reduction. \\ | |
5980 3. If $x.alloc < digs$ then grow $x$ to $digs$ digits. \\ | |
5981 4. $x.used \leftarrow digs$ \\ | |
5982 \\ | |
5983 Eliminate the lower $k$ digits. \\ | |
5984 5. For $ix$ from $0$ to $k - 1$ do \\ | |
5985 \hspace{3mm}5.1 $\mu \leftarrow x_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\ | |
5986 \hspace{3mm}5.2 $u \leftarrow 0$ \\ | |
5987 \hspace{3mm}5.3 For $iy$ from $0$ to $k - 1$ do \\ | |
5988 \hspace{6mm}5.3.1 $\hat r \leftarrow \mu n_{iy} + x_{ix + iy} + u$ \\ | |
5989 \hspace{6mm}5.3.2 $x_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
5990 \hspace{6mm}5.3.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
5991 \hspace{3mm}5.4 While $u > 0$ do \\ | |
5992 \hspace{6mm}5.4.1 $iy \leftarrow iy + 1$ \\ | |
5993 \hspace{6mm}5.4.2 $x_{ix + iy} \leftarrow x_{ix + iy} + u$ \\ | |
5994 \hspace{6mm}5.4.3 $u \leftarrow \lfloor x_{ix+iy} / \beta \rfloor$ \\ | |
5995 \hspace{6mm}5.4.4 $x_{ix + iy} \leftarrow x_{ix+iy} \mbox{ (mod }\beta\mbox{)}$ \\ | |
5996 \\ | |
5997 Divide by $\beta^k$ and fix up as required. \\ | |
5998 6. $x \leftarrow \lfloor x / \beta^k \rfloor$ \\ | |
5999 7. If $x \ge n$ then \\ | |
6000 \hspace{3mm}7.1 $x \leftarrow x - n$ \\ | |
6001 8. Return(\textit{MP\_OKAY}). \\ | |
6002 \hline | |
6003 \end{tabular} | |
6004 \end{center} | |
6005 \end{small} | |
6006 \caption{Algorithm mp\_montgomery\_reduce} | |
6007 \end{figure} | |
6008 | |
6009 \textbf{Algorithm mp\_montgomery\_reduce.} | |
6010 This algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm. The algorithm is loosely based | |
6011 on algorithm 14.32 of \cite[pp.601]{HAC} except it merges the multiplication of $\mu n \beta^t$ with the addition in the inner loop. The | |
6012 restrictions on this algorithm are fairly easy to adapt to. First $0 \le x < n^2$ bounds the input to numbers in the same range as | |
6013 for the Barrett algorithm. Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$. $\rho$ must be calculated in | |
6014 advance of this algorithm. Finally the variable $k$ is fixed and a pseudonym for $n.used$. | |
6015 | |
6016 Step 2 decides whether a faster Montgomery algorithm can be used. It is based on the Comba technique meaning that there are limits on | |
6017 the size of the input. This algorithm is discussed in sub-section 6.3.3. | |
6018 | |
6019 Step 5 is the main reduction loop of the algorithm. The value of $\mu$ is calculated once per iteration in the outer loop. The inner loop | |
6020 calculates $x + \mu n \beta^{ix}$ by multiplying $\mu n$ and adding the result to $x$ shifted by $ix$ digits. Both the addition and | |
6021 multiplication are performed in the same loop to save time and memory. Step 5.4 will handle any additional carries that escape the inner loop. | |
6022 | |
6023 Using a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications | |
6024 in the inner loop. In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision | |
6025 multiplications. | |
6026 | |
6027 \vspace{+3mm}\begin{small} | |
6028 \hspace{-5.1mm}{\bf File}: bn\_mp\_montgomery\_reduce.c | |
6029 \vspace{-3mm} | |
6030 \begin{alltt} | |
6031 016 | |
6032 017 /* computes xR**-1 == x (mod N) via Montgomery Reduction */ | |
6033 018 int | |
6034 019 mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho) | |
6035 020 \{ | |
6036 021 int ix, res, digs; | |
6037 022 mp_digit mu; | |
6038 023 | |
6039 024 /* can the fast reduction [comba] method be used? | |
6040 025 * | |
6041 026 * Note that unlike in mp_mul you're safely allowed *less* | |
6042 027 * than the available columns [255 per default] since carries | |
6043 028 * are fixed up in the inner loop. | |
6044 029 */ | |
6045 030 digs = n->used * 2 + 1; | |
6046 031 if ((digs < MP_WARRAY) && | |
6047 032 n->used < | |
6048 033 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{ | |
6049 034 return fast_mp_montgomery_reduce (x, n, rho); | |
6050 035 \} | |
6051 036 | |
6052 037 /* grow the input as required */ | |
6053 038 if (x->alloc < digs) \{ | |
6054 039 if ((res = mp_grow (x, digs)) != MP_OKAY) \{ | |
6055 040 return res; | |
6056 041 \} | |
6057 042 \} | |
6058 043 x->used = digs; | |
6059 044 | |
6060 045 for (ix = 0; ix < n->used; ix++) \{ | |
6061 046 /* mu = ai * rho mod b | |
6062 047 * | |
6063 048 * The value of rho must be precalculated via | |
6064 049 * bn_mp_montgomery_setup() such that | |
6065 050 * it equals -1/n0 mod b this allows the | |
6066 051 * following inner loop to reduce the | |
6067 052 * input one digit at a time | |
6068 053 */ | |
6069 054 mu = (mp_digit) (((mp_word)x->dp[ix]) * ((mp_word)rho) & MP_MASK); | |
6070 055 | |
6071 056 /* a = a + mu * m * b**i */ | |
6072 057 \{ | |
6073 058 register int iy; | |
6074 059 register mp_digit *tmpn, *tmpx, u; | |
6075 060 register mp_word r; | |
6076 061 | |
6077 062 /* alias for digits of the modulus */ | |
6078 063 tmpn = n->dp; | |
6079 064 | |
6080 065 /* alias for the digits of x [the input] */ | |
6081 066 tmpx = x->dp + ix; | |
6082 067 | |
6083 068 /* set the carry to zero */ | |
6084 069 u = 0; | |
6085 070 | |
6086 071 /* Multiply and add in place */ | |
6087 072 for (iy = 0; iy < n->used; iy++) \{ | |
6088 073 /* compute product and sum */ | |
6089 074 r = ((mp_word)mu) * ((mp_word)*tmpn++) + | |
6090 075 ((mp_word) u) + ((mp_word) * tmpx); | |
6091 076 | |
6092 077 /* get carry */ | |
6093 078 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); | |
6094 079 | |
6095 080 /* fix digit */ | |
6096 081 *tmpx++ = (mp_digit)(r & ((mp_word) MP_MASK)); | |
6097 082 \} | |
6098 083 /* At this point the ix'th digit of x should be zero */ | |
6099 084 | |
6100 085 | |
6101 086 /* propagate carries upwards as required*/ | |
6102 087 while (u) \{ | |
6103 088 *tmpx += u; | |
6104 089 u = *tmpx >> DIGIT_BIT; | |
6105 090 *tmpx++ &= MP_MASK; | |
6106 091 \} | |
6107 092 \} | |
6108 093 \} | |
6109 094 | |
6110 095 /* at this point the n.used'th least | |
6111 096 * significant digits of x are all zero | |
6112 097 * which means we can shift x to the | |
6113 098 * right by n.used digits and the | |
6114 099 * residue is unchanged. | |
6115 100 */ | |
6116 101 | |
6117 102 /* x = x/b**n.used */ | |
6118 103 mp_clamp(x); | |
6119 104 mp_rshd (x, n->used); | |
6120 105 | |
6121 106 /* if x >= n then x = x - n */ | |
6122 107 if (mp_cmp_mag (x, n) != MP_LT) \{ | |
6123 108 return s_mp_sub (x, n, x); | |
6124 109 \} | |
6125 110 | |
6126 111 return MP_OKAY; | |
6127 112 \} | |
6128 \end{alltt} | |
6129 \end{small} | |
6130 | |
6131 This is the baseline implementation of the Montgomery reduction algorithm. Lines 30 to 35 determine if the Comba based | |
6132 routine can be used instead. Line 48 computes the value of $\mu$ for that particular iteration of the outer loop. | |
6133 | |
6134 The multiplication $\mu n \beta^{ix}$ is performed in one step in the inner loop. The alias $tmpx$ refers to the $ix$'th digit of $x$ and | |
6135 the alias $tmpn$ refers to the modulus $n$. | |
6136 | |
6137 \subsection{Faster ``Comba'' Montgomery Reduction} | |
6138 | |
6139 The Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial | |
6140 nature of the inner loop. The Barrett reduction algorithm requires two slightly modified multipliers which can be implemented with the Comba | |
6141 technique. The Montgomery reduction algorithm cannot directly use the Comba technique to any significant advantage since the inner loop calculates | |
6142 a $k \times 1$ product $k$ times. | |
6143 | |
6144 The biggest obstacle is that at the $ix$'th iteration of the outer loop the value of $x_{ix}$ is required to calculate $\mu$. This means the | |
6145 carries from $0$ to $ix - 1$ must have been propagated upwards to form a valid $ix$'th digit. The solution as it turns out is very simple. | |
6146 Perform a Comba like multiplier and inside the outer loop just after the inner loop fix up the $ix + 1$'th digit by forwarding the carry. | |
6147 | |
6148 With this change in place the Montgomery reduction algorithm can be performed with a Comba style multiplication loop which substantially increases | |
6149 the speed of the algorithm. | |
6150 | |
6151 \newpage\begin{figure}[!here] | |
6152 \begin{small} | |
6153 \begin{center} | |
6154 \begin{tabular}{l} | |
6155 \hline Algorithm \textbf{fast\_mp\_montgomery\_reduce}. \\ | |
6156 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\ | |
6157 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\ | |
6158 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ | |
6159 \hline \\ | |
6160 Place an array of \textbf{MP\_WARRAY} mp\_word variables called $\hat W$ on the stack. \\ | |
6161 1. if $x.alloc < n.used + 1$ then grow $x$ to $n.used + 1$ digits. \\ | |
6162 Copy the digits of $x$ into the array $\hat W$ \\ | |
6163 2. For $ix$ from $0$ to $x.used - 1$ do \\ | |
6164 \hspace{3mm}2.1 $\hat W_{ix} \leftarrow x_{ix}$ \\ | |
6165 3. For $ix$ from $x.used$ to $2n.used - 1$ do \\ | |
6166 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ | |
6167 Elimiate the lower $k$ digits. \\ | |
6168 4. for $ix$ from $0$ to $n.used - 1$ do \\ | |
6169 \hspace{3mm}4.1 $\mu \leftarrow \hat W_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\ | |
6170 \hspace{3mm}4.2 For $iy$ from $0$ to $n.used - 1$ do \\ | |
6171 \hspace{6mm}4.2.1 $\hat W_{iy + ix} \leftarrow \hat W_{iy + ix} + \mu \cdot n_{iy}$ \\ | |
6172 \hspace{3mm}4.3 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\ | |
6173 Propagate carries upwards. \\ | |
6174 5. for $ix$ from $n.used$ to $2n.used + 1$ do \\ | |
6175 \hspace{3mm}5.1 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\ | |
6176 Shift right and reduce modulo $\beta$ simultaneously. \\ | |
6177 6. for $ix$ from $0$ to $n.used + 1$ do \\ | |
6178 \hspace{3mm}6.1 $x_{ix} \leftarrow \hat W_{ix + n.used} \mbox{ (mod }\beta\mbox{)}$ \\ | |
6179 Zero excess digits and fixup $x$. \\ | |
6180 7. if $x.used > n.used + 1$ then do \\ | |
6181 \hspace{3mm}7.1 for $ix$ from $n.used + 1$ to $x.used - 1$ do \\ | |
6182 \hspace{6mm}7.1.1 $x_{ix} \leftarrow 0$ \\ | |
6183 8. $x.used \leftarrow n.used + 1$ \\ | |
6184 9. Clamp excessive digits of $x$. \\ | |
6185 10. If $x \ge n$ then \\ | |
6186 \hspace{3mm}10.1 $x \leftarrow x - n$ \\ | |
6187 11. Return(\textit{MP\_OKAY}). \\ | |
6188 \hline | |
6189 \end{tabular} | |
6190 \end{center} | |
6191 \end{small} | |
6192 \caption{Algorithm fast\_mp\_montgomery\_reduce} | |
6193 \end{figure} | |
6194 | |
6195 \textbf{Algorithm fast\_mp\_montgomery\_reduce.} | |
6196 This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique. It is on most computer platforms significantly | |
6197 faster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}). The algorithm has the same restrictions | |
6198 on the input as the baseline reduction algorithm. An additional two restrictions are imposed on this algorithm. The number of digits $k$ in the | |
6199 the modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$. When $\beta = 2^{28}$ this algorithm can be used to reduce modulo | |
6200 a modulus of at most $3,556$ bits in length. | |
6201 | |
6202 As in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product. It is initially filled with the | |
6203 contents of $x$ with the excess digits zeroed. The reduction loop is very similar the to the baseline loop at heart. The multiplication on step | |
6204 4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$. Some multipliers such | |
6205 as those on the ARM processors take a variable length time to complete depending on the number of bytes of result it must produce. By performing | |
6206 a single precision multiplication instead half the amount of time is spent. | |
6207 | |
6208 Also note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work. That is what step | |
6209 4.3 will do. In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards. Note | |
6210 how the upper bits of those same words are not reduced modulo $\beta$. This is because those values will be discarded shortly and there is no | |
6211 point. | |
6212 | |
6213 Step 5 will propagate the remainder of the carries upwards. On step 6 the columns are reduced modulo $\beta$ and shifted simultaneously as they are | |
6214 stored in the destination $x$. | |
6215 | |
6216 \vspace{+3mm}\begin{small} | |
6217 \hspace{-5.1mm}{\bf File}: bn\_fast\_mp\_montgomery\_reduce.c | |
6218 \vspace{-3mm} | |
6219 \begin{alltt} | |
6220 016 | |
6221 017 /* computes xR**-1 == x (mod N) via Montgomery Reduction | |
6222 018 * | |
6223 019 * This is an optimized implementation of mp_montgomery_reduce | |
6224 020 * which uses the comba method to quickly calculate the columns of the | |
6225 021 * reduction. | |
6226 022 * | |
6227 023 * Based on Algorithm 14.32 on pp.601 of HAC. | |
6228 024 */ | |
6229 025 int | |
6230 026 fast_mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho) | |
6231 027 \{ | |
6232 028 int ix, res, olduse; | |
6233 029 mp_word W[MP_WARRAY]; | |
6234 030 | |
6235 031 /* get old used count */ | |
6236 032 olduse = x->used; | |
6237 033 | |
6238 034 /* grow a as required */ | |
6239 035 if (x->alloc < n->used + 1) \{ | |
6240 036 if ((res = mp_grow (x, n->used + 1)) != MP_OKAY) \{ | |
6241 037 return res; | |
6242 038 \} | |
6243 039 \} | |
6244 040 | |
6245 041 /* first we have to get the digits of the input into | |
6246 042 * an array of double precision words W[...] | |
6247 043 */ | |
6248 044 \{ | |
6249 045 register mp_word *_W; | |
6250 046 register mp_digit *tmpx; | |
6251 047 | |
6252 048 /* alias for the W[] array */ | |
6253 049 _W = W; | |
6254 050 | |
6255 051 /* alias for the digits of x*/ | |
6256 052 tmpx = x->dp; | |
6257 053 | |
6258 054 /* copy the digits of a into W[0..a->used-1] */ | |
6259 055 for (ix = 0; ix < x->used; ix++) \{ | |
6260 056 *_W++ = *tmpx++; | |
6261 057 \} | |
6262 058 | |
6263 059 /* zero the high words of W[a->used..m->used*2] */ | |
6264 060 for (; ix < n->used * 2 + 1; ix++) \{ | |
6265 061 *_W++ = 0; | |
6266 062 \} | |
6267 063 \} | |
6268 064 | |
6269 065 /* now we proceed to zero successive digits | |
6270 066 * from the least significant upwards | |
6271 067 */ | |
6272 068 for (ix = 0; ix < n->used; ix++) \{ | |
6273 069 /* mu = ai * m' mod b | |
6274 070 * | |
6275 071 * We avoid a double precision multiplication (which isn't required) | |
6276 072 * by casting the value down to a mp_digit. Note this requires | |
6277 073 * that W[ix-1] have the carry cleared (see after the inner loop) | |
6278 074 */ | |
6279 075 register mp_digit mu; | |
6280 076 mu = (mp_digit) (((W[ix] & MP_MASK) * rho) & MP_MASK); | |
6281 077 | |
6282 078 /* a = a + mu * m * b**i | |
6283 079 * | |
6284 080 * This is computed in place and on the fly. The multiplication | |
6285 081 * by b**i is handled by offseting which columns the results | |
6286 082 * are added to. | |
6287 083 * | |
6288 084 * Note the comba method normally doesn't handle carries in the | |
6289 085 * inner loop In this case we fix the carry from the previous | |
6290 086 * column since the Montgomery reduction requires digits of the | |
6291 087 * result (so far) [see above] to work. This is | |
6292 088 * handled by fixing up one carry after the inner loop. The | |
6293 089 * carry fixups are done in order so after these loops the | |
6294 090 * first m->used words of W[] have the carries fixed | |
6295 091 */ | |
6296 092 \{ | |
6297 093 register int iy; | |
6298 094 register mp_digit *tmpn; | |
6299 095 register mp_word *_W; | |
6300 096 | |
6301 097 /* alias for the digits of the modulus */ | |
6302 098 tmpn = n->dp; | |
6303 099 | |
6304 100 /* Alias for the columns set by an offset of ix */ | |
6305 101 _W = W + ix; | |
6306 102 | |
6307 103 /* inner loop */ | |
6308 104 for (iy = 0; iy < n->used; iy++) \{ | |
6309 105 *_W++ += ((mp_word)mu) * ((mp_word)*tmpn++); | |
6310 106 \} | |
6311 107 \} | |
6312 108 | |
6313 109 /* now fix carry for next digit, W[ix+1] */ | |
6314 110 W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT); | |
6315 111 \} | |
6316 112 | |
6317 113 /* now we have to propagate the carries and | |
6318 114 * shift the words downward [all those least | |
6319 115 * significant digits we zeroed]. | |
6320 116 */ | |
6321 117 \{ | |
6322 118 register mp_digit *tmpx; | |
6323 119 register mp_word *_W, *_W1; | |
6324 120 | |
6325 121 /* nox fix rest of carries */ | |
6326 122 | |
6327 123 /* alias for current word */ | |
6328 124 _W1 = W + ix; | |
6329 125 | |
6330 126 /* alias for next word, where the carry goes */ | |
6331 127 _W = W + ++ix; | |
6332 128 | |
6333 129 for (; ix <= n->used * 2 + 1; ix++) \{ | |
6334 130 *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT); | |
6335 131 \} | |
6336 132 | |
6337 133 /* copy out, A = A/b**n | |
6338 134 * | |
6339 135 * The result is A/b**n but instead of converting from an | |
6340 136 * array of mp_word to mp_digit than calling mp_rshd | |
6341 137 * we just copy them in the right order | |
6342 138 */ | |
6343 139 | |
6344 140 /* alias for destination word */ | |
6345 141 tmpx = x->dp; | |
6346 142 | |
6347 143 /* alias for shifted double precision result */ | |
6348 144 _W = W + n->used; | |
6349 145 | |
6350 146 for (ix = 0; ix < n->used + 1; ix++) \{ | |
6351 147 *tmpx++ = (mp_digit)(*_W++ & ((mp_word) MP_MASK)); | |
6352 148 \} | |
6353 149 | |
6354 150 /* zero oldused digits, if the input a was larger than | |
6355 151 * m->used+1 we'll have to clear the digits | |
6356 152 */ | |
6357 153 for (; ix < olduse; ix++) \{ | |
6358 154 *tmpx++ = 0; | |
6359 155 \} | |
6360 156 \} | |
6361 157 | |
6362 158 /* set the max used and clamp */ | |
6363 159 x->used = n->used + 1; | |
6364 160 mp_clamp (x); | |
6365 161 | |
6366 162 /* if A >= m then A = A - m */ | |
6367 163 if (mp_cmp_mag (x, n) != MP_LT) \{ | |
6368 164 return s_mp_sub (x, n, x); | |
6369 165 \} | |
6370 166 return MP_OKAY; | |
6371 167 \} | |
6372 \end{alltt} | |
6373 \end{small} | |
6374 | |
6375 The $\hat W$ array is first filled with digits of $x$ on line 48 then the rest of the digits are zeroed on line 55. Both loops share | |
6376 the same alias variables to make the code easier to read. | |
6377 | |
6378 The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This | |
6379 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line 110 fixes the carry | |
6380 for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$. | |
6381 | |
6382 The for loop on line 109 propagates the rest of the carries upwards through the columns. The for loop on line 126 reduces the columns | |
6383 modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th | |
6384 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$. | |
6385 | |
6386 \subsection{Montgomery Setup} | |
6387 To calculate the variable $\rho$ a relatively simple algorithm will be required. | |
6388 | |
6389 \begin{figure}[!here] | |
6390 \begin{small} | |
6391 \begin{center} | |
6392 \begin{tabular}{l} | |
6393 \hline Algorithm \textbf{mp\_montgomery\_setup}. \\ | |
6394 \textbf{Input}. mp\_int $n$ ($n > 1$ and $(n, 2) = 1$) \\ | |
6395 \textbf{Output}. $\rho \equiv -1/n_0 \mbox{ (mod }\beta\mbox{)}$ \\ | |
6396 \hline \\ | |
6397 1. $b \leftarrow n_0$ \\ | |
6398 2. If $b$ is even return(\textit{MP\_VAL}) \\ | |
6399 3. $x \leftarrow ((b + 2) \mbox{ AND } 4) << 1) + b$ \\ | |
6400 4. for $k$ from 0 to $\lceil lg(lg(\beta)) \rceil - 2$ do \\ | |
6401 \hspace{3mm}4.1 $x \leftarrow x \cdot (2 - bx)$ \\ | |
6402 5. $\rho \leftarrow \beta - x \mbox{ (mod }\beta\mbox{)}$ \\ | |
6403 6. Return(\textit{MP\_OKAY}). \\ | |
6404 \hline | |
6405 \end{tabular} | |
6406 \end{center} | |
6407 \end{small} | |
6408 \caption{Algorithm mp\_montgomery\_setup} | |
6409 \end{figure} | |
6410 | |
6411 \textbf{Algorithm mp\_montgomery\_setup.} | |
6412 This algorithm will calculate the value of $\rho$ required within the Montgomery reduction algorithms. It uses a very interesting trick | |
6413 to calculate $1/n_0$ when $\beta$ is a power of two. | |
6414 | |
6415 \vspace{+3mm}\begin{small} | |
6416 \hspace{-5.1mm}{\bf File}: bn\_mp\_montgomery\_setup.c | |
6417 \vspace{-3mm} | |
6418 \begin{alltt} | |
6419 016 | |
6420 017 /* setups the montgomery reduction stuff */ | |
6421 018 int | |
6422 019 mp_montgomery_setup (mp_int * n, mp_digit * rho) | |
6423 020 \{ | |
6424 021 mp_digit x, b; | |
6425 022 | |
6426 023 /* fast inversion mod 2**k | |
6427 024 * | |
6428 025 * Based on the fact that | |
6429 026 * | |
6430 027 * XA = 1 (mod 2**n) => (X(2-XA)) A = 1 (mod 2**2n) | |
6431 028 * => 2*X*A - X*X*A*A = 1 | |
6432 029 * => 2*(1) - (1) = 1 | |
6433 030 */ | |
6434 031 b = n->dp[0]; | |
6435 032 | |
6436 033 if ((b & 1) == 0) \{ | |
6437 034 return MP_VAL; | |
6438 035 \} | |
6439 036 | |
6440 037 x = (((b + 2) & 4) << 1) + b; /* here x*a==1 mod 2**4 */ | |
6441 038 x *= 2 - b * x; /* here x*a==1 mod 2**8 */ | |
6442 039 #if !defined(MP_8BIT) | |
6443 040 x *= 2 - b * x; /* here x*a==1 mod 2**16 */ | |
6444 041 #endif | |
6445 042 #if defined(MP_64BIT) || !(defined(MP_8BIT) || defined(MP_16BIT)) | |
6446 043 x *= 2 - b * x; /* here x*a==1 mod 2**32 */ | |
6447 044 #endif | |
6448 045 #ifdef MP_64BIT | |
6449 046 x *= 2 - b * x; /* here x*a==1 mod 2**64 */ | |
6450 047 #endif | |
6451 048 | |
6452 049 /* rho = -1/m mod b */ | |
6453 050 *rho = (((mp_digit) 1 << ((mp_digit) DIGIT_BIT)) - x) & MP_MASK; | |
6454 051 | |
6455 052 return MP_OKAY; | |
6456 053 \} | |
6457 \end{alltt} | |
6458 \end{small} | |
6459 | |
6460 This source code computes the value of $\rho$ required to perform Montgomery reduction. It has been modified to avoid performing excess | |
6461 multiplications when $\beta$ is not the default 28-bits. | |
6462 | |
6463 \section{The Diminished Radix Algorithm} | |
6464 The Diminished Radix method of modular reduction \cite{DRMET} is a fairly clever technique which can be more efficient than either the Barrett | |
6465 or Montgomery methods for certain forms of moduli. The technique is based on the following simple congruence. | |
6466 | |
6467 \begin{equation} | |
6468 (x \mbox{ mod } n) + k \lfloor x / n \rfloor \equiv x \mbox{ (mod }(n - k)\mbox{)} | |
6469 \end{equation} | |
6470 | |
6471 This observation was used in the MMB \cite{MMB} block cipher to create a diffusion primitive. It used the fact that if $n = 2^{31}$ and $k=1$ that | |
6472 then a x86 multiplier could produce the 62-bit product and use the ``shrd'' instruction to perform a double-precision right shift. The proof | |
6473 of the above equation is very simple. First write $x$ in the product form. | |
6474 | |
6475 \begin{equation} | |
6476 x = qn + r | |
6477 \end{equation} | |
6478 | |
6479 Now reduce both sides modulo $(n - k)$. | |
6480 | |
6481 \begin{equation} | |
6482 x \equiv qk + r \mbox{ (mod }(n-k)\mbox{)} | |
6483 \end{equation} | |
6484 | |
6485 The variable $n$ reduces modulo $n - k$ to $k$. By putting $q = \lfloor x/n \rfloor$ and $r = x \mbox{ mod } n$ | |
6486 into the equation the original congruence is reproduced, thus concluding the proof. The following algorithm is based on this observation. | |
6487 | |
6488 \begin{figure}[!here] | |
6489 \begin{small} | |
6490 \begin{center} | |
6491 \begin{tabular}{l} | |
6492 \hline Algorithm \textbf{Diminished Radix Reduction}. \\ | |
6493 \textbf{Input}. Integer $x$, $n$, $k$ \\ | |
6494 \textbf{Output}. $x \mbox{ mod } (n - k)$ \\ | |
6495 \hline \\ | |
6496 1. $q \leftarrow \lfloor x / n \rfloor$ \\ | |
6497 2. $q \leftarrow k \cdot q$ \\ | |
6498 3. $x \leftarrow x \mbox{ (mod }n\mbox{)}$ \\ | |
6499 4. $x \leftarrow x + q$ \\ | |
6500 5. If $x \ge (n - k)$ then \\ | |
6501 \hspace{3mm}5.1 $x \leftarrow x - (n - k)$ \\ | |
6502 \hspace{3mm}5.2 Goto step 1. \\ | |
6503 6. Return $x$ \\ | |
6504 \hline | |
6505 \end{tabular} | |
6506 \end{center} | |
6507 \end{small} | |
6508 \caption{Algorithm Diminished Radix Reduction} | |
6509 \label{fig:DR} | |
6510 \end{figure} | |
6511 | |
6512 This algorithm will reduce $x$ modulo $n - k$ and return the residue. If $0 \le x < (n - k)^2$ then the algorithm will loop almost always | |
6513 once or twice and occasionally three times. For simplicity sake the value of $x$ is bounded by the following simple polynomial. | |
6514 | |
6515 \begin{equation} | |
6516 0 \le x < n^2 + k^2 - 2nk | |
6517 \end{equation} | |
6518 | |
6519 The true bound is $0 \le x < (n - k - 1)^2$ but this has quite a few more terms. The value of $q$ after step 1 is bounded by the following. | |
6520 | |
6521 \begin{equation} | |
6522 q < n - 2k - k^2/n | |
6523 \end{equation} | |
6524 | |
6525 Since $k^2$ is going to be considerably smaller than $n$ that term will always be zero. The value of $x$ after step 3 is bounded trivially as | |
6526 $0 \le x < n$. By step four the sum $x + q$ is bounded by | |
6527 | |
6528 \begin{equation} | |
6529 0 \le q + x < (k + 1)n - 2k^2 - 1 | |
6530 \end{equation} | |
6531 | |
6532 With a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3. After the second pass it is highly unlike that the | |
6533 sum in step 4 will exceed $n - k$. In practice fewer than three passes of the algorithm are required to reduce virtually every input in the | |
6534 range $0 \le x < (n - k - 1)^2$. | |
6535 | |
6536 \begin{figure} | |
6537 \begin{small} | |
6538 \begin{center} | |
6539 \begin{tabular}{|l|} | |
6540 \hline | |
6541 $x = 123456789, n = 256, k = 3$ \\ | |
6542 \hline $q \leftarrow \lfloor x/n \rfloor = 482253$ \\ | |
6543 $q \leftarrow q*k = 1446759$ \\ | |
6544 $x \leftarrow x \mbox{ mod } n = 21$ \\ | |
6545 $x \leftarrow x + q = 1446780$ \\ | |
6546 $x \leftarrow x - (n - k) = 1446527$ \\ | |
6547 \hline | |
6548 $q \leftarrow \lfloor x/n \rfloor = 5650$ \\ | |
6549 $q \leftarrow q*k = 16950$ \\ | |
6550 $x \leftarrow x \mbox{ mod } n = 127$ \\ | |
6551 $x \leftarrow x + q = 17077$ \\ | |
6552 $x \leftarrow x - (n - k) = 16824$ \\ | |
6553 \hline | |
6554 $q \leftarrow \lfloor x/n \rfloor = 65$ \\ | |
6555 $q \leftarrow q*k = 195$ \\ | |
6556 $x \leftarrow x \mbox{ mod } n = 184$ \\ | |
6557 $x \leftarrow x + q = 379$ \\ | |
6558 $x \leftarrow x - (n - k) = 126$ \\ | |
6559 \hline | |
6560 \end{tabular} | |
6561 \end{center} | |
6562 \end{small} | |
6563 \caption{Example Diminished Radix Reduction} | |
6564 \label{fig:EXDR} | |
6565 \end{figure} | |
6566 | |
6567 Figure~\ref{fig:EXDR} demonstrates the reduction of $x = 123456789$ modulo $n - k = 253$ when $n = 256$ and $k = 3$. Note that even while $x$ | |
6568 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast. In this case only | |
6569 three passes were required to find the residue $x \equiv 126$. | |
6570 | |
6571 | |
6572 \subsection{Choice of Moduli} | |
6573 On the surface this algorithm looks like a very expensive algorithm. It requires a couple of subtractions followed by multiplication and other | |
6574 modular reductions. The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen. | |
6575 | |
6576 Division in general is a very expensive operation to perform. The one exception is when the division is by a power of the radix of representation used. | |
6577 Division by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right. Similarly division | |
6578 by two (\textit{or powers of two}) is very simple for binary computers to perform. It would therefore seem logical to choose $n$ of the form $2^p$ | |
6579 which would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits. | |
6580 | |
6581 However, there is one operation related to division of power of twos that is even faster than this. If $n = \beta^p$ then the division may be | |
6582 performed by moving whole digits to the right $p$ places. In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$. | |
6583 Also with the choice of $n = \beta^p$ reducing $x$ modulo $n$ merely requires zeroing the digits above the $p-1$'th digit of $x$. | |
6584 | |
6585 Throughout the next section the term ``restricted modulus'' will refer to a modulus of the form $\beta^p - k$ whereas the term ``unrestricted | |
6586 modulus'' will refer to a modulus of the form $2^p - k$. The word ``restricted'' in this case refers to the fact that it is based on the | |
6587 $2^p$ logic except $p$ must be a multiple of $lg(\beta)$. | |
6588 | |
6589 \subsection{Choice of $k$} | |
6590 Now that division and reduction (\textit{step 1 and 3 of figure~\ref{fig:DR}}) have been optimized to simple digit operations the multiplication by $k$ | |
6591 in step 2 is the most expensive operation. Fortunately the choice of $k$ is not terribly limited. For all intents and purposes it might | |
6592 as well be a single digit. The smaller the value of $k$ is the faster the algorithm will be. | |
6593 | |
6594 \subsection{Restricted Diminished Radix Reduction} | |
6595 The restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$. This algorithm can reduce | |
6596 an input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}. The implementation | |
6597 of this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition | |
6598 of $x$ and $q$. The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular | |
6599 exponentiations are performed. | |
6600 | |
6601 \newpage\begin{figure}[!here] | |
6602 \begin{small} | |
6603 \begin{center} | |
6604 \begin{tabular}{l} | |
6605 \hline Algorithm \textbf{mp\_dr\_reduce}. \\ | |
6606 \textbf{Input}. mp\_int $x$, $n$ and a mp\_digit $k = \beta - n_0$ \\ | |
6607 \hspace{11.5mm}($0 \le x < n^2$, $n > 1$, $0 < k < \beta$) \\ | |
6608 \textbf{Output}. $x \mbox{ mod } n$ \\ | |
6609 \hline \\ | |
6610 1. $m \leftarrow n.used$ \\ | |
6611 2. If $x.alloc < 2m$ then grow $x$ to $2m$ digits. \\ | |
6612 3. $\mu \leftarrow 0$ \\ | |
6613 4. for $i$ from $0$ to $m - 1$ do \\ | |
6614 \hspace{3mm}4.1 $\hat r \leftarrow k \cdot x_{m+i} + x_{i} + \mu$ \\ | |
6615 \hspace{3mm}4.2 $x_{i} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
6616 \hspace{3mm}4.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
6617 5. $x_{m} \leftarrow \mu$ \\ | |
6618 6. for $i$ from $m + 1$ to $x.used - 1$ do \\ | |
6619 \hspace{3mm}6.1 $x_{i} \leftarrow 0$ \\ | |
6620 7. Clamp excess digits of $x$. \\ | |
6621 8. If $x \ge n$ then \\ | |
6622 \hspace{3mm}8.1 $x \leftarrow x - n$ \\ | |
6623 \hspace{3mm}8.2 Goto step 3. \\ | |
6624 9. Return(\textit{MP\_OKAY}). \\ | |
6625 \hline | |
6626 \end{tabular} | |
6627 \end{center} | |
6628 \end{small} | |
6629 \caption{Algorithm mp\_dr\_reduce} | |
6630 \end{figure} | |
6631 | |
6632 \textbf{Algorithm mp\_dr\_reduce.} | |
6633 This algorithm will perform the Dimished Radix reduction of $x$ modulo $n$. It has similar restrictions to that of the Barrett reduction | |
6634 with the addition that $n$ must be of the form $n = \beta^m - k$ where $0 < k <\beta$. | |
6635 | |
6636 This algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization. The division by $\beta^m$, multiplication by $k$ | |
6637 and addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4. The division by $\beta^m$ is emulated by accessing | |
6638 the term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position. After the loop the $m$'th | |
6639 digit is set to the carry and the upper digits are zeroed. Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to | |
6640 $x$ before the addition of the multiple of the upper half. | |
6641 | |
6642 At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required. First $n$ is subtracted from $x$ and then the algorithm resumes | |
6643 at step 3. | |
6644 | |
6645 \vspace{+3mm}\begin{small} | |
6646 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_reduce.c | |
6647 \vspace{-3mm} | |
6648 \begin{alltt} | |
6649 016 | |
6650 017 /* reduce "x" in place modulo "n" using the Diminished Radix algorithm. | |
6651 018 * | |
6652 019 * Based on algorithm from the paper | |
6653 020 * | |
6654 021 * "Generating Efficient Primes for Discrete Log Cryptosystems" | |
6655 022 * Chae Hoon Lim, Pil Loong Lee, | |
6656 023 * POSTECH Information Research Laboratories | |
6657 024 * | |
6658 025 * The modulus must be of a special format [see manual] | |
6659 026 * | |
6660 027 * Has been modified to use algorithm 7.10 from the LTM book instead | |
6661 028 * | |
6662 029 * Input x must be in the range 0 <= x <= (n-1)**2 | |
6663 030 */ | |
6664 031 int | |
6665 032 mp_dr_reduce (mp_int * x, mp_int * n, mp_digit k) | |
6666 033 \{ | |
6667 034 int err, i, m; | |
6668 035 mp_word r; | |
6669 036 mp_digit mu, *tmpx1, *tmpx2; | |
6670 037 | |
6671 038 /* m = digits in modulus */ | |
6672 039 m = n->used; | |
6673 040 | |
6674 041 /* ensure that "x" has at least 2m digits */ | |
6675 042 if (x->alloc < m + m) \{ | |
6676 043 if ((err = mp_grow (x, m + m)) != MP_OKAY) \{ | |
6677 044 return err; | |
6678 045 \} | |
6679 046 \} | |
6680 047 | |
6681 048 /* top of loop, this is where the code resumes if | |
6682 049 * another reduction pass is required. | |
6683 050 */ | |
6684 051 top: | |
6685 052 /* aliases for digits */ | |
6686 053 /* alias for lower half of x */ | |
6687 054 tmpx1 = x->dp; | |
6688 055 | |
6689 056 /* alias for upper half of x, or x/B**m */ | |
6690 057 tmpx2 = x->dp + m; | |
6691 058 | |
6692 059 /* set carry to zero */ | |
6693 060 mu = 0; | |
6694 061 | |
6695 062 /* compute (x mod B**m) + k * [x/B**m] inline and inplace */ | |
6696 063 for (i = 0; i < m; i++) \{ | |
6697 064 r = ((mp_word)*tmpx2++) * ((mp_word)k) + *tmpx1 + mu; | |
6698 065 *tmpx1++ = (mp_digit)(r & MP_MASK); | |
6699 066 mu = (mp_digit)(r >> ((mp_word)DIGIT_BIT)); | |
6700 067 \} | |
6701 068 | |
6702 069 /* set final carry */ | |
6703 070 *tmpx1++ = mu; | |
6704 071 | |
6705 072 /* zero words above m */ | |
6706 073 for (i = m + 1; i < x->used; i++) \{ | |
6707 074 *tmpx1++ = 0; | |
6708 075 \} | |
6709 076 | |
6710 077 /* clamp, sub and return */ | |
6711 078 mp_clamp (x); | |
6712 079 | |
6713 080 /* if x >= n then subtract and reduce again | |
6714 081 * Each successive "recursion" makes the input smaller and smaller. | |
6715 082 */ | |
6716 083 if (mp_cmp_mag (x, n) != MP_LT) \{ | |
6717 084 s_mp_sub(x, n, x); | |
6718 085 goto top; | |
6719 086 \} | |
6720 087 return MP_OKAY; | |
6721 088 \} | |
6722 \end{alltt} | |
6723 \end{small} | |
6724 | |
6725 The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$. The label on line 51 is where | |
6726 the algorithm will resume if further reduction passes are required. In theory it could be placed at the top of the function however, the size of | |
6727 the modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time. | |
6728 | |
6729 The aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits. By reading digits from $x$ offset by $m$ digits | |
6730 a division by $\beta^m$ can be simulated virtually for free. The loop on line 63 performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11}) | |
6731 in this algorithm. | |
6732 | |
6733 By line 70 the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed. Similarly by line 73 the | |
6734 same pointer will point to the $m+1$'th digit where the zeroes will be placed. | |
6735 | |
6736 Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required. | |
6737 With the same logic at line 84 the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used | |
6738 as well. Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code | |
6739 does not need to be checked. | |
6740 | |
6741 \subsubsection{Setup} | |
6742 To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required. This algorithm is not really complicated but provided for | |
6743 completeness. | |
6744 | |
6745 \begin{figure}[!here] | |
6746 \begin{small} | |
6747 \begin{center} | |
6748 \begin{tabular}{l} | |
6749 \hline Algorithm \textbf{mp\_dr\_setup}. \\ | |
6750 \textbf{Input}. mp\_int $n$ \\ | |
6751 \textbf{Output}. $k = \beta - n_0$ \\ | |
6752 \hline \\ | |
6753 1. $k \leftarrow \beta - n_0$ \\ | |
6754 \hline | |
6755 \end{tabular} | |
6756 \end{center} | |
6757 \end{small} | |
6758 \caption{Algorithm mp\_dr\_setup} | |
6759 \end{figure} | |
6760 | |
6761 \vspace{+3mm}\begin{small} | |
6762 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_setup.c | |
6763 \vspace{-3mm} | |
6764 \begin{alltt} | |
6765 016 | |
6766 017 /* determines the setup value */ | |
6767 018 void mp_dr_setup(mp_int *a, mp_digit *d) | |
6768 019 \{ | |
6769 020 /* the casts are required if DIGIT_BIT is one less than | |
6770 021 * the number of bits in a mp_digit [e.g. DIGIT_BIT==31] | |
6771 022 */ | |
6772 023 *d = (mp_digit)((((mp_word)1) << ((mp_word)DIGIT_BIT)) - | |
6773 024 ((mp_word)a->dp[0])); | |
6774 025 \} | |
6775 026 | |
6776 \end{alltt} | |
6777 \end{small} | |
6778 | |
6779 \subsubsection{Modulus Detection} | |
6780 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus. An integer is said to be | |
6781 of restricted Diminished Radix form if all of the digits are equal to $\beta - 1$ except the trailing digit which may be any value. | |
6782 | |
6783 \begin{figure}[!here] | |
6784 \begin{small} | |
6785 \begin{center} | |
6786 \begin{tabular}{l} | |
6787 \hline Algorithm \textbf{mp\_dr\_is\_modulus}. \\ | |
6788 \textbf{Input}. mp\_int $n$ \\ | |
6789 \textbf{Output}. $1$ if $n$ is in D.R form, $0$ otherwise \\ | |
6790 \hline | |
6791 1. If $n.used < 2$ then return($0$). \\ | |
6792 2. for $ix$ from $1$ to $n.used - 1$ do \\ | |
6793 \hspace{3mm}2.1 If $n_{ix} \ne \beta - 1$ return($0$). \\ | |
6794 3. Return($1$). \\ | |
6795 \hline | |
6796 \end{tabular} | |
6797 \end{center} | |
6798 \end{small} | |
6799 \caption{Algorithm mp\_dr\_is\_modulus} | |
6800 \end{figure} | |
6801 | |
6802 \textbf{Algorithm mp\_dr\_is\_modulus.} | |
6803 This algorithm determines if a value is in Diminished Radix form. Step 1 rejects obvious cases where fewer than two digits are | |
6804 in the mp\_int. Step 2 tests all but the first digit to see if they are equal to $\beta - 1$. If the algorithm manages to get to | |
6805 step 3 then $n$ must be of Diminished Radix form. | |
6806 | |
6807 \vspace{+3mm}\begin{small} | |
6808 \hspace{-5.1mm}{\bf File}: bn\_mp\_dr\_is\_modulus.c | |
6809 \vspace{-3mm} | |
6810 \begin{alltt} | |
6811 016 | |
6812 017 /* determines if a number is a valid DR modulus */ | |
6813 018 int mp_dr_is_modulus(mp_int *a) | |
6814 019 \{ | |
6815 020 int ix; | |
6816 021 | |
6817 022 /* must be at least two digits */ | |
6818 023 if (a->used < 2) \{ | |
6819 024 return 0; | |
6820 025 \} | |
6821 026 | |
6822 027 /* must be of the form b**k - a [a <= b] so all | |
6823 028 * but the first digit must be equal to -1 (mod b). | |
6824 029 */ | |
6825 030 for (ix = 1; ix < a->used; ix++) \{ | |
6826 031 if (a->dp[ix] != MP_MASK) \{ | |
6827 032 return 0; | |
6828 033 \} | |
6829 034 \} | |
6830 035 return 1; | |
6831 036 \} | |
6832 037 | |
6833 \end{alltt} | |
6834 \end{small} | |
6835 | |
6836 \subsection{Unrestricted Diminished Radix Reduction} | |
6837 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$. This algorithm | |
6838 is a straightforward adaptation of algorithm~\ref{fig:DR}. | |
6839 | |
6840 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead. However, this new | |
6841 algorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form. | |
6842 | |
6843 \begin{figure}[!here] | |
6844 \begin{small} | |
6845 \begin{center} | |
6846 \begin{tabular}{l} | |
6847 \hline Algorithm \textbf{mp\_reduce\_2k}. \\ | |
6848 \textbf{Input}. mp\_int $a$ and $n$. mp\_digit $k$ \\ | |
6849 \hspace{11.5mm}($a \ge 0$, $n > 1$, $0 < k < \beta$, $n + k$ is a power of two) \\ | |
6850 \textbf{Output}. $a \mbox{ (mod }n\mbox{)}$ \\ | |
6851 \hline | |
6852 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ | |
6853 2. While $a \ge n$ do \\ | |
6854 \hspace{3mm}2.1 $q \leftarrow \lfloor a / 2^p \rfloor$ (\textit{mp\_div\_2d}) \\ | |
6855 \hspace{3mm}2.2 $a \leftarrow a \mbox{ (mod }2^p\mbox{)}$ (\textit{mp\_mod\_2d}) \\ | |
6856 \hspace{3mm}2.3 $q \leftarrow q \cdot k$ (\textit{mp\_mul\_d}) \\ | |
6857 \hspace{3mm}2.4 $a \leftarrow a - q$ (\textit{s\_mp\_sub}) \\ | |
6858 \hspace{3mm}2.5 If $a \ge n$ then do \\ | |
6859 \hspace{6mm}2.5.1 $a \leftarrow a - n$ \\ | |
6860 3. Return(\textit{MP\_OKAY}). \\ | |
6861 \hline | |
6862 \end{tabular} | |
6863 \end{center} | |
6864 \end{small} | |
6865 \caption{Algorithm mp\_reduce\_2k} | |
6866 \end{figure} | |
6867 | |
6868 \textbf{Algorithm mp\_reduce\_2k.} | |
6869 This algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$. Division by $2^p$ is emulated with a right | |
6870 shift which makes the algorithm fairly inexpensive to use. | |
6871 | |
6872 \vspace{+3mm}\begin{small} | |
6873 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k.c | |
6874 \vspace{-3mm} | |
6875 \begin{alltt} | |
6876 016 | |
6877 017 /* reduces a modulo n where n is of the form 2**p - d */ | |
6878 018 int | |
6879 019 mp_reduce_2k(mp_int *a, mp_int *n, mp_digit d) | |
6880 020 \{ | |
6881 021 mp_int q; | |
6882 022 int p, res; | |
6883 023 | |
6884 024 if ((res = mp_init(&q)) != MP_OKAY) \{ | |
6885 025 return res; | |
6886 026 \} | |
6887 027 | |
6888 028 p = mp_count_bits(n); | |
6889 029 top: | |
6890 030 /* q = a/2**p, a = a mod 2**p */ | |
6891 031 if ((res = mp_div_2d(a, p, &q, a)) != MP_OKAY) \{ | |
6892 032 goto ERR; | |
6893 033 \} | |
6894 034 | |
6895 035 if (d != 1) \{ | |
6896 036 /* q = q * d */ | |
6897 037 if ((res = mp_mul_d(&q, d, &q)) != MP_OKAY) \{ | |
6898 038 goto ERR; | |
6899 039 \} | |
6900 040 \} | |
6901 041 | |
6902 042 /* a = a + q */ | |
6903 043 if ((res = s_mp_add(a, &q, a)) != MP_OKAY) \{ | |
6904 044 goto ERR; | |
6905 045 \} | |
6906 046 | |
6907 047 if (mp_cmp_mag(a, n) != MP_LT) \{ | |
6908 048 s_mp_sub(a, n, a); | |
6909 049 goto top; | |
6910 050 \} | |
6911 051 | |
6912 052 ERR: | |
6913 053 mp_clear(&q); | |
6914 054 return res; | |
6915 055 \} | |
6916 056 | |
6917 \end{alltt} | |
6918 \end{small} | |
6919 | |
6920 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$. The call to mp\_div\_2d | |
6921 on line 31 calculates both the quotient $q$ and the remainder $a$ required. By doing both in a single function call the code size | |
6922 is kept fairly small. The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without | |
6923 any multiplications. | |
6924 | |
6925 The unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are | |
6926 positive. By using the unsigned versions the overhead is kept to a minimum. | |
6927 | |
6928 \subsubsection{Unrestricted Setup} | |
6929 To setup this reduction algorithm the value of $k = 2^p - n$ is required. | |
6930 | |
6931 \begin{figure}[!here] | |
6932 \begin{small} | |
6933 \begin{center} | |
6934 \begin{tabular}{l} | |
6935 \hline Algorithm \textbf{mp\_reduce\_2k\_setup}. \\ | |
6936 \textbf{Input}. mp\_int $n$ \\ | |
6937 \textbf{Output}. $k = 2^p - n$ \\ | |
6938 \hline | |
6939 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ | |
6940 2. $x \leftarrow 2^p$ (\textit{mp\_2expt}) \\ | |
6941 3. $x \leftarrow x - n$ (\textit{mp\_sub}) \\ | |
6942 4. $k \leftarrow x_0$ \\ | |
6943 5. Return(\textit{MP\_OKAY}). \\ | |
6944 \hline | |
6945 \end{tabular} | |
6946 \end{center} | |
6947 \end{small} | |
6948 \caption{Algorithm mp\_reduce\_2k\_setup} | |
6949 \end{figure} | |
6950 | |
6951 \textbf{Algorithm mp\_reduce\_2k\_setup.} | |
6952 This algorithm computes the value of $k$ required for the algorithm mp\_reduce\_2k. By making a temporary variable $x$ equal to $2^p$ a subtraction | |
6953 is sufficient to solve for $k$. Alternatively if $n$ has more than one digit the value of $k$ is simply $\beta - n_0$. | |
6954 | |
6955 \vspace{+3mm}\begin{small} | |
6956 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k\_setup.c | |
6957 \vspace{-3mm} | |
6958 \begin{alltt} | |
6959 016 | |
6960 017 /* determines the setup value */ | |
6961 018 int | |
6962 019 mp_reduce_2k_setup(mp_int *a, mp_digit *d) | |
6963 020 \{ | |
6964 021 int res, p; | |
6965 022 mp_int tmp; | |
6966 023 | |
6967 024 if ((res = mp_init(&tmp)) != MP_OKAY) \{ | |
6968 025 return res; | |
6969 026 \} | |
6970 027 | |
6971 028 p = mp_count_bits(a); | |
6972 029 if ((res = mp_2expt(&tmp, p)) != MP_OKAY) \{ | |
6973 030 mp_clear(&tmp); | |
6974 031 return res; | |
6975 032 \} | |
6976 033 | |
6977 034 if ((res = s_mp_sub(&tmp, a, &tmp)) != MP_OKAY) \{ | |
6978 035 mp_clear(&tmp); | |
6979 036 return res; | |
6980 037 \} | |
6981 038 | |
6982 039 *d = tmp.dp[0]; | |
6983 040 mp_clear(&tmp); | |
6984 041 return MP_OKAY; | |
6985 042 \} | |
6986 \end{alltt} | |
6987 \end{small} | |
6988 | |
6989 \subsubsection{Unrestricted Detection} | |
6990 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true. | |
6991 | |
6992 \begin{enumerate} | |
6993 \item The number has only one digit. | |
6994 \item The number has more than one digit and every bit from the $\beta$'th to the most significant is one. | |
6995 \end{enumerate} | |
6996 | |
6997 If either condition is true than there is a power of two $2^p$ such that $0 < 2^p - n < \beta$. If the input is only | |
6998 one digit than it will always be of the correct form. Otherwise all of the bits above the first digit must be one. This arises from the fact | |
6999 that there will be value of $k$ that when added to the modulus causes a carry in the first digit which propagates all the way to the most | |
7000 significant bit. The resulting sum will be a power of two. | |
7001 | |
7002 \begin{figure}[!here] | |
7003 \begin{small} | |
7004 \begin{center} | |
7005 \begin{tabular}{l} | |
7006 \hline Algorithm \textbf{mp\_reduce\_is\_2k}. \\ | |
7007 \textbf{Input}. mp\_int $n$ \\ | |
7008 \textbf{Output}. $1$ if of proper form, $0$ otherwise \\ | |
7009 \hline | |
7010 1. If $n.used = 0$ then return($0$). \\ | |
7011 2. If $n.used = 1$ then return($1$). \\ | |
7012 3. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ | |
7013 4. for $x$ from $lg(\beta)$ to $p$ do \\ | |
7014 \hspace{3mm}4.1 If the ($x \mbox{ mod }lg(\beta)$)'th bit of the $\lfloor x / lg(\beta) \rfloor$ of $n$ is zero then return($0$). \\ | |
7015 5. Return($1$). \\ | |
7016 \hline | |
7017 \end{tabular} | |
7018 \end{center} | |
7019 \end{small} | |
7020 \caption{Algorithm mp\_reduce\_is\_2k} | |
7021 \end{figure} | |
7022 | |
7023 \textbf{Algorithm mp\_reduce\_is\_2k.} | |
7024 This algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly. | |
7025 | |
7026 \vspace{+3mm}\begin{small} | |
7027 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_is\_2k.c | |
7028 \vspace{-3mm} | |
7029 \begin{alltt} | |
7030 016 | |
7031 017 /* determines if mp_reduce_2k can be used */ | |
7032 018 int mp_reduce_is_2k(mp_int *a) | |
7033 019 \{ | |
7034 020 int ix, iy, iz, iw; | |
7035 021 | |
7036 022 if (a->used == 0) \{ | |
7037 023 return 0; | |
7038 024 \} else if (a->used == 1) \{ | |
7039 025 return 1; | |
7040 026 \} else if (a->used > 1) \{ | |
7041 027 iy = mp_count_bits(a); | |
7042 028 iz = 1; | |
7043 029 iw = 1; | |
7044 030 | |
7045 031 /* Test every bit from the second digit up, must be 1 */ | |
7046 032 for (ix = DIGIT_BIT; ix < iy; ix++) \{ | |
7047 033 if ((a->dp[iw] & iz) == 0) \{ | |
7048 034 return 0; | |
7049 035 \} | |
7050 036 iz <<= 1; | |
7051 037 if (iz > (int)MP_MASK) \{ | |
7052 038 ++iw; | |
7053 039 iz = 1; | |
7054 040 \} | |
7055 041 \} | |
7056 042 \} | |
7057 043 return 1; | |
7058 044 \} | |
7059 045 | |
7060 \end{alltt} | |
7061 \end{small} | |
7062 | |
7063 | |
7064 | |
7065 \section{Algorithm Comparison} | |
7066 So far three very different algorithms for modular reduction have been discussed. Each of the algorithms have their own strengths and weaknesses | |
7067 that makes having such a selection very useful. The following table sumarizes the three algorithms along with comparisons of work factors. Since | |
7068 all three algorithms have the restriction that $0 \le x < n^2$ and $n > 1$ those limitations are not included in the table. | |
7069 | |
7070 \begin{center} | |
7071 \begin{small} | |
7072 \begin{tabular}{|c|c|c|c|c|c|} | |
7073 \hline \textbf{Method} & \textbf{Work Required} & \textbf{Limitations} & \textbf{$m = 8$} & \textbf{$m = 32$} & \textbf{$m = 64$} \\ | |
7074 \hline Barrett & $m^2 + 2m - 1$ & None & $79$ & $1087$ & $4223$ \\ | |
7075 \hline Montgomery & $m^2 + m$ & $n$ must be odd & $72$ & $1056$ & $4160$ \\ | |
7076 \hline D.R. & $2m$ & $n = \beta^m - k$ & $16$ & $64$ & $128$ \\ | |
7077 \hline | |
7078 \end{tabular} | |
7079 \end{small} | |
7080 \end{center} | |
7081 | |
7082 In theory Montgomery and Barrett reductions would require roughly the same amount of time to complete. However, in practice since Montgomery | |
7083 reduction can be written as a single function with the Comba technique it is much faster. Barrett reduction suffers from the overhead of | |
7084 calling the half precision multipliers, addition and division by $\beta$ algorithms. | |
7085 | |
7086 For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice. The one set of algorithms where Diminished Radix reduction truly | |
7087 shines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}. In these algorithms | |
7088 primes of the form $\beta^m - k$ can be found and shared amongst users. These primes will allow the Diminished Radix algorithm to be used in | |
7089 modular exponentiation to greatly speed up the operation. | |
7090 | |
7091 | |
7092 | |
7093 \section*{Exercises} | |
7094 \begin{tabular}{cl} | |
7095 $\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\ | |
7096 & calculates the correct value of $\rho$. \\ | |
7097 & \\ | |
7098 $\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly. \\ | |
7099 & \\ | |
7100 $\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\ | |
7101 & (\textit{figure~\ref{fig:DR}}) terminates. Also prove the probability that it will \\ | |
7102 & terminate within $1 \le k \le 10$ iterations. \\ | |
7103 & \\ | |
7104 \end{tabular} | |
7105 | |
7106 | |
7107 \chapter{Exponentiation} | |
7108 Exponentiation is the operation of raising one variable to the power of another, for example, $a^b$. A variant of exponentiation, computed | |
7109 in a finite field or ring, is called modular exponentiation. This latter style of operation is typically used in public key | |
7110 cryptosystems such as RSA and Diffie-Hellman. The ability to quickly compute modular exponentiations is of great benefit to any | |
7111 such cryptosystem and many methods have been sought to speed it up. | |
7112 | |
7113 \section{Exponentiation Basics} | |
7114 A trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired. However, as $b$ grows in size | |
7115 the number of multiplications becomes prohibitive. Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature | |
7116 with a $1024$-bit key. Such a calculation could never be completed as it would take simply far too long. | |
7117 | |
7118 Fortunately there is a very simple algorithm based on the laws of exponents. Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which | |
7119 are two trivial relationships between the base and the exponent. Let $b_i$ represent the $i$'th bit of $b$ starting from the least | |
7120 significant bit. If $b$ is a $k$-bit integer than the following equation is true. | |
7121 | |
7122 \begin{equation} | |
7123 a^b = \prod_{i=0}^{k-1} a^{2^i \cdot b_i} | |
7124 \end{equation} | |
7125 | |
7126 By taking the base $a$ logarithm of both sides of the equation the following equation is the result. | |
7127 | |
7128 \begin{equation} | |
7129 b = \sum_{i=0}^{k-1}2^i \cdot b_i | |
7130 \end{equation} | |
7131 | |
7132 The term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to | |
7133 $a^{2^{i+1}}$. This observation forms the basis of essentially all fast exponentiation algorithms. It requires $k$ squarings and on average | |
7134 $k \over 2$ multiplications to compute the result. This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times. | |
7135 | |
7136 While this current method is a considerable speed up there are further improvements to be made. For example, the $a^{2^i}$ term does not need to | |
7137 be computed in an auxilary variable. Consider the following equivalent algorithm. | |
7138 | |
7139 \begin{figure}[!here] | |
7140 \begin{small} | |
7141 \begin{center} | |
7142 \begin{tabular}{l} | |
7143 \hline Algorithm \textbf{Left to Right Exponentiation}. \\ | |
7144 \textbf{Input}. Integer $a$, $b$ and $k$ \\ | |
7145 \textbf{Output}. $c = a^b$ \\ | |
7146 \hline \\ | |
7147 1. $c \leftarrow 1$ \\ | |
7148 2. for $i$ from $k - 1$ to $0$ do \\ | |
7149 \hspace{3mm}2.1 $c \leftarrow c^2$ \\ | |
7150 \hspace{3mm}2.2 $c \leftarrow c \cdot a^{b_i}$ \\ | |
7151 3. Return $c$. \\ | |
7152 \hline | |
7153 \end{tabular} | |
7154 \end{center} | |
7155 \end{small} | |
7156 \caption{Left to Right Exponentiation} | |
7157 \label{fig:LTOR} | |
7158 \end{figure} | |
7159 | |
7160 This algorithm starts from the most significant bit and works towards the least significant bit. When the $i$'th bit of $b$ is set $a$ is | |
7161 multiplied against the current product. In each iteration the product is squared which doubles the exponent of the individual terms of the | |
7162 product. | |
7163 | |
7164 For example, let $b = 101100_2 \equiv 44_{10}$. The following chart demonstrates the actions of the algorithm. | |
7165 | |
7166 \newpage\begin{figure} | |
7167 \begin{center} | |
7168 \begin{tabular}{|c|c|} | |
7169 \hline \textbf{Value of $i$} & \textbf{Value of $c$} \\ | |
7170 \hline - & $1$ \\ | |
7171 \hline $5$ & $a$ \\ | |
7172 \hline $4$ & $a^2$ \\ | |
7173 \hline $3$ & $a^4 \cdot a$ \\ | |
7174 \hline $2$ & $a^8 \cdot a^2 \cdot a$ \\ | |
7175 \hline $1$ & $a^{16} \cdot a^4 \cdot a^2$ \\ | |
7176 \hline $0$ & $a^{32} \cdot a^8 \cdot a^4$ \\ | |
7177 \hline | |
7178 \end{tabular} | |
7179 \end{center} | |
7180 \caption{Example of Left to Right Exponentiation} | |
7181 \end{figure} | |
7182 | |
7183 When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation. This particular algorithm is | |
7184 called ``Left to Right'' because it reads the exponent in that order. All of the exponentiation algorithms that will be presented are of this nature. | |
7185 | |
7186 \subsection{Single Digit Exponentiation} | |
7187 The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit. It is intended | |
7188 to be used when a small power of an input is required (\textit{e.g. $a^5$}). It is faster than simply multiplying $b - 1$ times for all values of | |
7189 $b$ that are greater than three. | |
7190 | |
7191 \newpage\begin{figure}[!here] | |
7192 \begin{small} | |
7193 \begin{center} | |
7194 \begin{tabular}{l} | |
7195 \hline Algorithm \textbf{mp\_expt\_d}. \\ | |
7196 \textbf{Input}. mp\_int $a$ and mp\_digit $b$ \\ | |
7197 \textbf{Output}. $c = a^b$ \\ | |
7198 \hline \\ | |
7199 1. $g \leftarrow a$ (\textit{mp\_init\_copy}) \\ | |
7200 2. $c \leftarrow 1$ (\textit{mp\_set}) \\ | |
7201 3. for $x$ from 1 to $lg(\beta)$ do \\ | |
7202 \hspace{3mm}3.1 $c \leftarrow c^2$ (\textit{mp\_sqr}) \\ | |
7203 \hspace{3mm}3.2 If $b$ AND $2^{lg(\beta) - 1} \ne 0$ then \\ | |
7204 \hspace{6mm}3.2.1 $c \leftarrow c \cdot g$ (\textit{mp\_mul}) \\ | |
7205 \hspace{3mm}3.3 $b \leftarrow b << 1$ \\ | |
7206 4. Clear $g$. \\ | |
7207 5. Return(\textit{MP\_OKAY}). \\ | |
7208 \hline | |
7209 \end{tabular} | |
7210 \end{center} | |
7211 \end{small} | |
7212 \caption{Algorithm mp\_expt\_d} | |
7213 \end{figure} | |
7214 | |
7215 \textbf{Algorithm mp\_expt\_d.} | |
7216 This algorithm computes the value of $a$ raised to the power of a single digit $b$. It uses the left to right exponentiation algorithm to | |
7217 quickly compute the exponentiation. It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the | |
7218 exponent is a fixed width. | |
7219 | |
7220 A copy of $a$ is made first to allow destination variable $c$ be the same as the source variable $a$. The result is set to the initial value of | |
7221 $1$ in the subsequent step. | |
7222 | |
7223 Inside the loop the exponent is read from the most significant bit first down to the least significant bit. First $c$ is invariably squared | |
7224 on step 3.1. In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$. The value | |
7225 of $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit. In effect each | |
7226 iteration of the loop moves the bits of the exponent $b$ upwards to the most significant location. | |
7227 | |
7228 \vspace{+3mm}\begin{small} | |
7229 \hspace{-5.1mm}{\bf File}: bn\_mp\_expt\_d.c | |
7230 \vspace{-3mm} | |
7231 \begin{alltt} | |
7232 016 | |
7233 017 /* calculate c = a**b using a square-multiply algorithm */ | |
7234 018 int mp_expt_d (mp_int * a, mp_digit b, mp_int * c) | |
7235 019 \{ | |
7236 020 int res, x; | |
7237 021 mp_int g; | |
7238 022 | |
7239 023 if ((res = mp_init_copy (&g, a)) != MP_OKAY) \{ | |
7240 024 return res; | |
7241 025 \} | |
7242 026 | |
7243 027 /* set initial result */ | |
7244 028 mp_set (c, 1); | |
7245 029 | |
7246 030 for (x = 0; x < (int) DIGIT_BIT; x++) \{ | |
7247 031 /* square */ | |
7248 032 if ((res = mp_sqr (c, c)) != MP_OKAY) \{ | |
7249 033 mp_clear (&g); | |
7250 034 return res; | |
7251 035 \} | |
7252 036 | |
7253 037 /* if the bit is set multiply */ | |
7254 038 if ((b & (mp_digit) (((mp_digit)1) << (DIGIT_BIT - 1))) != 0) \{ | |
7255 039 if ((res = mp_mul (c, &g, c)) != MP_OKAY) \{ | |
7256 040 mp_clear (&g); | |
7257 041 return res; | |
7258 042 \} | |
7259 043 \} | |
7260 044 | |
7261 045 /* shift to next bit */ | |
7262 046 b <<= 1; | |
7263 047 \} | |
7264 048 | |
7265 049 mp_clear (&g); | |
7266 050 return MP_OKAY; | |
7267 051 \} | |
7268 \end{alltt} | |
7269 \end{small} | |
7270 | |
7271 Line 28 sets the initial value of the result to $1$. Next the loop on line 30 steps through each bit of the exponent starting from | |
7272 the most significant down towards the least significant. The invariant squaring operation placed on line 32 is performed first. After | |
7273 the squaring the result $c$ is multiplied by the base $g$ if and only if the most significant bit of the exponent is set. The shift on line | |
7274 46 moves all of the bits of the exponent upwards towards the most significant location. | |
7275 | |
7276 \section{$k$-ary Exponentiation} | |
7277 When calculating an exponentiation the most time consuming bottleneck is the multiplications which are in general a small factor | |
7278 slower than squaring. Recall from the previous algorithm that $b_{i}$ refers to the $i$'th bit of the exponent $b$. Suppose instead it referred to | |
7279 the $i$'th $k$-bit digit of the exponent of $b$. For $k = 1$ the definitions are synonymous and for $k > 1$ algorithm~\ref{fig:KARY} | |
7280 computes the same exponentiation. A group of $k$ bits from the exponent is called a \textit{window}. That is it is a small window on only a | |
7281 portion of the entire exponent. Consider the following modification to the basic left to right exponentiation algorithm. | |
7282 | |
7283 \begin{figure}[!here] | |
7284 \begin{small} | |
7285 \begin{center} | |
7286 \begin{tabular}{l} | |
7287 \hline Algorithm \textbf{$k$-ary Exponentiation}. \\ | |
7288 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\ | |
7289 \textbf{Output}. $c = a^b$ \\ | |
7290 \hline \\ | |
7291 1. $c \leftarrow 1$ \\ | |
7292 2. for $i$ from $t - 1$ to $0$ do \\ | |
7293 \hspace{3mm}2.1 $c \leftarrow c^{2^k} $ \\ | |
7294 \hspace{3mm}2.2 Extract the $i$'th $k$-bit word from $b$ and store it in $g$. \\ | |
7295 \hspace{3mm}2.3 $c \leftarrow c \cdot a^g$ \\ | |
7296 3. Return $c$. \\ | |
7297 \hline | |
7298 \end{tabular} | |
7299 \end{center} | |
7300 \end{small} | |
7301 \caption{$k$-ary Exponentiation} | |
7302 \label{fig:KARY} | |
7303 \end{figure} | |
7304 | |
7305 The squaring on step 2.1 can be calculated by squaring the value $c$ successively $k$ times. If the values of $a^g$ for $0 < g < 2^k$ have been | |
7306 precomputed this algorithm requires only $t$ multiplications and $tk$ squarings. The table can be generated with $2^{k - 1} - 1$ squarings and | |
7307 $2^{k - 1} + 1$ multiplications. This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$. | |
7308 However, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}. | |
7309 | |
7310 Suppose $k = 4$ and $t = 100$. This modified algorithm will require $109$ multiplications and $408$ squarings to compute the exponentiation. The | |
7311 original algorithm would on average have required $200$ multiplications and $400$ squrings to compute the same value. The total number of squarings | |
7312 has increased slightly but the number of multiplications has nearly halved. | |
7313 | |
7314 \subsection{Optimal Values of $k$} | |
7315 An optimal value of $k$ will minimize $2^{k} + \lceil n / k \rceil + n - 1$ for a fixed number of bits in the exponent $n$. The simplest | |
7316 approach is to brute force search amongst the values $k = 2, 3, \ldots, 8$ for the lowest result. Table~\ref{fig:OPTK} lists optimal values of $k$ | |
7317 for various exponent sizes and compares the number of multiplication and squarings required against algorithm~\ref{fig:LTOR}. | |
7318 | |
7319 \begin{figure}[here] | |
7320 \begin{center} | |
7321 \begin{small} | |
7322 \begin{tabular}{|c|c|c|c|c|c|} | |
7323 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:LTOR}} \\ | |
7324 \hline $16$ & $2$ & $27$ & $24$ \\ | |
7325 \hline $32$ & $3$ & $49$ & $48$ \\ | |
7326 \hline $64$ & $3$ & $92$ & $96$ \\ | |
7327 \hline $128$ & $4$ & $175$ & $192$ \\ | |
7328 \hline $256$ & $4$ & $335$ & $384$ \\ | |
7329 \hline $512$ & $5$ & $645$ & $768$ \\ | |
7330 \hline $1024$ & $6$ & $1257$ & $1536$ \\ | |
7331 \hline $2048$ & $6$ & $2452$ & $3072$ \\ | |
7332 \hline $4096$ & $7$ & $4808$ & $6144$ \\ | |
7333 \hline | |
7334 \end{tabular} | |
7335 \end{small} | |
7336 \end{center} | |
7337 \caption{Optimal Values of $k$ for $k$-ary Exponentiation} | |
7338 \label{fig:OPTK} | |
7339 \end{figure} | |
7340 | |
7341 \subsection{Sliding-Window Exponentiation} | |
7342 A simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$. Essentially | |
7343 this is a table for all values of $g$ where the most significant bit of $g$ is a one. However, in order for this to be allowed in the | |
7344 algorithm values of $g$ in the range $0 \le g < 2^{k-1}$ must be avoided. | |
7345 | |
7346 Table~\ref{fig:OPTK2} lists optimal values of $k$ for various exponent sizes and compares the work required against algorithm~\ref{fig:KARY}. | |
7347 | |
7348 \begin{figure}[here] | |
7349 \begin{center} | |
7350 \begin{small} | |
7351 \begin{tabular}{|c|c|c|c|c|c|} | |
7352 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:KARY}} \\ | |
7353 \hline $16$ & $3$ & $24$ & $27$ \\ | |
7354 \hline $32$ & $3$ & $45$ & $49$ \\ | |
7355 \hline $64$ & $4$ & $87$ & $92$ \\ | |
7356 \hline $128$ & $4$ & $167$ & $175$ \\ | |
7357 \hline $256$ & $5$ & $322$ & $335$ \\ | |
7358 \hline $512$ & $6$ & $628$ & $645$ \\ | |
7359 \hline $1024$ & $6$ & $1225$ & $1257$ \\ | |
7360 \hline $2048$ & $7$ & $2403$ & $2452$ \\ | |
7361 \hline $4096$ & $8$ & $4735$ & $4808$ \\ | |
7362 \hline | |
7363 \end{tabular} | |
7364 \end{small} | |
7365 \end{center} | |
7366 \caption{Optimal Values of $k$ for Sliding Window Exponentiation} | |
7367 \label{fig:OPTK2} | |
7368 \end{figure} | |
7369 | |
7370 \newpage\begin{figure}[!here] | |
7371 \begin{small} | |
7372 \begin{center} | |
7373 \begin{tabular}{l} | |
7374 \hline Algorithm \textbf{Sliding Window $k$-ary Exponentiation}. \\ | |
7375 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\ | |
7376 \textbf{Output}. $c = a^b$ \\ | |
7377 \hline \\ | |
7378 1. $c \leftarrow 1$ \\ | |
7379 2. for $i$ from $t - 1$ to $0$ do \\ | |
7380 \hspace{3mm}2.1 If the $i$'th bit of $b$ is a zero then \\ | |
7381 \hspace{6mm}2.1.1 $c \leftarrow c^2$ \\ | |
7382 \hspace{3mm}2.2 else do \\ | |
7383 \hspace{6mm}2.2.1 $c \leftarrow c^{2^k}$ \\ | |
7384 \hspace{6mm}2.2.2 Extract the $k$ bits from $(b_{i}b_{i-1}\ldots b_{i-(k-1)})$ and store it in $g$. \\ | |
7385 \hspace{6mm}2.2.3 $c \leftarrow c \cdot a^g$ \\ | |
7386 \hspace{6mm}2.2.4 $i \leftarrow i - k$ \\ | |
7387 3. Return $c$. \\ | |
7388 \hline | |
7389 \end{tabular} | |
7390 \end{center} | |
7391 \end{small} | |
7392 \caption{Sliding Window $k$-ary Exponentiation} | |
7393 \end{figure} | |
7394 | |
7395 Similar to the previous algorithm this algorithm must have a special handler when fewer than $k$ bits are left in the exponent. While this | |
7396 algorithm requires the same number of squarings it can potentially have fewer multiplications. The pre-computed table $a^g$ is also half | |
7397 the size as the previous table. | |
7398 | |
7399 Consider the exponent $b = 111101011001000_2 \equiv 31432_{10}$ with $k = 3$ using both algorithms. The first algorithm will divide the exponent up as | |
7400 the following five $3$-bit words $b \equiv \left ( 111, 101, 011, 001, 000 \right )_{2}$. The second algorithm will break the | |
7401 exponent as $b \equiv \left ( 111, 101, 0, 110, 0, 100, 0 \right )_{2}$. The single digit $0$ in the second representation are where | |
7402 a single squaring took place instead of a squaring and multiplication. In total the first method requires $10$ multiplications and $18$ | |
7403 squarings. The second method requires $8$ multiplications and $18$ squarings. | |
7404 | |
7405 In general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster. | |
7406 | |
7407 \section{Modular Exponentiation} | |
7408 | |
7409 Modular exponentiation is essentially computing the power of a base within a finite field or ring. For example, computing | |
7410 $d \equiv a^b \mbox{ (mod }c\mbox{)}$ is a modular exponentiation. Instead of first computing $a^b$ and then reducing it | |
7411 modulo $c$ the intermediate result is reduced modulo $c$ after every squaring or multiplication operation. | |
7412 | |
7413 This guarantees that any intermediate result is bounded by $0 \le d \le c^2 - 2c + 1$ and can be reduced modulo $c$ quickly using | |
7414 one of the algorithms presented in chapter six. | |
7415 | |
7416 Before the actual modular exponentiation algorithm can be written a wrapper algorithm must be written first. This algorithm | |
7417 will allow the exponent $b$ to be negative which is computed as $c \equiv \left (1 / a \right )^{\vert b \vert} \mbox{(mod }d\mbox{)}$. The | |
7418 value of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}). If no inverse exists the algorithm | |
7419 terminates with an error. | |
7420 | |
7421 \begin{figure}[!here] | |
7422 \begin{small} | |
7423 \begin{center} | |
7424 \begin{tabular}{l} | |
7425 \hline Algorithm \textbf{mp\_exptmod}. \\ | |
7426 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ | |
7427 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ | |
7428 \hline \\ | |
7429 1. If $c.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\ | |
7430 2. If $b.sign = MP\_NEG$ then \\ | |
7431 \hspace{3mm}2.1 $g' \leftarrow g^{-1} \mbox{ (mod }c\mbox{)}$ \\ | |
7432 \hspace{3mm}2.2 $x' \leftarrow \vert x \vert$ \\ | |
7433 \hspace{3mm}2.3 Compute $d \equiv g'^{x'} \mbox{ (mod }c\mbox{)}$ via recursion. \\ | |
7434 3. if $p$ is odd \textbf{OR} $p$ is a D.R. modulus then \\ | |
7435 \hspace{3mm}3.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm mp\_exptmod\_fast. \\ | |
7436 4. else \\ | |
7437 \hspace{3mm}4.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm s\_mp\_exptmod. \\ | |
7438 \hline | |
7439 \end{tabular} | |
7440 \end{center} | |
7441 \end{small} | |
7442 \caption{Algorithm mp\_exptmod} | |
7443 \end{figure} | |
7444 | |
7445 \textbf{Algorithm mp\_exptmod.} | |
7446 The first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod. It is a sliding window $k$-ary algorithm | |
7447 which uses Barrett reduction to reduce the product modulo $p$. The second algorithm mp\_exptmod\_fast performs the same operation | |
7448 except it uses either Montgomery or Diminished Radix reduction. The two latter reduction algorithms are clumped in the same exponentiation | |
7449 algorithm since their arguments are essentially the same (\textit{two mp\_ints and one mp\_digit}). | |
7450 | |
7451 \vspace{+3mm}\begin{small} | |
7452 \hspace{-5.1mm}{\bf File}: bn\_mp\_exptmod.c | |
7453 \vspace{-3mm} | |
7454 \begin{alltt} | |
7455 016 | |
7456 017 | |
7457 018 /* this is a shell function that calls either the normal or Montgomery | |
7458 019 * exptmod functions. Originally the call to the montgomery code was | |
7459 020 * embedded in the normal function but that wasted alot of stack space | |
7460 021 * for nothing (since 99% of the time the Montgomery code would be called) | |
7461 022 */ | |
7462 023 int mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y) | |
7463 024 \{ | |
7464 025 int dr; | |
7465 026 | |
7466 027 /* modulus P must be positive */ | |
7467 028 if (P->sign == MP_NEG) \{ | |
7468 029 return MP_VAL; | |
7469 030 \} | |
7470 031 | |
7471 032 /* if exponent X is negative we have to recurse */ | |
7472 033 if (X->sign == MP_NEG) \{ | |
7473 034 mp_int tmpG, tmpX; | |
7474 035 int err; | |
7475 036 | |
7476 037 /* first compute 1/G mod P */ | |
7477 038 if ((err = mp_init(&tmpG)) != MP_OKAY) \{ | |
7478 039 return err; | |
7479 040 \} | |
7480 041 if ((err = mp_invmod(G, P, &tmpG)) != MP_OKAY) \{ | |
7481 042 mp_clear(&tmpG); | |
7482 043 return err; | |
7483 044 \} | |
7484 045 | |
7485 046 /* now get |X| */ | |
7486 047 if ((err = mp_init(&tmpX)) != MP_OKAY) \{ | |
7487 048 mp_clear(&tmpG); | |
7488 049 return err; | |
7489 050 \} | |
7490 051 if ((err = mp_abs(X, &tmpX)) != MP_OKAY) \{ | |
7491 052 mp_clear_multi(&tmpG, &tmpX, NULL); | |
7492 053 return err; | |
7493 054 \} | |
7494 055 | |
7495 056 /* and now compute (1/G)**|X| instead of G**X [X < 0] */ | |
7496 057 err = mp_exptmod(&tmpG, &tmpX, P, Y); | |
7497 058 mp_clear_multi(&tmpG, &tmpX, NULL); | |
7498 059 return err; | |
7499 060 \} | |
7500 061 | |
7501 062 /* is it a DR modulus? */ | |
7502 063 dr = mp_dr_is_modulus(P); | |
7503 064 | |
7504 065 /* if not, is it a uDR modulus? */ | |
7505 066 if (dr == 0) \{ | |
7506 067 dr = mp_reduce_is_2k(P) << 1; | |
7507 068 \} | |
7508 069 | |
7509 070 /* if the modulus is odd or dr != 0 use the fast method */ | |
7510 071 if (mp_isodd (P) == 1 || dr != 0) \{ | |
7511 072 return mp_exptmod_fast (G, X, P, Y, dr); | |
7512 073 \} else \{ | |
7513 074 /* otherwise use the generic Barrett reduction technique */ | |
7514 075 return s_mp_exptmod (G, X, P, Y); | |
7515 076 \} | |
7516 077 \} | |
7517 078 | |
7518 \end{alltt} | |
7519 \end{small} | |
7520 | |
7521 In order to keep the algorithms in a known state the first step on line 28 is to reject any negative modulus as input. If the exponent is | |
7522 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$. The temporary variable $tmpG$ is assigned | |
7523 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$. The algorithm will recuse with these new values with a positive | |
7524 exponent. | |
7525 | |
7526 If the exponent is positive the algorithm resumes the exponentiation. Line 63 determines if the modulus is of the restricted Diminished Radix | |
7527 form. If it is not line 67 attempts to determine if it is of a unrestricted Diminished Radix form. The integer $dr$ will take on one | |
7528 of three values. | |
7529 | |
7530 \begin{enumerate} | |
7531 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form. | |
7532 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form. | |
7533 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form. | |
7534 \end{enumerate} | |
7535 | |
7536 Line 70 determines if the fast modular exponentiation algorithm can be used. It is allowed if $dr \ne 0$ or if the modulus is odd. Otherwise, | |
7537 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction. | |
7538 | |
7539 \subsection{Barrett Modular Exponentiation} | |
7540 | |
7541 \newpage\begin{figure}[!here] | |
7542 \begin{small} | |
7543 \begin{center} | |
7544 \begin{tabular}{l} | |
7545 \hline Algorithm \textbf{s\_mp\_exptmod}. \\ | |
7546 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ | |
7547 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ | |
7548 \hline \\ | |
7549 1. $k \leftarrow lg(x)$ \\ | |
7550 2. $winsize \leftarrow \left \lbrace \begin{array}{ll} | |
7551 2 & \mbox{if }k \le 7 \\ | |
7552 3 & \mbox{if }7 < k \le 36 \\ | |
7553 4 & \mbox{if }36 < k \le 140 \\ | |
7554 5 & \mbox{if }140 < k \le 450 \\ | |
7555 6 & \mbox{if }450 < k \le 1303 \\ | |
7556 7 & \mbox{if }1303 < k \le 3529 \\ | |
7557 8 & \mbox{if }3529 < k \\ | |
7558 \end{array} \right .$ \\ | |
7559 3. Initialize $2^{winsize}$ mp\_ints in an array named $M$ and one mp\_int named $\mu$ \\ | |
7560 4. Calculate the $\mu$ required for Barrett Reduction (\textit{mp\_reduce\_setup}). \\ | |
7561 5. $M_1 \leftarrow g \mbox{ (mod }p\mbox{)}$ \\ | |
7562 \\ | |
7563 Setup the table of small powers of $g$. First find $g^{2^{winsize}}$ and then all multiples of it. \\ | |
7564 6. $k \leftarrow 2^{winsize - 1}$ \\ | |
7565 7. $M_{k} \leftarrow M_1$ \\ | |
7566 8. for $ix$ from 0 to $winsize - 2$ do \\ | |
7567 \hspace{3mm}8.1 $M_k \leftarrow \left ( M_k \right )^2$ (\textit{mp\_sqr}) \\ | |
7568 \hspace{3mm}8.2 $M_k \leftarrow M_k \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\ | |
7569 9. for $ix$ from $2^{winsize - 1} + 1$ to $2^{winsize} - 1$ do \\ | |
7570 \hspace{3mm}9.1 $M_{ix} \leftarrow M_{ix - 1} \cdot M_{1}$ (\textit{mp\_mul}) \\ | |
7571 \hspace{3mm}9.2 $M_{ix} \leftarrow M_{ix} \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\ | |
7572 10. $res \leftarrow 1$ \\ | |
7573 \\ | |
7574 Start Sliding Window. \\ | |
7575 11. $mode \leftarrow 0, bitcnt \leftarrow 1, buf \leftarrow 0, digidx \leftarrow x.used - 1, bitcpy \leftarrow 0, bitbuf \leftarrow 0$ \\ | |
7576 12. Loop \\ | |
7577 \hspace{3mm}12.1 $bitcnt \leftarrow bitcnt - 1$ \\ | |
7578 \hspace{3mm}12.2 If $bitcnt = 0$ then do \\ | |
7579 \hspace{6mm}12.2.1 If $digidx = -1$ goto step 13. \\ | |
7580 \hspace{6mm}12.2.2 $buf \leftarrow x_{digidx}$ \\ | |
7581 \hspace{6mm}12.2.3 $digidx \leftarrow digidx - 1$ \\ | |
7582 \hspace{6mm}12.2.4 $bitcnt \leftarrow lg(\beta)$ \\ | |
7583 Continued on next page. \\ | |
7584 \hline | |
7585 \end{tabular} | |
7586 \end{center} | |
7587 \end{small} | |
7588 \caption{Algorithm s\_mp\_exptmod} | |
7589 \end{figure} | |
7590 | |
7591 \newpage\begin{figure}[!here] | |
7592 \begin{small} | |
7593 \begin{center} | |
7594 \begin{tabular}{l} | |
7595 \hline Algorithm \textbf{s\_mp\_exptmod} (\textit{continued}). \\ | |
7596 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ | |
7597 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ | |
7598 \hline \\ | |
7599 \hspace{3mm}12.3 $y \leftarrow (buf >> (lg(\beta) - 1))$ AND $1$ \\ | |
7600 \hspace{3mm}12.4 $buf \leftarrow buf << 1$ \\ | |
7601 \hspace{3mm}12.5 if $mode = 0$ and $y = 0$ then goto step 12. \\ | |
7602 \hspace{3mm}12.6 if $mode = 1$ and $y = 0$ then do \\ | |
7603 \hspace{6mm}12.6.1 $res \leftarrow res^2$ \\ | |
7604 \hspace{6mm}12.6.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ | |
7605 \hspace{6mm}12.6.3 Goto step 12. \\ | |
7606 \hspace{3mm}12.7 $bitcpy \leftarrow bitcpy + 1$ \\ | |
7607 \hspace{3mm}12.8 $bitbuf \leftarrow bitbuf + (y << (winsize - bitcpy))$ \\ | |
7608 \hspace{3mm}12.9 $mode \leftarrow 2$ \\ | |
7609 \hspace{3mm}12.10 If $bitcpy = winsize$ then do \\ | |
7610 \hspace{6mm}Window is full so perform the squarings and single multiplication. \\ | |
7611 \hspace{6mm}12.10.1 for $ix$ from $0$ to $winsize -1$ do \\ | |
7612 \hspace{9mm}12.10.1.1 $res \leftarrow res^2$ \\ | |
7613 \hspace{9mm}12.10.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ | |
7614 \hspace{6mm}12.10.2 $res \leftarrow res \cdot M_{bitbuf}$ \\ | |
7615 \hspace{6mm}12.10.3 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ | |
7616 \hspace{6mm}Reset the window. \\ | |
7617 \hspace{6mm}12.10.4 $bitcpy \leftarrow 0, bitbuf \leftarrow 0, mode \leftarrow 1$ \\ | |
7618 \\ | |
7619 No more windows left. Check for residual bits of exponent. \\ | |
7620 13. If $mode = 2$ and $bitcpy > 0$ then do \\ | |
7621 \hspace{3mm}13.1 for $ix$ form $0$ to $bitcpy - 1$ do \\ | |
7622 \hspace{6mm}13.1.1 $res \leftarrow res^2$ \\ | |
7623 \hspace{6mm}13.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ | |
7624 \hspace{6mm}13.1.3 $bitbuf \leftarrow bitbuf << 1$ \\ | |
7625 \hspace{6mm}13.1.4 If $bitbuf$ AND $2^{winsize} \ne 0$ then do \\ | |
7626 \hspace{9mm}13.1.4.1 $res \leftarrow res \cdot M_{1}$ \\ | |
7627 \hspace{9mm}13.1.4.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ | |
7628 14. $y \leftarrow res$ \\ | |
7629 15. Clear $res$, $mu$ and the $M$ array. \\ | |
7630 16. Return(\textit{MP\_OKAY}). \\ | |
7631 \hline | |
7632 \end{tabular} | |
7633 \end{center} | |
7634 \end{small} | |
7635 \caption{Algorithm s\_mp\_exptmod (continued)} | |
7636 \end{figure} | |
7637 | |
7638 \textbf{Algorithm s\_mp\_exptmod.} | |
7639 This algorithm computes the $x$'th power of $g$ modulo $p$ and stores the result in $y$. It takes advantage of the Barrett reduction | |
7640 algorithm to keep the product small throughout the algorithm. | |
7641 | |
7642 The first two steps determine the optimal window size based on the number of bits in the exponent. The larger the exponent the | |
7643 larger the window size becomes. After a window size $winsize$ has been chosen an array of $2^{winsize}$ mp\_int variables is allocated. This | |
7644 table will hold the values of $g^x \mbox{ (mod }p\mbox{)}$ for $2^{winsize - 1} \le x < 2^{winsize}$. | |
7645 | |
7646 After the table is allocated the first power of $g$ is found. Since $g \ge p$ is allowed it must be first reduced modulo $p$ to make | |
7647 the rest of the algorithm more efficient. The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$ | |
7648 times. The rest of the table elements are found by multiplying the previous element by $M_1$ modulo $p$. | |
7649 | |
7650 Now that the table is available the sliding window may begin. The following list describes the functions of all the variables in the window. | |
7651 \begin{enumerate} | |
7652 \item The variable $mode$ dictates how the bits of the exponent are interpreted. | |
7653 \begin{enumerate} | |
7654 \item When $mode = 0$ the bits are ignored since no non-zero bit of the exponent has been seen yet. For example, if the exponent were simply | |
7655 $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit. In this case bits are ignored until a non-zero bit is found. | |
7656 \item When $mode = 1$ a non-zero bit has been seen before and a new $winsize$-bit window has not been formed yet. In this mode leading $0$ bits | |
7657 are read and a single squaring is performed. If a non-zero bit is read a new window is created. | |
7658 \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit | |
7659 downwards. | |
7660 \end{enumerate} | |
7661 \item The variable $bitcnt$ indicates how many bits are left in the current digit of the exponent left to be read. When it reaches zero a new digit | |
7662 is fetched from the exponent. | |
7663 \item The variable $buf$ holds the currently read digit of the exponent. | |
7664 \item The variable $digidx$ is an index into the exponents digits. It starts at the leading digit $x.used - 1$ and moves towards the trailing digit. | |
7665 \item The variable $bitcpy$ indicates how many bits are in the currently formed window. When it reaches $winsize$ the window is flushed and | |
7666 the appropriate operations performed. | |
7667 \item The variable $bitbuf$ holds the current bits of the window being formed. | |
7668 \end{enumerate} | |
7669 | |
7670 All of step 12 is the window processing loop. It will iterate while there are digits available form the exponent to read. The first step | |
7671 inside this loop is to extract a new digit if no more bits are available in the current digit. If there are no bits left a new digit is | |
7672 read and if there are no digits left than the loop terminates. | |
7673 | |
7674 After a digit is made available step 12.3 will extract the most significant bit of the current digit and move all other bits in the digit | |
7675 upwards. In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to | |
7676 trailing edges the entire exponent is read from most significant bit to least significant bit. | |
7677 | |
7678 At step 12.5 if the $mode$ and currently extracted bit $y$ are both zero the bit is ignored and the next bit is read. This prevents the | |
7679 algorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read. Step 12.6 and 12.7-10 handle | |
7680 the two cases of $mode = 1$ and $mode = 2$ respectively. | |
7681 | |
7682 \begin{center} | |
7683 \begin{figure}[here] | |
7684 \includegraphics{pics/expt_state.ps} | |
7685 \caption{Sliding Window State Diagram} | |
7686 \label{pic:expt_state} | |
7687 \end{figure} | |
7688 \end{center} | |
7689 | |
7690 By step 13 there are no more digits left in the exponent. However, there may be partial bits in the window left. If $mode = 2$ then | |
7691 a Left-to-Right algorithm is used to process the remaining few bits. | |
7692 | |
7693 \vspace{+3mm}\begin{small} | |
7694 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_exptmod.c | |
7695 \vspace{-3mm} | |
7696 \begin{alltt} | |
7697 016 | |
7698 017 #ifdef MP_LOW_MEM | |
7699 018 #define TAB_SIZE 32 | |
7700 019 #else | |
7701 020 #define TAB_SIZE 256 | |
7702 021 #endif | |
7703 022 | |
7704 023 int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y) | |
7705 024 \{ | |
7706 025 mp_int M[TAB_SIZE], res, mu; | |
7707 026 mp_digit buf; | |
7708 027 int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; | |
7709 028 | |
7710 029 /* find window size */ | |
7711 030 x = mp_count_bits (X); | |
7712 031 if (x <= 7) \{ | |
7713 032 winsize = 2; | |
7714 033 \} else if (x <= 36) \{ | |
7715 034 winsize = 3; | |
7716 035 \} else if (x <= 140) \{ | |
7717 036 winsize = 4; | |
7718 037 \} else if (x <= 450) \{ | |
7719 038 winsize = 5; | |
7720 039 \} else if (x <= 1303) \{ | |
7721 040 winsize = 6; | |
7722 041 \} else if (x <= 3529) \{ | |
7723 042 winsize = 7; | |
7724 043 \} else \{ | |
7725 044 winsize = 8; | |
7726 045 \} | |
7727 046 | |
7728 047 #ifdef MP_LOW_MEM | |
7729 048 if (winsize > 5) \{ | |
7730 049 winsize = 5; | |
7731 050 \} | |
7732 051 #endif | |
7733 052 | |
7734 053 /* init M array */ | |
7735 054 /* init first cell */ | |
7736 055 if ((err = mp_init(&M[1])) != MP_OKAY) \{ | |
7737 056 return err; | |
7738 057 \} | |
7739 058 | |
7740 059 /* now init the second half of the array */ | |
7741 060 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{ | |
7742 061 if ((err = mp_init(&M[x])) != MP_OKAY) \{ | |
7743 062 for (y = 1<<(winsize-1); y < x; y++) \{ | |
7744 063 mp_clear (&M[y]); | |
7745 064 \} | |
7746 065 mp_clear(&M[1]); | |
7747 066 return err; | |
7748 067 \} | |
7749 068 \} | |
7750 069 | |
7751 070 /* create mu, used for Barrett reduction */ | |
7752 071 if ((err = mp_init (&mu)) != MP_OKAY) \{ | |
7753 072 goto __M; | |
7754 073 \} | |
7755 074 if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{ | |
7756 075 goto __MU; | |
7757 076 \} | |
7758 077 | |
7759 078 /* create M table | |
7760 079 * | |
7761 080 * The M table contains powers of the base, | |
7762 081 * e.g. M[x] = G**x mod P | |
7763 082 * | |
7764 083 * The first half of the table is not | |
7765 084 * computed though accept for M[0] and M[1] | |
7766 085 */ | |
7767 086 if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{ | |
7768 087 goto __MU; | |
7769 088 \} | |
7770 089 | |
7771 090 /* compute the value at M[1<<(winsize-1)] by squaring | |
7772 091 * M[1] (winsize-1) times | |
7773 092 */ | |
7774 093 if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{ | |
7775 094 goto __MU; | |
7776 095 \} | |
7777 096 | |
7778 097 for (x = 0; x < (winsize - 1); x++) \{ | |
7779 098 if ((err = mp_sqr (&M[1 << (winsize - 1)], | |
7780 099 &M[1 << (winsize - 1)])) != MP_OKAY) \{ | |
7781 100 goto __MU; | |
7782 101 \} | |
7783 102 if ((err = mp_reduce (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{ | |
7784 103 goto __MU; | |
7785 104 \} | |
7786 105 \} | |
7787 106 | |
7788 107 /* create upper table, that is M[x] = M[x-1] * M[1] (mod P) | |
7789 108 * for x = (2**(winsize - 1) + 1) to (2**winsize - 1) | |
7790 109 */ | |
7791 110 for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{ | |
7792 111 if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{ | |
7793 112 goto __MU; | |
7794 113 \} | |
7795 114 if ((err = mp_reduce (&M[x], P, &mu)) != MP_OKAY) \{ | |
7796 115 goto __MU; | |
7797 116 \} | |
7798 117 \} | |
7799 118 | |
7800 119 /* setup result */ | |
7801 120 if ((err = mp_init (&res)) != MP_OKAY) \{ | |
7802 121 goto __MU; | |
7803 122 \} | |
7804 123 mp_set (&res, 1); | |
7805 124 | |
7806 125 /* set initial mode and bit cnt */ | |
7807 126 mode = 0; | |
7808 127 bitcnt = 1; | |
7809 128 buf = 0; | |
7810 129 digidx = X->used - 1; | |
7811 130 bitcpy = 0; | |
7812 131 bitbuf = 0; | |
7813 132 | |
7814 133 for (;;) \{ | |
7815 134 /* grab next digit as required */ | |
7816 135 if (--bitcnt == 0) \{ | |
7817 136 /* if digidx == -1 we are out of digits */ | |
7818 137 if (digidx == -1) \{ | |
7819 138 break; | |
7820 139 \} | |
7821 140 /* read next digit and reset the bitcnt */ | |
7822 141 buf = X->dp[digidx--]; | |
7823 142 bitcnt = (int) DIGIT_BIT; | |
7824 143 \} | |
7825 144 | |
7826 145 /* grab the next msb from the exponent */ | |
7827 146 y = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1; | |
7828 147 buf <<= (mp_digit)1; | |
7829 148 | |
7830 149 /* if the bit is zero and mode == 0 then we ignore it | |
7831 150 * These represent the leading zero bits before the first 1 bit | |
7832 151 * in the exponent. Technically this opt is not required but it | |
7833 152 * does lower the # of trivial squaring/reductions used | |
7834 153 */ | |
7835 154 if (mode == 0 && y == 0) \{ | |
7836 155 continue; | |
7837 156 \} | |
7838 157 | |
7839 158 /* if the bit is zero and mode == 1 then we square */ | |
7840 159 if (mode == 1 && y == 0) \{ | |
7841 160 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ | |
7842 161 goto __RES; | |
7843 162 \} | |
7844 163 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ | |
7845 164 goto __RES; | |
7846 165 \} | |
7847 166 continue; | |
7848 167 \} | |
7849 168 | |
7850 169 /* else we add it to the window */ | |
7851 170 bitbuf |= (y << (winsize - ++bitcpy)); | |
7852 171 mode = 2; | |
7853 172 | |
7854 173 if (bitcpy == winsize) \{ | |
7855 174 /* ok window is filled so square as required and multiply */ | |
7856 175 /* square first */ | |
7857 176 for (x = 0; x < winsize; x++) \{ | |
7858 177 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ | |
7859 178 goto __RES; | |
7860 179 \} | |
7861 180 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ | |
7862 181 goto __RES; | |
7863 182 \} | |
7864 183 \} | |
7865 184 | |
7866 185 /* then multiply */ | |
7867 186 if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{ | |
7868 187 goto __RES; | |
7869 188 \} | |
7870 189 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ | |
7871 190 goto __RES; | |
7872 191 \} | |
7873 192 | |
7874 193 /* empty window and reset */ | |
7875 194 bitcpy = 0; | |
7876 195 bitbuf = 0; | |
7877 196 mode = 1; | |
7878 197 \} | |
7879 198 \} | |
7880 199 | |
7881 200 /* if bits remain then square/multiply */ | |
7882 201 if (mode == 2 && bitcpy > 0) \{ | |
7883 202 /* square then multiply if the bit is set */ | |
7884 203 for (x = 0; x < bitcpy; x++) \{ | |
7885 204 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ | |
7886 205 goto __RES; | |
7887 206 \} | |
7888 207 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ | |
7889 208 goto __RES; | |
7890 209 \} | |
7891 210 | |
7892 211 bitbuf <<= 1; | |
7893 212 if ((bitbuf & (1 << winsize)) != 0) \{ | |
7894 213 /* then multiply */ | |
7895 214 if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{ | |
7896 215 goto __RES; | |
7897 216 \} | |
7898 217 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ | |
7899 218 goto __RES; | |
7900 219 \} | |
7901 220 \} | |
7902 221 \} | |
7903 222 \} | |
7904 223 | |
7905 224 mp_exch (&res, Y); | |
7906 225 err = MP_OKAY; | |
7907 226 __RES:mp_clear (&res); | |
7908 227 __MU:mp_clear (&mu); | |
7909 228 __M: | |
7910 229 mp_clear(&M[1]); | |
7911 230 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{ | |
7912 231 mp_clear (&M[x]); | |
7913 232 \} | |
7914 233 return err; | |
7915 234 \} | |
7916 \end{alltt} | |
7917 \end{small} | |
7918 | |
7919 Lines 31 through 41 determine the optimal window size based on the length of the exponent in bits. The window divisions are sorted | |
7920 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested. For example, by the \textbf{if} statement | |
7921 on line 33 the value of $x$ is already known to be greater than $140$. | |
7922 | |
7923 The conditional piece of code beginning on line 47 allows the window size to be restricted to five bits. This logic is used to ensure | |
7924 the table of precomputed powers of $G$ remains relatively small. | |
7925 | |
7926 The for loop on line 60 initializes the $M$ array while lines 61 and 74 compute the value of $\mu$ required for | |
7927 Barrett reduction. | |
7928 | |
7929 -- More later. | |
7930 | |
7931 \section{Quick Power of Two} | |
7932 Calculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms. Recall that a logical shift left $m << k$ is | |
7933 equivalent to $m \cdot 2^k$. By this logic when $m = 1$ a quick power of two can be achieved. | |
7934 | |
7935 \begin{figure}[!here] | |
7936 \begin{small} | |
7937 \begin{center} | |
7938 \begin{tabular}{l} | |
7939 \hline Algorithm \textbf{mp\_2expt}. \\ | |
7940 \textbf{Input}. integer $b$ \\ | |
7941 \textbf{Output}. $a \leftarrow 2^b$ \\ | |
7942 \hline \\ | |
7943 1. $a \leftarrow 0$ \\ | |
7944 2. If $a.alloc < \lfloor b / lg(\beta) \rfloor + 1$ then grow $a$ appropriately. \\ | |
7945 3. $a.used \leftarrow \lfloor b / lg(\beta) \rfloor + 1$ \\ | |
7946 4. $a_{\lfloor b / lg(\beta) \rfloor} \leftarrow 1 << (b \mbox{ mod } lg(\beta))$ \\ | |
7947 5. Return(\textit{MP\_OKAY}). \\ | |
7948 \hline | |
7949 \end{tabular} | |
7950 \end{center} | |
7951 \end{small} | |
7952 \caption{Algorithm mp\_2expt} | |
7953 \end{figure} | |
7954 | |
7955 \textbf{Algorithm mp\_2expt.} | |
7956 | |
7957 \vspace{+3mm}\begin{small} | |
7958 \hspace{-5.1mm}{\bf File}: bn\_mp\_2expt.c | |
7959 \vspace{-3mm} | |
7960 \begin{alltt} | |
7961 016 | |
7962 017 /* computes a = 2**b | |
7963 018 * | |
7964 019 * Simple algorithm which zeroes the int, grows it then just sets one bit | |
7965 020 * as required. | |
7966 021 */ | |
7967 022 int | |
7968 023 mp_2expt (mp_int * a, int b) | |
7969 024 \{ | |
7970 025 int res; | |
7971 026 | |
7972 027 /* zero a as per default */ | |
7973 028 mp_zero (a); | |
7974 029 | |
7975 030 /* grow a to accomodate the single bit */ | |
7976 031 if ((res = mp_grow (a, b / DIGIT_BIT + 1)) != MP_OKAY) \{ | |
7977 032 return res; | |
7978 033 \} | |
7979 034 | |
7980 035 /* set the used count of where the bit will go */ | |
7981 036 a->used = b / DIGIT_BIT + 1; | |
7982 037 | |
7983 038 /* put the single bit in its place */ | |
7984 039 a->dp[b / DIGIT_BIT] = 1 << (b % DIGIT_BIT); | |
7985 040 | |
7986 041 return MP_OKAY; | |
7987 042 \} | |
7988 \end{alltt} | |
7989 \end{small} | |
7990 | |
7991 \chapter{Higher Level Algorithms} | |
7992 | |
7993 This chapter discusses the various higher level algorithms that are required to complete a well rounded multiple precision integer package. These | |
7994 routines are less performance oriented than the algorithms of chapters five, six and seven but are no less important. | |
7995 | |
7996 The first section describes a method of integer division with remainder that is universally well known. It provides the signed division logic | |
7997 for the package. The subsequent section discusses a set of algorithms which allow a single digit to be the 2nd operand for a variety of operations. | |
7998 These algorithms serve mostly to simplify other algorithms where small constants are required. The last two sections discuss how to manipulate | |
7999 various representations of integers. For example, converting from an mp\_int to a string of character. | |
8000 | |
8001 \section{Integer Division with Remainder} | |
8002 \label{sec:division} | |
8003 | |
8004 Integer division aside from modular exponentiation is the most intensive algorithm to compute. Like addition, subtraction and multiplication | |
8005 the basis of this algorithm is the long-hand division algorithm taught to school children. Throughout this discussion several common variables | |
8006 will be used. Let $x$ represent the divisor and $y$ represent the dividend. Let $q$ represent the integer quotient $\lfloor y / x \rfloor$ and | |
8007 let $r$ represent the remainder $r = y - x \lfloor y / x \rfloor$. The following simple algorithm will be used to start the discussion. | |
8008 | |
8009 \newpage\begin{figure}[!here] | |
8010 \begin{small} | |
8011 \begin{center} | |
8012 \begin{tabular}{l} | |
8013 \hline Algorithm \textbf{Radix-$\beta$ Integer Division}. \\ | |
8014 \textbf{Input}. integer $x$ and $y$ \\ | |
8015 \textbf{Output}. $q = \lfloor y/x\rfloor, r = y - xq$ \\ | |
8016 \hline \\ | |
8017 1. $q \leftarrow 0$ \\ | |
8018 2. $n \leftarrow \vert \vert y \vert \vert - \vert \vert x \vert \vert$ \\ | |
8019 3. for $t$ from $n$ down to $0$ do \\ | |
8020 \hspace{3mm}3.1 Maximize $k$ such that $kx\beta^t$ is less than or equal to $y$ and $(k + 1)x\beta^t$ is greater. \\ | |
8021 \hspace{3mm}3.2 $q \leftarrow q + k\beta^t$ \\ | |
8022 \hspace{3mm}3.3 $y \leftarrow y - kx\beta^t$ \\ | |
8023 4. $r \leftarrow y$ \\ | |
8024 5. Return($q, r$) \\ | |
8025 \hline | |
8026 \end{tabular} | |
8027 \end{center} | |
8028 \end{small} | |
8029 \caption{Algorithm Radix-$\beta$ Integer Division} | |
8030 \label{fig:raddiv} | |
8031 \end{figure} | |
8032 | |
8033 As children we are taught this very simple algorithm for the case of $\beta = 10$. Almost instinctively several optimizations are taught for which | |
8034 their reason of existing are never explained. For this example let $y = 5471$ represent the dividend and $x = 23$ represent the divisor. | |
8035 | |
8036 To find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and | |
8037 simultaneously $(k + 1)x\beta^t$ is greater than $y$. Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have. The habitual method | |
8038 used to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient. By only using leading | |
8039 digits a much simpler division may be used to form an educated guess at what the value must be. In this case $k = \lfloor 54/23\rfloor = 2$ quickly | |
8040 arises as a possible solution. Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$. | |
8041 As a result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$. | |
8042 | |
8043 Again this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder | |
8044 $y = 841 - 3x\beta = 181$. Finally the last iteration of the loop produces $k = 7$ which leads to the quotient $q = 230 + 7 = 237$ and the | |
8045 remainder $y = 181 - 7x = 20$. The final quotient and remainder found are $q = 237$ and $r = y = 20$ which are indeed correct since | |
8046 $237 \cdot 23 + 20 = 5471$ is true. | |
8047 | |
8048 \subsection{Quotient Estimation} | |
8049 \label{sec:divest} | |
8050 As alluded to earlier the quotient digit $k$ can be estimated from only the leading digits of both the divisor and dividend. When $p$ leading | |
8051 digits are used from both the divisor and dividend to form an estimation the accuracy of the estimation rises as $p$ grows. Technically | |
8052 speaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the | |
8053 dividend and divisor are zero. | |
8054 | |
8055 The value of the estimation may off by a few values in either direction and in general is fairly correct. A simplification \cite[pp. 271]{TAOCPV2} | |
8056 of the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$. The estimate | |
8057 using this technique is never too small. For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$ | |
8058 represent the most significant digits of the dividend and divisor respectively. | |
8059 | |
8060 \textbf{Proof.}\textit{ The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to | |
8061 $k = \lfloor y / (x \cdot \beta^{\vert \vert y \vert \vert - \vert \vert x \vert \vert - 1}) \rfloor$. } | |
8062 The first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger. For all other | |
8063 cases $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ and $\hat k x_s \ge y_t\beta + y_{t-1} - x_s + 1$. The latter portion of the inequalility | |
8064 $-x_s + 1$ arises from the fact that a truncated integer division will give the same quotient for at most $x_s - 1$ values. Next a series of | |
8065 inequalities will prove the hypothesis. | |
8066 | |
8067 \begin{equation} | |
8068 y - \hat k x \le y - \hat k x_s\beta^s | |
8069 \end{equation} | |
8070 | |
8071 This is trivially true since $x \ge x_s\beta^s$. Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$. | |
8072 | |
8073 \begin{equation} | |
8074 y - \hat k x \le y_t\beta^t + \ldots + y_0 - (y_t\beta^t + y_{t-1}\beta^{t-1} - x_s\beta^t + \beta^s) | |
8075 \end{equation} | |
8076 | |
8077 By simplifying the previous inequality the following inequality is formed. | |
8078 | |
8079 \begin{equation} | |
8080 y - \hat k x \le y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s | |
8081 \end{equation} | |
8082 | |
8083 Subsequently, | |
8084 | |
8085 \begin{equation} | |
8086 y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s < x_s\beta^s \le x | |
8087 \end{equation} | |
8088 | |
8089 Which proves that $y - \hat kx \le x$ and by consequence $\hat k \ge k$ which concludes the proof. \textbf{QED} | |
8090 | |
8091 | |
8092 \subsection{Normalized Integers} | |
8093 For the purposes of division a normalized input is when the divisors leading digit $x_n$ is greater than or equal to $\beta / 2$. By multiplying both | |
8094 $x$ and $y$ by $j = \lfloor (\beta / 2) / x_n \rfloor$ the quotient remains unchanged and the remainder is simply $j$ times the original | |
8095 remainder. The purpose of normalization is to ensure the leading digit of the divisor is sufficiently large such that the estimated quotient will | |
8096 lie in the domain of a single digit. Consider the maximum dividend $(\beta - 1) \cdot \beta + (\beta - 1)$ and the minimum divisor $\beta / 2$. | |
8097 | |
8098 \begin{equation} | |
8099 {{\beta^2 - 1} \over { \beta / 2}} \le 2\beta - {2 \over \beta} | |
8100 \end{equation} | |
8101 | |
8102 At most the quotient approaches $2\beta$, however, in practice this will not occur since that would imply the previous quotient digit was too small. | |
8103 | |
8104 \subsection{Radix-$\beta$ Division with Remainder} | |
8105 \newpage\begin{figure}[!here] | |
8106 \begin{small} | |
8107 \begin{center} | |
8108 \begin{tabular}{l} | |
8109 \hline Algorithm \textbf{mp\_div}. \\ | |
8110 \textbf{Input}. mp\_int $a, b$ \\ | |
8111 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\ | |
8112 \hline \\ | |
8113 1. If $b = 0$ return(\textit{MP\_VAL}). \\ | |
8114 2. If $\vert a \vert < \vert b \vert$ then do \\ | |
8115 \hspace{3mm}2.1 $d \leftarrow a$ \\ | |
8116 \hspace{3mm}2.2 $c \leftarrow 0$ \\ | |
8117 \hspace{3mm}2.3 Return(\textit{MP\_OKAY}). \\ | |
8118 \\ | |
8119 Setup the quotient to receive the digits. \\ | |
8120 3. Grow $q$ to $a.used + 2$ digits. \\ | |
8121 4. $q \leftarrow 0$ \\ | |
8122 5. $x \leftarrow \vert a \vert , y \leftarrow \vert b \vert$ \\ | |
8123 6. $sign \leftarrow \left \lbrace \begin{array}{ll} | |
8124 MP\_ZPOS & \mbox{if }a.sign = b.sign \\ | |
8125 MP\_NEG & \mbox{otherwise} \\ | |
8126 \end{array} \right .$ \\ | |
8127 \\ | |
8128 Normalize the inputs such that the leading digit of $y$ is greater than or equal to $\beta / 2$. \\ | |
8129 7. $norm \leftarrow (lg(\beta) - 1) - (\lceil lg(y) \rceil \mbox{ (mod }lg(\beta)\mbox{)})$ \\ | |
8130 8. $x \leftarrow x \cdot 2^{norm}, y \leftarrow y \cdot 2^{norm}$ \\ | |
8131 \\ | |
8132 Find the leading digit of the quotient. \\ | |
8133 9. $n \leftarrow x.used - 1, t \leftarrow y.used - 1$ \\ | |
8134 10. $y \leftarrow y \cdot \beta^{n - t}$ \\ | |
8135 11. While ($x \ge y$) do \\ | |
8136 \hspace{3mm}11.1 $q_{n - t} \leftarrow q_{n - t} + 1$ \\ | |
8137 \hspace{3mm}11.2 $x \leftarrow x - y$ \\ | |
8138 12. $y \leftarrow \lfloor y / \beta^{n-t} \rfloor$ \\ | |
8139 \\ | |
8140 Continued on the next page. \\ | |
8141 \hline | |
8142 \end{tabular} | |
8143 \end{center} | |
8144 \end{small} | |
8145 \caption{Algorithm mp\_div} | |
8146 \end{figure} | |
8147 | |
8148 \newpage\begin{figure}[!here] | |
8149 \begin{small} | |
8150 \begin{center} | |
8151 \begin{tabular}{l} | |
8152 \hline Algorithm \textbf{mp\_div} (continued). \\ | |
8153 \textbf{Input}. mp\_int $a, b$ \\ | |
8154 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\ | |
8155 \hline \\ | |
8156 Now find the remainder fo the digits. \\ | |
8157 13. for $i$ from $n$ down to $(t + 1)$ do \\ | |
8158 \hspace{3mm}13.1 If $i > x.used$ then jump to the next iteration of this loop. \\ | |
8159 \hspace{3mm}13.2 If $x_{i} = y_{t}$ then \\ | |
8160 \hspace{6mm}13.2.1 $q_{i - t - 1} \leftarrow \beta - 1$ \\ | |
8161 \hspace{3mm}13.3 else \\ | |
8162 \hspace{6mm}13.3.1 $\hat r \leftarrow x_{i} \cdot \beta + x_{i - 1}$ \\ | |
8163 \hspace{6mm}13.3.2 $\hat r \leftarrow \lfloor \hat r / y_{t} \rfloor$ \\ | |
8164 \hspace{6mm}13.3.3 $q_{i - t - 1} \leftarrow \hat r$ \\ | |
8165 \hspace{3mm}13.4 $q_{i - t - 1} \leftarrow q_{i - t - 1} + 1$ \\ | |
8166 \\ | |
8167 Fixup quotient estimation. \\ | |
8168 \hspace{3mm}13.5 Loop \\ | |
8169 \hspace{6mm}13.5.1 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\ | |
8170 \hspace{6mm}13.5.2 t$1 \leftarrow 0$ \\ | |
8171 \hspace{6mm}13.5.3 t$1_0 \leftarrow y_{t - 1}, $ t$1_1 \leftarrow y_t,$ t$1.used \leftarrow 2$ \\ | |
8172 \hspace{6mm}13.5.4 $t1 \leftarrow t1 \cdot q_{i - t - 1}$ \\ | |
8173 \hspace{6mm}13.5.5 t$2_0 \leftarrow x_{i - 2}, $ t$2_1 \leftarrow x_{i - 1}, $ t$2_2 \leftarrow x_i, $ t$2.used \leftarrow 3$ \\ | |
8174 \hspace{6mm}13.5.6 If $\vert t1 \vert > \vert t2 \vert$ then goto step 13.5. \\ | |
8175 \hspace{3mm}13.6 t$1 \leftarrow y \cdot q_{i - t - 1}$ \\ | |
8176 \hspace{3mm}13.7 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\ | |
8177 \hspace{3mm}13.8 $x \leftarrow x - $ t$1$ \\ | |
8178 \hspace{3mm}13.9 If $x.sign = MP\_NEG$ then \\ | |
8179 \hspace{6mm}13.10 t$1 \leftarrow y$ \\ | |
8180 \hspace{6mm}13.11 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\ | |
8181 \hspace{6mm}13.12 $x \leftarrow x + $ t$1$ \\ | |
8182 \hspace{6mm}13.13 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\ | |
8183 \\ | |
8184 Finalize the result. \\ | |
8185 14. Clamp excess digits of $q$ \\ | |
8186 15. $c \leftarrow q, c.sign \leftarrow sign$ \\ | |
8187 16. $x.sign \leftarrow a.sign$ \\ | |
8188 17. $d \leftarrow \lfloor x / 2^{norm} \rfloor$ \\ | |
8189 18. Return(\textit{MP\_OKAY}). \\ | |
8190 \hline | |
8191 \end{tabular} | |
8192 \end{center} | |
8193 \end{small} | |
8194 \caption{Algorithm mp\_div (continued)} | |
8195 \end{figure} | |
8196 \textbf{Algorithm mp\_div.} | |
8197 This algorithm will calculate quotient and remainder from an integer division given a dividend and divisor. The algorithm is a signed | |
8198 division and will produce a fully qualified quotient and remainder. | |
8199 | |
8200 First the divisor $b$ must be non-zero which is enforced in step one. If the divisor is larger than the dividend than the quotient is implicitly | |
8201 zero and the remainder is the dividend. | |
8202 | |
8203 After the first two trivial cases of inputs are handled the variable $q$ is setup to receive the digits of the quotient. Two unsigned copies of the | |
8204 divisor $y$ and dividend $x$ are made as well. The core of the division algorithm is an unsigned division and will only work if the values are | |
8205 positive. Now the two values $x$ and $y$ must be normalized such that the leading digit of $y$ is greater than or equal to $\beta / 2$. | |
8206 This is performed by shifting both to the left by enough bits to get the desired normalization. | |
8207 | |
8208 At this point the division algorithm can begin producing digits of the quotient. Recall that maximum value of the estimation used is | |
8209 $2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means. In this case $y$ is shifted | |
8210 to the left (\textit{step ten}) so that it has the same number of digits as $x$. The loop on step eleven will subtract multiples of the | |
8211 shifted copy of $y$ until $x$ is smaller. Since the leading digit of $y$ is greater than or equal to $\beta/2$ this loop will iterate at most two | |
8212 times to produce the desired leading digit of the quotient. | |
8213 | |
8214 Now the remainder of the digits can be produced. The equation $\hat q = \lfloor {{x_i \beta + x_{i-1}}\over y_t} \rfloor$ is used to fairly | |
8215 accurately approximate the true quotient digit. The estimation can in theory produce an estimation as high as $2\beta - {2 \over \beta}$ but by | |
8216 induction the upper quotient digit is correct (\textit{as established on step eleven}) and the estimate must be less than $\beta$. | |
8217 | |
8218 Recall from section~\ref{sec:divest} that the estimation is never too low but may be too high. The next step of the estimation process is | |
8219 to refine the estimation. The loop on step 13.5 uses $x_i\beta^2 + x_{i-1}\beta + x_{i-2}$ and $q_{i - t - 1}(y_t\beta + y_{t-1})$ as a higher | |
8220 order approximation to adjust the quotient digit. | |
8221 | |
8222 After both phases of estimation the quotient digit may still be off by a value of one\footnote{This is similar to the error introduced | |
8223 by optimizing Barrett reduction.}. Steps 13.6 and 13.7 subtract the multiple of the divisor from the dividend (\textit{Similar to step 3.3 of | |
8224 algorithm~\ref{fig:raddiv}} and then subsequently add a multiple of the divisor if the quotient was too large. | |
8225 | |
8226 Now that the quotient has been determine finializing the result is a matter of clamping the quotient, fixing the sizes and de-normalizing the | |
8227 remainder. An important aspect of this algorithm seemingly overlooked in other descriptions such as that of Algorithm 14.20 HAC \cite[pp. 598]{HAC} | |
8228 is that when the estimations are being made (\textit{inside the loop on step 13.5}) that the digits $y_{t-1}$, $x_{i-2}$ and $x_{i-1}$ may lie | |
8229 outside their respective boundaries. For example, if $t = 0$ or $i \le 1$ then the digits would be undefined. In those cases the digits should | |
8230 respectively be replaced with a zero. | |
8231 | |
8232 \vspace{+3mm}\begin{small} | |
8233 \hspace{-5.1mm}{\bf File}: bn\_mp\_div.c | |
8234 \vspace{-3mm} | |
8235 \begin{alltt} | |
8236 016 | |
8237 017 /* integer signed division. | |
8238 018 * c*b + d == a [e.g. a/b, c=quotient, d=remainder] | |
8239 019 * HAC pp.598 Algorithm 14.20 | |
8240 020 * | |
8241 021 * Note that the description in HAC is horribly | |
8242 022 * incomplete. For example, it doesn't consider | |
8243 023 * the case where digits are removed from 'x' in | |
8244 024 * the inner loop. It also doesn't consider the | |
8245 025 * case that y has fewer than three digits, etc.. | |
8246 026 * | |
8247 027 * The overall algorithm is as described as | |
8248 028 * 14.20 from HAC but fixed to treat these cases. | |
8249 029 */ | |
8250 030 int mp_div (mp_int * a, mp_int * b, mp_int * c, mp_int * d) | |
8251 031 \{ | |
8252 032 mp_int q, x, y, t1, t2; | |
8253 033 int res, n, t, i, norm, neg; | |
8254 034 | |
8255 035 /* is divisor zero ? */ | |
8256 036 if (mp_iszero (b) == 1) \{ | |
8257 037 return MP_VAL; | |
8258 038 \} | |
8259 039 | |
8260 040 /* if a < b then q=0, r = a */ | |
8261 041 if (mp_cmp_mag (a, b) == MP_LT) \{ | |
8262 042 if (d != NULL) \{ | |
8263 043 res = mp_copy (a, d); | |
8264 044 \} else \{ | |
8265 045 res = MP_OKAY; | |
8266 046 \} | |
8267 047 if (c != NULL) \{ | |
8268 048 mp_zero (c); | |
8269 049 \} | |
8270 050 return res; | |
8271 051 \} | |
8272 052 | |
8273 053 if ((res = mp_init_size (&q, a->used + 2)) != MP_OKAY) \{ | |
8274 054 return res; | |
8275 055 \} | |
8276 056 q.used = a->used + 2; | |
8277 057 | |
8278 058 if ((res = mp_init (&t1)) != MP_OKAY) \{ | |
8279 059 goto __Q; | |
8280 060 \} | |
8281 061 | |
8282 062 if ((res = mp_init (&t2)) != MP_OKAY) \{ | |
8283 063 goto __T1; | |
8284 064 \} | |
8285 065 | |
8286 066 if ((res = mp_init_copy (&x, a)) != MP_OKAY) \{ | |
8287 067 goto __T2; | |
8288 068 \} | |
8289 069 | |
8290 070 if ((res = mp_init_copy (&y, b)) != MP_OKAY) \{ | |
8291 071 goto __X; | |
8292 072 \} | |
8293 073 | |
8294 074 /* fix the sign */ | |
8295 075 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG; | |
8296 076 x.sign = y.sign = MP_ZPOS; | |
8297 077 | |
8298 078 /* normalize both x and y, ensure that y >= b/2, [b == 2**DIGIT_BIT] */ | |
8299 079 norm = mp_count_bits(&y) % DIGIT_BIT; | |
8300 080 if (norm < (int)(DIGIT_BIT-1)) \{ | |
8301 081 norm = (DIGIT_BIT-1) - norm; | |
8302 082 if ((res = mp_mul_2d (&x, norm, &x)) != MP_OKAY) \{ | |
8303 083 goto __Y; | |
8304 084 \} | |
8305 085 if ((res = mp_mul_2d (&y, norm, &y)) != MP_OKAY) \{ | |
8306 086 goto __Y; | |
8307 087 \} | |
8308 088 \} else \{ | |
8309 089 norm = 0; | |
8310 090 \} | |
8311 091 | |
8312 092 /* note hac does 0 based, so if used==5 then its 0,1,2,3,4, e.g. use 4 */ | |
8313 093 n = x.used - 1; | |
8314 094 t = y.used - 1; | |
8315 095 | |
8316 096 /* while (x >= y*b**n-t) do \{ q[n-t] += 1; x -= y*b**\{n-t\} \} */ | |
8317 097 if ((res = mp_lshd (&y, n - t)) != MP_OKAY) \{ /* y = y*b**\{n-t\} */ | |
8318 098 goto __Y; | |
8319 099 \} | |
8320 100 | |
8321 101 while (mp_cmp (&x, &y) != MP_LT) \{ | |
8322 102 ++(q.dp[n - t]); | |
8323 103 if ((res = mp_sub (&x, &y, &x)) != MP_OKAY) \{ | |
8324 104 goto __Y; | |
8325 105 \} | |
8326 106 \} | |
8327 107 | |
8328 108 /* reset y by shifting it back down */ | |
8329 109 mp_rshd (&y, n - t); | |
8330 110 | |
8331 111 /* step 3. for i from n down to (t + 1) */ | |
8332 112 for (i = n; i >= (t + 1); i--) \{ | |
8333 113 if (i > x.used) \{ | |
8334 114 continue; | |
8335 115 \} | |
8336 116 | |
8337 117 /* step 3.1 if xi == yt then set q\{i-t-1\} to b-1, | |
8338 118 * otherwise set q\{i-t-1\} to (xi*b + x\{i-1\})/yt */ | |
8339 119 if (x.dp[i] == y.dp[t]) \{ | |
8340 120 q.dp[i - t - 1] = ((((mp_digit)1) << DIGIT_BIT) - 1); | |
8341 121 \} else \{ | |
8342 122 mp_word tmp; | |
8343 123 tmp = ((mp_word) x.dp[i]) << ((mp_word) DIGIT_BIT); | |
8344 124 tmp |= ((mp_word) x.dp[i - 1]); | |
8345 125 tmp /= ((mp_word) y.dp[t]); | |
8346 126 if (tmp > (mp_word) MP_MASK) | |
8347 127 tmp = MP_MASK; | |
8348 128 q.dp[i - t - 1] = (mp_digit) (tmp & (mp_word) (MP_MASK)); | |
8349 129 \} | |
8350 130 | |
8351 131 /* while (q\{i-t-1\} * (yt * b + y\{t-1\})) > | |
8352 132 xi * b**2 + xi-1 * b + xi-2 | |
8353 133 | |
8354 134 do q\{i-t-1\} -= 1; | |
8355 135 */ | |
8356 136 q.dp[i - t - 1] = (q.dp[i - t - 1] + 1) & MP_MASK; | |
8357 137 do \{ | |
8358 138 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1) & MP_MASK; | |
8359 139 | |
8360 140 /* find left hand */ | |
8361 141 mp_zero (&t1); | |
8362 142 t1.dp[0] = (t - 1 < 0) ? 0 : y.dp[t - 1]; | |
8363 143 t1.dp[1] = y.dp[t]; | |
8364 144 t1.used = 2; | |
8365 145 if ((res = mp_mul_d (&t1, q.dp[i - t - 1], &t1)) != MP_OKAY) \{ | |
8366 146 goto __Y; | |
8367 147 \} | |
8368 148 | |
8369 149 /* find right hand */ | |
8370 150 t2.dp[0] = (i - 2 < 0) ? 0 : x.dp[i - 2]; | |
8371 151 t2.dp[1] = (i - 1 < 0) ? 0 : x.dp[i - 1]; | |
8372 152 t2.dp[2] = x.dp[i]; | |
8373 153 t2.used = 3; | |
8374 154 \} while (mp_cmp_mag(&t1, &t2) == MP_GT); | |
8375 155 | |
8376 156 /* step 3.3 x = x - q\{i-t-1\} * y * b**\{i-t-1\} */ | |
8377 157 if ((res = mp_mul_d (&y, q.dp[i - t - 1], &t1)) != MP_OKAY) \{ | |
8378 158 goto __Y; | |
8379 159 \} | |
8380 160 | |
8381 161 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{ | |
8382 162 goto __Y; | |
8383 163 \} | |
8384 164 | |
8385 165 if ((res = mp_sub (&x, &t1, &x)) != MP_OKAY) \{ | |
8386 166 goto __Y; | |
8387 167 \} | |
8388 168 | |
8389 169 /* if x < 0 then \{ x = x + y*b**\{i-t-1\}; q\{i-t-1\} -= 1; \} */ | |
8390 170 if (x.sign == MP_NEG) \{ | |
8391 171 if ((res = mp_copy (&y, &t1)) != MP_OKAY) \{ | |
8392 172 goto __Y; | |
8393 173 \} | |
8394 174 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{ | |
8395 175 goto __Y; | |
8396 176 \} | |
8397 177 if ((res = mp_add (&x, &t1, &x)) != MP_OKAY) \{ | |
8398 178 goto __Y; | |
8399 179 \} | |
8400 180 | |
8401 181 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1UL) & MP_MASK; | |
8402 182 \} | |
8403 183 \} | |
8404 184 | |
8405 185 /* now q is the quotient and x is the remainder | |
8406 186 * [which we have to normalize] | |
8407 187 */ | |
8408 188 | |
8409 189 /* get sign before writing to c */ | |
8410 190 x.sign = a->sign; | |
8411 191 | |
8412 192 if (c != NULL) \{ | |
8413 193 mp_clamp (&q); | |
8414 194 mp_exch (&q, c); | |
8415 195 c->sign = neg; | |
8416 196 \} | |
8417 197 | |
8418 198 if (d != NULL) \{ | |
8419 199 mp_div_2d (&x, norm, &x, NULL); | |
8420 200 mp_exch (&x, d); | |
8421 201 \} | |
8422 202 | |
8423 203 res = MP_OKAY; | |
8424 204 | |
8425 205 __Y:mp_clear (&y); | |
8426 206 __X:mp_clear (&x); | |
8427 207 __T2:mp_clear (&t2); | |
8428 208 __T1:mp_clear (&t1); | |
8429 209 __Q:mp_clear (&q); | |
8430 210 return res; | |
8431 211 \} | |
8432 \end{alltt} | |
8433 \end{small} | |
8434 | |
8435 The implementation of this algorithm differs slightly from the pseudo code presented previously. In this algorithm either of the quotient $c$ or | |
8436 remainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired. For example, the C code to call the division | |
8437 algorithm with only the quotient is | |
8438 | |
8439 \begin{verbatim} | |
8440 mp_div(&a, &b, &c, NULL); /* c = [a/b] */ | |
8441 \end{verbatim} | |
8442 | |
8443 Lines 36 and 42 handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor | |
8444 respectively. After the two trivial cases all of the temporary variables are initialized. Line 75 determines the sign of | |
8445 the quotient and line 76 ensures that both $x$ and $y$ are positive. | |
8446 | |
8447 The number of bits in the leading digit is calculated on line 80. Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits | |
8448 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$. In this case $k$ is the number of bits in the leading digit which is | |
8449 exactly what is required. For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting | |
8450 them to the left by $lg(\beta) - 1 - k$ bits. | |
8451 | |
8452 Throughout the variables $n$ and $t$ will represent the highest digit of $x$ and $y$ respectively. These are first used to produce the | |
8453 leading digit of the quotient. The loop beginning on line 112 will produce the remainder of the quotient digits. | |
8454 | |
8455 The conditional ``continue'' on line 113 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the | |
8456 algorithm eliminates multiple non-zero digits in a single iteration. This ensures that $x_i$ is always non-zero since by definition the digits | |
8457 above the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}. | |
8458 | |
8459 Lines 142, 143 and 150 through 152 manually construct the high accuracy estimations by setting the digits of the two mp\_int | |
8460 variables directly. | |
8461 | |
8462 \section{Single Digit Helpers} | |
8463 | |
8464 This section briefly describes a series of single digit helper algorithms which come in handy when working with small constants. All of | |
8465 the helper functions assume the single digit input is positive and will treat them as such. | |
8466 | |
8467 \subsection{Single Digit Addition and Subtraction} | |
8468 | |
8469 Both addition and subtraction are performed by ``cheating'' and using mp\_set followed by the higher level addition or subtraction | |
8470 algorithms. As a result these algorithms are subtantially simpler with a slight cost in performance. | |
8471 | |
8472 \newpage\begin{figure}[!here] | |
8473 \begin{small} | |
8474 \begin{center} | |
8475 \begin{tabular}{l} | |
8476 \hline Algorithm \textbf{mp\_add\_d}. \\ | |
8477 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ | |
8478 \textbf{Output}. $c = a + b$ \\ | |
8479 \hline \\ | |
8480 1. $t \leftarrow b$ (\textit{mp\_set}) \\ | |
8481 2. $c \leftarrow a + t$ \\ | |
8482 3. Return(\textit{MP\_OKAY}) \\ | |
8483 \hline | |
8484 \end{tabular} | |
8485 \end{center} | |
8486 \end{small} | |
8487 \caption{Algorithm mp\_add\_d} | |
8488 \end{figure} | |
8489 | |
8490 \textbf{Algorithm mp\_add\_d.} | |
8491 This algorithm initiates a temporary mp\_int with the value of the single digit and uses algorithm mp\_add to add the two values together. | |
8492 | |
8493 \vspace{+3mm}\begin{small} | |
8494 \hspace{-5.1mm}{\bf File}: bn\_mp\_add\_d.c | |
8495 \vspace{-3mm} | |
8496 \begin{alltt} | |
8497 016 | |
8498 017 /* single digit addition */ | |
8499 018 int | |
8500 019 mp_add_d (mp_int * a, mp_digit b, mp_int * c) | |
8501 020 \{ | |
8502 021 int res, ix, oldused; | |
8503 022 mp_digit *tmpa, *tmpc, mu; | |
8504 023 | |
8505 024 /* grow c as required */ | |
8506 025 if (c->alloc < a->used + 1) \{ | |
8507 026 if ((res = mp_grow(c, a->used + 1)) != MP_OKAY) \{ | |
8508 027 return res; | |
8509 028 \} | |
8510 029 \} | |
8511 030 | |
8512 031 /* if a is negative and |a| >= b, call c = |a| - b */ | |
8513 032 if (a->sign == MP_NEG && (a->used > 1 || a->dp[0] >= b)) \{ | |
8514 033 /* temporarily fix sign of a */ | |
8515 034 a->sign = MP_ZPOS; | |
8516 035 | |
8517 036 /* c = |a| - b */ | |
8518 037 res = mp_sub_d(a, b, c); | |
8519 038 | |
8520 039 /* fix sign */ | |
8521 040 a->sign = c->sign = MP_NEG; | |
8522 041 | |
8523 042 return res; | |
8524 043 \} | |
8525 044 | |
8526 045 /* old number of used digits in c */ | |
8527 046 oldused = c->used; | |
8528 047 | |
8529 048 /* sign always positive */ | |
8530 049 c->sign = MP_ZPOS; | |
8531 050 | |
8532 051 /* source alias */ | |
8533 052 tmpa = a->dp; | |
8534 053 | |
8535 054 /* destination alias */ | |
8536 055 tmpc = c->dp; | |
8537 056 | |
8538 057 /* if a is positive */ | |
8539 058 if (a->sign == MP_ZPOS) \{ | |
8540 059 /* add digit, after this we're propagating | |
8541 060 * the carry. | |
8542 061 */ | |
8543 062 *tmpc = *tmpa++ + b; | |
8544 063 mu = *tmpc >> DIGIT_BIT; | |
8545 064 *tmpc++ &= MP_MASK; | |
8546 065 | |
8547 066 /* now handle rest of the digits */ | |
8548 067 for (ix = 1; ix < a->used; ix++) \{ | |
8549 068 *tmpc = *tmpa++ + mu; | |
8550 069 mu = *tmpc >> DIGIT_BIT; | |
8551 070 *tmpc++ &= MP_MASK; | |
8552 071 \} | |
8553 072 /* set final carry */ | |
8554 073 ix++; | |
8555 074 *tmpc++ = mu; | |
8556 075 | |
8557 076 /* setup size */ | |
8558 077 c->used = a->used + 1; | |
8559 078 \} else \{ | |
8560 079 /* a was negative and |a| < b */ | |
8561 080 c->used = 1; | |
8562 081 | |
8563 082 /* the result is a single digit */ | |
8564 083 if (a->used == 1) \{ | |
8565 084 *tmpc++ = b - a->dp[0]; | |
8566 085 \} else \{ | |
8567 086 *tmpc++ = b; | |
8568 087 \} | |
8569 088 | |
8570 089 /* setup count so the clearing of oldused | |
8571 090 * can fall through correctly | |
8572 091 */ | |
8573 092 ix = 1; | |
8574 093 \} | |
8575 094 | |
8576 095 /* now zero to oldused */ | |
8577 096 while (ix++ < oldused) \{ | |
8578 097 *tmpc++ = 0; | |
8579 098 \} | |
8580 099 mp_clamp(c); | |
8581 100 | |
8582 101 return MP_OKAY; | |
8583 102 \} | |
8584 103 | |
8585 \end{alltt} | |
8586 \end{small} | |
8587 | |
8588 Clever use of the letter 't'. | |
8589 | |
8590 \subsubsection{Subtraction} | |
8591 The single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int. | |
8592 | |
8593 \subsection{Single Digit Multiplication} | |
8594 Single digit multiplication arises enough in division and radix conversion that it ought to be implement as a special case of the baseline | |
8595 multiplication algorithm. Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands | |
8596 only has one digit. | |
8597 | |
8598 \begin{figure}[!here] | |
8599 \begin{small} | |
8600 \begin{center} | |
8601 \begin{tabular}{l} | |
8602 \hline Algorithm \textbf{mp\_mul\_d}. \\ | |
8603 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ | |
8604 \textbf{Output}. $c = ab$ \\ | |
8605 \hline \\ | |
8606 1. $pa \leftarrow a.used$ \\ | |
8607 2. Grow $c$ to at least $pa + 1$ digits. \\ | |
8608 3. $oldused \leftarrow c.used$ \\ | |
8609 4. $c.used \leftarrow pa + 1$ \\ | |
8610 5. $c.sign \leftarrow a.sign$ \\ | |
8611 6. $\mu \leftarrow 0$ \\ | |
8612 7. for $ix$ from $0$ to $pa - 1$ do \\ | |
8613 \hspace{3mm}7.1 $\hat r \leftarrow \mu + a_{ix}b$ \\ | |
8614 \hspace{3mm}7.2 $c_{ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ | |
8615 \hspace{3mm}7.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\ | |
8616 8. $c_{pa} \leftarrow \mu$ \\ | |
8617 9. for $ix$ from $pa + 1$ to $oldused$ do \\ | |
8618 \hspace{3mm}9.1 $c_{ix} \leftarrow 0$ \\ | |
8619 10. Clamp excess digits of $c$. \\ | |
8620 11. Return(\textit{MP\_OKAY}). \\ | |
8621 \hline | |
8622 \end{tabular} | |
8623 \end{center} | |
8624 \end{small} | |
8625 \caption{Algorithm mp\_mul\_d} | |
8626 \end{figure} | |
8627 \textbf{Algorithm mp\_mul\_d.} | |
8628 This algorithm quickly multiplies an mp\_int by a small single digit value. It is specially tailored to the job and has a minimal of overhead. | |
8629 Unlike the full multiplication algorithms this algorithm does not require any significnat temporary storage or memory allocations. | |
8630 | |
8631 \vspace{+3mm}\begin{small} | |
8632 \hspace{-5.1mm}{\bf File}: bn\_mp\_mul\_d.c | |
8633 \vspace{-3mm} | |
8634 \begin{alltt} | |
8635 016 | |
8636 017 /* multiply by a digit */ | |
8637 018 int | |
8638 019 mp_mul_d (mp_int * a, mp_digit b, mp_int * c) | |
8639 020 \{ | |
8640 021 mp_digit u, *tmpa, *tmpc; | |
8641 022 mp_word r; | |
8642 023 int ix, res, olduse; | |
8643 024 | |
8644 025 /* make sure c is big enough to hold a*b */ | |
8645 026 if (c->alloc < a->used + 1) \{ | |
8646 027 if ((res = mp_grow (c, a->used + 1)) != MP_OKAY) \{ | |
8647 028 return res; | |
8648 029 \} | |
8649 030 \} | |
8650 031 | |
8651 032 /* get the original destinations used count */ | |
8652 033 olduse = c->used; | |
8653 034 | |
8654 035 /* set the sign */ | |
8655 036 c->sign = a->sign; | |
8656 037 | |
8657 038 /* alias for a->dp [source] */ | |
8658 039 tmpa = a->dp; | |
8659 040 | |
8660 041 /* alias for c->dp [dest] */ | |
8661 042 tmpc = c->dp; | |
8662 043 | |
8663 044 /* zero carry */ | |
8664 045 u = 0; | |
8665 046 | |
8666 047 /* compute columns */ | |
8667 048 for (ix = 0; ix < a->used; ix++) \{ | |
8668 049 /* compute product and carry sum for this term */ | |
8669 050 r = ((mp_word) u) + ((mp_word)*tmpa++) * ((mp_word)b); | |
8670 051 | |
8671 052 /* mask off higher bits to get a single digit */ | |
8672 053 *tmpc++ = (mp_digit) (r & ((mp_word) MP_MASK)); | |
8673 054 | |
8674 055 /* send carry into next iteration */ | |
8675 056 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT)); | |
8676 057 \} | |
8677 058 | |
8678 059 /* store final carry [if any] */ | |
8679 060 *tmpc++ = u; | |
8680 061 | |
8681 062 /* now zero digits above the top */ | |
8682 063 while (ix++ < olduse) \{ | |
8683 064 *tmpc++ = 0; | |
8684 065 \} | |
8685 066 | |
8686 067 /* set used count */ | |
8687 068 c->used = a->used + 1; | |
8688 069 mp_clamp(c); | |
8689 070 | |
8690 071 return MP_OKAY; | |
8691 072 \} | |
8692 \end{alltt} | |
8693 \end{small} | |
8694 | |
8695 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is | |
8696 read from the source. This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively. | |
8697 | |
8698 \subsection{Single Digit Division} | |
8699 Like the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion. Since the | |
8700 divisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient. | |
8701 | |
8702 \newpage\begin{figure}[!here] | |
8703 \begin{small} | |
8704 \begin{center} | |
8705 \begin{tabular}{l} | |
8706 \hline Algorithm \textbf{mp\_div\_d}. \\ | |
8707 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ | |
8708 \textbf{Output}. $c = \lfloor a / b \rfloor, d = a - cb$ \\ | |
8709 \hline \\ | |
8710 1. If $b = 0$ then return(\textit{MP\_VAL}).\\ | |
8711 2. If $b = 3$ then use algorithm mp\_div\_3 instead. \\ | |
8712 3. Init $q$ to $a.used$ digits. \\ | |
8713 4. $q.used \leftarrow a.used$ \\ | |
8714 5. $q.sign \leftarrow a.sign$ \\ | |
8715 6. $\hat w \leftarrow 0$ \\ | |
8716 7. for $ix$ from $a.used - 1$ down to $0$ do \\ | |
8717 \hspace{3mm}7.1 $\hat w \leftarrow \hat w \beta + a_{ix}$ \\ | |
8718 \hspace{3mm}7.2 If $\hat w \ge b$ then \\ | |
8719 \hspace{6mm}7.2.1 $t \leftarrow \lfloor \hat w / b \rfloor$ \\ | |
8720 \hspace{6mm}7.2.2 $\hat w \leftarrow \hat w \mbox{ (mod }b\mbox{)}$ \\ | |
8721 \hspace{3mm}7.3 else\\ | |
8722 \hspace{6mm}7.3.1 $t \leftarrow 0$ \\ | |
8723 \hspace{3mm}7.4 $q_{ix} \leftarrow t$ \\ | |
8724 8. $d \leftarrow \hat w$ \\ | |
8725 9. Clamp excess digits of $q$. \\ | |
8726 10. $c \leftarrow q$ \\ | |
8727 11. Return(\textit{MP\_OKAY}). \\ | |
8728 \hline | |
8729 \end{tabular} | |
8730 \end{center} | |
8731 \end{small} | |
8732 \caption{Algorithm mp\_div\_d} | |
8733 \end{figure} | |
8734 \textbf{Algorithm mp\_div\_d.} | |
8735 This algorithm divides the mp\_int $a$ by the single mp\_digit $b$ using an optimized approach. Essentially in every iteration of the | |
8736 algorithm another digit of the dividend is reduced and another digit of quotient produced. Provided $b < \beta$ the value of $\hat w$ | |
8737 after step 7.1 will be limited such that $0 \le \lfloor \hat w / b \rfloor < \beta$. | |
8738 | |
8739 If the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3. It replaces the division by three with | |
8740 a multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup. In essence it is much like the Barrett reduction | |
8741 from chapter seven. | |
8742 | |
8743 \vspace{+3mm}\begin{small} | |
8744 \hspace{-5.1mm}{\bf File}: bn\_mp\_div\_d.c | |
8745 \vspace{-3mm} | |
8746 \begin{alltt} | |
8747 016 | |
8748 017 static int s_is_power_of_two(mp_digit b, int *p) | |
8749 018 \{ | |
8750 019 int x; | |
8751 020 | |
8752 021 for (x = 1; x < DIGIT_BIT; x++) \{ | |
8753 022 if (b == (((mp_digit)1)<<x)) \{ | |
8754 023 *p = x; | |
8755 024 return 1; | |
8756 025 \} | |
8757 026 \} | |
8758 027 return 0; | |
8759 028 \} | |
8760 029 | |
8761 030 /* single digit division (based on routine from MPI) */ | |
8762 031 int mp_div_d (mp_int * a, mp_digit b, mp_int * c, mp_digit * d) | |
8763 032 \{ | |
8764 033 mp_int q; | |
8765 034 mp_word w; | |
8766 035 mp_digit t; | |
8767 036 int res, ix; | |
8768 037 | |
8769 038 /* cannot divide by zero */ | |
8770 039 if (b == 0) \{ | |
8771 040 return MP_VAL; | |
8772 041 \} | |
8773 042 | |
8774 043 /* quick outs */ | |
8775 044 if (b == 1 || mp_iszero(a) == 1) \{ | |
8776 045 if (d != NULL) \{ | |
8777 046 *d = 0; | |
8778 047 \} | |
8779 048 if (c != NULL) \{ | |
8780 049 return mp_copy(a, c); | |
8781 050 \} | |
8782 051 return MP_OKAY; | |
8783 052 \} | |
8784 053 | |
8785 054 /* power of two ? */ | |
8786 055 if (s_is_power_of_two(b, &ix) == 1) \{ | |
8787 056 if (d != NULL) \{ | |
8788 057 *d = a->dp[0] & ((1<<ix) - 1); | |
8789 058 \} | |
8790 059 if (c != NULL) \{ | |
8791 060 return mp_div_2d(a, ix, c, NULL); | |
8792 061 \} | |
8793 062 return MP_OKAY; | |
8794 063 \} | |
8795 064 | |
8796 065 /* three? */ | |
8797 066 if (b == 3) \{ | |
8798 067 return mp_div_3(a, c, d); | |
8799 068 \} | |
8800 069 | |
8801 070 /* no easy answer [c'est la vie]. Just division */ | |
8802 071 if ((res = mp_init_size(&q, a->used)) != MP_OKAY) \{ | |
8803 072 return res; | |
8804 073 \} | |
8805 074 | |
8806 075 q.used = a->used; | |
8807 076 q.sign = a->sign; | |
8808 077 w = 0; | |
8809 078 for (ix = a->used - 1; ix >= 0; ix--) \{ | |
8810 079 w = (w << ((mp_word)DIGIT_BIT)) | ((mp_word)a->dp[ix]); | |
8811 080 | |
8812 081 if (w >= b) \{ | |
8813 082 t = (mp_digit)(w / b); | |
8814 083 w -= ((mp_word)t) * ((mp_word)b); | |
8815 084 \} else \{ | |
8816 085 t = 0; | |
8817 086 \} | |
8818 087 q.dp[ix] = (mp_digit)t; | |
8819 088 \} | |
8820 089 | |
8821 090 if (d != NULL) \{ | |
8822 091 *d = (mp_digit)w; | |
8823 092 \} | |
8824 093 | |
8825 094 if (c != NULL) \{ | |
8826 095 mp_clamp(&q); | |
8827 096 mp_exch(&q, c); | |
8828 097 \} | |
8829 098 mp_clear(&q); | |
8830 099 | |
8831 100 return res; | |
8832 101 \} | |
8833 102 | |
8834 \end{alltt} | |
8835 \end{small} | |
8836 | |
8837 Like the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to | |
8838 indicate the respective value is not required. This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created. | |
8839 | |
8840 The division and remainder on lines 43 and @45,%@ can be replaced often by a single division on most processors. For example, the 32-bit x86 based | |
8841 processors can divide a 64-bit quantity by a 32-bit quantity and produce the quotient and remainder simultaneously. Unfortunately the GCC | |
8842 compiler does not recognize that optimization and will actually produce two function calls to find the quotient and remainder respectively. | |
8843 | |
8844 \subsection{Single Digit Root Extraction} | |
8845 | |
8846 Finding the $n$'th root of an integer is fairly easy as far as numerical analysis is concerned. Algorithms such as the Newton-Raphson approximation | |
8847 (\ref{eqn:newton}) series will converge very quickly to a root for any continuous function $f(x)$. | |
8848 | |
8849 \begin{equation} | |
8850 x_{i+1} = x_i - {f(x_i) \over f'(x_i)} | |
8851 \label{eqn:newton} | |
8852 \end{equation} | |
8853 | |
8854 In this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired. The derivative of $f(x)$ is | |
8855 simply $f'(x) = nx^{n - 1}$. Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain | |
8856 such as the real numbers. As a result the root found can be above the true root by few and must be manually adjusted. Ideally at the end of the | |
8857 algorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$. | |
8858 | |
8859 \newpage\begin{figure}[!here] | |
8860 \begin{small} | |
8861 \begin{center} | |
8862 \begin{tabular}{l} | |
8863 \hline Algorithm \textbf{mp\_n\_root}. \\ | |
8864 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ | |
8865 \textbf{Output}. $c^b \le a$ \\ | |
8866 \hline \\ | |
8867 1. If $b$ is even and $a.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\ | |
8868 2. $sign \leftarrow a.sign$ \\ | |
8869 3. $a.sign \leftarrow MP\_ZPOS$ \\ | |
8870 4. t$2 \leftarrow 2$ \\ | |
8871 5. Loop \\ | |
8872 \hspace{3mm}5.1 t$1 \leftarrow $ t$2$ \\ | |
8873 \hspace{3mm}5.2 t$3 \leftarrow $ t$1^{b - 1}$ \\ | |
8874 \hspace{3mm}5.3 t$2 \leftarrow $ t$3 $ $\cdot$ t$1$ \\ | |
8875 \hspace{3mm}5.4 t$2 \leftarrow $ t$2 - a$ \\ | |
8876 \hspace{3mm}5.5 t$3 \leftarrow $ t$3 \cdot b$ \\ | |
8877 \hspace{3mm}5.6 t$3 \leftarrow \lfloor $t$2 / $t$3 \rfloor$ \\ | |
8878 \hspace{3mm}5.7 t$2 \leftarrow $ t$1 - $ t$3$ \\ | |
8879 \hspace{3mm}5.8 If t$1 \ne $ t$2$ then goto step 5. \\ | |
8880 6. Loop \\ | |
8881 \hspace{3mm}6.1 t$2 \leftarrow $ t$1^b$ \\ | |
8882 \hspace{3mm}6.2 If t$2 > a$ then \\ | |
8883 \hspace{6mm}6.2.1 t$1 \leftarrow $ t$1 - 1$ \\ | |
8884 \hspace{6mm}6.2.2 Goto step 6. \\ | |
8885 7. $a.sign \leftarrow sign$ \\ | |
8886 8. $c \leftarrow $ t$1$ \\ | |
8887 9. $c.sign \leftarrow sign$ \\ | |
8888 10. Return(\textit{MP\_OKAY}). \\ | |
8889 \hline | |
8890 \end{tabular} | |
8891 \end{center} | |
8892 \end{small} | |
8893 \caption{Algorithm mp\_n\_root} | |
8894 \end{figure} | |
8895 \textbf{Algorithm mp\_n\_root.} | |
8896 This algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach. It is partially optimized based on the observation | |
8897 that the numerator of ${f(x) \over f'(x)}$ can be derived from a partial denominator. That is at first the denominator is calculated by finding | |
8898 $x^{b - 1}$. This value can then be multiplied by $x$ and have $a$ subtracted from it to find the numerator. This saves a total of $b - 1$ | |
8899 multiplications by t$1$ inside the loop. | |
8900 | |
8901 The initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the | |
8902 root. Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$. | |
8903 | |
8904 \vspace{+3mm}\begin{small} | |
8905 \hspace{-5.1mm}{\bf File}: bn\_mp\_n\_root.c | |
8906 \vspace{-3mm} | |
8907 \begin{alltt} | |
8908 016 | |
8909 017 /* find the n'th root of an integer | |
8910 018 * | |
8911 019 * Result found such that (c)**b <= a and (c+1)**b > a | |
8912 020 * | |
8913 021 * This algorithm uses Newton's approximation | |
8914 022 * x[i+1] = x[i] - f(x[i])/f'(x[i]) | |
8915 023 * which will find the root in log(N) time where | |
8916 024 * each step involves a fair bit. This is not meant to | |
8917 025 * find huge roots [square and cube, etc]. | |
8918 026 */ | |
8919 027 int mp_n_root (mp_int * a, mp_digit b, mp_int * c) | |
8920 028 \{ | |
8921 029 mp_int t1, t2, t3; | |
8922 030 int res, neg; | |
8923 031 | |
8924 032 /* input must be positive if b is even */ | |
8925 033 if ((b & 1) == 0 && a->sign == MP_NEG) \{ | |
8926 034 return MP_VAL; | |
8927 035 \} | |
8928 036 | |
8929 037 if ((res = mp_init (&t1)) != MP_OKAY) \{ | |
8930 038 return res; | |
8931 039 \} | |
8932 040 | |
8933 041 if ((res = mp_init (&t2)) != MP_OKAY) \{ | |
8934 042 goto __T1; | |
8935 043 \} | |
8936 044 | |
8937 045 if ((res = mp_init (&t3)) != MP_OKAY) \{ | |
8938 046 goto __T2; | |
8939 047 \} | |
8940 048 | |
8941 049 /* if a is negative fudge the sign but keep track */ | |
8942 050 neg = a->sign; | |
8943 051 a->sign = MP_ZPOS; | |
8944 052 | |
8945 053 /* t2 = 2 */ | |
8946 054 mp_set (&t2, 2); | |
8947 055 | |
8948 056 do \{ | |
8949 057 /* t1 = t2 */ | |
8950 058 if ((res = mp_copy (&t2, &t1)) != MP_OKAY) \{ | |
8951 059 goto __T3; | |
8952 060 \} | |
8953 061 | |
8954 062 /* t2 = t1 - ((t1**b - a) / (b * t1**(b-1))) */ | |
8955 063 | |
8956 064 /* t3 = t1**(b-1) */ | |
8957 065 if ((res = mp_expt_d (&t1, b - 1, &t3)) != MP_OKAY) \{ | |
8958 066 goto __T3; | |
8959 067 \} | |
8960 068 | |
8961 069 /* numerator */ | |
8962 070 /* t2 = t1**b */ | |
8963 071 if ((res = mp_mul (&t3, &t1, &t2)) != MP_OKAY) \{ | |
8964 072 goto __T3; | |
8965 073 \} | |
8966 074 | |
8967 075 /* t2 = t1**b - a */ | |
8968 076 if ((res = mp_sub (&t2, a, &t2)) != MP_OKAY) \{ | |
8969 077 goto __T3; | |
8970 078 \} | |
8971 079 | |
8972 080 /* denominator */ | |
8973 081 /* t3 = t1**(b-1) * b */ | |
8974 082 if ((res = mp_mul_d (&t3, b, &t3)) != MP_OKAY) \{ | |
8975 083 goto __T3; | |
8976 084 \} | |
8977 085 | |
8978 086 /* t3 = (t1**b - a)/(b * t1**(b-1)) */ | |
8979 087 if ((res = mp_div (&t2, &t3, &t3, NULL)) != MP_OKAY) \{ | |
8980 088 goto __T3; | |
8981 089 \} | |
8982 090 | |
8983 091 if ((res = mp_sub (&t1, &t3, &t2)) != MP_OKAY) \{ | |
8984 092 goto __T3; | |
8985 093 \} | |
8986 094 \} while (mp_cmp (&t1, &t2) != MP_EQ); | |
8987 095 | |
8988 096 /* result can be off by a few so check */ | |
8989 097 for (;;) \{ | |
8990 098 if ((res = mp_expt_d (&t1, b, &t2)) != MP_OKAY) \{ | |
8991 099 goto __T3; | |
8992 100 \} | |
8993 101 | |
8994 102 if (mp_cmp (&t2, a) == MP_GT) \{ | |
8995 103 if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) \{ | |
8996 104 goto __T3; | |
8997 105 \} | |
8998 106 \} else \{ | |
8999 107 break; | |
9000 108 \} | |
9001 109 \} | |
9002 110 | |
9003 111 /* reset the sign of a first */ | |
9004 112 a->sign = neg; | |
9005 113 | |
9006 114 /* set the result */ | |
9007 115 mp_exch (&t1, c); | |
9008 116 | |
9009 117 /* set the sign of the result */ | |
9010 118 c->sign = neg; | |
9011 119 | |
9012 120 res = MP_OKAY; | |
9013 121 | |
9014 122 __T3:mp_clear (&t3); | |
9015 123 __T2:mp_clear (&t2); | |
9016 124 __T1:mp_clear (&t1); | |
9017 125 return res; | |
9018 126 \} | |
9019 \end{alltt} | |
9020 \end{small} | |
9021 | |
9022 \section{Random Number Generation} | |
9023 | |
9024 Random numbers come up in a variety of activities from public key cryptography to simple simulations and various randomized algorithms. Pollard-Rho | |
9025 factoring for example, can make use of random values as starting points to find factors of a composite integer. In this case the algorithm presented | |
9026 is solely for simulations and not intended for cryptographic use. | |
9027 | |
9028 \newpage\begin{figure}[!here] | |
9029 \begin{small} | |
9030 \begin{center} | |
9031 \begin{tabular}{l} | |
9032 \hline Algorithm \textbf{mp\_rand}. \\ | |
9033 \textbf{Input}. An integer $b$ \\ | |
9034 \textbf{Output}. A pseudo-random number of $b$ digits \\ | |
9035 \hline \\ | |
9036 1. $a \leftarrow 0$ \\ | |
9037 2. If $b \le 0$ return(\textit{MP\_OKAY}) \\ | |
9038 3. Pick a non-zero random digit $d$. \\ | |
9039 4. $a \leftarrow a + d$ \\ | |
9040 5. for $ix$ from 1 to $d - 1$ do \\ | |
9041 \hspace{3mm}5.1 $a \leftarrow a \cdot \beta$ \\ | |
9042 \hspace{3mm}5.2 Pick a random digit $d$. \\ | |
9043 \hspace{3mm}5.3 $a \leftarrow a + d$ \\ | |
9044 6. Return(\textit{MP\_OKAY}). \\ | |
9045 \hline | |
9046 \end{tabular} | |
9047 \end{center} | |
9048 \end{small} | |
9049 \caption{Algorithm mp\_rand} | |
9050 \end{figure} | |
9051 \textbf{Algorithm mp\_rand.} | |
9052 This algorithm produces a pseudo-random integer of $b$ digits. By ensuring that the first digit is non-zero the algorithm also guarantees that the | |
9053 final result has at least $b$ digits. It relies heavily on a third-part random number generator which should ideally generate uniformly all of | |
9054 the integers from $0$ to $\beta - 1$. | |
9055 | |
9056 \vspace{+3mm}\begin{small} | |
9057 \hspace{-5.1mm}{\bf File}: bn\_mp\_rand.c | |
9058 \vspace{-3mm} | |
9059 \begin{alltt} | |
9060 016 | |
9061 017 /* makes a pseudo-random int of a given size */ | |
9062 018 int | |
9063 019 mp_rand (mp_int * a, int digits) | |
9064 020 \{ | |
9065 021 int res; | |
9066 022 mp_digit d; | |
9067 023 | |
9068 024 mp_zero (a); | |
9069 025 if (digits <= 0) \{ | |
9070 026 return MP_OKAY; | |
9071 027 \} | |
9072 028 | |
9073 029 /* first place a random non-zero digit */ | |
9074 030 do \{ | |
9075 031 d = ((mp_digit) abs (rand ())); | |
9076 032 \} while (d == 0); | |
9077 033 | |
9078 034 if ((res = mp_add_d (a, d, a)) != MP_OKAY) \{ | |
9079 035 return res; | |
9080 036 \} | |
9081 037 | |
9082 038 while (digits-- > 0) \{ | |
9083 039 if ((res = mp_lshd (a, 1)) != MP_OKAY) \{ | |
9084 040 return res; | |
9085 041 \} | |
9086 042 | |
9087 043 if ((res = mp_add_d (a, ((mp_digit) abs (rand ())), a)) != MP_OKAY) \{ | |
9088 044 return res; | |
9089 045 \} | |
9090 046 \} | |
9091 047 | |
9092 048 return MP_OKAY; | |
9093 049 \} | |
9094 \end{alltt} | |
9095 \end{small} | |
9096 | |
9097 \section{Formatted Representations} | |
9098 The ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties. For example, the ability to | |
9099 be given a string of characters such as ``114585'' and turn it into the radix-$\beta$ equivalent would make it easier to enter numbers | |
9100 into a program. | |
9101 | |
9102 \subsection{Reading Radix-n Input} | |
9103 For the purposes of this text we will assume that a simple lower ASCII map (\ref{fig:ASC}) is used for the values of from $0$ to $63$ to | |
9104 printable characters. For example, when the character ``N'' is read it represents the integer $23$. The first $16$ characters of the | |
9105 map are for the common representations up to hexadecimal. After that they match the ``base64'' encoding scheme which are suitable chosen | |
9106 such that they are printable. While outputting as base64 may not be too helpful for human operators it does allow communication via non binary | |
9107 mediums. | |
9108 | |
9109 \newpage\begin{figure}[here] | |
9110 \begin{center} | |
9111 \begin{tabular}{cc|cc|cc|cc} | |
9112 \hline \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} \\ | |
9113 \hline | |
9114 0 & 0 & 1 & 1 & 2 & 2 & 3 & 3 \\ | |
9115 4 & 4 & 5 & 5 & 6 & 6 & 7 & 7 \\ | |
9116 8 & 8 & 9 & 9 & 10 & A & 11 & B \\ | |
9117 12 & C & 13 & D & 14 & E & 15 & F \\ | |
9118 16 & G & 17 & H & 18 & I & 19 & J \\ | |
9119 20 & K & 21 & L & 22 & M & 23 & N \\ | |
9120 24 & O & 25 & P & 26 & Q & 27 & R \\ | |
9121 28 & S & 29 & T & 30 & U & 31 & V \\ | |
9122 32 & W & 33 & X & 34 & Y & 35 & Z \\ | |
9123 36 & a & 37 & b & 38 & c & 39 & d \\ | |
9124 40 & e & 41 & f & 42 & g & 43 & h \\ | |
9125 44 & i & 45 & j & 46 & k & 47 & l \\ | |
9126 48 & m & 49 & n & 50 & o & 51 & p \\ | |
9127 52 & q & 53 & r & 54 & s & 55 & t \\ | |
9128 56 & u & 57 & v & 58 & w & 59 & x \\ | |
9129 60 & y & 61 & z & 62 & $+$ & 63 & $/$ \\ | |
9130 \hline | |
9131 \end{tabular} | |
9132 \end{center} | |
9133 \caption{Lower ASCII Map} | |
9134 \label{fig:ASC} | |
9135 \end{figure} | |
9136 | |
9137 \newpage\begin{figure}[!here] | |
9138 \begin{small} | |
9139 \begin{center} | |
9140 \begin{tabular}{l} | |
9141 \hline Algorithm \textbf{mp\_read\_radix}. \\ | |
9142 \textbf{Input}. A string $str$ of length $sn$ and radix $r$. \\ | |
9143 \textbf{Output}. The radix-$\beta$ equivalent mp\_int. \\ | |
9144 \hline \\ | |
9145 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\ | |
9146 2. $ix \leftarrow 0$ \\ | |
9147 3. If $str_0 =$ ``-'' then do \\ | |
9148 \hspace{3mm}3.1 $ix \leftarrow ix + 1$ \\ | |
9149 \hspace{3mm}3.2 $sign \leftarrow MP\_NEG$ \\ | |
9150 4. else \\ | |
9151 \hspace{3mm}4.1 $sign \leftarrow MP\_ZPOS$ \\ | |
9152 5. $a \leftarrow 0$ \\ | |
9153 6. for $iy$ from $ix$ to $sn - 1$ do \\ | |
9154 \hspace{3mm}6.1 Let $y$ denote the position in the map of $str_{iy}$. \\ | |
9155 \hspace{3mm}6.2 If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\ | |
9156 \hspace{3mm}6.3 $a \leftarrow a \cdot r$ \\ | |
9157 \hspace{3mm}6.4 $a \leftarrow a + y$ \\ | |
9158 7. If $a \ne 0$ then $a.sign \leftarrow sign$ \\ | |
9159 8. Return(\textit{MP\_OKAY}). \\ | |
9160 \hline | |
9161 \end{tabular} | |
9162 \end{center} | |
9163 \end{small} | |
9164 \caption{Algorithm mp\_read\_radix} | |
9165 \end{figure} | |
9166 \textbf{Algorithm mp\_read\_radix.} | |
9167 This algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer. A minus symbol ``-'' may precede the | |
9168 string to indicate the value is negative, otherwise it is assumed to be positive. The algorithm will read up to $sn$ characters from the input | |
9169 and will stop when it reads a character it cannot map the algorithm stops reading characters from the string. This allows numbers to be embedded | |
9170 as part of larger input without any significant problem. | |
9171 | |
9172 \vspace{+3mm}\begin{small} | |
9173 \hspace{-5.1mm}{\bf File}: bn\_mp\_read\_radix.c | |
9174 \vspace{-3mm} | |
9175 \begin{alltt} | |
9176 016 | |
9177 017 /* read a string [ASCII] in a given radix */ | |
9178 018 int mp_read_radix (mp_int * a, char *str, int radix) | |
9179 019 \{ | |
9180 020 int y, res, neg; | |
9181 021 char ch; | |
9182 022 | |
9183 023 /* make sure the radix is ok */ | |
9184 024 if (radix < 2 || radix > 64) \{ | |
9185 025 return MP_VAL; | |
9186 026 \} | |
9187 027 | |
9188 028 /* if the leading digit is a | |
9189 029 * minus set the sign to negative. | |
9190 030 */ | |
9191 031 if (*str == '-') \{ | |
9192 032 ++str; | |
9193 033 neg = MP_NEG; | |
9194 034 \} else \{ | |
9195 035 neg = MP_ZPOS; | |
9196 036 \} | |
9197 037 | |
9198 038 /* set the integer to the default of zero */ | |
9199 039 mp_zero (a); | |
9200 040 | |
9201 041 /* process each digit of the string */ | |
9202 042 while (*str) \{ | |
9203 043 /* if the radix < 36 the conversion is case insensitive | |
9204 044 * this allows numbers like 1AB and 1ab to represent the same value | |
9205 045 * [e.g. in hex] | |
9206 046 */ | |
9207 047 ch = (char) ((radix < 36) ? toupper (*str) : *str); | |
9208 048 for (y = 0; y < 64; y++) \{ | |
9209 049 if (ch == mp_s_rmap[y]) \{ | |
9210 050 break; | |
9211 051 \} | |
9212 052 \} | |
9213 053 | |
9214 054 /* if the char was found in the map | |
9215 055 * and is less than the given radix add it | |
9216 056 * to the number, otherwise exit the loop. | |
9217 057 */ | |
9218 058 if (y < radix) \{ | |
9219 059 if ((res = mp_mul_d (a, (mp_digit) radix, a)) != MP_OKAY) \{ | |
9220 060 return res; | |
9221 061 \} | |
9222 062 if ((res = mp_add_d (a, (mp_digit) y, a)) != MP_OKAY) \{ | |
9223 063 return res; | |
9224 064 \} | |
9225 065 \} else \{ | |
9226 066 break; | |
9227 067 \} | |
9228 068 ++str; | |
9229 069 \} | |
9230 070 | |
9231 071 /* set the sign only if a != 0 */ | |
9232 072 if (mp_iszero(a) != 1) \{ | |
9233 073 a->sign = neg; | |
9234 074 \} | |
9235 075 return MP_OKAY; | |
9236 076 \} | |
9237 \end{alltt} | |
9238 \end{small} | |
9239 | |
9240 \subsection{Generating Radix-$n$ Output} | |
9241 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm. | |
9242 | |
9243 \newpage\begin{figure}[!here] | |
9244 \begin{small} | |
9245 \begin{center} | |
9246 \begin{tabular}{l} | |
9247 \hline Algorithm \textbf{mp\_toradix}. \\ | |
9248 \textbf{Input}. A mp\_int $a$ and an integer $r$\\ | |
9249 \textbf{Output}. The radix-$r$ representation of $a$ \\ | |
9250 \hline \\ | |
9251 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\ | |
9252 2. If $a = 0$ then $str = $ ``$0$'' and return(\textit{MP\_OKAY}). \\ | |
9253 3. $t \leftarrow a$ \\ | |
9254 4. $str \leftarrow$ ``'' \\ | |
9255 5. if $t.sign = MP\_NEG$ then \\ | |
9256 \hspace{3mm}5.1 $str \leftarrow str + $ ``-'' \\ | |
9257 \hspace{3mm}5.2 $t.sign = MP\_ZPOS$ \\ | |
9258 6. While ($t \ne 0$) do \\ | |
9259 \hspace{3mm}6.1 $d \leftarrow t \mbox{ (mod }r\mbox{)}$ \\ | |
9260 \hspace{3mm}6.2 $t \leftarrow \lfloor t / r \rfloor$ \\ | |
9261 \hspace{3mm}6.3 Look up $d$ in the map and store the equivalent character in $y$. \\ | |
9262 \hspace{3mm}6.4 $str \leftarrow str + y$ \\ | |
9263 7. If $str_0 = $``$-$'' then \\ | |
9264 \hspace{3mm}7.1 Reverse the digits $str_1, str_2, \ldots str_n$. \\ | |
9265 8. Otherwise \\ | |
9266 \hspace{3mm}8.1 Reverse the digits $str_0, str_1, \ldots str_n$. \\ | |
9267 9. Return(\textit{MP\_OKAY}).\\ | |
9268 \hline | |
9269 \end{tabular} | |
9270 \end{center} | |
9271 \end{small} | |
9272 \caption{Algorithm mp\_toradix} | |
9273 \end{figure} | |
9274 \textbf{Algorithm mp\_toradix.} | |
9275 This algorithm computes the radix-$r$ representation of an mp\_int $a$. The ``digits'' of the representation are extracted by reducing | |
9276 successive powers of $\lfloor a / r^k \rfloor$ the input modulo $r$ until $r^k > a$. Note that instead of actually dividing by $r^k$ in | |
9277 each iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration. As a result a series of trivial $n \times 1$ divisions | |
9278 are required instead of a series of $n \times k$ divisions. One design flaw of this approach is that the digits are produced in the reverse order | |
9279 (see~\ref{fig:mpradix}). To remedy this flaw the digits must be swapped or simply ``reversed''. | |
9280 | |
9281 \begin{figure} | |
9282 \begin{center} | |
9283 \begin{tabular}{|c|c|c|} | |
9284 \hline \textbf{Value of $a$} & \textbf{Value of $d$} & \textbf{Value of $str$} \\ | |
9285 \hline $1234$ & -- & -- \\ | |
9286 \hline $123$ & $4$ & ``4'' \\ | |
9287 \hline $12$ & $3$ & ``43'' \\ | |
9288 \hline $1$ & $2$ & ``432'' \\ | |
9289 \hline $0$ & $1$ & ``4321'' \\ | |
9290 \hline | |
9291 \end{tabular} | |
9292 \end{center} | |
9293 \caption{Example of Algorithm mp\_toradix.} | |
9294 \label{fig:mpradix} | |
9295 \end{figure} | |
9296 | |
9297 \vspace{+3mm}\begin{small} | |
9298 \hspace{-5.1mm}{\bf File}: bn\_mp\_toradix.c | |
9299 \vspace{-3mm} | |
9300 \begin{alltt} | |
9301 016 | |
9302 017 /* stores a bignum as a ASCII string in a given radix (2..64) */ | |
9303 018 int mp_toradix (mp_int * a, char *str, int radix) | |
9304 019 \{ | |
9305 020 int res, digs; | |
9306 021 mp_int t; | |
9307 022 mp_digit d; | |
9308 023 char *_s = str; | |
9309 024 | |
9310 025 /* check range of the radix */ | |
9311 026 if (radix < 2 || radix > 64) \{ | |
9312 027 return MP_VAL; | |
9313 028 \} | |
9314 029 | |
9315 030 /* quick out if its zero */ | |
9316 031 if (mp_iszero(a) == 1) \{ | |
9317 032 *str++ = '0'; | |
9318 033 *str = '\symbol{92}0'; | |
9319 034 return MP_OKAY; | |
9320 035 \} | |
9321 036 | |
9322 037 if ((res = mp_init_copy (&t, a)) != MP_OKAY) \{ | |
9323 038 return res; | |
9324 039 \} | |
9325 040 | |
9326 041 /* if it is negative output a - */ | |
9327 042 if (t.sign == MP_NEG) \{ | |
9328 043 ++_s; | |
9329 044 *str++ = '-'; | |
9330 045 t.sign = MP_ZPOS; | |
9331 046 \} | |
9332 047 | |
9333 048 digs = 0; | |
9334 049 while (mp_iszero (&t) == 0) \{ | |
9335 050 if ((res = mp_div_d (&t, (mp_digit) radix, &t, &d)) != MP_OKAY) \{ | |
9336 051 mp_clear (&t); | |
9337 052 return res; | |
9338 053 \} | |
9339 054 *str++ = mp_s_rmap[d]; | |
9340 055 ++digs; | |
9341 056 \} | |
9342 057 | |
9343 058 /* reverse the digits of the string. In this case _s points | |
9344 059 * to the first digit [exluding the sign] of the number] | |
9345 060 */ | |
9346 061 bn_reverse ((unsigned char *)_s, digs); | |
9347 062 | |
9348 063 /* append a NULL so the string is properly terminated */ | |
9349 064 *str = '\symbol{92}0'; | |
9350 065 | |
9351 066 mp_clear (&t); | |
9352 067 return MP_OKAY; | |
9353 068 \} | |
9354 069 | |
9355 \end{alltt} | |
9356 \end{small} | |
9357 | |
9358 \chapter{Number Theoretic Algorithms} | |
9359 This chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi | |
9360 symbol computation. These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and | |
9361 various Sieve based factoring algorithms. | |
9362 | |
9363 \section{Greatest Common Divisor} | |
9364 The greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of | |
9365 both $a$ and $b$. That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur | |
9366 simultaneously. | |
9367 | |
9368 The most common approach (cite) is to reduce one input modulo another. That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then | |
9369 $r$ is also divisible by $k$. The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$. | |
9370 | |
9371 \newpage\begin{figure}[!here] | |
9372 \begin{small} | |
9373 \begin{center} | |
9374 \begin{tabular}{l} | |
9375 \hline Algorithm \textbf{Greatest Common Divisor (I)}. \\ | |
9376 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ | |
9377 \textbf{Output}. The greatest common divisor $(a, b)$. \\ | |
9378 \hline \\ | |
9379 1. While ($b > 0$) do \\ | |
9380 \hspace{3mm}1.1 $r \leftarrow a \mbox{ (mod }b\mbox{)}$ \\ | |
9381 \hspace{3mm}1.2 $a \leftarrow b$ \\ | |
9382 \hspace{3mm}1.3 $b \leftarrow r$ \\ | |
9383 2. Return($a$). \\ | |
9384 \hline | |
9385 \end{tabular} | |
9386 \end{center} | |
9387 \end{small} | |
9388 \caption{Algorithm Greatest Common Divisor (I)} | |
9389 \label{fig:gcd1} | |
9390 \end{figure} | |
9391 | |
9392 This algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly. However, divisions are | |
9393 relatively expensive operations to perform and should ideally be avoided. There is another approach based on a similar relationship of | |
9394 greatest common divisors. The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$. | |
9395 In particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$. | |
9396 | |
9397 \begin{figure}[!here] | |
9398 \begin{small} | |
9399 \begin{center} | |
9400 \begin{tabular}{l} | |
9401 \hline Algorithm \textbf{Greatest Common Divisor (II)}. \\ | |
9402 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ | |
9403 \textbf{Output}. The greatest common divisor $(a, b)$. \\ | |
9404 \hline \\ | |
9405 1. While ($b > 0$) do \\ | |
9406 \hspace{3mm}1.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\ | |
9407 \hspace{3mm}1.2 $b \leftarrow b - a$ \\ | |
9408 2. Return($a$). \\ | |
9409 \hline | |
9410 \end{tabular} | |
9411 \end{center} | |
9412 \end{small} | |
9413 \caption{Algorithm Greatest Common Divisor (II)} | |
9414 \label{fig:gcd2} | |
9415 \end{figure} | |
9416 | |
9417 \textbf{Proof} \textit{Algorithm~\ref{fig:gcd2} will return the greatest common divisor of $a$ and $b$.} | |
9418 The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$. In other | |
9419 words in every iteration that tuple $\left < a, b \right >$ decrease in magnitude until eventually $a = b$. Since both $a$ and $b$ are always | |
9420 divisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the | |
9421 second to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof. \textbf{QED}. | |
9422 | |
9423 As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful. Specially if $b$ is much larger than $a$ such that | |
9424 $b - a$ is still very much larger than $a$. A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does | |
9425 not divide the greatest common divisor but will divide $b - a$. In this case ${b - a} \over p$ is also an integer and still divisible by | |
9426 the greatest common divisor. | |
9427 | |
9428 However, instead of factoring $b - a$ to find a suitable value of $p$ the powers of $p$ can be removed from $a$ and $b$ that are in common first. | |
9429 Then inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed. | |
9430 | |
9431 \begin{figure}[!here] | |
9432 \begin{small} | |
9433 \begin{center} | |
9434 \begin{tabular}{l} | |
9435 \hline Algorithm \textbf{Greatest Common Divisor (III)}. \\ | |
9436 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ | |
9437 \textbf{Output}. The greatest common divisor $(a, b)$. \\ | |
9438 \hline \\ | |
9439 1. $k \leftarrow 0$ \\ | |
9440 2. While $a$ and $b$ are both divisible by $p$ do \\ | |
9441 \hspace{3mm}2.1 $a \leftarrow \lfloor a / p \rfloor$ \\ | |
9442 \hspace{3mm}2.2 $b \leftarrow \lfloor b / p \rfloor$ \\ | |
9443 \hspace{3mm}2.3 $k \leftarrow k + 1$ \\ | |
9444 3. While $a$ is divisible by $p$ do \\ | |
9445 \hspace{3mm}3.1 $a \leftarrow \lfloor a / p \rfloor$ \\ | |
9446 4. While $b$ is divisible by $p$ do \\ | |
9447 \hspace{3mm}4.1 $b \leftarrow \lfloor b / p \rfloor$ \\ | |
9448 5. While ($b > 0$) do \\ | |
9449 \hspace{3mm}5.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\ | |
9450 \hspace{3mm}5.2 $b \leftarrow b - a$ \\ | |
9451 \hspace{3mm}5.3 While $b$ is divisible by $p$ do \\ | |
9452 \hspace{6mm}5.3.1 $b \leftarrow \lfloor b / p \rfloor$ \\ | |
9453 6. Return($a \cdot p^k$). \\ | |
9454 \hline | |
9455 \end{tabular} | |
9456 \end{center} | |
9457 \end{small} | |
9458 \caption{Algorithm Greatest Common Divisor (III)} | |
9459 \label{fig:gcd3} | |
9460 \end{figure} | |
9461 | |
9462 This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$ | |
9463 decreases more rapidly. The first loop on step two removes powers of $p$ that are in common. A count, $k$, is kept which will present a common | |
9464 divisor of $p^k$. After step two the remaining common divisor of $a$ and $b$ cannot be divisible by $p$. This means that $p$ can be safely | |
9465 divided out of the difference $b - a$ so long as the division leaves no remainder. | |
9466 | |
9467 In particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often. It also helps that division by $p$ be easy | |
9468 to compute. The ideal choice of $p$ is two since division by two amounts to a right logical shift. Another important observation is that by | |
9469 step five both $a$ and $b$ are odd. Therefore, the diffrence $b - a$ must be even which means that each iteration removes one bit from the | |
9470 largest of the pair. | |
9471 | |
9472 \subsection{Complete Greatest Common Divisor} | |
9473 The algorithms presented so far cannot handle inputs which are zero or negative. The following algorithm can handle all input cases properly | |
9474 and will produce the greatest common divisor. | |
9475 | |
9476 \newpage\begin{figure}[!here] | |
9477 \begin{small} | |
9478 \begin{center} | |
9479 \begin{tabular}{l} | |
9480 \hline Algorithm \textbf{mp\_gcd}. \\ | |
9481 \textbf{Input}. mp\_int $a$ and $b$ \\ | |
9482 \textbf{Output}. The greatest common divisor $c = (a, b)$. \\ | |
9483 \hline \\ | |
9484 1. If $a = 0$ and $b \ne 0$ then \\ | |
9485 \hspace{3mm}1.1 $c \leftarrow b$ \\ | |
9486 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ | |
9487 2. If $a \ne 0$ and $b = 0$ then \\ | |
9488 \hspace{3mm}2.1 $c \leftarrow a$ \\ | |
9489 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\ | |
9490 3. If $a = b = 0$ then \\ | |
9491 \hspace{3mm}3.1 $c \leftarrow 1$ \\ | |
9492 \hspace{3mm}3.2 Return(\textit{MP\_OKAY}). \\ | |
9493 4. $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\ | |
9494 5. $k \leftarrow 0$ \\ | |
9495 6. While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
9496 \hspace{3mm}6.1 $k \leftarrow k + 1$ \\ | |
9497 \hspace{3mm}6.2 $u \leftarrow \lfloor u / 2 \rfloor$ \\ | |
9498 \hspace{3mm}6.3 $v \leftarrow \lfloor v / 2 \rfloor$ \\ | |
9499 7. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
9500 \hspace{3mm}7.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\ | |
9501 8. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
9502 \hspace{3mm}8.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ | |
9503 9. While $v.used > 0$ \\ | |
9504 \hspace{3mm}9.1 If $\vert u \vert > \vert v \vert$ then \\ | |
9505 \hspace{6mm}9.1.1 Swap $u$ and $v$. \\ | |
9506 \hspace{3mm}9.2 $v \leftarrow \vert v \vert - \vert u \vert$ \\ | |
9507 \hspace{3mm}9.3 While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
9508 \hspace{6mm}9.3.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ | |
9509 10. $c \leftarrow u \cdot 2^k$ \\ | |
9510 11. Return(\textit{MP\_OKAY}). \\ | |
9511 \hline | |
9512 \end{tabular} | |
9513 \end{center} | |
9514 \end{small} | |
9515 \caption{Algorithm mp\_gcd} | |
9516 \end{figure} | |
9517 \textbf{Algorithm mp\_gcd.} | |
9518 This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$. The algorithm was originally based on Algorithm B of | |
9519 Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain. In theory it achieves the same asymptotic working time as | |
9520 Algorithm B and in practice this appears to be true. | |
9521 | |
9522 The first three steps handle the cases where either one of or both inputs are zero. If either input is zero the greatest common divisor is the | |
9523 largest input or zero if they are both zero. If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of | |
9524 $a$ and $b$ respectively and the algorithm will proceed to reduce the pair. | |
9525 | |
9526 Step six will divide out any common factors of two and keep track of the count in the variable $k$. After this step two is no longer a | |
9527 factor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even. Step | |
9528 seven and eight ensure that the $u$ and $v$ respectively have no more factors of two. At most only one of the while loops will iterate since | |
9529 they cannot both be even. | |
9530 | |
9531 By step nine both of $u$ and $v$ are odd which is required for the inner logic. First the pair are swapped such that $v$ is equal to | |
9532 or greater than $u$. This ensures that the subtraction on step 9.2 will always produce a positive and even result. Step 9.3 removes any | |
9533 factors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd. | |
9534 | |
9535 After $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six. The result | |
9536 must be adjusted by multiplying by the common factors of two ($2^k$) removed earlier. | |
9537 | |
9538 \vspace{+3mm}\begin{small} | |
9539 \hspace{-5.1mm}{\bf File}: bn\_mp\_gcd.c | |
9540 \vspace{-3mm} | |
9541 \begin{alltt} | |
9542 016 | |
9543 017 /* Greatest Common Divisor using the binary method */ | |
9544 018 int mp_gcd (mp_int * a, mp_int * b, mp_int * c) | |
9545 019 \{ | |
9546 020 mp_int u, v; | |
9547 021 int k, u_lsb, v_lsb, res; | |
9548 022 | |
9549 023 /* either zero than gcd is the largest */ | |
9550 024 if (mp_iszero (a) == 1 && mp_iszero (b) == 0) \{ | |
9551 025 return mp_abs (b, c); | |
9552 026 \} | |
9553 027 if (mp_iszero (a) == 0 && mp_iszero (b) == 1) \{ | |
9554 028 return mp_abs (a, c); | |
9555 029 \} | |
9556 030 | |
9557 031 /* optimized. At this point if a == 0 then | |
9558 032 * b must equal zero too | |
9559 033 */ | |
9560 034 if (mp_iszero (a) == 1) \{ | |
9561 035 mp_zero(c); | |
9562 036 return MP_OKAY; | |
9563 037 \} | |
9564 038 | |
9565 039 /* get copies of a and b we can modify */ | |
9566 040 if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{ | |
9567 041 return res; | |
9568 042 \} | |
9569 043 | |
9570 044 if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{ | |
9571 045 goto __U; | |
9572 046 \} | |
9573 047 | |
9574 048 /* must be positive for the remainder of the algorithm */ | |
9575 049 u.sign = v.sign = MP_ZPOS; | |
9576 050 | |
9577 051 /* B1. Find the common power of two for u and v */ | |
9578 052 u_lsb = mp_cnt_lsb(&u); | |
9579 053 v_lsb = mp_cnt_lsb(&v); | |
9580 054 k = MIN(u_lsb, v_lsb); | |
9581 055 | |
9582 056 if (k > 0) \{ | |
9583 057 /* divide the power of two out */ | |
9584 058 if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{ | |
9585 059 goto __V; | |
9586 060 \} | |
9587 061 | |
9588 062 if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{ | |
9589 063 goto __V; | |
9590 064 \} | |
9591 065 \} | |
9592 066 | |
9593 067 /* divide any remaining factors of two out */ | |
9594 068 if (u_lsb != k) \{ | |
9595 069 if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{ | |
9596 070 goto __V; | |
9597 071 \} | |
9598 072 \} | |
9599 073 | |
9600 074 if (v_lsb != k) \{ | |
9601 075 if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{ | |
9602 076 goto __V; | |
9603 077 \} | |
9604 078 \} | |
9605 079 | |
9606 080 while (mp_iszero(&v) == 0) \{ | |
9607 081 /* make sure v is the largest */ | |
9608 082 if (mp_cmp_mag(&u, &v) == MP_GT) \{ | |
9609 083 /* swap u and v to make sure v is >= u */ | |
9610 084 mp_exch(&u, &v); | |
9611 085 \} | |
9612 086 | |
9613 087 /* subtract smallest from largest */ | |
9614 088 if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{ | |
9615 089 goto __V; | |
9616 090 \} | |
9617 091 | |
9618 092 /* Divide out all factors of two */ | |
9619 093 if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{ | |
9620 094 goto __V; | |
9621 095 \} | |
9622 096 \} | |
9623 097 | |
9624 098 /* multiply by 2**k which we divided out at the beginning */ | |
9625 099 if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{ | |
9626 100 goto __V; | |
9627 101 \} | |
9628 102 c->sign = MP_ZPOS; | |
9629 103 res = MP_OKAY; | |
9630 104 __V:mp_clear (&u); | |
9631 105 __U:mp_clear (&v); | |
9632 106 return res; | |
9633 107 \} | |
9634 \end{alltt} | |
9635 \end{small} | |
9636 | |
9637 This function makes use of the macros mp\_iszero and mp\_iseven. The former evaluates to $1$ if the input mp\_int is equivalent to the | |
9638 integer zero otherwise it evaluates to $0$. The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise | |
9639 it evaluates to $0$. Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero. The three | |
9640 trivial cases of inputs are handled on lines 24 through 37. After those lines the inputs are assumed to be non-zero. | |
9641 | |
9642 Lines 34 and 40 make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively. At this point the common factors of two | |
9643 must be divided out of the two inputs. The while loop on line 80 iterates so long as both are even. The local integer $k$ is used to | |
9644 keep track of how many factors of $2$ are pulled out of both values. It is assumed that the number of factors will not exceed the maximum | |
9645 value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than entries than are accessible by an ``int'' so this is not | |
9646 a limitation.}. | |
9647 | |
9648 At this point there are no more common factors of two in the two values. The while loops on lines 80 and 80 remove any independent | |
9649 factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm. The while loop | |
9650 on line 80 performs the reduction of the pair until $v$ is equal to zero. The unsigned comparison and subtraction algorithms are used in | |
9651 place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative. | |
9652 | |
9653 \section{Least Common Multiple} | |
9654 The least common multiple of a pair of integers is their product divided by their greatest common divisor. For two integers $a$ and $b$ the | |
9655 least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$. For example, if $a = 2 \cdot 2 \cdot 3 = 12$ | |
9656 and $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$. | |
9657 | |
9658 The least common multiple arises often in coding theory as well as number theory. If two functions have periods of $a$ and $b$ respectively they will | |
9659 collide, that is be in synchronous states, after only $[ a, b ]$ iterations. This is why, for example, random number generators based on | |
9660 Linear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}). | |
9661 Similarly in number theory if a composite $n$ has two prime factors $p$ and $q$ then maximal order of any unit of $\Z/n\Z$ will be $[ p - 1, q - 1] $. | |
9662 | |
9663 \begin{figure}[!here] | |
9664 \begin{small} | |
9665 \begin{center} | |
9666 \begin{tabular}{l} | |
9667 \hline Algorithm \textbf{mp\_lcm}. \\ | |
9668 \textbf{Input}. mp\_int $a$ and $b$ \\ | |
9669 \textbf{Output}. The least common multiple $c = [a, b]$. \\ | |
9670 \hline \\ | |
9671 1. $c \leftarrow (a, b)$ \\ | |
9672 2. $t \leftarrow a \cdot b$ \\ | |
9673 3. $c \leftarrow \lfloor t / c \rfloor$ \\ | |
9674 4. Return(\textit{MP\_OKAY}). \\ | |
9675 \hline | |
9676 \end{tabular} | |
9677 \end{center} | |
9678 \end{small} | |
9679 \caption{Algorithm mp\_lcm} | |
9680 \end{figure} | |
9681 \textbf{Algorithm mp\_lcm.} | |
9682 This algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$. It computes the least common multiple directly by | |
9683 dividing the product of the two inputs by their greatest common divisor. | |
9684 | |
9685 \vspace{+3mm}\begin{small} | |
9686 \hspace{-5.1mm}{\bf File}: bn\_mp\_lcm.c | |
9687 \vspace{-3mm} | |
9688 \begin{alltt} | |
9689 016 | |
9690 017 /* computes least common multiple as |a*b|/(a, b) */ | |
9691 018 int mp_lcm (mp_int * a, mp_int * b, mp_int * c) | |
9692 019 \{ | |
9693 020 int res; | |
9694 021 mp_int t1, t2; | |
9695 022 | |
9696 023 | |
9697 024 if ((res = mp_init_multi (&t1, &t2, NULL)) != MP_OKAY) \{ | |
9698 025 return res; | |
9699 026 \} | |
9700 027 | |
9701 028 /* t1 = get the GCD of the two inputs */ | |
9702 029 if ((res = mp_gcd (a, b, &t1)) != MP_OKAY) \{ | |
9703 030 goto __T; | |
9704 031 \} | |
9705 032 | |
9706 033 /* divide the smallest by the GCD */ | |
9707 034 if (mp_cmp_mag(a, b) == MP_LT) \{ | |
9708 035 /* store quotient in t2 such that t2 * b is the LCM */ | |
9709 036 if ((res = mp_div(a, &t1, &t2, NULL)) != MP_OKAY) \{ | |
9710 037 goto __T; | |
9711 038 \} | |
9712 039 res = mp_mul(b, &t2, c); | |
9713 040 \} else \{ | |
9714 041 /* store quotient in t2 such that t2 * a is the LCM */ | |
9715 042 if ((res = mp_div(b, &t1, &t2, NULL)) != MP_OKAY) \{ | |
9716 043 goto __T; | |
9717 044 \} | |
9718 045 res = mp_mul(a, &t2, c); | |
9719 046 \} | |
9720 047 | |
9721 048 /* fix the sign to positive */ | |
9722 049 c->sign = MP_ZPOS; | |
9723 050 | |
9724 051 __T: | |
9725 052 mp_clear_multi (&t1, &t2, NULL); | |
9726 053 return res; | |
9727 054 \} | |
9728 \end{alltt} | |
9729 \end{small} | |
9730 | |
9731 \section{Jacobi Symbol Computation} | |
9732 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is | |
9733 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is | |
9734 equivalent to equation \ref{eqn:legendre}. | |
9735 | |
9736 \begin{equation} | |
9737 a^{(p-1)/2} \equiv \begin{array}{rl} | |
9738 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ | |
9739 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ | |
9740 1 & \mbox{if }a\mbox{ is a quadratic residue}. | |
9741 \end{array} \mbox{ (mod }p\mbox{)} | |
9742 \label{eqn:legendre} | |
9743 \end{equation} | |
9744 | |
9745 \textbf{Proof.} \textit{Equation \ref{eqn:legendre} correctly identifies the residue status of an integer $a$ modulo a prime $p$.} | |
9746 An integer $a$ is a quadratic residue if the following equation has a solution. | |
9747 | |
9748 \begin{equation} | |
9749 x^2 \equiv a \mbox{ (mod }p\mbox{)} | |
9750 \label{eqn:root} | |
9751 \end{equation} | |
9752 | |
9753 Consider the following equation. | |
9754 | |
9755 \begin{equation} | |
9756 0 \equiv x^{p-1} - 1 \equiv \left \lbrace \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \right \rbrace + \left ( a^{(p-1)/2} - 1 \right ) \mbox{ (mod }p\mbox{)} | |
9757 \label{eqn:rooti} | |
9758 \end{equation} | |
9759 | |
9760 Whether equation \ref{eqn:root} has a solution or not equation \ref{eqn:rooti} is always true. If $a^{(p-1)/2} - 1 \equiv 0 \mbox{ (mod }p\mbox{)}$ | |
9761 then the quantity in the braces must be zero. By reduction, | |
9762 | |
9763 \begin{eqnarray} | |
9764 \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \equiv 0 \nonumber \\ | |
9765 \left (x^2 \right )^{(p-1)/2} \equiv a^{(p-1)/2} \nonumber \\ | |
9766 x^2 \equiv a \mbox{ (mod }p\mbox{)} | |
9767 \end{eqnarray} | |
9768 | |
9769 As a result there must be a solution to the quadratic equation and in turn $a$ must be a quadratic residue. If $a$ does not divide $p$ and $a$ | |
9770 is not a quadratic residue then the only other value $a^{(p-1)/2}$ may be congruent to is $-1$ since | |
9771 \begin{equation} | |
9772 0 \equiv a^{p - 1} - 1 \equiv (a^{(p-1)/2} + 1)(a^{(p-1)/2} - 1) \mbox{ (mod }p\mbox{)} | |
9773 \end{equation} | |
9774 One of the terms on the right hand side must be zero. \textbf{QED} | |
9775 | |
9776 \subsection{Jacobi Symbol} | |
9777 The Jacobi symbol is a generalization of the Legendre function for any odd non prime moduli $p$ greater than 2. If $p = \prod_{i=0}^n p_i$ then | |
9778 the Jacobi symbol $\left ( { a \over p } \right )$ is equal to the following equation. | |
9779 | |
9780 \begin{equation} | |
9781 \left ( { a \over p } \right ) = \left ( { a \over p_0} \right ) \left ( { a \over p_1} \right ) \ldots \left ( { a \over p_n} \right ) | |
9782 \end{equation} | |
9783 | |
9784 By inspection if $p$ is prime the Jacobi symbol is equivalent to the Legendre function. The following facts\footnote{See HAC \cite[pp. 72-74]{HAC} for | |
9785 further details.} will be used to derive an efficient Jacobi symbol algorithm. Where $p$ is an odd integer greater than two and $a, b \in \Z$ the | |
9786 following are true. | |
9787 | |
9788 \begin{enumerate} | |
9789 \item $\left ( { a \over p} \right )$ equals $-1$, $0$ or $1$. | |
9790 \item $\left ( { ab \over p} \right ) = \left ( { a \over p} \right )\left ( { b \over p} \right )$. | |
9791 \item If $a \equiv b$ then $\left ( { a \over p} \right ) = \left ( { b \over p} \right )$. | |
9792 \item $\left ( { 2 \over p} \right )$ equals $1$ if $p \equiv 1$ or $7 \mbox{ (mod }8\mbox{)}$. Otherwise, it equals $-1$. | |
9793 \item $\left ( { a \over p} \right ) \equiv \left ( { p \over a} \right ) \cdot (-1)^{(p-1)(a-1)/4}$. More specifically | |
9794 $\left ( { a \over p} \right ) = \left ( { p \over a} \right )$ if $p \equiv a \equiv 1 \mbox{ (mod }4\mbox{)}$. | |
9795 \end{enumerate} | |
9796 | |
9797 Using these facts if $a = 2^k \cdot a'$ then | |
9798 | |
9799 \begin{eqnarray} | |
9800 \left ( { a \over p } \right ) = \left ( {{2^k} \over p } \right ) \left ( {a' \over p} \right ) \nonumber \\ | |
9801 = \left ( {2 \over p } \right )^k \left ( {a' \over p} \right ) | |
9802 \label{eqn:jacobi} | |
9803 \end{eqnarray} | |
9804 | |
9805 By fact five, | |
9806 | |
9807 \begin{equation} | |
9808 \left ( { a \over p } \right ) = \left ( { p \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4} | |
9809 \end{equation} | |
9810 | |
9811 Subsequently by fact three since $p \equiv (p \mbox{ mod }a) \mbox{ (mod }a\mbox{)}$ then | |
9812 | |
9813 \begin{equation} | |
9814 \left ( { a \over p } \right ) = \left ( { {p \mbox{ mod } a} \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4} | |
9815 \end{equation} | |
9816 | |
9817 By putting both observations into equation \ref{eqn:jacobi} the following simplified equation is formed. | |
9818 | |
9819 \begin{equation} | |
9820 \left ( { a \over p } \right ) = \left ( {2 \over p } \right )^k \left ( {{p\mbox{ mod }a'} \over a'} \right ) \cdot (-1)^{(p-1)(a'-1)/4} | |
9821 \end{equation} | |
9822 | |
9823 The value of $\left ( {{p \mbox{ mod }a'} \over a'} \right )$ can be found by using the same equation recursively. The value of | |
9824 $\left ( {2 \over p } \right )^k$ equals $1$ if $k$ is even otherwise it equals $\left ( {2 \over p } \right )$. Using this approach the | |
9825 factors of $p$ do not have to be known. Furthermore, if $(a, p) = 1$ then the algorithm will terminate when the recursion requests the | |
9826 Jacobi symbol computation of $\left ( {1 \over a'} \right )$ which is simply $1$. | |
9827 | |
9828 \newpage\begin{figure}[!here] | |
9829 \begin{small} | |
9830 \begin{center} | |
9831 \begin{tabular}{l} | |
9832 \hline Algorithm \textbf{mp\_jacobi}. \\ | |
9833 \textbf{Input}. mp\_int $a$ and $p$, $a \ge 0$, $p \ge 3$, $p \equiv 1 \mbox{ (mod }2\mbox{)}$ \\ | |
9834 \textbf{Output}. The Jacobi symbol $c = \left ( {a \over p } \right )$. \\ | |
9835 \hline \\ | |
9836 1. If $a = 0$ then \\ | |
9837 \hspace{3mm}1.1 $c \leftarrow 0$ \\ | |
9838 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ | |
9839 2. If $a = 1$ then \\ | |
9840 \hspace{3mm}2.1 $c \leftarrow 1$ \\ | |
9841 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\ | |
9842 3. $a' \leftarrow a$ \\ | |
9843 4. $k \leftarrow 0$ \\ | |
9844 5. While $a'.used > 0$ and $a'_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
9845 \hspace{3mm}5.1 $k \leftarrow k + 1$ \\ | |
9846 \hspace{3mm}5.2 $a' \leftarrow \lfloor a' / 2 \rfloor$ \\ | |
9847 6. If $k \equiv 0 \mbox{ (mod }2\mbox{)}$ then \\ | |
9848 \hspace{3mm}6.1 $s \leftarrow 1$ \\ | |
9849 7. else \\ | |
9850 \hspace{3mm}7.1 $r \leftarrow p_0 \mbox{ (mod }8\mbox{)}$ \\ | |
9851 \hspace{3mm}7.2 If $r = 1$ or $r = 7$ then \\ | |
9852 \hspace{6mm}7.2.1 $s \leftarrow 1$ \\ | |
9853 \hspace{3mm}7.3 else \\ | |
9854 \hspace{6mm}7.3.1 $s \leftarrow -1$ \\ | |
9855 8. If $p_0 \equiv a'_0 \equiv 3 \mbox{ (mod }4\mbox{)}$ then \\ | |
9856 \hspace{3mm}8.1 $s \leftarrow -s$ \\ | |
9857 9. If $a' \ne 1$ then \\ | |
9858 \hspace{3mm}9.1 $p' \leftarrow p \mbox{ (mod }a'\mbox{)}$ \\ | |
9859 \hspace{3mm}9.2 $s \leftarrow s \cdot \mbox{mp\_jacobi}(p', a')$ \\ | |
9860 10. $c \leftarrow s$ \\ | |
9861 11. Return(\textit{MP\_OKAY}). \\ | |
9862 \hline | |
9863 \end{tabular} | |
9864 \end{center} | |
9865 \end{small} | |
9866 \caption{Algorithm mp\_jacobi} | |
9867 \end{figure} | |
9868 \textbf{Algorithm mp\_jacobi.} | |
9869 This algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three. The algorithm | |
9870 is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}. | |
9871 | |
9872 Step numbers one and two handle the trivial cases of $a = 0$ and $a = 1$ respectively. Step five determines the number of two factors in the | |
9873 input $a$. If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one. If $k$ is odd than the term evaluates to one | |
9874 if $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled | |
9875 the $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$. The latter term evaluates to one if both $p$ and $a'$ | |
9876 are congruent to one modulo four, otherwise it evaluates to negative one. | |
9877 | |
9878 By step nine if $a'$ does not equal one a recursion is required. Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute | |
9879 $\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product. | |
9880 | |
9881 \vspace{+3mm}\begin{small} | |
9882 \hspace{-5.1mm}{\bf File}: bn\_mp\_jacobi.c | |
9883 \vspace{-3mm} | |
9884 \begin{alltt} | |
9885 016 | |
9886 017 /* computes the jacobi c = (a | n) (or Legendre if n is prime) | |
9887 018 * HAC pp. 73 Algorithm 2.149 | |
9888 019 */ | |
9889 020 int mp_jacobi (mp_int * a, mp_int * p, int *c) | |
9890 021 \{ | |
9891 022 mp_int a1, p1; | |
9892 023 int k, s, r, res; | |
9893 024 mp_digit residue; | |
9894 025 | |
9895 026 /* if p <= 0 return MP_VAL */ | |
9896 027 if (mp_cmp_d(p, 0) != MP_GT) \{ | |
9897 028 return MP_VAL; | |
9898 029 \} | |
9899 030 | |
9900 031 /* step 1. if a == 0, return 0 */ | |
9901 032 if (mp_iszero (a) == 1) \{ | |
9902 033 *c = 0; | |
9903 034 return MP_OKAY; | |
9904 035 \} | |
9905 036 | |
9906 037 /* step 2. if a == 1, return 1 */ | |
9907 038 if (mp_cmp_d (a, 1) == MP_EQ) \{ | |
9908 039 *c = 1; | |
9909 040 return MP_OKAY; | |
9910 041 \} | |
9911 042 | |
9912 043 /* default */ | |
9913 044 s = 0; | |
9914 045 | |
9915 046 /* step 3. write a = a1 * 2**k */ | |
9916 047 if ((res = mp_init_copy (&a1, a)) != MP_OKAY) \{ | |
9917 048 return res; | |
9918 049 \} | |
9919 050 | |
9920 051 if ((res = mp_init (&p1)) != MP_OKAY) \{ | |
9921 052 goto __A1; | |
9922 053 \} | |
9923 054 | |
9924 055 /* divide out larger power of two */ | |
9925 056 k = mp_cnt_lsb(&a1); | |
9926 057 if ((res = mp_div_2d(&a1, k, &a1, NULL)) != MP_OKAY) \{ | |
9927 058 goto __P1; | |
9928 059 \} | |
9929 060 | |
9930 061 /* step 4. if e is even set s=1 */ | |
9931 062 if ((k & 1) == 0) \{ | |
9932 063 s = 1; | |
9933 064 \} else \{ | |
9934 065 /* else set s=1 if p = 1/7 (mod 8) or s=-1 if p = 3/5 (mod 8) */ | |
9935 066 residue = p->dp[0] & 7; | |
9936 067 | |
9937 068 if (residue == 1 || residue == 7) \{ | |
9938 069 s = 1; | |
9939 070 \} else if (residue == 3 || residue == 5) \{ | |
9940 071 s = -1; | |
9941 072 \} | |
9942 073 \} | |
9943 074 | |
9944 075 /* step 5. if p == 3 (mod 4) *and* a1 == 3 (mod 4) then s = -s */ | |
9945 076 if ( ((p->dp[0] & 3) == 3) && ((a1.dp[0] & 3) == 3)) \{ | |
9946 077 s = -s; | |
9947 078 \} | |
9948 079 | |
9949 080 /* if a1 == 1 we're done */ | |
9950 081 if (mp_cmp_d (&a1, 1) == MP_EQ) \{ | |
9951 082 *c = s; | |
9952 083 \} else \{ | |
9953 084 /* n1 = n mod a1 */ | |
9954 085 if ((res = mp_mod (p, &a1, &p1)) != MP_OKAY) \{ | |
9955 086 goto __P1; | |
9956 087 \} | |
9957 088 if ((res = mp_jacobi (&p1, &a1, &r)) != MP_OKAY) \{ | |
9958 089 goto __P1; | |
9959 090 \} | |
9960 091 *c = s * r; | |
9961 092 \} | |
9962 093 | |
9963 094 /* done */ | |
9964 095 res = MP_OKAY; | |
9965 096 __P1:mp_clear (&p1); | |
9966 097 __A1:mp_clear (&a1); | |
9967 098 return res; | |
9968 099 \} | |
9969 \end{alltt} | |
9970 \end{small} | |
9971 | |
9972 As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C | |
9973 variable name character. | |
9974 | |
9975 The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm. If the input is non-trivial the algorithm | |
9976 has to proceed compute the Jacobi. The variable $s$ is used to hold the current Jacobi product. Note that $s$ is merely a C ``int'' data type since | |
9977 the values it may obtain are merely $-1$, $0$ and $1$. | |
9978 | |
9979 After a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$. Technically only the least significant | |
9980 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same | |
9981 processor requirements and neither is faster than the other. | |
9982 | |
9983 Line 61 through 70 determines the value of $\left ( { 2 \over p } \right )^k$. If the least significant bit of $k$ is zero than | |
9984 $k$ is even and the value is one. Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight. The value of | |
9985 $(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines 75 through 73. | |
9986 | |
9987 Finally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$. | |
9988 | |
9989 \textit{-- Comment about default $s$ and such...} | |
9990 | |
9991 \section{Modular Inverse} | |
9992 \label{sec:modinv} | |
9993 The modular inverse of a number actually refers to the modular multiplicative inverse. Essentially for any integer $a$ such that $(a, p) = 1$ there | |
9994 exist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$. The integer $b$ is called the multiplicative inverse of $a$ which is | |
9995 denoted as $b = a^{-1}$. Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and | |
9996 fields of integers. However, the former will be the matter of discussion. | |
9997 | |
9998 The simplest approach is to compute the algebraic inverse of the input. That is to compute $b \equiv a^{\Phi(p) - 1}$. If $\Phi(p)$ is the | |
9999 order of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$. The proof of which is trivial. | |
10000 | |
10001 \begin{equation} | |
10002 ab \equiv a \left (a^{\Phi(p) - 1} \right ) \equiv a^{\Phi(p)} \equiv a^0 \equiv 1 \mbox{ (mod }p\mbox{)} | |
10003 \end{equation} | |
10004 | |
10005 However, as simple as this approach may be it has two serious flaws. It requires that the value of $\Phi(p)$ be known which if $p$ is composite | |
10006 requires all of the prime factors. This approach also is very slow as the size of $p$ grows. | |
10007 | |
10008 A simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear | |
10009 Diophantine\footnote{See LeVeque \cite[pp. 40-43]{LeVeque} for more information.} equation. | |
10010 | |
10011 \begin{equation} | |
10012 ab + pq = 1 | |
10013 \end{equation} | |
10014 | |
10015 Where $a$, $b$, $p$ and $q$ are all integers. If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of | |
10016 $a$ modulo $p$. The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$. | |
10017 However, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place. The | |
10018 binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine | |
10019 equation. | |
10020 | |
10021 \subsection{General Case} | |
10022 \newpage\begin{figure}[!here] | |
10023 \begin{small} | |
10024 \begin{center} | |
10025 \begin{tabular}{l} | |
10026 \hline Algorithm \textbf{mp\_invmod}. \\ | |
10027 \textbf{Input}. mp\_int $a$ and $b$, $(a, b) = 1$, $p \ge 2$, $0 < a < p$. \\ | |
10028 \textbf{Output}. The modular inverse $c \equiv a^{-1} \mbox{ (mod }b\mbox{)}$. \\ | |
10029 \hline \\ | |
10030 1. If $b \le 0$ then return(\textit{MP\_VAL}). \\ | |
10031 2. If $b_0 \equiv 1 \mbox{ (mod }2\mbox{)}$ then use algorithm fast\_mp\_invmod. \\ | |
10032 3. $x \leftarrow \vert a \vert, y \leftarrow b$ \\ | |
10033 4. If $x_0 \equiv y_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ then return(\textit{MP\_VAL}). \\ | |
10034 5. $B \leftarrow 0, C \leftarrow 0, A \leftarrow 1, D \leftarrow 1$ \\ | |
10035 6. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
10036 \hspace{3mm}6.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\ | |
10037 \hspace{3mm}6.2 If ($A.used > 0$ and $A_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($B.used > 0$ and $B_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\ | |
10038 \hspace{6mm}6.2.1 $A \leftarrow A + y$ \\ | |
10039 \hspace{6mm}6.2.2 $B \leftarrow B - x$ \\ | |
10040 \hspace{3mm}6.3 $A \leftarrow \lfloor A / 2 \rfloor$ \\ | |
10041 \hspace{3mm}6.4 $B \leftarrow \lfloor B / 2 \rfloor$ \\ | |
10042 7. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
10043 \hspace{3mm}7.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ | |
10044 \hspace{3mm}7.2 If ($C.used > 0$ and $C_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($D.used > 0$ and $D_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\ | |
10045 \hspace{6mm}7.2.1 $C \leftarrow C + y$ \\ | |
10046 \hspace{6mm}7.2.2 $D \leftarrow D - x$ \\ | |
10047 \hspace{3mm}7.3 $C \leftarrow \lfloor C / 2 \rfloor$ \\ | |
10048 \hspace{3mm}7.4 $D \leftarrow \lfloor D / 2 \rfloor$ \\ | |
10049 8. If $u \ge v$ then \\ | |
10050 \hspace{3mm}8.1 $u \leftarrow u - v$ \\ | |
10051 \hspace{3mm}8.2 $A \leftarrow A - C$ \\ | |
10052 \hspace{3mm}8.3 $B \leftarrow B - D$ \\ | |
10053 9. else \\ | |
10054 \hspace{3mm}9.1 $v \leftarrow v - u$ \\ | |
10055 \hspace{3mm}9.2 $C \leftarrow C - A$ \\ | |
10056 \hspace{3mm}9.3 $D \leftarrow D - B$ \\ | |
10057 10. If $u \ne 0$ goto step 6. \\ | |
10058 11. If $v \ne 1$ return(\textit{MP\_VAL}). \\ | |
10059 12. While $C \le 0$ do \\ | |
10060 \hspace{3mm}12.1 $C \leftarrow C + b$ \\ | |
10061 13. While $C \ge b$ do \\ | |
10062 \hspace{3mm}13.1 $C \leftarrow C - b$ \\ | |
10063 14. $c \leftarrow C$ \\ | |
10064 15. Return(\textit{MP\_OKAY}). \\ | |
10065 \hline | |
10066 \end{tabular} | |
10067 \end{center} | |
10068 \end{small} | |
10069 \end{figure} | |
10070 \textbf{Algorithm mp\_invmod.} | |
10071 This algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$. This algorithm is a variation of the | |
10072 extended binary Euclidean algorithm from HAC \cite[pp. 608]{HAC}. It has been modified to only compute the modular inverse and not a complete | |
10073 Diophantine solution. | |
10074 | |
10075 If $b \le 0$ than the modulus is invalid and MP\_VAL is returned. Similarly if both $a$ and $b$ are even then there cannot be a multiplicative | |
10076 inverse for $a$ and the error is reported. | |
10077 | |
10078 The astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd. In this case | |
10079 the other variables to the Diophantine equation are solved. The algorithm terminates when $u = 0$ in which case the solution is | |
10080 | |
10081 \begin{equation} | |
10082 Ca + Db = v | |
10083 \end{equation} | |
10084 | |
10085 If $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists. Otherwise, $C$ | |
10086 is the modular inverse of $a$. The actual value of $C$ is congruent to, but not necessarily equal to, the ideal modular inverse which should lie | |
10087 within $1 \le a^{-1} < b$. Step numbers twelve and thirteen adjust the inverse until it is in range. If the original input $a$ is within $0 < a < p$ | |
10088 then only a couple of additions or subtractions will be required to adjust the inverse. | |
10089 | |
10090 \vspace{+3mm}\begin{small} | |
10091 \hspace{-5.1mm}{\bf File}: bn\_mp\_invmod.c | |
10092 \vspace{-3mm} | |
10093 \begin{alltt} | |
10094 016 | |
10095 017 /* hac 14.61, pp608 */ | |
10096 018 int mp_invmod (mp_int * a, mp_int * b, mp_int * c) | |
10097 019 \{ | |
10098 020 mp_int x, y, u, v, A, B, C, D; | |
10099 021 int res; | |
10100 022 | |
10101 023 /* b cannot be negative */ | |
10102 024 if (b->sign == MP_NEG || mp_iszero(b) == 1) \{ | |
10103 025 return MP_VAL; | |
10104 026 \} | |
10105 027 | |
10106 028 /* if the modulus is odd we can use a faster routine instead */ | |
10107 029 if (mp_isodd (b) == 1) \{ | |
10108 030 return fast_mp_invmod (a, b, c); | |
10109 031 \} | |
10110 032 | |
10111 033 /* init temps */ | |
10112 034 if ((res = mp_init_multi(&x, &y, &u, &v, | |
10113 035 &A, &B, &C, &D, NULL)) != MP_OKAY) \{ | |
10114 036 return res; | |
10115 037 \} | |
10116 038 | |
10117 039 /* x = a, y = b */ | |
10118 040 if ((res = mp_copy (a, &x)) != MP_OKAY) \{ | |
10119 041 goto __ERR; | |
10120 042 \} | |
10121 043 if ((res = mp_copy (b, &y)) != MP_OKAY) \{ | |
10122 044 goto __ERR; | |
10123 045 \} | |
10124 046 | |
10125 047 /* 2. [modified] if x,y are both even then return an error! */ | |
10126 048 if (mp_iseven (&x) == 1 && mp_iseven (&y) == 1) \{ | |
10127 049 res = MP_VAL; | |
10128 050 goto __ERR; | |
10129 051 \} | |
10130 052 | |
10131 053 /* 3. u=x, v=y, A=1, B=0, C=0,D=1 */ | |
10132 054 if ((res = mp_copy (&x, &u)) != MP_OKAY) \{ | |
10133 055 goto __ERR; | |
10134 056 \} | |
10135 057 if ((res = mp_copy (&y, &v)) != MP_OKAY) \{ | |
10136 058 goto __ERR; | |
10137 059 \} | |
10138 060 mp_set (&A, 1); | |
10139 061 mp_set (&D, 1); | |
10140 062 | |
10141 063 top: | |
10142 064 /* 4. while u is even do */ | |
10143 065 while (mp_iseven (&u) == 1) \{ | |
10144 066 /* 4.1 u = u/2 */ | |
10145 067 if ((res = mp_div_2 (&u, &u)) != MP_OKAY) \{ | |
10146 068 goto __ERR; | |
10147 069 \} | |
10148 070 /* 4.2 if A or B is odd then */ | |
10149 071 if (mp_isodd (&A) == 1 || mp_isodd (&B) == 1) \{ | |
10150 072 /* A = (A+y)/2, B = (B-x)/2 */ | |
10151 073 if ((res = mp_add (&A, &y, &A)) != MP_OKAY) \{ | |
10152 074 goto __ERR; | |
10153 075 \} | |
10154 076 if ((res = mp_sub (&B, &x, &B)) != MP_OKAY) \{ | |
10155 077 goto __ERR; | |
10156 078 \} | |
10157 079 \} | |
10158 080 /* A = A/2, B = B/2 */ | |
10159 081 if ((res = mp_div_2 (&A, &A)) != MP_OKAY) \{ | |
10160 082 goto __ERR; | |
10161 083 \} | |
10162 084 if ((res = mp_div_2 (&B, &B)) != MP_OKAY) \{ | |
10163 085 goto __ERR; | |
10164 086 \} | |
10165 087 \} | |
10166 088 | |
10167 089 /* 5. while v is even do */ | |
10168 090 while (mp_iseven (&v) == 1) \{ | |
10169 091 /* 5.1 v = v/2 */ | |
10170 092 if ((res = mp_div_2 (&v, &v)) != MP_OKAY) \{ | |
10171 093 goto __ERR; | |
10172 094 \} | |
10173 095 /* 5.2 if C or D is odd then */ | |
10174 096 if (mp_isodd (&C) == 1 || mp_isodd (&D) == 1) \{ | |
10175 097 /* C = (C+y)/2, D = (D-x)/2 */ | |
10176 098 if ((res = mp_add (&C, &y, &C)) != MP_OKAY) \{ | |
10177 099 goto __ERR; | |
10178 100 \} | |
10179 101 if ((res = mp_sub (&D, &x, &D)) != MP_OKAY) \{ | |
10180 102 goto __ERR; | |
10181 103 \} | |
10182 104 \} | |
10183 105 /* C = C/2, D = D/2 */ | |
10184 106 if ((res = mp_div_2 (&C, &C)) != MP_OKAY) \{ | |
10185 107 goto __ERR; | |
10186 108 \} | |
10187 109 if ((res = mp_div_2 (&D, &D)) != MP_OKAY) \{ | |
10188 110 goto __ERR; | |
10189 111 \} | |
10190 112 \} | |
10191 113 | |
10192 114 /* 6. if u >= v then */ | |
10193 115 if (mp_cmp (&u, &v) != MP_LT) \{ | |
10194 116 /* u = u - v, A = A - C, B = B - D */ | |
10195 117 if ((res = mp_sub (&u, &v, &u)) != MP_OKAY) \{ | |
10196 118 goto __ERR; | |
10197 119 \} | |
10198 120 | |
10199 121 if ((res = mp_sub (&A, &C, &A)) != MP_OKAY) \{ | |
10200 122 goto __ERR; | |
10201 123 \} | |
10202 124 | |
10203 125 if ((res = mp_sub (&B, &D, &B)) != MP_OKAY) \{ | |
10204 126 goto __ERR; | |
10205 127 \} | |
10206 128 \} else \{ | |
10207 129 /* v - v - u, C = C - A, D = D - B */ | |
10208 130 if ((res = mp_sub (&v, &u, &v)) != MP_OKAY) \{ | |
10209 131 goto __ERR; | |
10210 132 \} | |
10211 133 | |
10212 134 if ((res = mp_sub (&C, &A, &C)) != MP_OKAY) \{ | |
10213 135 goto __ERR; | |
10214 136 \} | |
10215 137 | |
10216 138 if ((res = mp_sub (&D, &B, &D)) != MP_OKAY) \{ | |
10217 139 goto __ERR; | |
10218 140 \} | |
10219 141 \} | |
10220 142 | |
10221 143 /* if not zero goto step 4 */ | |
10222 144 if (mp_iszero (&u) == 0) | |
10223 145 goto top; | |
10224 146 | |
10225 147 /* now a = C, b = D, gcd == g*v */ | |
10226 148 | |
10227 149 /* if v != 1 then there is no inverse */ | |
10228 150 if (mp_cmp_d (&v, 1) != MP_EQ) \{ | |
10229 151 res = MP_VAL; | |
10230 152 goto __ERR; | |
10231 153 \} | |
10232 154 | |
10233 155 /* if its too low */ | |
10234 156 while (mp_cmp_d(&C, 0) == MP_LT) \{ | |
10235 157 if ((res = mp_add(&C, b, &C)) != MP_OKAY) \{ | |
10236 158 goto __ERR; | |
10237 159 \} | |
10238 160 \} | |
10239 161 | |
10240 162 /* too big */ | |
10241 163 while (mp_cmp_mag(&C, b) != MP_LT) \{ | |
10242 164 if ((res = mp_sub(&C, b, &C)) != MP_OKAY) \{ | |
10243 165 goto __ERR; | |
10244 166 \} | |
10245 167 \} | |
10246 168 | |
10247 169 /* C is now the inverse */ | |
10248 170 mp_exch (&C, c); | |
10249 171 res = MP_OKAY; | |
10250 172 __ERR:mp_clear_multi (&x, &y, &u, &v, &A, &B, &C, &D, NULL); | |
10251 173 return res; | |
10252 174 \} | |
10253 \end{alltt} | |
10254 \end{small} | |
10255 | |
10256 \subsubsection{Odd Moduli} | |
10257 | |
10258 When the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse. In particular by attempting to solve | |
10259 the Diophantine $Cb + Da = 1$ only $B$ and $D$ are required to find the inverse of $a$. | |
10260 | |
10261 The algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed. This | |
10262 optimization will halve the time required to compute the modular inverse. | |
10263 | |
10264 \section{Primality Tests} | |
10265 | |
10266 A non-zero integer $a$ is said to be prime if it is not divisible by any other integer excluding one and itself. For example, $a = 7$ is prime | |
10267 since the integers $2 \ldots 6$ do not evenly divide $a$. By contrast, $a = 6$ is not prime since $a = 6 = 2 \cdot 3$. | |
10268 | |
10269 Prime numbers arise in cryptography considerably as they allow finite fields to be formed. The ability to determine whether an integer is prime or | |
10270 not quickly has been a viable subject in cryptography and number theory for considerable time. The algorithms that will be presented are all | |
10271 probablistic algorithms in that when they report an integer is composite it must be composite. However, when the algorithms report an integer is | |
10272 prime the algorithm may be incorrect. | |
10273 | |
10274 As will be discussed it is possible to limit the probability of error so well that for practical purposes the probablity of error might as | |
10275 well be zero. For the purposes of these discussions let $n$ represent the candidate integer of which the primality is in question. | |
10276 | |
10277 \subsection{Trial Division} | |
10278 | |
10279 Trial division means to attempt to evenly divide a candidate integer by small prime integers. If the candidate can be evenly divided it obviously | |
10280 cannot be prime. By dividing by all primes $1 < p \le \sqrt{n}$ this test can actually prove whether an integer is prime. However, such a test | |
10281 would require a prohibitive amount of time as $n$ grows. | |
10282 | |
10283 Instead of dividing by every prime, a smaller, more mangeable set of primes may be used instead. By performing trial division with only a subset | |
10284 of the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime. However, often it can prove a candidate is not prime. | |
10285 | |
10286 The benefit of this test is that trial division by small values is fairly efficient. Specially compared to the other algorithms that will be | |
10287 discussed shortly. The probability that this approach correctly identifies a composite candidate when tested with all primes upto $q$ is given by | |
10288 $1 - {1.12 \over ln(q)}$. The graph (\ref{pic:primality}, will be added later) demonstrates the probability of success for the range | |
10289 $3 \le q \le 100$. | |
10290 | |
10291 At approximately $q = 30$ the gain of performing further tests diminishes fairly quickly. At $q = 90$ further testing is generally not going to | |
10292 be of any practical use. In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate | |
10293 approximately $80\%$ of all candidate integers. The constant \textbf{PRIME\_SIZE} is equal to the number of primes in the test base. The | |
10294 array \_\_prime\_tab is an array of the first \textbf{PRIME\_SIZE} prime numbers. | |
10295 | |
10296 \begin{figure}[!here] | |
10297 \begin{small} | |
10298 \begin{center} | |
10299 \begin{tabular}{l} | |
10300 \hline Algorithm \textbf{mp\_prime\_is\_divisible}. \\ | |
10301 \textbf{Input}. mp\_int $a$ \\ | |
10302 \textbf{Output}. $c = 1$ if $n$ is divisible by a small prime, otherwise $c = 0$. \\ | |
10303 \hline \\ | |
10304 1. for $ix$ from $0$ to $PRIME\_SIZE$ do \\ | |
10305 \hspace{3mm}1.1 $d \leftarrow n \mbox{ (mod }\_\_prime\_tab_{ix}\mbox{)}$ \\ | |
10306 \hspace{3mm}1.2 If $d = 0$ then \\ | |
10307 \hspace{6mm}1.2.1 $c \leftarrow 1$ \\ | |
10308 \hspace{6mm}1.2.2 Return(\textit{MP\_OKAY}). \\ | |
10309 2. $c \leftarrow 0$ \\ | |
10310 3. Return(\textit{MP\_OKAY}). \\ | |
10311 \hline | |
10312 \end{tabular} | |
10313 \end{center} | |
10314 \end{small} | |
10315 \caption{Algorithm mp\_prime\_is\_divisible} | |
10316 \end{figure} | |
10317 \textbf{Algorithm mp\_prime\_is\_divisible.} | |
10318 This algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions. | |
10319 | |
10320 \vspace{+3mm}\begin{small} | |
10321 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_is\_divisible.c | |
10322 \vspace{-3mm} | |
10323 \begin{alltt} | |
10324 016 | |
10325 017 /* determines if an integers is divisible by one | |
10326 018 * of the first PRIME_SIZE primes or not | |
10327 019 * | |
10328 020 * sets result to 0 if not, 1 if yes | |
10329 021 */ | |
10330 022 int mp_prime_is_divisible (mp_int * a, int *result) | |
10331 023 \{ | |
10332 024 int err, ix; | |
10333 025 mp_digit res; | |
10334 026 | |
10335 027 /* default to not */ | |
10336 028 *result = MP_NO; | |
10337 029 | |
10338 030 for (ix = 0; ix < PRIME_SIZE; ix++) \{ | |
10339 031 /* what is a mod __prime_tab[ix] */ | |
10340 032 if ((err = mp_mod_d (a, __prime_tab[ix], &res)) != MP_OKAY) \{ | |
10341 033 return err; | |
10342 034 \} | |
10343 035 | |
10344 036 /* is the residue zero? */ | |
10345 037 if (res == 0) \{ | |
10346 038 *result = MP_YES; | |
10347 039 return MP_OKAY; | |
10348 040 \} | |
10349 041 \} | |
10350 042 | |
10351 043 return MP_OKAY; | |
10352 044 \} | |
10353 \end{alltt} | |
10354 \end{small} | |
10355 | |
10356 The algorithm defaults to a return of $0$ in case an error occurs. The values in the prime table are all specified to be in the range of a | |
10357 mp\_digit. The table \_\_prime\_tab is defined in the following file. | |
10358 | |
10359 \vspace{+3mm}\begin{small} | |
10360 \hspace{-5.1mm}{\bf File}: bn\_prime\_tab.c | |
10361 \vspace{-3mm} | |
10362 \begin{alltt} | |
10363 016 const mp_digit __prime_tab[] = \{ | |
10364 017 0x0002, 0x0003, 0x0005, 0x0007, 0x000B, 0x000D, 0x0011, 0x0013, | |
10365 018 0x0017, 0x001D, 0x001F, 0x0025, 0x0029, 0x002B, 0x002F, 0x0035, | |
10366 019 0x003B, 0x003D, 0x0043, 0x0047, 0x0049, 0x004F, 0x0053, 0x0059, | |
10367 020 0x0061, 0x0065, 0x0067, 0x006B, 0x006D, 0x0071, 0x007F, | |
10368 021 #ifndef MP_8BIT | |
10369 022 0x0083, | |
10370 023 0x0089, 0x008B, 0x0095, 0x0097, 0x009D, 0x00A3, 0x00A7, 0x00AD, | |
10371 024 0x00B3, 0x00B5, 0x00BF, 0x00C1, 0x00C5, 0x00C7, 0x00D3, 0x00DF, | |
10372 025 0x00E3, 0x00E5, 0x00E9, 0x00EF, 0x00F1, 0x00FB, 0x0101, 0x0107, | |
10373 026 0x010D, 0x010F, 0x0115, 0x0119, 0x011B, 0x0125, 0x0133, 0x0137, | |
10374 027 | |
10375 028 0x0139, 0x013D, 0x014B, 0x0151, 0x015B, 0x015D, 0x0161, 0x0167, | |
10376 029 0x016F, 0x0175, 0x017B, 0x017F, 0x0185, 0x018D, 0x0191, 0x0199, | |
10377 030 0x01A3, 0x01A5, 0x01AF, 0x01B1, 0x01B7, 0x01BB, 0x01C1, 0x01C9, | |
10378 031 0x01CD, 0x01CF, 0x01D3, 0x01DF, 0x01E7, 0x01EB, 0x01F3, 0x01F7, | |
10379 032 0x01FD, 0x0209, 0x020B, 0x021D, 0x0223, 0x022D, 0x0233, 0x0239, | |
10380 033 0x023B, 0x0241, 0x024B, 0x0251, 0x0257, 0x0259, 0x025F, 0x0265, | |
10381 034 0x0269, 0x026B, 0x0277, 0x0281, 0x0283, 0x0287, 0x028D, 0x0293, | |
10382 035 0x0295, 0x02A1, 0x02A5, 0x02AB, 0x02B3, 0x02BD, 0x02C5, 0x02CF, | |
10383 036 | |
10384 037 0x02D7, 0x02DD, 0x02E3, 0x02E7, 0x02EF, 0x02F5, 0x02F9, 0x0301, | |
10385 038 0x0305, 0x0313, 0x031D, 0x0329, 0x032B, 0x0335, 0x0337, 0x033B, | |
10386 039 0x033D, 0x0347, 0x0355, 0x0359, 0x035B, 0x035F, 0x036D, 0x0371, | |
10387 040 0x0373, 0x0377, 0x038B, 0x038F, 0x0397, 0x03A1, 0x03A9, 0x03AD, | |
10388 041 0x03B3, 0x03B9, 0x03C7, 0x03CB, 0x03D1, 0x03D7, 0x03DF, 0x03E5, | |
10389 042 0x03F1, 0x03F5, 0x03FB, 0x03FD, 0x0407, 0x0409, 0x040F, 0x0419, | |
10390 043 0x041B, 0x0425, 0x0427, 0x042D, 0x043F, 0x0443, 0x0445, 0x0449, | |
10391 044 0x044F, 0x0455, 0x045D, 0x0463, 0x0469, 0x047F, 0x0481, 0x048B, | |
10392 045 | |
10393 046 0x0493, 0x049D, 0x04A3, 0x04A9, 0x04B1, 0x04BD, 0x04C1, 0x04C7, | |
10394 047 0x04CD, 0x04CF, 0x04D5, 0x04E1, 0x04EB, 0x04FD, 0x04FF, 0x0503, | |
10395 048 0x0509, 0x050B, 0x0511, 0x0515, 0x0517, 0x051B, 0x0527, 0x0529, | |
10396 049 0x052F, 0x0551, 0x0557, 0x055D, 0x0565, 0x0577, 0x0581, 0x058F, | |
10397 050 0x0593, 0x0595, 0x0599, 0x059F, 0x05A7, 0x05AB, 0x05AD, 0x05B3, | |
10398 051 0x05BF, 0x05C9, 0x05CB, 0x05CF, 0x05D1, 0x05D5, 0x05DB, 0x05E7, | |
10399 052 0x05F3, 0x05FB, 0x0607, 0x060D, 0x0611, 0x0617, 0x061F, 0x0623, | |
10400 053 0x062B, 0x062F, 0x063D, 0x0641, 0x0647, 0x0649, 0x064D, 0x0653 | |
10401 054 #endif | |
10402 055 \}; | |
10403 \end{alltt} | |
10404 \end{small} | |
10405 | |
10406 Note that there are two possible tables. When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes | |
10407 upto $1619$ are used. Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit. | |
10408 | |
10409 \subsection{The Fermat Test} | |
10410 The Fermat test is probably one the oldest tests to have a non-trivial probability of success. It is based on the fact that if $n$ is in | |
10411 fact prime then $a^{n} \equiv a \mbox{ (mod }n\mbox{)}$ for all $0 < a < n$. The reason being that if $n$ is prime than the order of | |
10412 the multiplicative sub group is $n - 1$. Any base $a$ must have an order which divides $n - 1$ and as such $a^n$ is equivalent to | |
10413 $a^1 = a$. | |
10414 | |
10415 If $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$. In which case | |
10416 it is possible that $a^n \nequiv a \mbox{ (mod }n\mbox{)}$. However, this test is not absolute as it is possible that the order | |
10417 of a base will divide $n - 1$ which would then be reported as prime. Such a base yields what is known as a Fermat pseudo-prime. Several | |
10418 integers known as Carmichael numbers will be a pseudo-prime to all valid bases. Fortunately such numbers are extremely rare as $n$ grows | |
10419 in size. | |
10420 | |
10421 \begin{figure}[!here] | |
10422 \begin{small} | |
10423 \begin{center} | |
10424 \begin{tabular}{l} | |
10425 \hline Algorithm \textbf{mp\_prime\_fermat}. \\ | |
10426 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\ | |
10427 \textbf{Output}. $c = 1$ if $b^a \equiv b \mbox{ (mod }a\mbox{)}$, otherwise $c = 0$. \\ | |
10428 \hline \\ | |
10429 1. $t \leftarrow b^a \mbox{ (mod }a\mbox{)}$ \\ | |
10430 2. If $t = b$ then \\ | |
10431 \hspace{3mm}2.1 $c = 1$ \\ | |
10432 3. else \\ | |
10433 \hspace{3mm}3.1 $c = 0$ \\ | |
10434 4. Return(\textit{MP\_OKAY}). \\ | |
10435 \hline | |
10436 \end{tabular} | |
10437 \end{center} | |
10438 \end{small} | |
10439 \caption{Algorithm mp\_prime\_fermat} | |
10440 \end{figure} | |
10441 \textbf{Algorithm mp\_prime\_fermat.} | |
10442 This algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not. It uses a single modular exponentiation to | |
10443 determine the result. | |
10444 | |
10445 \vspace{+3mm}\begin{small} | |
10446 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_fermat.c | |
10447 \vspace{-3mm} | |
10448 \begin{alltt} | |
10449 016 | |
10450 017 /* performs one Fermat test. | |
10451 018 * | |
10452 019 * If "a" were prime then b**a == b (mod a) since the order of | |
10453 020 * the multiplicative sub-group would be phi(a) = a-1. That means | |
10454 021 * it would be the same as b**(a mod (a-1)) == b**1 == b (mod a). | |
10455 022 * | |
10456 023 * Sets result to 1 if the congruence holds, or zero otherwise. | |
10457 024 */ | |
10458 025 int mp_prime_fermat (mp_int * a, mp_int * b, int *result) | |
10459 026 \{ | |
10460 027 mp_int t; | |
10461 028 int err; | |
10462 029 | |
10463 030 /* default to composite */ | |
10464 031 *result = MP_NO; | |
10465 032 | |
10466 033 /* ensure b > 1 */ | |
10467 034 if (mp_cmp_d(b, 1) != MP_GT) \{ | |
10468 035 return MP_VAL; | |
10469 036 \} | |
10470 037 | |
10471 038 /* init t */ | |
10472 039 if ((err = mp_init (&t)) != MP_OKAY) \{ | |
10473 040 return err; | |
10474 041 \} | |
10475 042 | |
10476 043 /* compute t = b**a mod a */ | |
10477 044 if ((err = mp_exptmod (b, a, a, &t)) != MP_OKAY) \{ | |
10478 045 goto __T; | |
10479 046 \} | |
10480 047 | |
10481 048 /* is it equal to b? */ | |
10482 049 if (mp_cmp (&t, b) == MP_EQ) \{ | |
10483 050 *result = MP_YES; | |
10484 051 \} | |
10485 052 | |
10486 053 err = MP_OKAY; | |
10487 054 __T:mp_clear (&t); | |
10488 055 return err; | |
10489 056 \} | |
10490 \end{alltt} | |
10491 \end{small} | |
10492 | |
10493 \subsection{The Miller-Rabin Test} | |
10494 The Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen | |
10495 candidate integers. The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the | |
10496 value must be equal to $-1$. The squarings are stopped as soon as $-1$ is observed. If the value of $1$ is observed first it means that | |
10497 some value not congruent to $\pm 1$ when squared equals one which cannot occur if $n$ is prime. | |
10498 | |
10499 \begin{figure}[!here] | |
10500 \begin{small} | |
10501 \begin{center} | |
10502 \begin{tabular}{l} | |
10503 \hline Algorithm \textbf{mp\_prime\_miller\_rabin}. \\ | |
10504 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\ | |
10505 \textbf{Output}. $c = 1$ if $a$ is a Miller-Rabin prime to the base $a$, otherwise $c = 0$. \\ | |
10506 \hline | |
10507 1. $a' \leftarrow a - 1$ \\ | |
10508 2. $r \leftarrow n1$ \\ | |
10509 3. $c \leftarrow 0, s \leftarrow 0$ \\ | |
10510 4. While $r.used > 0$ and $r_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ | |
10511 \hspace{3mm}4.1 $s \leftarrow s + 1$ \\ | |
10512 \hspace{3mm}4.2 $r \leftarrow \lfloor r / 2 \rfloor$ \\ | |
10513 5. $y \leftarrow b^r \mbox{ (mod }a\mbox{)}$ \\ | |
10514 6. If $y \nequiv \pm 1$ then \\ | |
10515 \hspace{3mm}6.1 $j \leftarrow 1$ \\ | |
10516 \hspace{3mm}6.2 While $j \le (s - 1)$ and $y \nequiv a'$ \\ | |
10517 \hspace{6mm}6.2.1 $y \leftarrow y^2 \mbox{ (mod }a\mbox{)}$ \\ | |
10518 \hspace{6mm}6.2.2 If $y = 1$ then goto step 8. \\ | |
10519 \hspace{6mm}6.2.3 $j \leftarrow j + 1$ \\ | |
10520 \hspace{3mm}6.3 If $y \nequiv a'$ goto step 8. \\ | |
10521 7. $c \leftarrow 1$\\ | |
10522 8. Return(\textit{MP\_OKAY}). \\ | |
10523 \hline | |
10524 \end{tabular} | |
10525 \end{center} | |
10526 \end{small} | |
10527 \caption{Algorithm mp\_prime\_miller\_rabin} | |
10528 \end{figure} | |
10529 \textbf{Algorithm mp\_prime\_miller\_rabin.} | |
10530 This algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$. It will set $c = 1$ if the algorithm cannot determine | |
10531 if $b$ is composite or $c = 0$ if $b$ is provably composite. The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$. | |
10532 | |
10533 If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not. Otherwise, the algorithm will | |
10534 square $y$ upto $s - 1$ times stopping only when $y \equiv -1$. If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$ | |
10535 is provably composite. If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite. If $a$ is not provably | |
10536 composite then it is \textit{probably} prime. | |
10537 | |
10538 \vspace{+3mm}\begin{small} | |
10539 \hspace{-5.1mm}{\bf File}: bn\_mp\_prime\_miller\_rabin.c | |
10540 \vspace{-3mm} | |
10541 \begin{alltt} | |
10542 016 | |
10543 017 /* Miller-Rabin test of "a" to the base of "b" as described in | |
10544 018 * HAC pp. 139 Algorithm 4.24 | |
10545 019 * | |
10546 020 * Sets result to 0 if definitely composite or 1 if probably prime. | |
10547 021 * Randomly the chance of error is no more than 1/4 and often | |
10548 022 * very much lower. | |
10549 023 */ | |
10550 024 int mp_prime_miller_rabin (mp_int * a, mp_int * b, int *result) | |
10551 025 \{ | |
10552 026 mp_int n1, y, r; | |
10553 027 int s, j, err; | |
10554 028 | |
10555 029 /* default */ | |
10556 030 *result = MP_NO; | |
10557 031 | |
10558 032 /* ensure b > 1 */ | |
10559 033 if (mp_cmp_d(b, 1) != MP_GT) \{ | |
10560 034 return MP_VAL; | |
10561 035 \} | |
10562 036 | |
10563 037 /* get n1 = a - 1 */ | |
10564 038 if ((err = mp_init_copy (&n1, a)) != MP_OKAY) \{ | |
10565 039 return err; | |
10566 040 \} | |
10567 041 if ((err = mp_sub_d (&n1, 1, &n1)) != MP_OKAY) \{ | |
10568 042 goto __N1; | |
10569 043 \} | |
10570 044 | |
10571 045 /* set 2**s * r = n1 */ | |
10572 046 if ((err = mp_init_copy (&r, &n1)) != MP_OKAY) \{ | |
10573 047 goto __N1; | |
10574 048 \} | |
10575 049 | |
10576 050 /* count the number of least significant bits | |
10577 051 * which are zero | |
10578 052 */ | |
10579 053 s = mp_cnt_lsb(&r); | |
10580 054 | |
10581 055 /* now divide n - 1 by 2**s */ | |
10582 056 if ((err = mp_div_2d (&r, s, &r, NULL)) != MP_OKAY) \{ | |
10583 057 goto __R; | |
10584 058 \} | |
10585 059 | |
10586 060 /* compute y = b**r mod a */ | |
10587 061 if ((err = mp_init (&y)) != MP_OKAY) \{ | |
10588 062 goto __R; | |
10589 063 \} | |
10590 064 if ((err = mp_exptmod (b, &r, a, &y)) != MP_OKAY) \{ | |
10591 065 goto __Y; | |
10592 066 \} | |
10593 067 | |
10594 068 /* if y != 1 and y != n1 do */ | |
10595 069 if (mp_cmp_d (&y, 1) != MP_EQ && mp_cmp (&y, &n1) != MP_EQ) \{ | |
10596 070 j = 1; | |
10597 071 /* while j <= s-1 and y != n1 */ | |
10598 072 while ((j <= (s - 1)) && mp_cmp (&y, &n1) != MP_EQ) \{ | |
10599 073 if ((err = mp_sqrmod (&y, a, &y)) != MP_OKAY) \{ | |
10600 074 goto __Y; | |
10601 075 \} | |
10602 076 | |
10603 077 /* if y == 1 then composite */ | |
10604 078 if (mp_cmp_d (&y, 1) == MP_EQ) \{ | |
10605 079 goto __Y; | |
10606 080 \} | |
10607 081 | |
10608 082 ++j; | |
10609 083 \} | |
10610 084 | |
10611 085 /* if y != n1 then composite */ | |
10612 086 if (mp_cmp (&y, &n1) != MP_EQ) \{ | |
10613 087 goto __Y; | |
10614 088 \} | |
10615 089 \} | |
10616 090 | |
10617 091 /* probably prime now */ | |
10618 092 *result = MP_YES; | |
10619 093 __Y:mp_clear (&y); | |
10620 094 __R:mp_clear (&r); | |
10621 095 __N1:mp_clear (&n1); | |
10622 096 return err; | |
10623 097 \} | |
10624 \end{alltt} | |
10625 \end{small} | |
10626 | |
10627 | |
10628 | |
10629 | |
10630 \backmatter | |
10631 \appendix | |
10632 \begin{thebibliography}{ABCDEF} | |
10633 \bibitem[1]{TAOCPV2} | |
10634 Donald Knuth, \textit{The Art of Computer Programming}, Third Edition, Volume Two, Seminumerical Algorithms, Addison-Wesley, 1998 | |
10635 | |
10636 \bibitem[2]{HAC} | |
10637 A. Menezes, P. van Oorschot, S. Vanstone, \textit{Handbook of Applied Cryptography}, CRC Press, 1996 | |
10638 | |
10639 \bibitem[3]{ROSE} | |
10640 Michael Rosing, \textit{Implementing Elliptic Curve Cryptography}, Manning Publications, 1999 | |
10641 | |
10642 \bibitem[4]{COMBA} | |
10643 Paul G. Comba, \textit{Exponentiation Cryptosystems on the IBM PC}. IBM Systems Journal 29(4): 526-538 (1990) | |
10644 | |
10645 \bibitem[5]{KARA} | |
10646 A. Karatsuba, Doklay Akad. Nauk SSSR 145 (1962), pp.293-294 | |
10647 | |
10648 \bibitem[6]{KARAP} | |
10649 Andre Weimerskirch and Christof Paar, \textit{Generalizations of the Karatsuba Algorithm for Polynomial Multiplication}, Submitted to Design, Codes and Cryptography, March 2002 | |
10650 | |
10651 \bibitem[7]{BARRETT} | |
10652 Paul Barrett, \textit{Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor}, Advances in Cryptology, Crypto '86, Springer-Verlag. | |
10653 | |
10654 \bibitem[8]{MONT} | |
10655 P.L.Montgomery. \textit{Modular multiplication without trial division}. Mathematics of Computation, 44(170):519-521, April 1985. | |
10656 | |
10657 \bibitem[9]{DRMET} | |
10658 Chae Hoon Lim and Pil Joong Lee, \textit{Generating Efficient Primes for Discrete Log Cryptosystems}, POSTECH Information Research Laboratories | |
10659 | |
10660 \bibitem[10]{MMB} | |
10661 J. Daemen and R. Govaerts and J. Vandewalle, \textit{Block ciphers based on Modular Arithmetic}, State and {P}rogress in the {R}esearch of {C}ryptography, 1993, pp. 80-89 | |
10662 | |
10663 \bibitem[11]{RSAREF} | |
10664 R.L. Rivest, A. Shamir, L. Adleman, \textit{A Method for Obtaining Digital Signatures and Public-Key Cryptosystems} | |
10665 | |
10666 \bibitem[12]{DHREF} | |
10667 Whitfield Diffie, Martin E. Hellman, \textit{New Directions in Cryptography}, IEEE Transactions on Information Theory, 1976 | |
10668 | |
10669 \bibitem[13]{IEEE} | |
10670 IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) | |
10671 | |
10672 \bibitem[14]{GMP} | |
10673 GNU Multiple Precision (GMP), \url{http://www.swox.com/gmp/} | |
10674 | |
10675 \bibitem[15]{MPI} | |
10676 Multiple Precision Integer Library (MPI), Michael Fromberger, \url{http://thayer.dartmouth.edu/~sting/mpi/} | |
10677 | |
10678 \bibitem[16]{OPENSSL} | |
10679 OpenSSL Cryptographic Toolkit, \url{http://openssl.org} | |
10680 | |
10681 \bibitem[17]{LIP} | |
10682 Large Integer Package, \url{http://home.hetnet.nl/~ecstr/LIP.zip} | |
10683 | |
10684 \bibitem[18]{ISOC} | |
10685 JTC1/SC22/WG14, ISO/IEC 9899:1999, ``A draft rationale for the C99 standard.'' | |
10686 | |
10687 \bibitem[19]{JAVA} | |
10688 The Sun Java Website, \url{http://java.sun.com/} | |
10689 | |
10690 \end{thebibliography} | |
10691 | |
10692 \input{tommath.ind} | |
10693 | |
10694 \end{document} |