19
|
1 \documentclass[b5paper]{book} |
|
2 \usepackage{hyperref} |
|
3 \usepackage{makeidx} |
|
4 \usepackage{amssymb} |
|
5 \usepackage{color} |
|
6 \usepackage{alltt} |
|
7 \usepackage{graphicx} |
|
8 \usepackage{layout} |
|
9 \def\union{\cup} |
|
10 \def\intersect{\cap} |
|
11 \def\getsrandom{\stackrel{\rm R}{\gets}} |
|
12 \def\cross{\times} |
|
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}} |
|
14 \def\catn{$\|$} |
|
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}} |
|
16 \def\nequiv{\not\equiv} |
|
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}} |
|
18 \def\lcm{{\rm lcm}} |
|
19 \def\gcd{{\rm gcd}} |
|
20 \def\log{{\rm log}} |
|
21 \def\ord{{\rm ord}} |
|
22 \def\abs{{\mathit abs}} |
|
23 \def\rep{{\mathit rep}} |
|
24 \def\mod{{\mathit\ mod\ }} |
|
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})} |
|
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor} |
|
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil} |
|
28 \def\Or{{\rm\ or\ }} |
|
29 \def\And{{\rm\ and\ }} |
|
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}} |
|
31 \def\implies{\Rightarrow} |
|
32 \def\undefined{{\rm ``undefined"}} |
|
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}} |
|
34 \let\oldphi\phi |
|
35 \def\phi{\varphi} |
|
36 \def\Pr{{\rm Pr}} |
|
37 \newcommand{\str}[1]{{\mathbf{#1}}} |
|
38 \def\F{{\mathbb F}} |
|
39 \def\N{{\mathbb N}} |
|
40 \def\Z{{\mathbb Z}} |
|
41 \def\R{{\mathbb R}} |
|
42 \def\C{{\mathbb C}} |
|
43 \def\Q{{\mathbb Q}} |
|
44 \definecolor{DGray}{gray}{0.5} |
|
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}} |
|
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}} |
|
47 \def\gap{\vspace{0.5ex}} |
|
48 \makeindex |
|
49 \begin{document} |
|
50 \frontmatter |
|
51 \pagestyle{empty} |
|
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } |
|
53 \author{\mbox{ |
|
54 %\begin{small} |
|
55 \begin{tabular}{c} |
|
56 Tom St Denis \\ |
|
57 Algonquin College \\ |
|
58 \\ |
|
59 Mads Rasmussen \\ |
|
60 Open Communications Security \\ |
|
61 \\ |
|
62 Greg Rose \\ |
|
63 QUALCOMM Australia \\ |
|
64 \end{tabular} |
|
65 %\end{small} |
|
66 } |
|
67 } |
|
68 \maketitle |
|
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the |
|
70 LibTomMath project. |
|
71 |
|
72 \begin{alltt} |
|
73 Tom St Denis |
|
74 111 Banning Rd |
|
75 Ottawa, Ontario |
|
76 K2L 1C3 |
|
77 Canada |
|
78 |
|
79 Phone: 1-613-836-3160 |
|
80 Email: [email protected] |
|
81 \end{alltt} |
|
82 |
|
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} |
|
84 {\em book} macro package and the Perl {\em booker} package. |
|
85 |
|
86 \tableofcontents |
|
87 \listoffigures |
|
88 \chapter*{Prefaces to the Draft Edition} |
|
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions |
|
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their |
|
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the |
|
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance |
|
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the |
|
94 text. |
|
95 |
|
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit |
|
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only |
|
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted |
|
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have |
|
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain |
|
101 rewarding. |
|
102 |
|
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. |
|
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even |
|
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then |
|
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy |
|
107 onto finishing the book not securing a contract. |
|
108 |
|
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. |
|
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written |
|
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy |
|
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, |
|
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely |
|
114 without hinderance. |
|
115 |
|
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come |
|
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their |
|
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as |
|
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing |
|
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy |
|
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C. |
|
122 |
|
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is |
|
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic. |
|
125 |
|
126 At this time I feel I should share a little information about myself. The most common question I was asked at |
|
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate |
|
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which |
|
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training. |
|
130 |
|
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen |
|
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am |
|
133 still far off from that goal. |
|
134 |
|
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing |
|
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out |
|
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even |
|
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off |
|
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what |
|
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors |
|
141 in my written English have saved me on several occasions to say the least. |
|
142 |
|
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've |
|
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative |
|
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college |
|
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many |
|
147 people who will take it. |
|
148 |
|
149 \begin{flushright} Tom St Denis \end{flushright} |
|
150 |
|
151 \newpage |
|
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also |
|
153 contribute to educate others facing the problem of having to handle big number mathematical calculations. |
|
154 |
|
155 This book is Tom's child and he has been caring and fostering the project ever since the beginning with a clear mind of |
|
156 how he wanted the project to turn out. I have helped by proofreading the text and we have had several discussions about |
|
157 the layout and language used. |
|
158 |
|
159 I hold a masters degree in cryptography from the University of Southern Denmark and have always been interested in the |
|
160 practical aspects of cryptography. |
|
161 |
|
162 Having worked in the security consultancy business for several years in S\~{a}o Paulo, Brazil, I have been in touch with a |
|
163 great deal of work in which multiple precision mathematics was needed. Understanding the possibilities for speeding up |
|
164 multiple precision calculations is often very important since we deal with outdated machine architecture where modular |
|
165 reductions, for example, become painfully slow. |
|
166 |
|
167 This text is for people who stop and wonder when first examining algorithms such as RSA for the first time and asks |
|
168 themselves, ``You tell me this is only secure for large numbers, fine; but how do you implement these numbers?'' |
|
169 |
|
170 \begin{flushright} |
|
171 Mads Rasmussen |
|
172 |
|
173 S\~{a}o Paulo - SP |
|
174 |
|
175 Brazil |
|
176 \end{flushright} |
|
177 |
|
178 \newpage |
|
179 It's all because I broke my leg. That just happened to be at about the same time that Tom asked for someone to review the section of the book about |
|
180 Karatsuba multiplication. I was laid up, alone and immobile, and thought ``Why not?'' I vaguely knew what Karatsuba multiplication was, but not |
|
181 really, so I thought I could help, learn, and stop myself from watching daytime cable TV, all at once. |
|
182 |
|
183 At the time of writing this, I've still not met Tom or Mads in meatspace. I've been following Tom's progress since his first splash on the |
|
184 sci.crypt Usenet news group. I watched him go from a clueless newbie, to the cryptographic equivalent of a reformed smoker, to a real |
|
185 contributor to the field, over a period of about two years. I've been impressed with his obvious intelligence, and astounded by his productivity. |
|
186 Of course, he's young enough to be my own child, so he doesn't have my problems with staying awake. |
|
187 |
|
188 When I reviewed that single section of the book, in its very earliest form, I was very pleasantly surprised. So I decided to collaborate more fully, |
|
189 and at least review all of it, and perhaps write some bits too. There's still a long way to go with it, and I have watched a number of close |
|
190 friends go through the mill of publication, so I think that the way to go is longer than Tom thinks it is. Nevertheless, it's a good effort, |
|
191 and I'm pleased to be involved with it. |
|
192 |
|
193 \begin{flushright} |
|
194 Greg Rose, Sydney, Australia, June 2003. |
|
195 \end{flushright} |
|
196 |
|
197 \mainmatter |
|
198 \pagestyle{headings} |
|
199 \chapter{Introduction} |
|
200 \section{Multiple Precision Arithmetic} |
|
201 |
|
202 \subsection{What is Multiple Precision Arithmetic?} |
|
203 When we think of long-hand arithmetic such as addition or multiplication we rarely consider the fact that we instinctively |
|
204 raise or lower the precision of the numbers we are dealing with. For example, in decimal we almost immediate can |
|
205 reason that $7$ times $6$ is $42$. However, $42$ has two digits of precision as opposed to one digit we started with. |
|
206 Further multiplications of say $3$ result in a larger precision result $126$. In these few examples we have multiple |
|
207 precisions for the numbers we are working with. Despite the various levels of precision a single subset\footnote{With the occasional optimization.} |
|
208 of algorithms can be designed to accomodate them. |
|
209 |
|
210 By way of comparison a fixed or single precision operation would lose precision on various operations. For example, in |
|
211 the decimal system with fixed precision $6 \cdot 7 = 2$. |
|
212 |
|
213 Essentially at the heart of computer based multiple precision arithmetic are the same long-hand algorithms taught in |
|
214 schools to manually add, subtract, multiply and divide. |
|
215 |
|
216 \subsection{The Need for Multiple Precision Arithmetic} |
|
217 The most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation |
|
218 of public-key cryptography algorithms. Algorithms such as RSA \cite{RSAREF} and Diffie-Hellman \cite{DHREF} require |
|
219 integers of significant magnitude to resist known cryptanalytic attacks. For example, at the time of this writing a |
|
220 typical RSA modulus would be at least greater than $10^{309}$. However, modern programming languages such as ISO C \cite{ISOC} and |
|
221 Java \cite{JAVA} only provide instrinsic support for integers which are relatively small and single precision. |
|
222 |
|
223 \begin{figure}[!here] |
|
224 \begin{center} |
|
225 \begin{tabular}{|r|c|} |
|
226 \hline \textbf{Data Type} & \textbf{Range} \\ |
|
227 \hline char & $-128 \ldots 127$ \\ |
|
228 \hline short & $-32768 \ldots 32767$ \\ |
|
229 \hline long & $-2147483648 \ldots 2147483647$ \\ |
|
230 \hline long long & $-9223372036854775808 \ldots 9223372036854775807$ \\ |
|
231 \hline |
|
232 \end{tabular} |
|
233 \end{center} |
|
234 \caption{Typical Data Types for the C Programming Language} |
|
235 \label{fig:ISOC} |
|
236 \end{figure} |
|
237 |
|
238 The largest data type guaranteed to be provided by the ISO C programming |
|
239 language\footnote{As per the ISO C standard. However, each compiler vendor is allowed to augment the precision as they |
|
240 see fit.} can only represent values up to $10^{19}$ as shown in figure \ref{fig:ISOC}. On its own the C language is |
|
241 insufficient to accomodate the magnitude required for the problem at hand. An RSA modulus of magnitude $10^{19}$ could be |
|
242 trivially factored\footnote{A Pollard-Rho factoring would take only $2^{16}$ time.} on the average desktop computer, |
|
243 rendering any protocol based on the algorithm insecure. Multiple precision algorithms solve this very problem by |
|
244 extending the range of representable integers while using single precision data types. |
|
245 |
|
246 Most advancements in fast multiple precision arithmetic stem from the need for faster and more efficient cryptographic |
|
247 primitives. Faster modular reduction and exponentiation algorithms such as Barrett's algorithm, which have appeared in |
|
248 various cryptographic journals, can render algorithms such as RSA and Diffie-Hellman more efficient. In fact, several |
|
249 major companies such as RSA Security, Certicom and Entrust have built entire product lines on the implementation and |
|
250 deployment of efficient algorithms. |
|
251 |
|
252 However, cryptography is not the only field of study that can benefit from fast multiple precision integer routines. |
|
253 Another auxiliary use of multiple precision integers is high precision floating point data types. |
|
254 The basic IEEE \cite{IEEE} standard floating point type is made up of an integer mantissa $q$, an exponent $e$ and a sign bit $s$. |
|
255 Numbers are given in the form $n = q \cdot b^e \cdot -1^s$ where $b = 2$ is the most common base for IEEE. Since IEEE |
|
256 floating point is meant to be implemented in hardware the precision of the mantissa is often fairly small |
|
257 (\textit{23, 48 and 64 bits}). The mantissa is merely an integer and a multiple precision integer could be used to create |
|
258 a mantissa of much larger precision than hardware alone can efficiently support. This approach could be useful where |
|
259 scientific applications must minimize the total output error over long calculations. |
|
260 |
142
|
261 Yet another use for large integers is within arithmetic on polynomials of large characteristic (i.e. $GF(p)[x]$ for large $p$). |
19
|
262 In fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}. |
|
263 |
|
264 \subsection{Benefits of Multiple Precision Arithmetic} |
|
265 \index{precision} |
|
266 The benefit of multiple precision representations over single or fixed precision representations is that |
|
267 no precision is lost while representing the result of an operation which requires excess precision. For example, |
|
268 the product of two $n$-bit integers requires at least $2n$ bits of precision to be represented faithfully. A multiple |
|
269 precision algorithm would augment the precision of the destination to accomodate the result while a single precision system |
|
270 would truncate excess bits to maintain a fixed level of precision. |
|
271 |
|
272 It is possible to implement algorithms which require large integers with fixed precision algorithms. For example, elliptic |
|
273 curve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum |
|
274 size the system will ever need. Such an approach can lead to vastly simpler algorithms which can accomodate the |
|
275 integers required even if the host platform cannot natively accomodate them\footnote{For example, the average smartcard |
|
276 processor has an 8 bit accumulator.}. However, as efficient as such an approach may be, the resulting source code is not |
|
277 normally very flexible. It cannot, at runtime, accomodate inputs of higher magnitude than the designer anticipated. |
|
278 |
|
279 Multiple precision algorithms have the most overhead of any style of arithmetic. For the the most part the |
|
280 overhead can be kept to a minimum with careful planning, but overall, it is not well suited for most memory starved |
|
281 platforms. However, multiple precision algorithms do offer the most flexibility in terms of the magnitude of the |
|
282 inputs. That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input |
|
283 without the designer's explicit forethought. This leads to lower cost of ownership for the code as it only has to |
|
284 be written and tested once. |
|
285 |
|
286 \section{Purpose of This Text} |
|
287 The purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms. |
|
288 That is to not only explain a limited subset of the core theory behind the algorithms but also the various ``house keeping'' |
|
289 elements that are neglected by authors of other texts on the subject. Several well reknowned texts \cite{TAOCPV2,HAC} |
|
290 give considerably detailed explanations of the theoretical aspects of algorithms and often very little information |
|
291 regarding the practical implementation aspects. |
|
292 |
|
293 In most cases how an algorithm is explained and how it is actually implemented are two very different concepts. For |
|
294 example, the Handbook of Applied Cryptography (\textit{HAC}), algorithm 14.7 on page 594, gives a relatively simple |
|
295 algorithm for performing multiple precision integer addition. However, the description lacks any discussion concerning |
|
296 the fact that the two integer inputs may be of differing magnitudes. As a result the implementation is not as simple |
|
297 as the text would lead people to believe. Similarly the division routine (\textit{algorithm 14.20, pp. 598}) does not |
|
298 discuss how to handle sign or handle the dividend's decreasing magnitude in the main loop (\textit{step \#3}). |
|
299 |
|
300 Both texts also do not discuss several key optimal algorithms required such as ``Comba'' and Karatsuba multipliers |
|
301 and fast modular inversion, which we consider practical oversights. These optimal algorithms are vital to achieve |
|
302 any form of useful performance in non-trivial applications. |
|
303 |
|
304 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer |
|
305 package. As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.org}} package is used |
|
306 to demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field |
|
307 tested and work very well. The LibTomMath library is freely available on the Internet for all uses and this text |
|
308 discusses a very large portion of the inner workings of the library. |
|
309 |
|
310 The algorithms that are presented will always include at least one ``pseudo-code'' description followed |
|
311 by the actual C source code that implements the algorithm. The pseudo-code can be used to implement the same |
|
312 algorithm in other programming languages as the reader sees fit. |
|
313 |
|
314 This text shall also serve as a walkthrough of the creation of multiple precision algorithms from scratch. Showing |
|
315 the reader how the algorithms fit together as well as where to start on various taskings. |
|
316 |
|
317 \section{Discussion and Notation} |
|
318 \subsection{Notation} |
142
|
319 A multiple precision integer of $n$-digits shall be denoted as $x = (x_{n-1}, \ldots, x_1, x_0)_{ \beta }$ and represent |
19
|
320 the integer $x \equiv \sum_{i=0}^{n-1} x_i\beta^i$. The elements of the array $x$ are said to be the radix $\beta$ digits |
|
321 of the integer. For example, $x = (1,2,3)_{10}$ would represent the integer |
|
322 $1\cdot 10^2 + 2\cdot10^1 + 3\cdot10^0 = 123$. |
|
323 |
|
324 \index{mp\_int} |
|
325 The term ``mp\_int'' shall refer to a composite structure which contains the digits of the integer it represents, as well |
|
326 as auxilary data required to manipulate the data. These additional members are discussed further in section |
|
327 \ref{sec:MPINT}. For the purposes of this text a ``multiple precision integer'' and an ``mp\_int'' are assumed to be |
|
328 synonymous. When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members |
|
329 are present as well. An expression of the type \textit{variablename.item} implies that it should evaluate to the |
|
330 member named ``item'' of the variable. For example, a string of characters may have a member ``length'' which would |
|
331 evaluate to the number of characters in the string. If the string $a$ equals ``hello'' then it follows that |
|
332 $a.length = 5$. |
|
333 |
|
334 For certain discussions more generic algorithms are presented to help the reader understand the final algorithm used |
|
335 to solve a given problem. When an algorithm is described as accepting an integer input it is assumed the input is |
|
336 a plain integer with no additional multiple-precision members. That is, algorithms that use integers as opposed to |
|
337 mp\_ints as inputs do not concern themselves with the housekeeping operations required such as memory management. These |
|
338 algorithms will be used to establish the relevant theory which will subsequently be used to describe a multiple |
|
339 precision algorithm to solve the same problem. |
|
340 |
|
341 \subsection{Precision Notation} |
142
|
342 The variable $\beta$ represents the radix of a single digit of a multiple precision integer and |
|
343 must be of the form $q^p$ for $q, p \in \Z^+$. A single precision variable must be able to represent integers in |
|
344 the range $0 \le x < q \beta$ while a double precision variable must be able to represent integers in the range |
|
345 $0 \le x < q \beta^2$. The extra radix-$q$ factor allows additions and subtractions to proceed without truncation of the |
|
346 carry. Since all modern computers are binary, it is assumed that $q$ is two. |
19
|
347 |
|
348 \index{mp\_digit} \index{mp\_word} |
|
349 Within the source code that will be presented for each algorithm, the data type \textbf{mp\_digit} will represent |
|
350 a single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type. In |
|
351 several algorithms (notably the Comba routines) temporary results will be stored in arrays of double precision mp\_words. |
|
352 For the purposes of this text $x_j$ will refer to the $j$'th digit of a single precision array and $\hat x_j$ will refer to |
|
353 the $j$'th digit of a double precision array. Whenever an expression is to be assigned to a double precision |
|
354 variable it is assumed that all single precision variables are promoted to double precision during the evaluation. |
|
355 Expressions that are assigned to a single precision variable are truncated to fit within the precision of a single |
|
356 precision data type. |
|
357 |
|
358 For example, if $\beta = 10^2$ a single precision data type may represent a value in the |
|
359 range $0 \le x < 10^3$, while a double precision data type may represent a value in the range $0 \le x < 10^5$. Let |
|
360 $a = 23$ and $b = 49$ represent two single precision variables. The single precision product shall be written |
|
361 as $c \leftarrow a \cdot b$ while the double precision product shall be written as $\hat c \leftarrow a \cdot b$. |
|
362 In this particular case, $\hat c = 1127$ and $c = 127$. The most significant digit of the product would not fit |
|
363 in a single precision data type and as a result $c \ne \hat c$. |
|
364 |
|
365 \subsection{Algorithm Inputs and Outputs} |
|
366 Within the algorithm descriptions all variables are assumed to be scalars of either single or double precision |
|
367 as indicated. The only exception to this rule is when variables have been indicated to be of type mp\_int. This |
|
368 distinction is important as scalars are often used as array indicies and various other counters. |
|
369 |
|
370 \subsection{Mathematical Expressions} |
|
371 The $\lfloor \mbox{ } \rfloor$ brackets imply an expression truncated to an integer not greater than the expression |
|
372 itself. For example, $\lfloor 5.7 \rfloor = 5$. Similarly the $\lceil \mbox{ } \rceil$ brackets imply an expression |
|
373 rounded to an integer not less than the expression itself. For example, $\lceil 5.1 \rceil = 6$. Typically when |
|
374 the $/$ division symbol is used the intention is to perform an integer division with truncation. For example, |
|
375 $5/2 = 2$ which will often be written as $\lfloor 5/2 \rfloor = 2$ for clarity. When an expression is written as a |
|
376 fraction a real value division is implied, for example ${5 \over 2} = 2.5$. |
|
377 |
142
|
378 The norm of a multiple precision integer, for example $\vert \vert x \vert \vert$, will be used to represent the number of digits in the representation |
19
|
379 of the integer. For example, $\vert \vert 123 \vert \vert = 3$ and $\vert \vert 79452 \vert \vert = 5$. |
|
380 |
|
381 \subsection{Work Effort} |
|
382 \index{big-Oh} |
|
383 To measure the efficiency of the specified algorithms, a modified big-Oh notation is used. In this system all |
|
384 single precision operations are considered to have the same cost\footnote{Except where explicitly noted.}. |
|
385 That is a single precision addition, multiplication and division are assumed to take the same time to |
|
386 complete. While this is generally not true in practice, it will simplify the discussions considerably. |
|
387 |
|
388 Some algorithms have slight advantages over others which is why some constants will not be removed in |
|
389 the notation. For example, a normal baseline multiplication (section \ref{sec:basemult}) requires $O(n^2)$ work while a |
|
390 baseline squaring (section \ref{sec:basesquare}) requires $O({{n^2 + n}\over 2})$ work. In standard big-Oh notation these |
|
391 would both be said to be equivalent to $O(n^2)$. However, |
|
392 in the context of the this text this is not the case as the magnitude of the inputs will typically be rather small. As a |
|
393 result small constant factors in the work effort will make an observable difference in algorithm efficiency. |
|
394 |
|
395 All of the algorithms presented in this text have a polynomial time work level. That is, of the form |
|
396 $O(n^k)$ for $n, k \in \Z^{+}$. This will help make useful comparisons in terms of the speed of the algorithms and how |
|
397 various optimizations will help pay off in the long run. |
|
398 |
|
399 \section{Exercises} |
|
400 Within the more advanced chapters a section will be set aside to give the reader some challenging exercises related to |
|
401 the discussion at hand. These exercises are not designed to be prize winning problems, but instead to be thought |
|
402 provoking. Wherever possible the problems are forward minded, stating problems that will be answered in subsequent |
|
403 chapters. The reader is encouraged to finish the exercises as they appear to get a better understanding of the |
|
404 subject material. |
|
405 |
|
406 That being said, the problems are designed to affirm knowledge of a particular subject matter. Students in particular |
|
407 are encouraged to verify they can answer the problems correctly before moving on. |
|
408 |
|
409 Similar to the exercises of \cite[pp. ix]{TAOCPV2} these exercises are given a scoring system based on the difficulty of |
|
410 the problem. However, unlike \cite{TAOCPV2} the problems do not get nearly as hard. The scoring of these |
|
411 exercises ranges from one (the easiest) to five (the hardest). The following table sumarizes the |
|
412 scoring system used. |
|
413 |
|
414 \begin{figure}[here] |
|
415 \begin{center} |
|
416 \begin{small} |
|
417 \begin{tabular}{|c|l|} |
|
418 \hline $\left [ 1 \right ]$ & An easy problem that should only take the reader a manner of \\ |
|
419 & minutes to solve. Usually does not involve much computer time \\ |
|
420 & to solve. \\ |
|
421 \hline $\left [ 2 \right ]$ & An easy problem that involves a marginal amount of computer \\ |
|
422 & time usage. Usually requires a program to be written to \\ |
|
423 & solve the problem. \\ |
|
424 \hline $\left [ 3 \right ]$ & A moderately hard problem that requires a non-trivial amount \\ |
|
425 & of work. Usually involves trivial research and development of \\ |
|
426 & new theory from the perspective of a student. \\ |
|
427 \hline $\left [ 4 \right ]$ & A moderately hard problem that involves a non-trivial amount \\ |
|
428 & of work and research, the solution to which will demonstrate \\ |
|
429 & a higher mastery of the subject matter. \\ |
|
430 \hline $\left [ 5 \right ]$ & A hard problem that involves concepts that are difficult for a \\ |
|
431 & novice to solve. Solutions to these problems will demonstrate a \\ |
|
432 & complete mastery of the given subject. \\ |
|
433 \hline |
|
434 \end{tabular} |
|
435 \end{small} |
|
436 \end{center} |
|
437 \caption{Exercise Scoring System} |
|
438 \end{figure} |
|
439 |
|
440 Problems at the first level are meant to be simple questions that the reader can answer quickly without programming a solution or |
|
441 devising new theory. These problems are quick tests to see if the material is understood. Problems at the second level |
|
442 are also designed to be easy but will require a program or algorithm to be implemented to arrive at the answer. These |
|
443 two levels are essentially entry level questions. |
|
444 |
|
445 Problems at the third level are meant to be a bit more difficult than the first two levels. The answer is often |
|
446 fairly obvious but arriving at an exacting solution requires some thought and skill. These problems will almost always |
|
447 involve devising a new algorithm or implementing a variation of another algorithm previously presented. Readers who can |
|
448 answer these questions will feel comfortable with the concepts behind the topic at hand. |
|
449 |
|
450 Problems at the fourth level are meant to be similar to those of the level three questions except they will require |
|
451 additional research to be completed. The reader will most likely not know the answer right away, nor will the text provide |
|
452 the exact details of the answer until a subsequent chapter. |
|
453 |
|
454 Problems at the fifth level are meant to be the hardest |
|
455 problems relative to all the other problems in the chapter. People who can correctly answer fifth level problems have a |
|
456 mastery of the subject matter at hand. |
|
457 |
|
458 Often problems will be tied together. The purpose of this is to start a chain of thought that will be discussed in future chapters. The reader |
|
459 is encouraged to answer the follow-up problems and try to draw the relevance of problems. |
|
460 |
|
461 \section{Introduction to LibTomMath} |
|
462 |
|
463 \subsection{What is LibTomMath?} |
|
464 LibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C. By portable it |
|
465 is meant that the library does not contain any code that is computer platform dependent or otherwise problematic to use on |
|
466 any given platform. |
|
467 |
|
468 The library has been successfully tested under numerous operating systems including Unix\footnote{All of these |
|
469 trademarks belong to their respective rightful owners.}, MacOS, Windows, Linux, PalmOS and on standalone hardware such |
|
470 as the Gameboy Advance. The library is designed to contain enough functionality to be able to develop applications such |
|
471 as public key cryptosystems and still maintain a relatively small footprint. |
|
472 |
|
473 \subsection{Goals of LibTomMath} |
|
474 |
|
475 Libraries which obtain the most efficiency are rarely written in a high level programming language such as C. However, |
|
476 even though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the |
|
477 library. Specifically the code has been written to work well with the GNU C Compiler (\textit{GCC}) on both x86 and ARM |
|
478 processors. Wherever possible, highly efficient algorithms, such as Karatsuba multiplication, sliding window |
|
479 exponentiation and Montgomery reduction have been provided to make the library more efficient. |
|
480 |
|
481 Even with the nearly optimal and specialized algorithms that have been included the Application Programing Interface |
|
482 (\textit{API}) has been kept as simple as possible. Often generic place holder routines will make use of specialized |
|
483 algorithms automatically without the developer's specific attention. One such example is the generic multiplication |
|
484 algorithm \textbf{mp\_mul()} which will automatically use Toom--Cook, Karatsuba, Comba or baseline multiplication |
|
485 based on the magnitude of the inputs and the configuration of the library. |
|
486 |
|
487 Making LibTomMath as efficient as possible is not the only goal of the LibTomMath project. Ideally the library should |
|
488 be source compatible with another popular library which makes it more attractive for developers to use. In this case the |
|
489 MPI library was used as a API template for all the basic functions. MPI was chosen because it is another library that fits |
|
490 in the same niche as LibTomMath. Even though LibTomMath uses MPI as the template for the function names and argument |
|
491 passing conventions, it has been written from scratch by Tom St Denis. |
|
492 |
|
493 The project is also meant to act as a learning tool for students, the logic being that no easy-to-follow ``bignum'' |
|
494 library exists which can be used to teach computer science students how to perform fast and reliable multiple precision |
|
495 integer arithmetic. To this end the source code has been given quite a few comments and algorithm discussion points. |
|
496 |
|
497 \section{Choice of LibTomMath} |
|
498 LibTomMath was chosen as the case study of this text not only because the author of both projects is one and the same but |
|
499 for more worthy reasons. Other libraries such as GMP \cite{GMP}, MPI \cite{MPI}, LIP \cite{LIP} and OpenSSL |
|
500 \cite{OPENSSL} have multiple precision integer arithmetic routines but would not be ideal for this text for |
|
501 reasons that will be explained in the following sub-sections. |
|
502 |
|
503 \subsection{Code Base} |
|
504 The LibTomMath code base is all portable ISO C source code. This means that there are no platform dependent conditional |
|
505 segments of code littered throughout the source. This clean and uncluttered approach to the library means that a |
|
506 developer can more readily discern the true intent of a given section of source code without trying to keep track of |
|
507 what conditional code will be used. |
|
508 |
|
509 The code base of LibTomMath is well organized. Each function is in its own separate source code file |
|
510 which allows the reader to find a given function very quickly. On average there are $76$ lines of code per source |
|
511 file which makes the source very easily to follow. By comparison MPI and LIP are single file projects making code tracing |
|
512 very hard. GMP has many conditional code segments which also hinder tracing. |
|
513 |
|
514 When compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.} |
|
515 which is fairly small compared to GMP (over $250$KiB). LibTomMath is slightly larger than MPI (which compiles to about |
|
516 $50$KiB) but LibTomMath is also much faster and more complete than MPI. |
|
517 |
|
518 \subsection{API Simplicity} |
|
519 LibTomMath is designed after the MPI library and shares the API design. Quite often programs that use MPI will build |
|
520 with LibTomMath without change. The function names correlate directly to the action they perform. Almost all of the |
|
521 functions share the same parameter passing convention. The learning curve is fairly shallow with the API provided |
|
522 which is an extremely valuable benefit for the student and developer alike. |
|
523 |
|
524 The LIP library is an example of a library with an API that is awkward to work with. LIP uses function names that are often ``compressed'' to |
|
525 illegible short hand. LibTomMath does not share this characteristic. |
|
526 |
|
527 The GMP library also does not return error codes. Instead it uses a POSIX.1 \cite{POSIX1} signal system where errors |
|
528 are signaled to the host application. This happens to be the fastest approach but definitely not the most versatile. In |
|
529 effect a math error (i.e. invalid input, heap error, etc) can cause a program to stop functioning which is definitely |
|
530 undersireable in many situations. |
|
531 |
|
532 \subsection{Optimizations} |
|
533 While LibTomMath is certainly not the fastest library (GMP often beats LibTomMath by a factor of two) it does |
|
534 feature a set of optimal algorithms for tasks such as modular reduction, exponentiation, multiplication and squaring. GMP |
|
535 and LIP also feature such optimizations while MPI only uses baseline algorithms with no optimizations. GMP lacks a few |
|
536 of the additional modular reduction optimizations that LibTomMath features\footnote{At the time of this writing GMP |
|
537 only had Barrett and Montgomery modular reduction algorithms.}. |
|
538 |
|
539 LibTomMath is almost always an order of magnitude faster than the MPI library at computationally expensive tasks such as modular |
|
540 exponentiation. In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually |
|
541 slower than the best libraries such as GMP and OpenSSL by only a small factor. |
|
542 |
|
543 \subsection{Portability and Stability} |
|
544 LibTomMath will build ``out of the box'' on any platform equipped with a modern version of the GNU C Compiler |
|
545 (\textit{GCC}). This means that without changes the library will build without configuration or setting up any |
|
546 variables. LIP and MPI will build ``out of the box'' as well but have numerous known bugs. Most notably the author of |
|
547 MPI has recently stopped working on his library and LIP has long since been discontinued. |
|
548 |
|
549 GMP requires a configuration script to run and will not build out of the box. GMP and LibTomMath are still in active |
|
550 development and are very stable across a variety of platforms. |
|
551 |
|
552 \subsection{Choice} |
|
553 LibTomMath is a relatively compact, well documented, highly optimized and portable library which seems only natural for |
|
554 the case study of this text. Various source files from the LibTomMath project will be included within the text. However, |
|
555 the reader is encouraged to download their own copy of the library to actually be able to work with the library. |
|
556 |
|
557 \chapter{Getting Started} |
|
558 \section{Library Basics} |
|
559 The trick to writing any useful library of source code is to build a solid foundation and work outwards from it. First, |
|
560 a problem along with allowable solution parameters should be identified and analyzed. In this particular case the |
|
561 inability to accomodate multiple precision integers is the problem. Futhermore, the solution must be written |
|
562 as portable source code that is reasonably efficient across several different computer platforms. |
|
563 |
|
564 After a foundation is formed the remainder of the library can be designed and implemented in a hierarchical fashion. |
|
565 That is, to implement the lowest level dependencies first and work towards the most abstract functions last. For example, |
|
566 before implementing a modular exponentiation algorithm one would implement a modular reduction algorithm. |
|
567 By building outwards from a base foundation instead of using a parallel design methodology the resulting project is |
|
568 highly modular. Being highly modular is a desirable property of any project as it often means the resulting product |
|
569 has a small footprint and updates are easy to perform. |
|
570 |
142
|
571 Usually when I start a project I will begin with the header files. I define the data types I think I will need and |
19
|
572 prototype the initial functions that are not dependent on other functions (within the library). After I |
|
573 implement these base functions I prototype more dependent functions and implement them. The process repeats until |
|
574 I implement all of the functions I require. For example, in the case of LibTomMath I implemented functions such as |
|
575 mp\_init() well before I implemented mp\_mul() and even further before I implemented mp\_exptmod(). As an example as to |
|
576 why this design works note that the Karatsuba and Toom-Cook multipliers were written \textit{after} the |
|
577 dependent function mp\_exptmod() was written. Adding the new multiplication algorithms did not require changes to the |
|
578 mp\_exptmod() function itself and lowered the total cost of ownership (\textit{so to speak}) and of development |
|
579 for new algorithms. This methodology allows new algorithms to be tested in a complete framework with relative ease. |
|
580 |
|
581 FIGU,design_process,Design Flow of the First Few Original LibTomMath Functions. |
|
582 |
|
583 Only after the majority of the functions were in place did I pursue a less hierarchical approach to auditing and optimizing |
|
584 the source code. For example, one day I may audit the multipliers and the next day the polynomial basis functions. |
|
585 |
|
586 It only makes sense to begin the text with the preliminary data types and support algorithms required as well. |
|
587 This chapter discusses the core algorithms of the library which are the dependents for every other algorithm. |
|
588 |
|
589 \section{What is a Multiple Precision Integer?} |
|
590 Recall that most programming languages, in particular ISO C \cite{ISOC}, only have fixed precision data types that on their own cannot |
|
591 be used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is |
|
592 to use fixed precision data types to create and manipulate multiple precision integers which may represent values |
|
593 that are very large. |
|
594 |
|
595 As a well known analogy, school children are taught how to form numbers larger than nine by prepending more radix ten digits. In the decimal system |
|
596 the largest single digit value is $9$. However, by concatenating digits together larger numbers may be represented. Newly prepended digits |
|
597 (\textit{to the left}) are said to be in a different power of ten column. That is, the number $123$ can be described as having a $1$ in the hundreds |
|
598 column, $2$ in the tens column and $3$ in the ones column. Or more formally $123 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3 \cdot 10^0$. Computer based |
|
599 multiple precision arithmetic is essentially the same concept. Larger integers are represented by adjoining fixed |
|
600 precision computer words with the exception that a different radix is used. |
|
601 |
|
602 What most people probably do not think about explicitly are the various other attributes that describe a multiple precision |
|
603 integer. For example, the integer $154_{10}$ has two immediately obvious properties. First, the integer is positive, |
|
604 that is the sign of this particular integer is positive as opposed to negative. Second, the integer has three digits in |
|
605 its representation. There is an additional property that the integer posesses that does not concern pencil-and-paper |
|
606 arithmetic. The third property is how many digits placeholders are available to hold the integer. |
|
607 |
|
608 The human analogy of this third property is ensuring there is enough space on the paper to write the integer. For example, |
|
609 if one starts writing a large number too far to the right on a piece of paper they will have to erase it and move left. |
|
610 Similarly, computer algorithms must maintain strict control over memory usage to ensure that the digits of an integer |
|
611 will not exceed the allowed boundaries. These three properties make up what is known as a multiple precision |
|
612 integer or mp\_int for short. |
|
613 |
|
614 \subsection{The mp\_int Structure} |
|
615 \label{sec:MPINT} |
|
616 The mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer. The ISO C standard does not provide for |
|
617 any such data type but it does provide for making composite data types known as structures. The following is the structure definition |
|
618 used within LibTomMath. |
|
619 |
|
620 \index{mp\_int} |
142
|
621 \begin{figure}[here] |
|
622 \begin{center} |
|
623 \begin{small} |
|
624 %\begin{verbatim} |
|
625 \begin{tabular}{|l|} |
|
626 \hline |
|
627 typedef struct \{ \\ |
|
628 \hspace{3mm}int used, alloc, sign;\\ |
|
629 \hspace{3mm}mp\_digit *dp;\\ |
|
630 \} \textbf{mp\_int}; \\ |
|
631 \hline |
|
632 \end{tabular} |
|
633 %\end{verbatim} |
|
634 \end{small} |
|
635 \caption{The mp\_int Structure} |
|
636 \label{fig:mpint} |
|
637 \end{center} |
|
638 \end{figure} |
|
639 |
|
640 The mp\_int structure (fig. \ref{fig:mpint}) can be broken down as follows. |
19
|
641 |
|
642 \begin{enumerate} |
|
643 \item The \textbf{used} parameter denotes how many digits of the array \textbf{dp} contain the digits used to represent |
|
644 a given integer. The \textbf{used} count must be positive (or zero) and may not exceed the \textbf{alloc} count. |
|
645 |
|
646 \item The \textbf{alloc} parameter denotes how |
|
647 many digits are available in the array to use by functions before it has to increase in size. When the \textbf{used} count |
|
648 of a result would exceed the \textbf{alloc} count all of the algorithms will automatically increase the size of the |
|
649 array to accommodate the precision of the result. |
|
650 |
|
651 \item The pointer \textbf{dp} points to a dynamically allocated array of digits that represent the given multiple |
|
652 precision integer. It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits. The array is maintained in a least |
|
653 significant digit order. As a pencil and paper analogy the array is organized such that the right most digits are stored |
|
654 first starting at the location indexed by zero\footnote{In C all arrays begin at zero.} in the array. For example, |
|
655 if \textbf{dp} contains $\lbrace a, b, c, \ldots \rbrace$ where \textbf{dp}$_0 = a$, \textbf{dp}$_1 = b$, \textbf{dp}$_2 = c$, $\ldots$ then |
|
656 it would represent the integer $a + b\beta + c\beta^2 + \ldots$ |
|
657 |
|
658 \index{MP\_ZPOS} \index{MP\_NEG} |
|
659 \item The \textbf{sign} parameter denotes the sign as either zero/positive (\textbf{MP\_ZPOS}) or negative (\textbf{MP\_NEG}). |
|
660 \end{enumerate} |
|
661 |
|
662 \subsubsection{Valid mp\_int Structures} |
|
663 Several rules are placed on the state of an mp\_int structure and are assumed to be followed for reasons of efficiency. |
|
664 The only exceptions are when the structure is passed to initialization functions such as mp\_init() and mp\_init\_copy(). |
|
665 |
|
666 \begin{enumerate} |
|
667 \item The value of \textbf{alloc} may not be less than one. That is \textbf{dp} always points to a previously allocated |
|
668 array of digits. |
|
669 \item The value of \textbf{used} may not exceed \textbf{alloc} and must be greater than or equal to zero. |
|
670 \item The value of \textbf{used} implies the digit at index $(used - 1)$ of the \textbf{dp} array is non-zero. That is, |
|
671 leading zero digits in the most significant positions must be trimmed. |
|
672 \begin{enumerate} |
|
673 \item Digits in the \textbf{dp} array at and above the \textbf{used} location must be zero. |
|
674 \end{enumerate} |
|
675 \item The value of \textbf{sign} must be \textbf{MP\_ZPOS} if \textbf{used} is zero; |
|
676 this represents the mp\_int value of zero. |
|
677 \end{enumerate} |
|
678 |
|
679 \section{Argument Passing} |
|
680 A convention of argument passing must be adopted early on in the development of any library. Making the function |
|
681 prototypes consistent will help eliminate many headaches in the future as the library grows to significant complexity. |
|
682 In LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int |
|
683 structures. That means that the source (input) operands are placed on the left and the destination (output) on the right. |
|
684 Consider the following examples. |
|
685 |
|
686 \begin{verbatim} |
|
687 mp_mul(&a, &b, &c); /* c = a * b */ |
|
688 mp_add(&a, &b, &a); /* a = a + b */ |
|
689 mp_sqr(&a, &b); /* b = a * a */ |
|
690 \end{verbatim} |
|
691 |
|
692 The left to right order is a fairly natural way to implement the functions since it lets the developer read aloud the |
|
693 functions and make sense of them. For example, the first function would read ``multiply a and b and store in c''. |
|
694 |
|
695 Certain libraries (\textit{LIP by Lenstra for instance}) accept parameters the other way around, to mimic the order |
|
696 of assignment expressions. That is, the destination (output) is on the left and arguments (inputs) are on the right. In |
|
697 truth, it is entirely a matter of preference. In the case of LibTomMath the convention from the MPI library has been |
|
698 adopted. |
|
699 |
|
700 Another very useful design consideration, provided for in LibTomMath, is whether to allow argument sources to also be a |
|
701 destination. For example, the second example (\textit{mp\_add}) adds $a$ to $b$ and stores in $a$. This is an important |
|
702 feature to implement since it allows the calling functions to cut down on the number of variables it must maintain. |
|
703 However, to implement this feature specific care has to be given to ensure the destination is not modified before the |
|
704 source is fully read. |
|
705 |
|
706 \section{Return Values} |
|
707 A well implemented application, no matter what its purpose, should trap as many runtime errors as possible and return them |
|
708 to the caller. By catching runtime errors a library can be guaranteed to prevent undefined behaviour. However, the end |
|
709 developer can still manage to cause a library to crash. For example, by passing an invalid pointer an application may |
|
710 fault by dereferencing memory not owned by the application. |
|
711 |
|
712 In the case of LibTomMath the only errors that are checked for are related to inappropriate inputs (division by zero for |
|
713 instance) and memory allocation errors. It will not check that the mp\_int passed to any function is valid nor |
|
714 will it check pointers for validity. Any function that can cause a runtime error will return an error code as an |
142
|
715 \textbf{int} data type with one of the following values (fig \ref{fig:errcodes}). |
19
|
716 |
|
717 \index{MP\_OKAY} \index{MP\_VAL} \index{MP\_MEM} |
142
|
718 \begin{figure}[here] |
19
|
719 \begin{center} |
|
720 \begin{tabular}{|l|l|} |
|
721 \hline \textbf{Value} & \textbf{Meaning} \\ |
|
722 \hline \textbf{MP\_OKAY} & The function was successful \\ |
|
723 \hline \textbf{MP\_VAL} & One of the input value(s) was invalid \\ |
|
724 \hline \textbf{MP\_MEM} & The function ran out of heap memory \\ |
|
725 \hline |
|
726 \end{tabular} |
|
727 \end{center} |
142
|
728 \caption{LibTomMath Error Codes} |
|
729 \label{fig:errcodes} |
|
730 \end{figure} |
19
|
731 |
|
732 When an error is detected within a function it should free any memory it allocated, often during the initialization of |
|
733 temporary mp\_ints, and return as soon as possible. The goal is to leave the system in the same state it was when the |
|
734 function was called. Error checking with this style of API is fairly simple. |
|
735 |
|
736 \begin{verbatim} |
|
737 int err; |
|
738 if ((err = mp_add(&a, &b, &c)) != MP_OKAY) { |
|
739 printf("Error: %s\n", mp_error_to_string(err)); |
|
740 exit(EXIT_FAILURE); |
|
741 } |
|
742 \end{verbatim} |
|
743 |
|
744 The GMP \cite{GMP} library uses C style \textit{signals} to flag errors which is of questionable use. Not all errors are fatal |
|
745 and it was not deemed ideal by the author of LibTomMath to force developers to have signal handlers for such cases. |
|
746 |
|
747 \section{Initialization and Clearing} |
|
748 The logical starting point when actually writing multiple precision integer functions is the initialization and |
|
749 clearing of the mp\_int structures. These two algorithms will be used by the majority of the higher level algorithms. |
|
750 |
|
751 Given the basic mp\_int structure an initialization routine must first allocate memory to hold the digits of |
|
752 the integer. Often it is optimal to allocate a sufficiently large pre-set number of digits even though |
|
753 the initial integer will represent zero. If only a single digit were allocated quite a few subsequent re-allocations |
|
754 would occur when operations are performed on the integers. There is a tradeoff between how many default digits to allocate |
|
755 and how many re-allocations are tolerable. Obviously allocating an excessive amount of digits initially will waste |
|
756 memory and become unmanageable. |
|
757 |
|
758 If the memory for the digits has been successfully allocated then the rest of the members of the structure must |
|
759 be initialized. Since the initial state of an mp\_int is to represent the zero integer, the allocated digits must be set |
|
760 to zero. The \textbf{used} count set to zero and \textbf{sign} set to \textbf{MP\_ZPOS}. |
|
761 |
|
762 \subsection{Initializing an mp\_int} |
|
763 An mp\_int is said to be initialized if it is set to a valid, preferably default, state such that all of the members of the |
|
764 structure are set to valid values. The mp\_init algorithm will perform such an action. |
|
765 |
142
|
766 \index{mp\_init} |
19
|
767 \begin{figure}[here] |
|
768 \begin{center} |
|
769 \begin{tabular}{l} |
|
770 \hline Algorithm \textbf{mp\_init}. \\ |
|
771 \textbf{Input}. An mp\_int $a$ \\ |
|
772 \textbf{Output}. Allocate memory and initialize $a$ to a known valid mp\_int state. \\ |
|
773 \hline \\ |
|
774 1. Allocate memory for \textbf{MP\_PREC} digits. \\ |
|
775 2. If the allocation failed return(\textit{MP\_MEM}) \\ |
|
776 3. for $n$ from $0$ to $MP\_PREC - 1$ do \\ |
|
777 \hspace{3mm}3.1 $a_n \leftarrow 0$\\ |
|
778 4. $a.sign \leftarrow MP\_ZPOS$\\ |
|
779 5. $a.used \leftarrow 0$\\ |
|
780 6. $a.alloc \leftarrow MP\_PREC$\\ |
|
781 7. Return(\textit{MP\_OKAY})\\ |
|
782 \hline |
|
783 \end{tabular} |
|
784 \end{center} |
|
785 \caption{Algorithm mp\_init} |
|
786 \end{figure} |
|
787 |
|
788 \textbf{Algorithm mp\_init.} |
142
|
789 The purpose of this function is to initialize an mp\_int structure so that the rest of the library can properly |
|
790 manipulte it. It is assumed that the input may not have had any of its members previously initialized which is certainly |
|
791 a valid assumption if the input resides on the stack. |
|
792 |
|
793 Before any of the members such as \textbf{sign}, \textbf{used} or \textbf{alloc} are initialized the memory for |
|
794 the digits is allocated. If this fails the function returns before setting any of the other members. The \textbf{MP\_PREC} |
|
795 name represents a constant\footnote{Defined in the ``tommath.h'' header file within LibTomMath.} |
|
796 used to dictate the minimum precision of newly initialized mp\_int integers. Ideally, it is at least equal to the smallest |
|
797 precision number you'll be working with. |
|
798 |
|
799 Allocating a block of digits at first instead of a single digit has the benefit of lowering the number of usually slow |
|
800 heap operations later functions will have to perform in the future. If \textbf{MP\_PREC} is set correctly the slack |
|
801 memory and the number of heap operations will be trivial. |
|
802 |
|
803 Once the allocation has been made the digits have to be set to zero as well as the \textbf{used}, \textbf{sign} and |
|
804 \textbf{alloc} members initialized. This ensures that the mp\_int will always represent the default state of zero regardless |
|
805 of the original condition of the input. |
19
|
806 |
|
807 \textbf{Remark.} |
|
808 This function introduces the idiosyncrasy that all iterative loops, commonly initiated with the ``for'' keyword, iterate incrementally |
|
809 when the ``to'' keyword is placed between two expressions. For example, ``for $a$ from $b$ to $c$ do'' means that |
|
810 a subsequent expression (or body of expressions) are to be evaluated upto $c - b$ times so long as $b \le c$. In each |
|
811 iteration the variable $a$ is substituted for a new integer that lies inclusively between $b$ and $c$. If $b > c$ occured |
|
812 the loop would not iterate. By contrast if the ``downto'' keyword were used in place of ``to'' the loop would iterate |
|
813 decrementally. |
|
814 |
|
815 EXAM,bn_mp_init.c |
|
816 |
|
817 One immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure. It |
|
818 is assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack. The |
|
819 call to mp\_init() is used only to initialize the members of the structure to a known default state. |
|
820 |
142
|
821 Here we see (line @23,XMALLOC@) the memory allocation is performed first. This allows us to exit cleanly and quickly |
|
822 if there is an error. If the allocation fails the routine will return \textbf{MP\_MEM} to the caller to indicate there |
|
823 was a memory error. The function XMALLOC is what actually allocates the memory. Technically XMALLOC is not a function |
|
824 but a macro defined in ``tommath.h``. By default, XMALLOC will evaluate to malloc() which is the C library's built--in |
|
825 memory allocation routine. |
|
826 |
|
827 In order to assure the mp\_int is in a known state the digits must be set to zero. On most platforms this could have been |
|
828 accomplished by using calloc() instead of malloc(). However, to correctly initialize a integer type to a given value in a |
|
829 portable fashion you have to actually assign the value. The for loop (line @28,for@) performs this required |
|
830 operation. |
|
831 |
|
832 After the memory has been successfully initialized the remainder of the members are initialized |
19
|
833 (lines @29,used@ through @31,sign@) to their respective default states. At this point the algorithm has succeeded and |
142
|
834 a success code is returned to the calling function. If this function returns \textbf{MP\_OKAY} it is safe to assume the |
|
835 mp\_int structure has been properly initialized and is safe to use with other functions within the library. |
19
|
836 |
|
837 \subsection{Clearing an mp\_int} |
|
838 When an mp\_int is no longer required by the application, the memory that has been allocated for its digits must be |
|
839 returned to the application's memory pool with the mp\_clear algorithm. |
|
840 |
|
841 \begin{figure}[here] |
|
842 \begin{center} |
|
843 \begin{tabular}{l} |
|
844 \hline Algorithm \textbf{mp\_clear}. \\ |
|
845 \textbf{Input}. An mp\_int $a$ \\ |
142
|
846 \textbf{Output}. The memory for $a$ shall be deallocated. \\ |
19
|
847 \hline \\ |
|
848 1. If $a$ has been previously freed then return(\textit{MP\_OKAY}). \\ |
|
849 2. for $n$ from 0 to $a.used - 1$ do \\ |
|
850 \hspace{3mm}2.1 $a_n \leftarrow 0$ \\ |
|
851 3. Free the memory allocated for the digits of $a$. \\ |
|
852 4. $a.used \leftarrow 0$ \\ |
|
853 5. $a.alloc \leftarrow 0$ \\ |
|
854 6. $a.sign \leftarrow MP\_ZPOS$ \\ |
|
855 7. Return(\textit{MP\_OKAY}). \\ |
|
856 \hline |
|
857 \end{tabular} |
|
858 \end{center} |
|
859 \caption{Algorithm mp\_clear} |
|
860 \end{figure} |
|
861 |
|
862 \textbf{Algorithm mp\_clear.} |
142
|
863 This algorithm accomplishes two goals. First, it clears the digits and the other mp\_int members. This ensures that |
|
864 if a developer accidentally re-uses a cleared structure it is less likely to cause problems. The second goal |
|
865 is to free the allocated memory. |
|
866 |
|
867 The logic behind the algorithm is extended by marking cleared mp\_int structures so that subsequent calls to this |
|
868 algorithm will not try to free the memory multiple times. Cleared mp\_ints are detectable by having a pre-defined invalid |
|
869 digit pointer \textbf{dp} setting. |
|
870 |
|
871 Once an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm |
19
|
872 with the exception of algorithms mp\_init, mp\_init\_copy, mp\_init\_size and mp\_clear. |
|
873 |
|
874 EXAM,bn_mp_clear.c |
|
875 |
142
|
876 The algorithm only operates on the mp\_int if it hasn't been previously cleared. The if statement (line @23,a->dp != NULL@) |
|
877 checks to see if the \textbf{dp} member is not \textbf{NULL}. If the mp\_int is a valid mp\_int then \textbf{dp} cannot be |
|
878 \textbf{NULL} in which case the if statement will evaluate to true. |
|
879 |
|
880 The digits of the mp\_int are cleared by the for loop (line @25,for@) which assigns a zero to every digit. Similar to mp\_init() |
|
881 the digits are assigned zero instead of using block memory operations (such as memset()) since this is more portable. |
|
882 |
|
883 The digits are deallocated off the heap via the XFREE macro. Similar to XMALLOC the XFREE macro actually evaluates to |
|
884 a standard C library function. In this case the free() function. Since free() only deallocates the memory the pointer |
|
885 still has to be reset to \textbf{NULL} manually (line @33,NULL@). |
|
886 |
|
887 Now that the digits have been cleared and deallocated the other members are set to their final values (lines @34,= 0@ and @35,ZPOS@). |
19
|
888 |
|
889 \section{Maintenance Algorithms} |
|
890 |
|
891 The previous sections describes how to initialize and clear an mp\_int structure. To further support operations |
|
892 that are to be performed on mp\_int structures (such as addition and multiplication) the dependent algorithms must be |
|
893 able to augment the precision of an mp\_int and |
|
894 initialize mp\_ints with differing initial conditions. |
|
895 |
|
896 These algorithms complete the set of low level algorithms required to work with mp\_int structures in the higher level |
|
897 algorithms such as addition, multiplication and modular exponentiation. |
|
898 |
|
899 \subsection{Augmenting an mp\_int's Precision} |
|
900 When storing a value in an mp\_int structure, a sufficient number of digits must be available to accomodate the entire |
|
901 result of an operation without loss of precision. Quite often the size of the array given by the \textbf{alloc} member |
|
902 is large enough to simply increase the \textbf{used} digit count. However, when the size of the array is too small it |
|
903 must be re-sized appropriately to accomodate the result. The mp\_grow algorithm will provide this functionality. |
|
904 |
|
905 \newpage\begin{figure}[here] |
|
906 \begin{center} |
|
907 \begin{tabular}{l} |
|
908 \hline Algorithm \textbf{mp\_grow}. \\ |
|
909 \textbf{Input}. An mp\_int $a$ and an integer $b$. \\ |
|
910 \textbf{Output}. $a$ is expanded to accomodate $b$ digits. \\ |
|
911 \hline \\ |
|
912 1. if $a.alloc \ge b$ then return(\textit{MP\_OKAY}) \\ |
|
913 2. $u \leftarrow b\mbox{ (mod }MP\_PREC\mbox{)}$ \\ |
|
914 3. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\ |
142
|
915 4. Re-allocate the array of digits $a$ to size $v$ \\ |
19
|
916 5. If the allocation failed then return(\textit{MP\_MEM}). \\ |
|
917 6. for n from a.alloc to $v - 1$ do \\ |
|
918 \hspace{+3mm}6.1 $a_n \leftarrow 0$ \\ |
|
919 7. $a.alloc \leftarrow v$ \\ |
|
920 8. Return(\textit{MP\_OKAY}) \\ |
|
921 \hline |
|
922 \end{tabular} |
|
923 \end{center} |
|
924 \caption{Algorithm mp\_grow} |
|
925 \end{figure} |
|
926 |
|
927 \textbf{Algorithm mp\_grow.} |
|
928 It is ideal to prevent re-allocations from being performed if they are not required (step one). This is useful to |
|
929 prevent mp\_ints from growing excessively in code that erroneously calls mp\_grow. |
|
930 |
|
931 The requested digit count is padded up to next multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} (steps two and three). |
|
932 This helps prevent many trivial reallocations that would grow an mp\_int by trivially small values. |
|
933 |
|
934 It is assumed that the reallocation (step four) leaves the lower $a.alloc$ digits of the mp\_int intact. This is much |
|
935 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are |
|
936 assumed to contain undefined values they are initially set to zero. |
|
937 |
|
938 EXAM,bn_mp_grow.c |
|
939 |
142
|
940 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @23,if@) checks |
|
941 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} |
|
942 the function skips the re-allocation part thus saving time. |
|
943 |
|
944 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is |
|
945 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@). The XREALLOC function is used |
|
946 to re-allocate the memory. As per the other functions XREALLOC is actually a macro which evaluates to realloc by default. The realloc |
|
947 function leaves the base of the allocation intact which means the first \textbf{alloc} digits of the mp\_int are the same as before |
|
948 the re-allocation. All that is left is to clear the newly allocated digits and return. |
|
949 |
|
950 Note that the re-allocation result is actually stored in a temporary pointer $tmp$. This is to allow this function to return |
|
951 an error with a valid pointer. Earlier releases of the library stored the result of XREALLOC into the mp\_int $a$. That would |
|
952 result in a memory leak if XREALLOC ever failed. |
19
|
953 |
|
954 \subsection{Initializing Variable Precision mp\_ints} |
|
955 Occasionally the number of digits required will be known in advance of an initialization, based on, for example, the size |
|
956 of input mp\_ints to a given algorithm. The purpose of algorithm mp\_init\_size is similar to mp\_init except that it |
|
957 will allocate \textit{at least} a specified number of digits. |
|
958 |
|
959 \begin{figure}[here] |
|
960 \begin{small} |
|
961 \begin{center} |
|
962 \begin{tabular}{l} |
|
963 \hline Algorithm \textbf{mp\_init\_size}. \\ |
|
964 \textbf{Input}. An mp\_int $a$ and the requested number of digits $b$. \\ |
|
965 \textbf{Output}. $a$ is initialized to hold at least $b$ digits. \\ |
|
966 \hline \\ |
|
967 1. $u \leftarrow b \mbox{ (mod }MP\_PREC\mbox{)}$ \\ |
|
968 2. $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\ |
|
969 3. Allocate $v$ digits. \\ |
|
970 4. for $n$ from $0$ to $v - 1$ do \\ |
|
971 \hspace{3mm}4.1 $a_n \leftarrow 0$ \\ |
|
972 5. $a.sign \leftarrow MP\_ZPOS$\\ |
|
973 6. $a.used \leftarrow 0$\\ |
|
974 7. $a.alloc \leftarrow v$\\ |
|
975 8. Return(\textit{MP\_OKAY})\\ |
|
976 \hline |
|
977 \end{tabular} |
|
978 \end{center} |
|
979 \end{small} |
|
980 \caption{Algorithm mp\_init\_size} |
|
981 \end{figure} |
|
982 |
|
983 \textbf{Algorithm mp\_init\_size.} |
|
984 This algorithm will initialize an mp\_int structure $a$ like algorithm mp\_init with the exception that the number of |
|
985 digits allocated can be controlled by the second input argument $b$. The input size is padded upwards so it is a |
|
986 multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} digits. This padding is used to prevent trivial |
|
987 allocations from becoming a bottleneck in the rest of the algorithms. |
|
988 |
|
989 Like algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero. This |
|
990 particular algorithm is useful if it is known ahead of time the approximate size of the input. If the approximation is |
|
991 correct no further memory re-allocations are required to work with the mp\_int. |
|
992 |
|
993 EXAM,bn_mp_init_size.c |
|
994 |
|
995 The number of digits $b$ requested is padded (line @22,MP_PREC@) by first augmenting it to the next multiple of |
|
996 \textbf{MP\_PREC} and then adding \textbf{MP\_PREC} to the result. If the memory can be successfully allocated the |
|
997 mp\_int is placed in a default state representing the integer zero. Otherwise, the error code \textbf{MP\_MEM} will be |
|
998 returned (line @27,return@). |
|
999 |
142
|
1000 The digits are allocated and set to zero at the same time with the calloc() function (line @25,XCALLOC@). The |
19
|
1001 \textbf{used} count is set to zero, the \textbf{alloc} count set to the padded digit count and the \textbf{sign} flag set |
|
1002 to \textbf{MP\_ZPOS} to achieve a default valid mp\_int state (lines @29,used@, @30,alloc@ and @31,sign@). If the function |
|
1003 returns succesfully then it is correct to assume that the mp\_int structure is in a valid state for the remainder of the |
|
1004 functions to work with. |
|
1005 |
|
1006 \subsection{Multiple Integer Initializations and Clearings} |
|
1007 Occasionally a function will require a series of mp\_int data types to be made available simultaneously. |
|
1008 The purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single |
|
1009 statement. It is essentially a shortcut to multiple initializations. |
|
1010 |
|
1011 \newpage\begin{figure}[here] |
|
1012 \begin{center} |
|
1013 \begin{tabular}{l} |
|
1014 \hline Algorithm \textbf{mp\_init\_multi}. \\ |
|
1015 \textbf{Input}. Variable length array $V_k$ of mp\_int variables of length $k$. \\ |
|
1016 \textbf{Output}. The array is initialized such that each mp\_int of $V_k$ is ready to use. \\ |
|
1017 \hline \\ |
|
1018 1. for $n$ from 0 to $k - 1$ do \\ |
|
1019 \hspace{+3mm}1.1. Initialize the mp\_int $V_n$ (\textit{mp\_init}) \\ |
|
1020 \hspace{+3mm}1.2. If initialization failed then do \\ |
|
1021 \hspace{+6mm}1.2.1. for $j$ from $0$ to $n$ do \\ |
|
1022 \hspace{+9mm}1.2.1.1. Free the mp\_int $V_j$ (\textit{mp\_clear}) \\ |
|
1023 \hspace{+6mm}1.2.2. Return(\textit{MP\_MEM}) \\ |
|
1024 2. Return(\textit{MP\_OKAY}) \\ |
|
1025 \hline |
|
1026 \end{tabular} |
|
1027 \end{center} |
|
1028 \caption{Algorithm mp\_init\_multi} |
|
1029 \end{figure} |
|
1030 |
|
1031 \textbf{Algorithm mp\_init\_multi.} |
|
1032 The algorithm will initialize the array of mp\_int variables one at a time. If a runtime error has been detected |
|
1033 (\textit{step 1.2}) all of the previously initialized variables are cleared. The goal is an ``all or nothing'' |
|
1034 initialization which allows for quick recovery from runtime errors. |
|
1035 |
|
1036 EXAM,bn_mp_init_multi.c |
|
1037 |
|
1038 This function intializes a variable length list of mp\_int structure pointers. However, instead of having the mp\_int |
|
1039 structures in an actual C array they are simply passed as arguments to the function. This function makes use of the |
|
1040 ``...'' argument syntax of the C programming language. The list is terminated with a final \textbf{NULL} argument |
|
1041 appended on the right. |
|
1042 |
|
1043 The function uses the ``stdarg.h'' \textit{va} functions to step portably through the arguments to the function. A count |
|
1044 $n$ of succesfully initialized mp\_int structures is maintained (line @47,n++@) such that if a failure does occur, |
|
1045 the algorithm can backtrack and free the previously initialized structures (lines @27,if@ to @46,}@). |
|
1046 |
|
1047 |
|
1048 \subsection{Clamping Excess Digits} |
|
1049 When a function anticipates a result will be $n$ digits it is simpler to assume this is true within the body of |
|
1050 the function instead of checking during the computation. For example, a multiplication of a $i$ digit number by a |
|
1051 $j$ digit produces a result of at most $i + j$ digits. It is entirely possible that the result is $i + j - 1$ |
|
1052 though, with no final carry into the last position. However, suppose the destination had to be first expanded |
|
1053 (\textit{via mp\_grow}) to accomodate $i + j - 1$ digits than further expanded to accomodate the final carry. |
|
1054 That would be a considerable waste of time since heap operations are relatively slow. |
|
1055 |
|
1056 The ideal solution is to always assume the result is $i + j$ and fix up the \textbf{used} count after the function |
|
1057 terminates. This way a single heap operation (\textit{at most}) is required. However, if the result was not checked |
|
1058 there would be an excess high order zero digit. |
|
1059 |
|
1060 For example, suppose the product of two integers was $x_n = (0x_{n-1}x_{n-2}...x_0)_{\beta}$. The leading zero digit |
|
1061 will not contribute to the precision of the result. In fact, through subsequent operations more leading zero digits would |
|
1062 accumulate to the point the size of the integer would be prohibitive. As a result even though the precision is very |
|
1063 low the representation is excessively large. |
|
1064 |
|
1065 The mp\_clamp algorithm is designed to solve this very problem. It will trim high-order zeros by decrementing the |
|
1066 \textbf{used} count until a non-zero most significant digit is found. Also in this system, zero is considered to be a |
|
1067 positive number which means that if the \textbf{used} count is decremented to zero, the sign must be set to |
|
1068 \textbf{MP\_ZPOS}. |
|
1069 |
|
1070 \begin{figure}[here] |
|
1071 \begin{center} |
|
1072 \begin{tabular}{l} |
|
1073 \hline Algorithm \textbf{mp\_clamp}. \\ |
|
1074 \textbf{Input}. An mp\_int $a$ \\ |
|
1075 \textbf{Output}. Any excess leading zero digits of $a$ are removed \\ |
|
1076 \hline \\ |
|
1077 1. while $a.used > 0$ and $a_{a.used - 1} = 0$ do \\ |
|
1078 \hspace{+3mm}1.1 $a.used \leftarrow a.used - 1$ \\ |
|
1079 2. if $a.used = 0$ then do \\ |
|
1080 \hspace{+3mm}2.1 $a.sign \leftarrow MP\_ZPOS$ \\ |
|
1081 \hline \\ |
|
1082 \end{tabular} |
|
1083 \end{center} |
|
1084 \caption{Algorithm mp\_clamp} |
|
1085 \end{figure} |
|
1086 |
|
1087 \textbf{Algorithm mp\_clamp.} |
|
1088 As can be expected this algorithm is very simple. The loop on step one is expected to iterate only once or twice at |
|
1089 the most. For example, this will happen in cases where there is not a carry to fill the last position. Step two fixes the sign for |
|
1090 when all of the digits are zero to ensure that the mp\_int is valid at all times. |
|
1091 |
|
1092 EXAM,bn_mp_clamp.c |
|
1093 |
|
1094 Note on line @27,while@ how to test for the \textbf{used} count is made on the left of the \&\& operator. In the C programming |
|
1095 language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails. This is |
|
1096 important since if the \textbf{used} is zero the test on the right would fetch below the array. That is obviously |
|
1097 undesirable. The parenthesis on line @28,a->used@ is used to make sure the \textbf{used} count is decremented and not |
|
1098 the pointer ``a''. |
|
1099 |
|
1100 \section*{Exercises} |
|
1101 \begin{tabular}{cl} |
|
1102 $\left [ 1 \right ]$ & Discuss the relevance of the \textbf{used} member of the mp\_int structure. \\ |
|
1103 & \\ |
|
1104 $\left [ 1 \right ]$ & Discuss the consequences of not using padding when performing allocations. \\ |
|
1105 & \\ |
|
1106 $\left [ 2 \right ]$ & Estimate an ideal value for \textbf{MP\_PREC} when performing 1024-bit RSA \\ |
|
1107 & encryption when $\beta = 2^{28}$. \\ |
|
1108 & \\ |
|
1109 $\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp. What does it prevent? \\ |
|
1110 & \\ |
|
1111 $\left [ 1 \right ]$ & Give an example of when the algorithm mp\_init\_copy might be useful. \\ |
|
1112 & \\ |
|
1113 \end{tabular} |
|
1114 |
|
1115 |
|
1116 %%% |
|
1117 % CHAPTER FOUR |
|
1118 %%% |
|
1119 |
|
1120 \chapter{Basic Operations} |
|
1121 |
|
1122 \section{Introduction} |
|
1123 In the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining |
|
1124 mp\_int structures. This chapter will discuss another set of seemingly non-algebraic algorithms which will form the low |
|
1125 level basis of the entire library. While these algorithm are relatively trivial it is important to understand how they |
|
1126 work before proceeding since these algorithms will be used almost intrinsically in the following chapters. |
|
1127 |
|
1128 The algorithms in this chapter deal primarily with more ``programmer'' related tasks such as creating copies of |
|
1129 mp\_int structures, assigning small values to mp\_int structures and comparisons of the values mp\_int structures |
|
1130 represent. |
|
1131 |
|
1132 \section{Assigning Values to mp\_int Structures} |
|
1133 \subsection{Copying an mp\_int} |
|
1134 Assigning the value that a given mp\_int structure represents to another mp\_int structure shall be known as making |
|
1135 a copy for the purposes of this text. The copy of the mp\_int will be a separate entity that represents the same |
|
1136 value as the mp\_int it was copied from. The mp\_copy algorithm provides this functionality. |
|
1137 |
|
1138 \newpage\begin{figure}[here] |
|
1139 \begin{center} |
|
1140 \begin{tabular}{l} |
|
1141 \hline Algorithm \textbf{mp\_copy}. \\ |
|
1142 \textbf{Input}. An mp\_int $a$ and $b$. \\ |
|
1143 \textbf{Output}. Store a copy of $a$ in $b$. \\ |
|
1144 \hline \\ |
|
1145 1. If $b.alloc < a.used$ then grow $b$ to $a.used$ digits. (\textit{mp\_grow}) \\ |
|
1146 2. for $n$ from 0 to $a.used - 1$ do \\ |
|
1147 \hspace{3mm}2.1 $b_{n} \leftarrow a_{n}$ \\ |
|
1148 3. for $n$ from $a.used$ to $b.used - 1$ do \\ |
|
1149 \hspace{3mm}3.1 $b_{n} \leftarrow 0$ \\ |
|
1150 4. $b.used \leftarrow a.used$ \\ |
|
1151 5. $b.sign \leftarrow a.sign$ \\ |
|
1152 6. return(\textit{MP\_OKAY}) \\ |
|
1153 \hline |
|
1154 \end{tabular} |
|
1155 \end{center} |
|
1156 \caption{Algorithm mp\_copy} |
|
1157 \end{figure} |
|
1158 |
|
1159 \textbf{Algorithm mp\_copy.} |
|
1160 This algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will |
|
1161 represent the same integer as the mp\_int $a$. The mp\_int $b$ shall be a complete and distinct copy of the |
|
1162 mp\_int $a$ meaing that the mp\_int $a$ can be modified and it shall not affect the value of the mp\_int $b$. |
|
1163 |
|
1164 If $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow |
|
1165 algorithm. The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two |
|
1166 and three). The \textbf{used} and \textbf{sign} members of $a$ are finally copied over the respective members of |
|
1167 $b$. |
|
1168 |
|
1169 \textbf{Remark.} This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the |
|
1170 text. The error return codes of other algorithms are not explicitly checked in the pseudo-code presented. For example, in |
|
1171 step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded. Text space is |
|
1172 limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return |
|
1173 the error code itself. However, the C code presented will demonstrate all of the error handling logic required to |
|
1174 implement the pseudo-code. |
|
1175 |
|
1176 EXAM,bn_mp_copy.c |
|
1177 |
|
1178 Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output |
|
1179 mp\_int structures passed to a function are one and the same. For this case it is optimal to return immediately without |
|
1180 copying digits (line @24,a == b@). |
|
1181 |
|
1182 The mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$. If $b.alloc$ is less than |
|
1183 $a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines @29,alloc@ to @33,}@). In order to |
|
1184 simplify the inner loop that copies the digits from $a$ to $b$, two aliases $tmpa$ and $tmpb$ point directly at the digits |
|
1185 of the mp\_ints $a$ and $b$ respectively. These aliases (lines @42,tmpa@ and @45,tmpb@) allow the compiler to access the digits without first dereferencing the |
|
1186 mp\_int pointers and then subsequently the pointer to the digits. |
|
1187 |
|
1188 After the aliases are established the digits from $a$ are copied into $b$ (lines @48,for@ to @50,}@) and then the excess |
|
1189 digits of $b$ are set to zero (lines @53,for@ to @55,}@). Both ``for'' loops make use of the pointer aliases and in |
|
1190 fact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits. This optimization |
|
1191 allows the alias to stay in a machine register fairly easy between the two loops. |
|
1192 |
|
1193 \textbf{Remarks.} The use of pointer aliases is an implementation methodology first introduced in this function that will |
|
1194 be used considerably in other functions. Technically, a pointer alias is simply a short hand alias used to lower the |
|
1195 number of pointer dereferencing operations required to access data. For example, a for loop may resemble |
|
1196 |
|
1197 \begin{alltt} |
|
1198 for (x = 0; x < 100; x++) \{ |
|
1199 a->num[4]->dp[x] = 0; |
|
1200 \} |
|
1201 \end{alltt} |
|
1202 |
|
1203 This could be re-written using aliases as |
|
1204 |
|
1205 \begin{alltt} |
|
1206 mp_digit *tmpa; |
|
1207 a = a->num[4]->dp; |
|
1208 for (x = 0; x < 100; x++) \{ |
|
1209 *a++ = 0; |
|
1210 \} |
|
1211 \end{alltt} |
|
1212 |
|
1213 In this case an alias is used to access the |
|
1214 array of digits within an mp\_int structure directly. It may seem that a pointer alias is strictly not required |
|
1215 as a compiler may optimize out the redundant pointer operations. However, there are two dominant reasons to use aliases. |
|
1216 |
|
1217 The first reason is that most compilers will not effectively optimize pointer arithmetic. For example, some optimizations |
|
1218 may work for the Microsoft Visual C++ compiler (MSVC) and not for the GNU C Compiler (GCC). Also some optimizations may |
|
1219 work for GCC and not MSVC. As such it is ideal to find a common ground for as many compilers as possible. Pointer |
|
1220 aliases optimize the code considerably before the compiler even reads the source code which means the end compiled code |
|
1221 stands a better chance of being faster. |
|
1222 |
|
1223 The second reason is that pointer aliases often can make an algorithm simpler to read. Consider the first ``for'' |
|
1224 loop of the function mp\_copy() re-written to not use pointer aliases. |
|
1225 |
|
1226 \begin{alltt} |
|
1227 /* copy all the digits */ |
|
1228 for (n = 0; n < a->used; n++) \{ |
|
1229 b->dp[n] = a->dp[n]; |
|
1230 \} |
|
1231 \end{alltt} |
|
1232 |
|
1233 Whether this code is harder to read depends strongly on the individual. However, it is quantifiably slightly more |
|
1234 complicated as there are four variables within the statement instead of just two. |
|
1235 |
|
1236 \subsubsection{Nested Statements} |
|
1237 Another commonly used technique in the source routines is that certain sections of code are nested. This is used in |
|
1238 particular with the pointer aliases to highlight code phases. For example, a Comba multiplier (discussed in chapter six) |
|
1239 will typically have three different phases. First the temporaries are initialized, then the columns calculated and |
|
1240 finally the carries are propagated. In this example the middle column production phase will typically be nested as it |
|
1241 uses temporary variables and aliases the most. |
|
1242 |
|
1243 The nesting also simplies the source code as variables that are nested are only valid for their scope. As a result |
|
1244 the various temporary variables required do not propagate into other sections of code. |
|
1245 |
|
1246 |
|
1247 \subsection{Creating a Clone} |
|
1248 Another common operation is to make a local temporary copy of an mp\_int argument. To initialize an mp\_int |
|
1249 and then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone. This is |
|
1250 useful within functions that need to modify an argument but do not wish to actually modify the original copy. The |
|
1251 mp\_init\_copy algorithm has been designed to help perform this task. |
|
1252 |
|
1253 \begin{figure}[here] |
|
1254 \begin{center} |
|
1255 \begin{tabular}{l} |
|
1256 \hline Algorithm \textbf{mp\_init\_copy}. \\ |
|
1257 \textbf{Input}. An mp\_int $a$ and $b$\\ |
|
1258 \textbf{Output}. $a$ is initialized to be a copy of $b$. \\ |
|
1259 \hline \\ |
|
1260 1. Init $a$. (\textit{mp\_init}) \\ |
|
1261 2. Copy $b$ to $a$. (\textit{mp\_copy}) \\ |
|
1262 3. Return the status of the copy operation. \\ |
|
1263 \hline |
|
1264 \end{tabular} |
|
1265 \end{center} |
|
1266 \caption{Algorithm mp\_init\_copy} |
|
1267 \end{figure} |
|
1268 |
|
1269 \textbf{Algorithm mp\_init\_copy.} |
|
1270 This algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it. As |
|
1271 such this algorithm will perform two operations in one step. |
|
1272 |
|
1273 EXAM,bn_mp_init_copy.c |
|
1274 |
|
1275 This will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}. Note that |
|
1276 \textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call |
|
1277 and \textbf{a} will be left intact. |
|
1278 |
|
1279 \section{Zeroing an Integer} |
|
1280 Reseting an mp\_int to the default state is a common step in many algorithms. The mp\_zero algorithm will be the algorithm used to |
|
1281 perform this task. |
|
1282 |
|
1283 \begin{figure}[here] |
|
1284 \begin{center} |
|
1285 \begin{tabular}{l} |
|
1286 \hline Algorithm \textbf{mp\_zero}. \\ |
|
1287 \textbf{Input}. An mp\_int $a$ \\ |
|
1288 \textbf{Output}. Zero the contents of $a$ \\ |
|
1289 \hline \\ |
|
1290 1. $a.used \leftarrow 0$ \\ |
|
1291 2. $a.sign \leftarrow$ MP\_ZPOS \\ |
|
1292 3. for $n$ from 0 to $a.alloc - 1$ do \\ |
|
1293 \hspace{3mm}3.1 $a_n \leftarrow 0$ \\ |
|
1294 \hline |
|
1295 \end{tabular} |
|
1296 \end{center} |
|
1297 \caption{Algorithm mp\_zero} |
|
1298 \end{figure} |
|
1299 |
|
1300 \textbf{Algorithm mp\_zero.} |
|
1301 This algorithm simply resets a mp\_int to the default state. |
|
1302 |
|
1303 EXAM,bn_mp_zero.c |
|
1304 |
|
1305 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the |
|
1306 \textbf{sign} variable is set to \textbf{MP\_ZPOS}. |
|
1307 |
|
1308 \section{Sign Manipulation} |
|
1309 \subsection{Absolute Value} |
|
1310 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute |
|
1311 the absolute value of an mp\_int. |
|
1312 |
|
1313 \newpage\begin{figure}[here] |
|
1314 \begin{center} |
|
1315 \begin{tabular}{l} |
|
1316 \hline Algorithm \textbf{mp\_abs}. \\ |
|
1317 \textbf{Input}. An mp\_int $a$ \\ |
|
1318 \textbf{Output}. Computes $b = \vert a \vert$ \\ |
|
1319 \hline \\ |
|
1320 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\ |
|
1321 2. If the copy failed return(\textit{MP\_MEM}). \\ |
|
1322 3. $b.sign \leftarrow MP\_ZPOS$ \\ |
|
1323 4. Return(\textit{MP\_OKAY}) \\ |
|
1324 \hline |
|
1325 \end{tabular} |
|
1326 \end{center} |
|
1327 \caption{Algorithm mp\_abs} |
|
1328 \end{figure} |
|
1329 |
|
1330 \textbf{Algorithm mp\_abs.} |
|
1331 This algorithm computes the absolute of an mp\_int input. First it copies $a$ over $b$. This is an example of an |
|
1332 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, |
|
1333 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition |
|
1334 logic to handle it. |
|
1335 |
|
1336 EXAM,bn_mp_abs.c |
|
1337 |
|
1338 \subsection{Integer Negation} |
|
1339 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute |
|
1340 the negative of an mp\_int input. |
|
1341 |
|
1342 \begin{figure}[here] |
|
1343 \begin{center} |
|
1344 \begin{tabular}{l} |
|
1345 \hline Algorithm \textbf{mp\_neg}. \\ |
|
1346 \textbf{Input}. An mp\_int $a$ \\ |
|
1347 \textbf{Output}. Computes $b = -a$ \\ |
|
1348 \hline \\ |
|
1349 1. Copy $a$ to $b$. (\textit{mp\_copy}) \\ |
|
1350 2. If the copy failed return(\textit{MP\_MEM}). \\ |
|
1351 3. If $a.used = 0$ then return(\textit{MP\_OKAY}). \\ |
|
1352 4. If $a.sign = MP\_ZPOS$ then do \\ |
|
1353 \hspace{3mm}4.1 $b.sign = MP\_NEG$. \\ |
|
1354 5. else do \\ |
|
1355 \hspace{3mm}5.1 $b.sign = MP\_ZPOS$. \\ |
|
1356 6. Return(\textit{MP\_OKAY}) \\ |
|
1357 \hline |
|
1358 \end{tabular} |
|
1359 \end{center} |
|
1360 \caption{Algorithm mp\_neg} |
|
1361 \end{figure} |
|
1362 |
|
1363 \textbf{Algorithm mp\_neg.} |
|
1364 This algorithm computes the negation of an input. First it copies $a$ over $b$. If $a$ has no used digits then |
|
1365 the algorithm returns immediately. Otherwise it flips the sign flag and stores the result in $b$. Note that if |
|
1366 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return |
|
1367 zero as negative. |
|
1368 |
|
1369 EXAM,bn_mp_neg.c |
|
1370 |
|
1371 \section{Small Constants} |
|
1372 \subsection{Setting Small Constants} |
|
1373 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. |
|
1374 |
|
1375 \begin{figure}[here] |
|
1376 \begin{center} |
|
1377 \begin{tabular}{l} |
|
1378 \hline Algorithm \textbf{mp\_set}. \\ |
|
1379 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ |
|
1380 \textbf{Output}. Make $a$ equivalent to $b$ \\ |
|
1381 \hline \\ |
|
1382 1. Zero $a$ (\textit{mp\_zero}). \\ |
|
1383 2. $a_0 \leftarrow b \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1384 3. $a.used \leftarrow \left \lbrace \begin{array}{ll} |
|
1385 1 & \mbox{if }a_0 > 0 \\ |
|
1386 0 & \mbox{if }a_0 = 0 |
|
1387 \end{array} \right .$ \\ |
|
1388 \hline |
|
1389 \end{tabular} |
|
1390 \end{center} |
|
1391 \caption{Algorithm mp\_set} |
|
1392 \end{figure} |
|
1393 |
|
1394 \textbf{Algorithm mp\_set.} |
|
1395 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The |
|
1396 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. |
|
1397 |
|
1398 EXAM,bn_mp_set.c |
|
1399 |
|
1400 Line @21,mp_zero@ calls mp\_zero() to clear the mp\_int and reset the sign. Line @22,MP_MASK@ copies the digit |
|
1401 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly |
|
1402 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with |
|
1403 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line @23,a->used@ will set the \textbf{used} member with respect to the |
|
1404 digit actually set. This function will always make the integer positive. |
|
1405 |
|
1406 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses |
|
1407 this function should take that into account. Only trivially small constants can be set using this function. |
|
1408 |
|
1409 \subsection{Setting Large Constants} |
|
1410 To overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal. It accepts a ``long'' |
|
1411 data type as input and will always treat it as a 32-bit integer. |
|
1412 |
|
1413 \begin{figure}[here] |
|
1414 \begin{center} |
|
1415 \begin{tabular}{l} |
|
1416 \hline Algorithm \textbf{mp\_set\_int}. \\ |
|
1417 \textbf{Input}. An mp\_int $a$ and a ``long'' integer $b$ \\ |
|
1418 \textbf{Output}. Make $a$ equivalent to $b$ \\ |
|
1419 \hline \\ |
|
1420 1. Zero $a$ (\textit{mp\_zero}) \\ |
|
1421 2. for $n$ from 0 to 7 do \\ |
|
1422 \hspace{3mm}2.1 $a \leftarrow a \cdot 16$ (\textit{mp\_mul2d}) \\ |
|
1423 \hspace{3mm}2.2 $u \leftarrow \lfloor b / 2^{4(7 - n)} \rfloor \mbox{ (mod }16\mbox{)}$\\ |
|
1424 \hspace{3mm}2.3 $a_0 \leftarrow a_0 + u$ \\ |
|
1425 \hspace{3mm}2.4 $a.used \leftarrow a.used + 1$ \\ |
|
1426 3. Clamp excess used digits (\textit{mp\_clamp}) \\ |
|
1427 \hline |
|
1428 \end{tabular} |
|
1429 \end{center} |
|
1430 \caption{Algorithm mp\_set\_int} |
|
1431 \end{figure} |
|
1432 |
|
1433 \textbf{Algorithm mp\_set\_int.} |
|
1434 The algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the |
|
1435 mp\_int. Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions. In step 2.2 the |
|
1436 next four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is |
|
1437 incremented to reflect the addition. The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have |
|
1438 zero digits used and the newly added four bits would be ignored. |
|
1439 |
|
1440 Excess zero digits are trimmed in steps 2.1 and 3 by using higher level algorithms mp\_mul2d and mp\_clamp. |
|
1441 |
|
1442 EXAM,bn_mp_set_int.c |
|
1443 |
|
1444 This function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes. The weird |
|
1445 addition on line @38,a->used@ ensures that the newly added in bits are added to the number of digits. While it may not |
|
1446 seem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line @27,mp_mul_2d@ |
|
1447 as well as the call to mp\_clamp() on line @40,mp_clamp@. Both functions will clamp excess leading digits which keeps |
|
1448 the number of used digits low. |
|
1449 |
|
1450 \section{Comparisons} |
|
1451 \subsection{Unsigned Comparisions} |
|
1452 Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers. For example, |
|
1453 to compare $1,234$ to $1,264$ the digits are extracted by their positions. That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$ |
|
1454 to $1 \cdot 10^3 + 2 \cdot 10^2 + 6 \cdot 10^1 + 4 \cdot 10^0$ by comparing single digits at a time starting with the highest magnitude |
|
1455 positions. If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater. |
|
1456 |
|
1457 The first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two |
|
1458 mp\_int variables alone. It will ignore the sign of the two inputs. Such a function is useful when an absolute comparison is required or if the |
|
1459 signs are known to agree in advance. |
|
1460 |
|
1461 To facilitate working with the results of the comparison functions three constants are required. |
|
1462 |
|
1463 \begin{figure}[here] |
|
1464 \begin{center} |
|
1465 \begin{tabular}{|r|l|} |
|
1466 \hline \textbf{Constant} & \textbf{Meaning} \\ |
|
1467 \hline \textbf{MP\_GT} & Greater Than \\ |
|
1468 \hline \textbf{MP\_EQ} & Equal To \\ |
|
1469 \hline \textbf{MP\_LT} & Less Than \\ |
|
1470 \hline |
|
1471 \end{tabular} |
|
1472 \end{center} |
|
1473 \caption{Comparison Return Codes} |
|
1474 \end{figure} |
|
1475 |
|
1476 \begin{figure}[here] |
|
1477 \begin{center} |
|
1478 \begin{tabular}{l} |
|
1479 \hline Algorithm \textbf{mp\_cmp\_mag}. \\ |
|
1480 \textbf{Input}. Two mp\_ints $a$ and $b$. \\ |
|
1481 \textbf{Output}. Unsigned comparison results ($a$ to the left of $b$). \\ |
|
1482 \hline \\ |
|
1483 1. If $a.used > b.used$ then return(\textit{MP\_GT}) \\ |
|
1484 2. If $a.used < b.used$ then return(\textit{MP\_LT}) \\ |
|
1485 3. for n from $a.used - 1$ to 0 do \\ |
|
1486 \hspace{+3mm}3.1 if $a_n > b_n$ then return(\textit{MP\_GT}) \\ |
|
1487 \hspace{+3mm}3.2 if $a_n < b_n$ then return(\textit{MP\_LT}) \\ |
|
1488 4. Return(\textit{MP\_EQ}) \\ |
|
1489 \hline |
|
1490 \end{tabular} |
|
1491 \end{center} |
|
1492 \caption{Algorithm mp\_cmp\_mag} |
|
1493 \end{figure} |
|
1494 |
|
1495 \textbf{Algorithm mp\_cmp\_mag.} |
|
1496 By saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return |
|
1497 \textbf{MP\_GT} and similar with respect to when $a = b$ and $a < b$. The first two steps compare the number of digits used in both $a$ and $b$. |
|
1498 Obviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is. |
|
1499 If both have the same number of digits than the actual digits themselves must be compared starting at the leading digit. |
|
1500 |
|
1501 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to |
|
1502 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. |
|
1503 |
|
1504 EXAM,bn_mp_cmp_mag.c |
|
1505 |
|
1506 The two if statements on lines @24,if@ and @28,if@ compare the number of digits in the two inputs. These two are performed before all of the digits |
|
1507 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid |
|
1508 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the |
|
1509 array of digits. |
|
1510 |
|
1511 \subsection{Signed Comparisons} |
|
1512 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude |
|
1513 comparison a trivial signed comparison algorithm can be written. |
|
1514 |
|
1515 \begin{figure}[here] |
|
1516 \begin{center} |
|
1517 \begin{tabular}{l} |
|
1518 \hline Algorithm \textbf{mp\_cmp}. \\ |
|
1519 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ |
|
1520 \textbf{Output}. Signed Comparison Results ($a$ to the left of $b$) \\ |
|
1521 \hline \\ |
|
1522 1. if $a.sign = MP\_NEG$ and $b.sign = MP\_ZPOS$ then return(\textit{MP\_LT}) \\ |
|
1523 2. if $a.sign = MP\_ZPOS$ and $b.sign = MP\_NEG$ then return(\textit{MP\_GT}) \\ |
|
1524 3. if $a.sign = MP\_NEG$ then \\ |
|
1525 \hspace{+3mm}3.1 Return the unsigned comparison of $b$ and $a$ (\textit{mp\_cmp\_mag}) \\ |
|
1526 4 Otherwise \\ |
|
1527 \hspace{+3mm}4.1 Return the unsigned comparison of $a$ and $b$ \\ |
|
1528 \hline |
|
1529 \end{tabular} |
|
1530 \end{center} |
|
1531 \caption{Algorithm mp\_cmp} |
|
1532 \end{figure} |
|
1533 |
|
1534 \textbf{Algorithm mp\_cmp.} |
|
1535 The first two steps compare the signs of the two inputs. If the signs do not agree then it can return right away with the appropriate |
|
1536 comparison code. When the signs are equal the digits of the inputs must be compared to determine the correct result. In step |
|
1537 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then |
|
1538 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. |
|
1539 |
|
1540 EXAM,bn_mp_cmp.c |
|
1541 |
|
1542 The two if statements on lines @22,if@ and @26,if@ perform the initial sign comparison. If the signs are not the equal then which ever |
|
1543 has the positive sign is larger. At line @30,if@, the inputs are compared based on magnitudes. If the signs were both negative then |
|
1544 the unsigned comparison is performed in the opposite direction (\textit{line @31,mp_cmp_mag@}). Otherwise, the signs are assumed to |
|
1545 be both positive and a forward direction unsigned comparison is performed. |
|
1546 |
|
1547 \section*{Exercises} |
|
1548 \begin{tabular}{cl} |
|
1549 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ |
|
1550 & \\ |
|
1551 $\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits \\ |
|
1552 & of two random digits (of equal magnitude) before a difference is found. \\ |
|
1553 & \\ |
|
1554 $\left [ 1 \right ]$ & Suggest a simple method to speed up the implementation of mp\_cmp\_mag based \\ |
|
1555 & on the observations made in the previous problem. \\ |
|
1556 & |
|
1557 \end{tabular} |
|
1558 |
|
1559 \chapter{Basic Arithmetic} |
|
1560 \section{Introduction} |
|
1561 At this point algorithms for initialization, clearing, zeroing, copying, comparing and setting small constants have been |
|
1562 established. The next logical set of algorithms to develop are addition, subtraction and digit shifting algorithms. These |
|
1563 algorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms. It is very important |
|
1564 that these algorithms are highly optimized. On their own they are simple $O(n)$ algorithms but they can be called from higher level algorithms |
|
1565 which easily places them at $O(n^2)$ or even $O(n^3)$ work levels. |
|
1566 |
|
1567 MARK,SHIFTS |
|
1568 All of the algorithms within this chapter make use of the logical bit shift operations denoted by $<<$ and $>>$ for left and right |
|
1569 logical shifts respectively. A logical shift is analogous to sliding the decimal point of radix-10 representations. For example, the real |
|
1570 number $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}). |
|
1571 Algebraically a binary logical shift is equivalent to a division or multiplication by a power of two. |
|
1572 For example, $a << k = a \cdot 2^k$ while $a >> k = \lfloor a/2^k \rfloor$. |
|
1573 |
|
1574 One significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed |
|
1575 from the number. For example, consider $1101_2 >> 1$ using decimal notation this would produce $110.1_2$. However, with a logical shift the |
|
1576 result is $110_2$. |
|
1577 |
|
1578 \section{Addition and Subtraction} |
|
1579 In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus. For example, with 32-bit integers |
|
1580 $a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$ since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$. |
|
1581 As a result subtraction can be performed with a trivial series of logical operations and an addition. |
|
1582 |
|
1583 However, in multiple precision arithmetic negative numbers are not represented in the same way. Instead a sign flag is used to keep track of the |
|
1584 sign of the integer. As a result signed addition and subtraction are actually implemented as conditional usage of lower level addition or |
|
1585 subtraction algorithms with the sign fixed up appropriately. |
|
1586 |
|
1587 The lower level algorithms will add or subtract integers without regard to the sign flag. That is they will add or subtract the magnitude of |
|
1588 the integers respectively. |
|
1589 |
|
1590 \subsection{Low Level Addition} |
|
1591 An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers. That is to add the |
|
1592 trailing digits first and propagate the resulting carry upwards. Since this is a lower level algorithm the name will have a ``s\_'' prefix. |
|
1593 Historically that convention stems from the MPI library where ``s\_'' stood for static functions that were hidden from the developer entirely. |
|
1594 |
|
1595 \newpage |
|
1596 \begin{figure}[!here] |
|
1597 \begin{center} |
|
1598 \begin{small} |
|
1599 \begin{tabular}{l} |
|
1600 \hline Algorithm \textbf{s\_mp\_add}. \\ |
|
1601 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ |
|
1602 \textbf{Output}. The unsigned addition $c = \vert a \vert + \vert b \vert$. \\ |
|
1603 \hline \\ |
|
1604 1. if $a.used > b.used$ then \\ |
|
1605 \hspace{+3mm}1.1 $min \leftarrow b.used$ \\ |
|
1606 \hspace{+3mm}1.2 $max \leftarrow a.used$ \\ |
|
1607 \hspace{+3mm}1.3 $x \leftarrow a$ \\ |
|
1608 2. else \\ |
|
1609 \hspace{+3mm}2.1 $min \leftarrow a.used$ \\ |
|
1610 \hspace{+3mm}2.2 $max \leftarrow b.used$ \\ |
|
1611 \hspace{+3mm}2.3 $x \leftarrow b$ \\ |
|
1612 3. If $c.alloc < max + 1$ then grow $c$ to hold at least $max + 1$ digits (\textit{mp\_grow}) \\ |
|
1613 4. $oldused \leftarrow c.used$ \\ |
|
1614 5. $c.used \leftarrow max + 1$ \\ |
|
1615 6. $u \leftarrow 0$ \\ |
|
1616 7. for $n$ from $0$ to $min - 1$ do \\ |
|
1617 \hspace{+3mm}7.1 $c_n \leftarrow a_n + b_n + u$ \\ |
|
1618 \hspace{+3mm}7.2 $u \leftarrow c_n >> lg(\beta)$ \\ |
|
1619 \hspace{+3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1620 8. if $min \ne max$ then do \\ |
|
1621 \hspace{+3mm}8.1 for $n$ from $min$ to $max - 1$ do \\ |
|
1622 \hspace{+6mm}8.1.1 $c_n \leftarrow x_n + u$ \\ |
|
1623 \hspace{+6mm}8.1.2 $u \leftarrow c_n >> lg(\beta)$ \\ |
|
1624 \hspace{+6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1625 9. $c_{max} \leftarrow u$ \\ |
|
1626 10. if $olduse > max$ then \\ |
|
1627 \hspace{+3mm}10.1 for $n$ from $max + 1$ to $oldused - 1$ do \\ |
|
1628 \hspace{+6mm}10.1.1 $c_n \leftarrow 0$ \\ |
|
1629 11. Clamp excess digits in $c$. (\textit{mp\_clamp}) \\ |
|
1630 12. Return(\textit{MP\_OKAY}) \\ |
|
1631 \hline |
|
1632 \end{tabular} |
|
1633 \end{small} |
|
1634 \end{center} |
|
1635 \caption{Algorithm s\_mp\_add} |
|
1636 \end{figure} |
|
1637 |
|
1638 \textbf{Algorithm s\_mp\_add.} |
|
1639 This algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes. |
|
1640 Coincidentally the description of algorithm A in Knuth \cite[pp. 266]{TAOCPV2} shares the same deficiency as the algorithm from \cite{HAC}. Even the |
|
1641 MIX pseudo machine code presented by Knuth \cite[pp. 266-267]{TAOCPV2} is incapable of handling inputs which are of different magnitudes. |
|
1642 |
|
1643 The first thing that has to be accomplished is to sort out which of the two inputs is the largest. The addition logic |
|
1644 will simply add all of the smallest input to the largest input and store that first part of the result in the |
|
1645 destination. Then it will apply a simpler addition loop to excess digits of the larger input. |
|
1646 |
|
1647 The first two steps will handle sorting the inputs such that $min$ and $max$ hold the digit counts of the two |
|
1648 inputs. The variable $x$ will be an mp\_int alias for the largest input or the second input $b$ if they have the |
|
1649 same number of digits. After the inputs are sorted the destination $c$ is grown as required to accomodate the sum |
|
1650 of the two inputs. The original \textbf{used} count of $c$ is copied and set to the new used count. |
|
1651 |
|
1652 At this point the first addition loop will go through as many digit positions that both inputs have. The carry |
|
1653 variable $\mu$ is set to zero outside the loop. Inside the loop an ``addition'' step requires three statements to produce |
|
1654 one digit of the summand. First |
|
1655 two digits from $a$ and $b$ are added together along with the carry $\mu$. The carry of this step is extracted and stored |
|
1656 in $\mu$ and finally the digit of the result $c_n$ is truncated within the range $0 \le c_n < \beta$. |
|
1657 |
|
1658 Now all of the digit positions that both inputs have in common have been exhausted. If $min \ne max$ then $x$ is an alias |
|
1659 for one of the inputs that has more digits. A simplified addition loop is then used to essentially copy the remaining digits |
|
1660 and the carry to the destination. |
|
1661 |
|
1662 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. |
|
1663 |
|
1664 |
|
1665 EXAM,bn_s_mp_add.c |
|
1666 |
|
1667 Lines @27,if@ to @35,}@ perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a |
|
1668 mp\_int assigned to the largest input, in effect it is a local alias. Lines @37,init@ to @42,}@ ensure that the destination is grown to |
|
1669 accomodate the result of the addition. |
|
1670 |
|
1671 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on |
|
1672 lines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively. These aliases are used to ensure the |
|
1673 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. |
|
1674 |
|
1675 The initial carry $u$ is cleared on line @65,u = 0@, note that $u$ is of type mp\_digit which ensures type compatibility within the |
|
1676 implementation. The initial addition loop begins on line @66,for@ and ends on line @75,}@. Similarly the conditional addition loop |
|
1677 begins on line @81,for@ and ends on line @90,}@. The addition is finished with the final carry being stored in $tmpc$ on line @94,tmpc++@. |
|
1678 Note the ``++'' operator on the same line. After line @94,tmpc++@ $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful |
|
1679 for the next loop on lines @97,for@ to @99,}@ which set any old upper digits to zero. |
|
1680 |
|
1681 \subsection{Low Level Subtraction} |
|
1682 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the |
|
1683 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must |
|
1684 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. |
|
1685 This algorithm as will be shown can be used to create functional signed addition and subtraction algorithms. |
|
1686 |
|
1687 MARK,GAMMA |
|
1688 |
|
1689 For this algorithm a new variable is required to make the description simpler. Recall from section 1.3.1 that a mp\_digit must be able to represent |
|
1690 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For |
|
1691 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a |
|
1692 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). |
|
1693 |
|
1694 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' |
|
1695 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. |
|
1696 |
|
1697 \newpage\begin{figure}[!here] |
|
1698 \begin{center} |
|
1699 \begin{small} |
|
1700 \begin{tabular}{l} |
|
1701 \hline Algorithm \textbf{s\_mp\_sub}. \\ |
|
1702 \textbf{Input}. Two mp\_ints $a$ and $b$ ($\vert a \vert \ge \vert b \vert$) \\ |
|
1703 \textbf{Output}. The unsigned subtraction $c = \vert a \vert - \vert b \vert$. \\ |
|
1704 \hline \\ |
|
1705 1. $min \leftarrow b.used$ \\ |
|
1706 2. $max \leftarrow a.used$ \\ |
|
1707 3. If $c.alloc < max$ then grow $c$ to hold at least $max$ digits. (\textit{mp\_grow}) \\ |
|
1708 4. $oldused \leftarrow c.used$ \\ |
|
1709 5. $c.used \leftarrow max$ \\ |
|
1710 6. $u \leftarrow 0$ \\ |
|
1711 7. for $n$ from $0$ to $min - 1$ do \\ |
|
1712 \hspace{3mm}7.1 $c_n \leftarrow a_n - b_n - u$ \\ |
|
1713 \hspace{3mm}7.2 $u \leftarrow c_n >> (\gamma - 1)$ \\ |
|
1714 \hspace{3mm}7.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1715 8. if $min < max$ then do \\ |
|
1716 \hspace{3mm}8.1 for $n$ from $min$ to $max - 1$ do \\ |
|
1717 \hspace{6mm}8.1.1 $c_n \leftarrow a_n - u$ \\ |
|
1718 \hspace{6mm}8.1.2 $u \leftarrow c_n >> (\gamma - 1)$ \\ |
|
1719 \hspace{6mm}8.1.3 $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1720 9. if $oldused > max$ then do \\ |
|
1721 \hspace{3mm}9.1 for $n$ from $max$ to $oldused - 1$ do \\ |
|
1722 \hspace{6mm}9.1.1 $c_n \leftarrow 0$ \\ |
|
1723 10. Clamp excess digits of $c$. (\textit{mp\_clamp}). \\ |
|
1724 11. Return(\textit{MP\_OKAY}). \\ |
|
1725 \hline |
|
1726 \end{tabular} |
|
1727 \end{small} |
|
1728 \end{center} |
|
1729 \caption{Algorithm s\_mp\_sub} |
|
1730 \end{figure} |
|
1731 |
|
1732 \textbf{Algorithm s\_mp\_sub.} |
|
1733 This algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive. That is when |
|
1734 passing variables $a$ and $b$ the condition that $\vert a \vert \ge \vert b \vert$ must be met for the algorithm to function correctly. This |
|
1735 algorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well. As was the case |
|
1736 of the algorithm s\_mp\_add both other references lack discussion concerning various practical details such as when the inputs differ in magnitude. |
|
1737 |
|
1738 The initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$. Steps 1 and 2 |
|
1739 set the $min$ and $max$ variables. Unlike the addition routine there is guaranteed to be no carry which means that the final result can be at |
|
1740 most $max$ digits in length as opposed to $max + 1$. Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and |
|
1741 set to the maximal count for the operation. |
|
1742 |
|
1743 The subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision |
|
1744 subtraction is used instead. Note the use of the $\gamma$ variable to extract the carry (\textit{also known as the borrow}) within the subtraction |
|
1745 loops. Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry. |
|
1746 |
|
1747 For example, consider subtracting $0101_2$ from $0100_2$ where $\gamma = 4$ and $\beta = 2$. The least significant bit will force a carry upwards to |
|
1748 the third bit which will be set to zero after the borrow. After the very first bit has been subtracted $4 - 1 \equiv 0011_2$ will remain, When the |
|
1749 third bit of $0101_2$ is subtracted from the result it will cause another carry. In this case though the carry will be forced to propagate all the |
|
1750 way to the most significant bit. |
|
1751 |
|
1752 Recall that $\beta < 2^{\gamma}$. This means that if a carry does occur just before the $lg(\beta)$'th bit it will propagate all the way to the most |
|
1753 significant bit. Thus, the high order bits of the mp\_digit that are not part of the actual digit will either be all zero, or all one. All that |
|
1754 is needed is a single zero or one bit for the carry. Therefore a single logical shift right by $\gamma - 1$ positions is sufficient to extract the |
|
1755 carry. This method of carry extraction may seem awkward but the reason for it becomes apparent when the implementation is discussed. |
|
1756 |
|
1757 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step |
|
1758 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. |
|
1759 |
|
1760 EXAM,bn_s_mp_sub.c |
|
1761 |
|
1762 Line @24,min@ and @25,max@ perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only |
|
1763 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines @42,tmpa@, @43,tmpb@ and @44,tmpc@ initialize the aliases for |
|
1764 $a$, $b$ and $c$ respectively. |
|
1765 |
|
1766 The first subtraction loop occurs on lines @47,u = 0@ through @61,}@. The theory behind the subtraction loop is exactly the same as that for |
|
1767 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry |
|
1768 (\textit{see line @57, >>@}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND |
|
1769 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry |
|
1770 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply |
|
1771 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on |
|
1772 twos compliment machines which is a safe assumption to make. |
|
1773 |
|
1774 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines @64,for@ through @73,}@}) is required to propagate the carry through |
|
1775 $a$ and copy the result to $c$. |
|
1776 |
|
1777 \subsection{High Level Addition} |
|
1778 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be |
|
1779 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data |
|
1780 types. |
|
1781 |
|
1782 Recall from section 5.2 that an mp\_int represents an integer with an unsigned mantissa (\textit{the array of digits}) and a \textbf{sign} |
|
1783 flag. A high level addition is actually performed as a series of eight separate cases which can be optimized down to three unique cases. |
|
1784 |
|
1785 \begin{figure}[!here] |
|
1786 \begin{center} |
|
1787 \begin{tabular}{l} |
|
1788 \hline Algorithm \textbf{mp\_add}. \\ |
|
1789 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ |
|
1790 \textbf{Output}. The signed addition $c = a + b$. \\ |
|
1791 \hline \\ |
|
1792 1. if $a.sign = b.sign$ then do \\ |
|
1793 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\ |
|
1794 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add})\\ |
|
1795 2. else do \\ |
|
1796 \hspace{3mm}2.1 if $\vert a \vert < \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\ |
|
1797 \hspace{6mm}2.1.1 $c.sign \leftarrow b.sign$ \\ |
|
1798 \hspace{6mm}2.1.2 $c \leftarrow \vert b \vert - \vert a \vert$ (\textit{s\_mp\_sub}) \\ |
|
1799 \hspace{3mm}2.2 else do \\ |
|
1800 \hspace{6mm}2.2.1 $c.sign \leftarrow a.sign$ \\ |
|
1801 \hspace{6mm}2.2.2 $c \leftarrow \vert a \vert - \vert b \vert$ \\ |
|
1802 3. Return(\textit{MP\_OKAY}). \\ |
|
1803 \hline |
|
1804 \end{tabular} |
|
1805 \end{center} |
|
1806 \caption{Algorithm mp\_add} |
|
1807 \end{figure} |
|
1808 |
|
1809 \textbf{Algorithm mp\_add.} |
|
1810 This algorithm performs the signed addition of two mp\_int variables. There is no reference algorithm to draw upon from |
|
1811 either \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations. The algorithm is fairly |
|
1812 straightforward but restricted since subtraction can only produce positive results. |
|
1813 |
|
1814 \begin{figure}[here] |
|
1815 \begin{small} |
|
1816 \begin{center} |
|
1817 \begin{tabular}{|c|c|c|c|c|} |
|
1818 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert > \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\ |
|
1819 \hline $+$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\ |
|
1820 \hline $+$ & $+$ & No & $c = a + b$ & $a.sign$ \\ |
|
1821 \hline $-$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\ |
|
1822 \hline $-$ & $-$ & No & $c = a + b$ & $a.sign$ \\ |
|
1823 \hline &&&&\\ |
|
1824 |
|
1825 \hline $+$ & $-$ & No & $c = b - a$ & $b.sign$ \\ |
|
1826 \hline $-$ & $+$ & No & $c = b - a$ & $b.sign$ \\ |
|
1827 |
|
1828 \hline &&&&\\ |
|
1829 |
|
1830 \hline $+$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\ |
|
1831 \hline $-$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\ |
|
1832 |
|
1833 \hline |
|
1834 \end{tabular} |
|
1835 \end{center} |
|
1836 \end{small} |
|
1837 \caption{Addition Guide Chart} |
|
1838 \label{fig:AddChart} |
|
1839 \end{figure} |
|
1840 |
|
1841 Figure~\ref{fig:AddChart} lists all of the eight possible input combinations and is sorted to show that only three |
|
1842 specific cases need to be handled. The return code of the unsigned operations at step 1.2, 2.1.2 and 2.2.2 are |
|
1843 forwarded to step three to check for errors. This simplifies the description of the algorithm considerably and best |
|
1844 follows how the implementation actually was achieved. |
|
1845 |
|
1846 Also note how the \textbf{sign} is set before the unsigned addition or subtraction is performed. Recall from the descriptions of algorithms |
|
1847 s\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits. The mp\_clamp algorithm will set the \textbf{sign} |
|
1848 to \textbf{MP\_ZPOS} when the \textbf{used} digit count reaches zero. |
|
1849 |
|
1850 For example, consider performing $-a + a$ with algorithm mp\_add. By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would |
|
1851 produce a result of $-0$. However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp |
|
1852 within algorithm s\_mp\_add will force $-0$ to become $0$. |
|
1853 |
|
1854 EXAM,bn_mp_add.c |
|
1855 |
|
1856 The source code follows the algorithm fairly closely. The most notable new source code addition is the usage of the $res$ integer variable which |
|
1857 is used to pass result of the unsigned operations forward. Unlike in the algorithm, the variable $res$ is merely returned as is without |
|
1858 explicitly checking it and returning the constant \textbf{MP\_OKAY}. The observation is this algorithm will succeed or fail only if the lower |
|
1859 level functions do so. Returning their return code is sufficient. |
|
1860 |
|
1861 \subsection{High Level Subtraction} |
|
1862 The high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm. |
|
1863 |
|
1864 \newpage\begin{figure}[!here] |
|
1865 \begin{center} |
|
1866 \begin{tabular}{l} |
|
1867 \hline Algorithm \textbf{mp\_sub}. \\ |
|
1868 \textbf{Input}. Two mp\_ints $a$ and $b$ \\ |
|
1869 \textbf{Output}. The signed subtraction $c = a - b$. \\ |
|
1870 \hline \\ |
|
1871 1. if $a.sign \ne b.sign$ then do \\ |
|
1872 \hspace{3mm}1.1 $c.sign \leftarrow a.sign$ \\ |
|
1873 \hspace{3mm}1.2 $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add}) \\ |
|
1874 2. else do \\ |
|
1875 \hspace{3mm}2.1 if $\vert a \vert \ge \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\ |
|
1876 \hspace{6mm}2.1.1 $c.sign \leftarrow a.sign$ \\ |
|
1877 \hspace{6mm}2.1.2 $c \leftarrow \vert a \vert - \vert b \vert$ (\textit{s\_mp\_sub}) \\ |
|
1878 \hspace{3mm}2.2 else do \\ |
|
1879 \hspace{6mm}2.2.1 $c.sign \leftarrow \left \lbrace \begin{array}{ll} |
|
1880 MP\_ZPOS & \mbox{if }a.sign = MP\_NEG \\ |
|
1881 MP\_NEG & \mbox{otherwise} \\ |
|
1882 \end{array} \right .$ \\ |
|
1883 \hspace{6mm}2.2.2 $c \leftarrow \vert b \vert - \vert a \vert$ \\ |
|
1884 3. Return(\textit{MP\_OKAY}). \\ |
|
1885 \hline |
|
1886 \end{tabular} |
|
1887 \end{center} |
|
1888 \caption{Algorithm mp\_sub} |
|
1889 \end{figure} |
|
1890 |
|
1891 \textbf{Algorithm mp\_sub.} |
|
1892 This algorithm performs the signed subtraction of two inputs. Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or |
|
1893 \cite{HAC}. Also this algorithm is restricted by algorithm s\_mp\_sub. Chart \ref{fig:SubChart} lists the eight possible inputs and |
|
1894 the operations required. |
|
1895 |
|
1896 \begin{figure}[!here] |
|
1897 \begin{small} |
|
1898 \begin{center} |
|
1899 \begin{tabular}{|c|c|c|c|c|} |
|
1900 \hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert \ge \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\ |
|
1901 \hline $+$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\ |
|
1902 \hline $+$ & $-$ & No & $c = a + b$ & $a.sign$ \\ |
|
1903 \hline $-$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\ |
|
1904 \hline $-$ & $+$ & No & $c = a + b$ & $a.sign$ \\ |
|
1905 \hline &&&& \\ |
|
1906 \hline $+$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\ |
|
1907 \hline $-$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\ |
|
1908 \hline &&&& \\ |
|
1909 \hline $+$ & $+$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\ |
|
1910 \hline $-$ & $-$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\ |
|
1911 \hline |
|
1912 \end{tabular} |
|
1913 \end{center} |
|
1914 \end{small} |
|
1915 \caption{Subtraction Guide Chart} |
|
1916 \label{fig:SubChart} |
|
1917 \end{figure} |
|
1918 |
|
1919 Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction. That is to prevent the |
|
1920 algorithm from producing $-a - -a = -0$ as a result. |
|
1921 |
|
1922 EXAM,bn_mp_sub.c |
|
1923 |
|
1924 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations |
|
1925 and forward it to the end of the function. On line @38, != MP_LT@ the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a |
|
1926 ``greater than or equal to'' comparison. |
|
1927 |
|
1928 \section{Bit and Digit Shifting} |
|
1929 MARK,POLY |
|
1930 It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$. |
|
1931 This notation arises within discussion of Montgomery and Diminished Radix Reduction as well as Karatsuba multiplication and squaring. |
|
1932 |
|
1933 In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established. That is to shift |
|
1934 the digits left or right as well to shift individual bits of the digits left and right. It is important to note that not all ``shift'' operations |
|
1935 are on radix-$\beta$ digits. |
|
1936 |
|
1937 \subsection{Multiplication by Two} |
|
1938 |
|
1939 In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient |
|
1940 operation to perform. A single precision logical shift left is sufficient to multiply a single digit by two. |
|
1941 |
|
1942 \newpage\begin{figure}[!here] |
|
1943 \begin{small} |
|
1944 \begin{center} |
|
1945 \begin{tabular}{l} |
|
1946 \hline Algorithm \textbf{mp\_mul\_2}. \\ |
|
1947 \textbf{Input}. One mp\_int $a$ \\ |
|
1948 \textbf{Output}. $b = 2a$. \\ |
|
1949 \hline \\ |
|
1950 1. If $b.alloc < a.used + 1$ then grow $b$ to hold $a.used + 1$ digits. (\textit{mp\_grow}) \\ |
|
1951 2. $oldused \leftarrow b.used$ \\ |
|
1952 3. $b.used \leftarrow a.used$ \\ |
|
1953 4. $r \leftarrow 0$ \\ |
|
1954 5. for $n$ from 0 to $a.used - 1$ do \\ |
|
1955 \hspace{3mm}5.1 $rr \leftarrow a_n >> (lg(\beta) - 1)$ \\ |
|
1956 \hspace{3mm}5.2 $b_n \leftarrow (a_n << 1) + r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
1957 \hspace{3mm}5.3 $r \leftarrow rr$ \\ |
|
1958 6. If $r \ne 0$ then do \\ |
|
1959 \hspace{3mm}6.1 $b_{n + 1} \leftarrow r$ \\ |
|
1960 \hspace{3mm}6.2 $b.used \leftarrow b.used + 1$ \\ |
|
1961 7. If $b.used < oldused - 1$ then do \\ |
|
1962 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\ |
|
1963 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\ |
|
1964 8. $b.sign \leftarrow a.sign$ \\ |
|
1965 9. Return(\textit{MP\_OKAY}).\\ |
|
1966 \hline |
|
1967 \end{tabular} |
|
1968 \end{center} |
|
1969 \end{small} |
|
1970 \caption{Algorithm mp\_mul\_2} |
|
1971 \end{figure} |
|
1972 |
|
1973 \textbf{Algorithm mp\_mul\_2.} |
|
1974 This algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two. Neither \cite{TAOCPV2} nor \cite{HAC} describe such |
|
1975 an algorithm despite the fact it arises often in other algorithms. The algorithm is setup much like the lower level algorithm s\_mp\_add since |
|
1976 it is for all intents and purposes equivalent to the operation $b = \vert a \vert + \vert a \vert$. |
|
1977 |
|
1978 Step 1 and 2 grow the input as required to accomodate the maximum number of \textbf{used} digits in the result. The initial \textbf{used} count |
|
1979 is set to $a.used$ at step 4. Only if there is a final carry will the \textbf{used} count require adjustment. |
|
1980 |
|
1981 Step 6 is an optimization implementation of the addition loop for this specific case. That is since the two values being added together |
|
1982 are the same there is no need to perform two reads from the digits of $a$. Step 6.1 performs a single precision shift on the current digit $a_n$ to |
|
1983 obtain what will be the carry for the next iteration. Step 6.2 calculates the $n$'th digit of the result as single precision shift of $a_n$ plus |
|
1984 the previous carry. Recall from ~SHIFTS~ that $a_n << 1$ is equivalent to $a_n \cdot 2$. An iteration of the addition loop is finished with |
|
1985 forwarding the carry to the next iteration. |
|
1986 |
|
1987 Step 7 takes care of any final carry by setting the $a.used$'th digit of the result to the carry and augmenting the \textbf{used} count of $b$. |
|
1988 Step 8 clears any leading digits of $b$ in case it originally had a larger magnitude than $a$. |
|
1989 |
|
1990 EXAM,bn_mp_mul_2.c |
|
1991 |
|
1992 This implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input. The only noteworthy difference |
|
1993 is the use of the logical shift operator on line @52,<<@ to perform a single precision doubling. |
|
1994 |
|
1995 \subsection{Division by Two} |
|
1996 A division by two can just as easily be accomplished with a logical shift right as multiplication by two can be with a logical shift left. |
|
1997 |
|
1998 \newpage\begin{figure}[!here] |
|
1999 \begin{small} |
|
2000 \begin{center} |
|
2001 \begin{tabular}{l} |
|
2002 \hline Algorithm \textbf{mp\_div\_2}. \\ |
|
2003 \textbf{Input}. One mp\_int $a$ \\ |
|
2004 \textbf{Output}. $b = a/2$. \\ |
|
2005 \hline \\ |
|
2006 1. If $b.alloc < a.used$ then grow $b$ to hold $a.used$ digits. (\textit{mp\_grow}) \\ |
|
2007 2. If the reallocation failed return(\textit{MP\_MEM}). \\ |
|
2008 3. $oldused \leftarrow b.used$ \\ |
|
2009 4. $b.used \leftarrow a.used$ \\ |
|
2010 5. $r \leftarrow 0$ \\ |
|
2011 6. for $n$ from $b.used - 1$ to $0$ do \\ |
|
2012 \hspace{3mm}6.1 $rr \leftarrow a_n \mbox{ (mod }2\mbox{)}$\\ |
|
2013 \hspace{3mm}6.2 $b_n \leftarrow (a_n >> 1) + (r << (lg(\beta) - 1)) \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2014 \hspace{3mm}6.3 $r \leftarrow rr$ \\ |
|
2015 7. If $b.used < oldused - 1$ then do \\ |
|
2016 \hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\ |
|
2017 \hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\ |
|
2018 8. $b.sign \leftarrow a.sign$ \\ |
|
2019 9. Clamp excess digits of $b$. (\textit{mp\_clamp}) \\ |
|
2020 10. Return(\textit{MP\_OKAY}).\\ |
|
2021 \hline |
|
2022 \end{tabular} |
|
2023 \end{center} |
|
2024 \end{small} |
|
2025 \caption{Algorithm mp\_div\_2} |
|
2026 \end{figure} |
|
2027 |
|
2028 \textbf{Algorithm mp\_div\_2.} |
|
2029 This algorithm will divide an mp\_int by two using logical shifts to the right. Like mp\_mul\_2 it uses a modified low level addition |
|
2030 core as the basis of the algorithm. Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit. The algorithm |
|
2031 could be written to work from the trailing digit to the leading digit however, it would have to stop one short of $a.used - 1$ digits to prevent |
|
2032 reading past the end of the array of digits. |
|
2033 |
|
2034 Essentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the |
|
2035 least significant bit not the most significant bit. |
|
2036 |
|
2037 EXAM,bn_mp_div_2.c |
|
2038 |
|
2039 \section{Polynomial Basis Operations} |
|
2040 Recall from ~POLY~ that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$. Such a representation is also known as |
|
2041 the polynomial basis \cite[pp. 48]{ROSE}. Given such a notation a multiplication or division by $x$ amounts to shifting whole digits a single |
|
2042 place. The need for such operations arises in several other higher level algorithms such as Barrett and Montgomery reduction, integer |
|
2043 division and Karatsuba multiplication. |
|
2044 |
|
2045 Converting from an array of digits to polynomial basis is very simple. Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that |
|
2046 $y = \sum_{i=0}^{2} a_i \beta^i$. Simply replace $\beta$ with $x$ and the expression is in polynomial basis. For example, $f(x) = 8x + 9$ is the |
|
2047 polynomial basis representation for $89$ using radix ten. That is, $f(10) = 8(10) + 9 = 89$. |
|
2048 |
|
2049 \subsection{Multiplication by $x$} |
|
2050 |
|
2051 Given a polynomial in $x$ such as $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_0$ multiplying by $x$ amounts to shifting the coefficients up one |
|
2052 degree. In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$. From a scalar basis point of view multiplying by $x$ is equivalent to |
|
2053 multiplying by the integer $\beta$. |
|
2054 |
|
2055 \newpage\begin{figure}[!here] |
|
2056 \begin{small} |
|
2057 \begin{center} |
|
2058 \begin{tabular}{l} |
|
2059 \hline Algorithm \textbf{mp\_lshd}. \\ |
|
2060 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ |
|
2061 \textbf{Output}. $a \leftarrow a \cdot \beta^b$ (equivalent to multiplication by $x^b$). \\ |
|
2062 \hline \\ |
|
2063 1. If $b \le 0$ then return(\textit{MP\_OKAY}). \\ |
|
2064 2. If $a.alloc < a.used + b$ then grow $a$ to at least $a.used + b$ digits. (\textit{mp\_grow}). \\ |
|
2065 3. If the reallocation failed return(\textit{MP\_MEM}). \\ |
|
2066 4. $a.used \leftarrow a.used + b$ \\ |
|
2067 5. $i \leftarrow a.used - 1$ \\ |
|
2068 6. $j \leftarrow a.used - 1 - b$ \\ |
|
2069 7. for $n$ from $a.used - 1$ to $b$ do \\ |
|
2070 \hspace{3mm}7.1 $a_{i} \leftarrow a_{j}$ \\ |
|
2071 \hspace{3mm}7.2 $i \leftarrow i - 1$ \\ |
|
2072 \hspace{3mm}7.3 $j \leftarrow j - 1$ \\ |
|
2073 8. for $n$ from 0 to $b - 1$ do \\ |
|
2074 \hspace{3mm}8.1 $a_n \leftarrow 0$ \\ |
|
2075 9. Return(\textit{MP\_OKAY}). \\ |
|
2076 \hline |
|
2077 \end{tabular} |
|
2078 \end{center} |
|
2079 \end{small} |
|
2080 \caption{Algorithm mp\_lshd} |
|
2081 \end{figure} |
|
2082 |
|
2083 \textbf{Algorithm mp\_lshd.} |
|
2084 This algorithm multiplies an mp\_int by the $b$'th power of $x$. This is equivalent to multiplying by $\beta^b$. The algorithm differs |
|
2085 from the other algorithms presented so far as it performs the operation in place instead storing the result in a separate location. The |
|
2086 motivation behind this change is due to the way this function is typically used. Algorithms such as mp\_add store the result in an optionally |
|
2087 different third mp\_int because the original inputs are often still required. Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is |
|
2088 typically used on values where the original value is no longer required. The algorithm will return success immediately if |
|
2089 $b \le 0$ since the rest of algorithm is only valid when $b > 0$. |
|
2090 |
|
2091 First the destination $a$ is grown as required to accomodate the result. The counters $i$ and $j$ are used to form a \textit{sliding window} over |
|
2092 the digits of $a$ of length $b$. The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}). |
|
2093 The loop on step 7 copies the digit from the tail to the head. In each iteration the window is moved down one digit. The last loop on |
|
2094 step 8 sets the lower $b$ digits to zero. |
|
2095 |
|
2096 \newpage |
|
2097 FIGU,sliding_window,Sliding Window Movement |
|
2098 |
|
2099 EXAM,bn_mp_lshd.c |
|
2100 |
|
2101 The if statement on line @24,if@ ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before |
|
2102 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line @42,top@ is an alias |
|
2103 for the leading digit while $bottom$ on line @45,bottom@ is an alias for the trailing edge. The aliases form a window of exactly $b$ digits |
|
2104 over the input. |
|
2105 |
|
2106 \subsection{Division by $x$} |
|
2107 |
|
2108 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. |
|
2109 |
|
2110 \newpage\begin{figure}[!here] |
|
2111 \begin{small} |
|
2112 \begin{center} |
|
2113 \begin{tabular}{l} |
|
2114 \hline Algorithm \textbf{mp\_rshd}. \\ |
|
2115 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ |
|
2116 \textbf{Output}. $a \leftarrow a / \beta^b$ (Divide by $x^b$). \\ |
|
2117 \hline \\ |
|
2118 1. If $b \le 0$ then return. \\ |
|
2119 2. If $a.used \le b$ then do \\ |
|
2120 \hspace{3mm}2.1 Zero $a$. (\textit{mp\_zero}). \\ |
|
2121 \hspace{3mm}2.2 Return. \\ |
|
2122 3. $i \leftarrow 0$ \\ |
|
2123 4. $j \leftarrow b$ \\ |
|
2124 5. for $n$ from 0 to $a.used - b - 1$ do \\ |
|
2125 \hspace{3mm}5.1 $a_i \leftarrow a_j$ \\ |
|
2126 \hspace{3mm}5.2 $i \leftarrow i + 1$ \\ |
|
2127 \hspace{3mm}5.3 $j \leftarrow j + 1$ \\ |
|
2128 6. for $n$ from $a.used - b$ to $a.used - 1$ do \\ |
|
2129 \hspace{3mm}6.1 $a_n \leftarrow 0$ \\ |
|
2130 7. $a.used \leftarrow a.used - b$ \\ |
|
2131 8. Return. \\ |
|
2132 \hline |
|
2133 \end{tabular} |
|
2134 \end{center} |
|
2135 \end{small} |
|
2136 \caption{Algorithm mp\_rshd} |
|
2137 \end{figure} |
|
2138 |
|
2139 \textbf{Algorithm mp\_rshd.} |
|
2140 This algorithm divides the input in place by the $b$'th power of $x$. It is analogous to dividing by a $\beta^b$ but much quicker since |
|
2141 it does not require single precision division. This algorithm does not actually return an error code as it cannot fail. |
|
2142 |
|
2143 If the input $b$ is less than one the algorithm quickly returns without performing any work. If the \textbf{used} count is less than or equal |
|
2144 to the shift count $b$ then it will simply zero the input and return. |
|
2145 |
|
2146 After the trivial cases of inputs have been handled the sliding window is setup. Much like the case of algorithm mp\_lshd a sliding window that |
|
2147 is $b$ digits wide is used to copy the digits. Unlike mp\_lshd the window slides in the opposite direction from the trailing to the leading digit. |
|
2148 Also the digits are copied from the leading to the trailing edge. |
|
2149 |
|
2150 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. |
|
2151 |
|
2152 EXAM,bn_mp_rshd.c |
|
2153 |
|
2154 The only noteworthy element of this routine is the lack of a return type. |
|
2155 |
|
2156 -- Will update later to give it a return type...Tom |
|
2157 |
|
2158 \section{Powers of Two} |
|
2159 |
|
2160 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For |
|
2161 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single |
|
2162 shifts $k$ times to achieve a multiplication by $2^{\pm k}$ a mixture of whole digit shifting and partial digit shifting is employed. |
|
2163 |
|
2164 \subsection{Multiplication by Power of Two} |
|
2165 |
|
2166 \newpage\begin{figure}[!here] |
|
2167 \begin{small} |
|
2168 \begin{center} |
|
2169 \begin{tabular}{l} |
|
2170 \hline Algorithm \textbf{mp\_mul\_2d}. \\ |
|
2171 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ |
|
2172 \textbf{Output}. $c \leftarrow a \cdot 2^b$. \\ |
|
2173 \hline \\ |
|
2174 1. $c \leftarrow a$. (\textit{mp\_copy}) \\ |
|
2175 2. If $c.alloc < c.used + \lfloor b / lg(\beta) \rfloor + 2$ then grow $c$ accordingly. \\ |
|
2176 3. If the reallocation failed return(\textit{MP\_MEM}). \\ |
|
2177 4. If $b \ge lg(\beta)$ then \\ |
|
2178 \hspace{3mm}4.1 $c \leftarrow c \cdot \beta^{\lfloor b / lg(\beta) \rfloor}$ (\textit{mp\_lshd}). \\ |
|
2179 \hspace{3mm}4.2 If step 4.1 failed return(\textit{MP\_MEM}). \\ |
|
2180 5. $d \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ |
|
2181 6. If $d \ne 0$ then do \\ |
|
2182 \hspace{3mm}6.1 $mask \leftarrow 2^d$ \\ |
|
2183 \hspace{3mm}6.2 $r \leftarrow 0$ \\ |
|
2184 \hspace{3mm}6.3 for $n$ from $0$ to $c.used - 1$ do \\ |
|
2185 \hspace{6mm}6.3.1 $rr \leftarrow c_n >> (lg(\beta) - d) \mbox{ (mod }mask\mbox{)}$ \\ |
|
2186 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n << d) + r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2187 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\ |
|
2188 \hspace{3mm}6.4 If $r > 0$ then do \\ |
|
2189 \hspace{6mm}6.4.1 $c_{c.used} \leftarrow r$ \\ |
|
2190 \hspace{6mm}6.4.2 $c.used \leftarrow c.used + 1$ \\ |
|
2191 7. Return(\textit{MP\_OKAY}). \\ |
|
2192 \hline |
|
2193 \end{tabular} |
|
2194 \end{center} |
|
2195 \end{small} |
|
2196 \caption{Algorithm mp\_mul\_2d} |
|
2197 \end{figure} |
|
2198 |
|
2199 \textbf{Algorithm mp\_mul\_2d.} |
|
2200 This algorithm multiplies $a$ by $2^b$ and stores the result in $c$. The algorithm uses algorithm mp\_lshd and a derivative of algorithm mp\_mul\_2 to |
|
2201 quickly compute the product. |
|
2202 |
|
2203 First the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than |
|
2204 $\beta$. For example, if $b = 37$ and $\beta = 2^{28}$ then this step will multiply by $x$ leaving a multiplication by $2^{37 - 28} = 2^{9}$ |
|
2205 left. |
|
2206 |
|
2207 After the digits have been shifted appropriately at most $lg(\beta) - 1$ shifts are left to perform. Step 5 calculates the number of remaining shifts |
|
2208 required. If it is non-zero a modified shift loop is used to calculate the remaining product. |
|
2209 Essentially the loop is a generic version of algorith mp\_mul2 designed to handle any shift count in the range $1 \le x < lg(\beta)$. The $mask$ |
|
2210 variable is used to extract the upper $d$ bits to form the carry for the next iteration. |
|
2211 |
|
2212 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to |
|
2213 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. |
|
2214 |
|
2215 EXAM,bn_mp_mul_2d.c |
|
2216 |
|
2217 Notes to be revised when code is updated. -- Tom |
|
2218 |
|
2219 \subsection{Division by Power of Two} |
|
2220 |
|
2221 \newpage\begin{figure}[!here] |
|
2222 \begin{small} |
|
2223 \begin{center} |
|
2224 \begin{tabular}{l} |
|
2225 \hline Algorithm \textbf{mp\_div\_2d}. \\ |
|
2226 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ |
|
2227 \textbf{Output}. $c \leftarrow \lfloor a / 2^b \rfloor, d \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\ |
|
2228 \hline \\ |
|
2229 1. If $b \le 0$ then do \\ |
|
2230 \hspace{3mm}1.1 $c \leftarrow a$ (\textit{mp\_copy}) \\ |
|
2231 \hspace{3mm}1.2 $d \leftarrow 0$ (\textit{mp\_zero}) \\ |
|
2232 \hspace{3mm}1.3 Return(\textit{MP\_OKAY}). \\ |
|
2233 2. $c \leftarrow a$ \\ |
|
2234 3. $d \leftarrow a \mbox{ (mod }2^b\mbox{)}$ (\textit{mp\_mod\_2d}) \\ |
|
2235 4. If $b \ge lg(\beta)$ then do \\ |
|
2236 \hspace{3mm}4.1 $c \leftarrow \lfloor c/\beta^{\lfloor b/lg(\beta) \rfloor} \rfloor$ (\textit{mp\_rshd}). \\ |
|
2237 5. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ |
|
2238 6. If $k \ne 0$ then do \\ |
|
2239 \hspace{3mm}6.1 $mask \leftarrow 2^k$ \\ |
|
2240 \hspace{3mm}6.2 $r \leftarrow 0$ \\ |
|
2241 \hspace{3mm}6.3 for $n$ from $c.used - 1$ to $0$ do \\ |
|
2242 \hspace{6mm}6.3.1 $rr \leftarrow c_n \mbox{ (mod }mask\mbox{)}$ \\ |
|
2243 \hspace{6mm}6.3.2 $c_n \leftarrow (c_n >> k) + (r << (lg(\beta) - k))$ \\ |
|
2244 \hspace{6mm}6.3.3 $r \leftarrow rr$ \\ |
|
2245 7. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\ |
|
2246 8. Return(\textit{MP\_OKAY}). \\ |
|
2247 \hline |
|
2248 \end{tabular} |
|
2249 \end{center} |
|
2250 \end{small} |
|
2251 \caption{Algorithm mp\_div\_2d} |
|
2252 \end{figure} |
|
2253 |
|
2254 \textbf{Algorithm mp\_div\_2d.} |
|
2255 This algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder. The algorithm is designed much like algorithm |
|
2256 mp\_mul\_2d by first using whole digit shifts then single precision shifts. This algorithm will also produce the remainder of the division |
|
2257 by using algorithm mp\_mod\_2d. |
|
2258 |
|
2259 EXAM,bn_mp_div_2d.c |
|
2260 |
|
2261 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally |
|
2262 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the |
|
2263 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before |
|
2264 the quotient is obtained. |
|
2265 |
|
2266 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). |
|
2267 |
|
2268 \subsection{Remainder of Division by Power of Two} |
|
2269 |
|
2270 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This |
|
2271 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. |
|
2272 |
|
2273 \begin{figure}[!here] |
|
2274 \begin{small} |
|
2275 \begin{center} |
|
2276 \begin{tabular}{l} |
|
2277 \hline Algorithm \textbf{mp\_mod\_2d}. \\ |
|
2278 \textbf{Input}. One mp\_int $a$ and an integer $b$ \\ |
|
2279 \textbf{Output}. $c \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\ |
|
2280 \hline \\ |
|
2281 1. If $b \le 0$ then do \\ |
|
2282 \hspace{3mm}1.1 $c \leftarrow 0$ (\textit{mp\_zero}) \\ |
|
2283 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ |
|
2284 2. If $b > a.used \cdot lg(\beta)$ then do \\ |
|
2285 \hspace{3mm}2.1 $c \leftarrow a$ (\textit{mp\_copy}) \\ |
|
2286 \hspace{3mm}2.2 Return the result of step 2.1. \\ |
|
2287 3. $c \leftarrow a$ \\ |
|
2288 4. If step 3 failed return(\textit{MP\_MEM}). \\ |
|
2289 5. for $n$ from $\lceil b / lg(\beta) \rceil$ to $c.used$ do \\ |
|
2290 \hspace{3mm}5.1 $c_n \leftarrow 0$ \\ |
|
2291 6. $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\ |
|
2292 7. $c_{\lfloor b / lg(\beta) \rfloor} \leftarrow c_{\lfloor b / lg(\beta) \rfloor} \mbox{ (mod }2^{k}\mbox{)}$. \\ |
|
2293 8. Clamp excess digits of $c$. (\textit{mp\_clamp}) \\ |
|
2294 9. Return(\textit{MP\_OKAY}). \\ |
|
2295 \hline |
|
2296 \end{tabular} |
|
2297 \end{center} |
|
2298 \end{small} |
|
2299 \caption{Algorithm mp\_mod\_2d} |
|
2300 \end{figure} |
|
2301 |
|
2302 \textbf{Algorithm mp\_mod\_2d.} |
|
2303 This algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$. First if $b$ is less than or equal to zero the |
|
2304 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ |
|
2305 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. |
|
2306 |
|
2307 EXAM,bn_mp_mod_2d.c |
|
2308 |
|
2309 -- Add comments later, Tom. |
|
2310 |
|
2311 \section*{Exercises} |
|
2312 \begin{tabular}{cl} |
|
2313 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ |
|
2314 & in $O(n)$ time. \\ |
|
2315 &\\ |
|
2316 $\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming \\ |
|
2317 & weight values such as $3$, $5$ and $9$. Extend it to handle all values \\ |
|
2318 & upto $64$ with a hamming weight less than three. \\ |
|
2319 &\\ |
|
2320 $\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\ |
|
2321 & $2^k - 1$ as well. \\ |
|
2322 &\\ |
|
2323 $\left [ 3 \right ] $ & Using only algorithms mp\_mul\_2, mp\_div\_2 and mp\_add create an \\ |
|
2324 & algorithm to multiply two integers in roughly $O(2n^2)$ time for \\ |
|
2325 & any $n$-bit input. Note that the time of addition is ignored in the \\ |
|
2326 & calculation. \\ |
|
2327 & \\ |
|
2328 $\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\ |
|
2329 & $O \left (2^{(k-1)}n + \left ({2n^2 \over k} \right ) \right )$ for an appropriate choice of $k$. Again ignore \\ |
|
2330 & the cost of addition. \\ |
|
2331 & \\ |
|
2332 $\left [ 2 \right ] $ & Devise a chart to find optimal values of $k$ for the previous problem \\ |
|
2333 & for $n = 64 \ldots 1024$ in steps of $64$. \\ |
|
2334 & \\ |
|
2335 $\left [ 2 \right ] $ & Using only algorithms mp\_abs and mp\_sub devise another method for \\ |
|
2336 & calculating the result of a signed comparison. \\ |
|
2337 & |
|
2338 \end{tabular} |
|
2339 |
|
2340 \chapter{Multiplication and Squaring} |
|
2341 \section{The Multipliers} |
|
2342 For most number theoretic problems including certain public key cryptographic algorithms, the ``multipliers'' form the most important subset of |
|
2343 algorithms of any multiple precision integer package. The set of multiplier algorithms include integer multiplication, squaring and modular reduction |
|
2344 where in each of the algorithms single precision multiplication is the dominant operation performed. This chapter will discuss integer multiplication |
|
2345 and squaring, leaving modular reductions for the subsequent chapter. |
|
2346 |
|
2347 The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular |
|
2348 exponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$. During a modular |
|
2349 exponentiation the majority\footnote{Roughly speaking a modular exponentiation will spend about 40\% of the time performing modular reductions, |
|
2350 35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision |
|
2351 multiplications. |
|
2352 |
|
2353 For centuries general purpose multiplication has required a lengthly $O(n^2)$ process, whereby each digit of one multiplicand has to be multiplied |
|
2354 against every digit of the other multiplicand. Traditional long-hand multiplication is based on this process; while the techniques can differ the |
|
2355 overall algorithm used is essentially the same. Only ``recently'' have faster algorithms been studied. First Karatsuba multiplication was discovered in |
|
2356 1962. This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach. |
|
2357 This technique led to the discovery of polynomial basis algorithms (\textit{good reference?}) and subquently Fourier Transform based solutions. |
|
2358 |
|
2359 \section{Multiplication} |
|
2360 \subsection{The Baseline Multiplication} |
|
2361 \label{sec:basemult} |
|
2362 \index{baseline multiplication} |
|
2363 Computing the product of two integers in software can be achieved using a trivial adaptation of the standard $O(n^2)$ long-hand multiplication |
|
2364 algorithm that school children are taught. The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision |
|
2365 multiplications are required. More specifically for a $m$ and $n$ digit input $m \cdot n$ single precision multiplications are required. To |
|
2366 simplify most discussions, it will be assumed that the inputs have comparable number of digits. |
|
2367 |
|
2368 The ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be |
|
2369 used. This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible. One important |
|
2370 facet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution. The importance of this |
|
2371 modification will become evident during the discussion of Barrett modular reduction. Recall that for a $n$ and $m$ digit input the product |
|
2372 will be at most $n + m$ digits. Therefore, this algorithm can be reduced to a full multiplier by having it produce $n + m$ digits of the product. |
|
2373 |
|
2374 Recall from ~GAMMA~ the definition of $\gamma$ as the number of bits in the type \textbf{mp\_digit}. We shall now extend the variable set to |
|
2375 include $\alpha$ which shall represent the number of bits in the type \textbf{mp\_word}. This implies that $2^{\alpha} > 2 \cdot \beta^2$. The |
|
2376 constant $\delta = 2^{\alpha - 2lg(\beta)}$ will represent the maximal weight of any column in a product (\textit{see ~COMBA~ for more information}). |
|
2377 |
|
2378 \newpage\begin{figure}[!here] |
|
2379 \begin{small} |
|
2380 \begin{center} |
|
2381 \begin{tabular}{l} |
|
2382 \hline Algorithm \textbf{s\_mp\_mul\_digs}. \\ |
|
2383 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ |
|
2384 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ |
|
2385 \hline \\ |
|
2386 1. If min$(a.used, b.used) < \delta$ then do \\ |
|
2387 \hspace{3mm}1.1 Calculate $c = \vert a \vert \cdot \vert b \vert$ by the Comba method (\textit{see algorithm~\ref{fig:COMBAMULT}}). \\ |
|
2388 \hspace{3mm}1.2 Return the result of step 1.1 \\ |
|
2389 \\ |
|
2390 Allocate and initialize a temporary mp\_int. \\ |
|
2391 2. Init $t$ to be of size $digs$ \\ |
|
2392 3. If step 2 failed return(\textit{MP\_MEM}). \\ |
|
2393 4. $t.used \leftarrow digs$ \\ |
|
2394 \\ |
|
2395 Compute the product. \\ |
|
2396 5. for $ix$ from $0$ to $a.used - 1$ do \\ |
|
2397 \hspace{3mm}5.1 $u \leftarrow 0$ \\ |
|
2398 \hspace{3mm}5.2 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ |
|
2399 \hspace{3mm}5.3 If $pb < 1$ then goto step 6. \\ |
|
2400 \hspace{3mm}5.4 for $iy$ from $0$ to $pb - 1$ do \\ |
|
2401 \hspace{6mm}5.4.1 $\hat r \leftarrow t_{iy + ix} + a_{ix} \cdot b_{iy} + u$ \\ |
|
2402 \hspace{6mm}5.4.2 $t_{iy + ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2403 \hspace{6mm}5.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
2404 \hspace{3mm}5.5 if $ix + pb < digs$ then do \\ |
|
2405 \hspace{6mm}5.5.1 $t_{ix + pb} \leftarrow u$ \\ |
|
2406 6. Clamp excess digits of $t$. \\ |
|
2407 7. Swap $c$ with $t$ \\ |
|
2408 8. Clear $t$ \\ |
|
2409 9. Return(\textit{MP\_OKAY}). \\ |
|
2410 \hline |
|
2411 \end{tabular} |
|
2412 \end{center} |
|
2413 \end{small} |
|
2414 \caption{Algorithm s\_mp\_mul\_digs} |
|
2415 \end{figure} |
|
2416 |
|
2417 \textbf{Algorithm s\_mp\_mul\_digs.} |
|
2418 This algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits. While it may seem |
|
2419 a bit awkward to modify the function from its simple $O(n^2)$ description, the usefulness of partial multipliers will arise in a subsequent |
|
2420 algorithm. The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}. |
|
2421 Algorithm s\_mp\_mul\_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the |
|
2422 inputs. |
|
2423 |
|
2424 The first thing this algorithm checks for is whether a Comba multiplier can be used instead. If the minimum digit count of either |
|
2425 input is less than $\delta$, then the Comba method may be used instead. After the Comba method is ruled out, the baseline algorithm begins. A |
|
2426 temporary mp\_int variable $t$ is used to hold the intermediate result of the product. This allows the algorithm to be used to |
|
2427 compute products when either $a = c$ or $b = c$ without overwriting the inputs. |
|
2428 |
|
2429 All of step 5 is the infamous $O(n^2)$ multiplication loop slightly modified to only produce upto $digs$ digits of output. The $pb$ variable |
|
2430 is given the count of digits to read from $b$ inside the nested loop. If $pb \le 1$ then no more output digits can be produced and the algorithm |
|
2431 will exit the loop. The best way to think of the loops are as a series of $pb \times 1$ multiplications. That is, in each pass of the |
|
2432 innermost loop $a_{ix}$ is multiplied against $b$ and the result is added (\textit{with an appropriate shift}) to $t$. |
|
2433 |
|
2434 For example, consider multiplying $576$ by $241$. That is equivalent to computing $10^0(1)(576) + 10^1(4)(576) + 10^2(2)(576)$ which is best |
|
2435 visualized in the following table. |
|
2436 |
|
2437 \begin{figure}[here] |
|
2438 \begin{center} |
|
2439 \begin{tabular}{|c|c|c|c|c|c|l|} |
|
2440 \hline && & 5 & 7 & 6 & \\ |
|
2441 \hline $\times$&& & 2 & 4 & 1 & \\ |
|
2442 \hline &&&&&&\\ |
|
2443 && & 5 & 7 & 6 & $10^0(1)(576)$ \\ |
|
2444 &2 & 3 & 6 & 1 & 6 & $10^1(4)(576) + 10^0(1)(576)$ \\ |
|
2445 1 & 3 & 8 & 8 & 1 & 6 & $10^2(2)(576) + 10^1(4)(576) + 10^0(1)(576)$ \\ |
|
2446 \hline |
|
2447 \end{tabular} |
|
2448 \end{center} |
|
2449 \caption{Long-Hand Multiplication Diagram} |
|
2450 \end{figure} |
|
2451 |
|
2452 Each row of the product is added to the result after being shifted to the left (\textit{multiplied by a power of the radix}) by the appropriate |
|
2453 count. That is in pass $ix$ of the inner loop the product is added starting at the $ix$'th digit of the reult. |
|
2454 |
|
2455 Step 5.4.1 introduces the hat symbol (\textit{e.g. $\hat r$}) which represents a double precision variable. The multiplication on that step |
|
2456 is assumed to be a double wide output single precision multiplication. That is, two single precision variables are multiplied to produce a |
|
2457 double precision result. The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step |
|
2458 5.4.1 is propagated through the nested loop. If the carry was not propagated immediately it would overflow the single precision digit |
|
2459 $t_{ix+iy}$ and the result would be lost. |
|
2460 |
|
2461 At step 5.5 the nested loop is finished and any carry that was left over should be forwarded. The carry does not have to be added to the $ix+pb$'th |
|
2462 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result |
|
2463 exceed the precision requested. |
|
2464 |
|
2465 EXAM,bn_s_mp_mul_digs.c |
|
2466 |
|
2467 Lines @31,if@ to @35,}@ determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and |
|
2468 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control |
|
2469 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. |
|
2470 |
|
2471 Of particular importance is the calculation of the $ix+iy$'th column on lines @64,mp_word@, @65,mp_word@ and @66,mp_word@. Note how all of the |
|
2472 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations |
|
2473 are used instead of single precision. The multiplication on line @65,) * (@ makes use of a specific GCC optimizer behaviour. On the outset it looks like |
|
2474 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most |
|
2475 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For |
|
2476 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. |
|
2477 |
|
2478 \subsection{Faster Multiplication by the ``Comba'' Method} |
|
2479 MARK,COMBA |
|
2480 |
|
2481 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This |
|
2482 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known |
|
2483 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested |
|
2484 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in |
|
2485 his 1986 paper \cite{BARRETT} written five years before. |
|
2486 |
|
2487 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how |
|
2488 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the |
|
2489 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. |
|
2490 |
|
2491 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a |
|
2492 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount |
|
2493 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. |
|
2494 |
|
2495 \begin{equation} |
|
2496 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace |
|
2497 \end{equation} |
|
2498 |
|
2499 Where $\vec x_n$ is the $n'th$ column of the output vector. Consider the following example which computes the vector $\vec x$ for the multiplication |
|
2500 of $576$ and $241$. |
|
2501 |
|
2502 \newpage\begin{figure}[here] |
|
2503 \begin{small} |
|
2504 \begin{center} |
|
2505 \begin{tabular}{|c|c|c|c|c|c|} |
|
2506 \hline & & 5 & 7 & 6 & First Input\\ |
|
2507 \hline $\times$ & & 2 & 4 & 1 & Second Input\\ |
|
2508 \hline & & $1 \cdot 5 = 5$ & $1 \cdot 7 = 7$ & $1 \cdot 6 = 6$ & First pass \\ |
|
2509 & $4 \cdot 5 = 20$ & $4 \cdot 7+5=33$ & $4 \cdot 6+7=31$ & 6 & Second pass \\ |
|
2510 $2 \cdot 5 = 10$ & $2 \cdot 7 + 20 = 34$ & $2 \cdot 6+33=45$ & 31 & 6 & Third pass \\ |
|
2511 \hline 10 & 34 & 45 & 31 & 6 & Final Result \\ |
|
2512 \hline |
|
2513 \end{tabular} |
|
2514 \end{center} |
|
2515 \end{small} |
|
2516 \caption{Comba Multiplication Diagram} |
|
2517 \end{figure} |
|
2518 |
|
2519 At this point the vector $x = \left < 10, 34, 45, 31, 6 \right >$ is the result of the first step of the Comba multipler. |
|
2520 Now the columns must be fixed by propagating the carry upwards. The resultant vector will have one extra dimension over the input vector which is |
|
2521 congruent to adding a leading zero digit. |
|
2522 |
|
2523 \begin{figure}[!here] |
|
2524 \begin{small} |
|
2525 \begin{center} |
|
2526 \begin{tabular}{l} |
|
2527 \hline Algorithm \textbf{Comba Fixup}. \\ |
|
2528 \textbf{Input}. Vector $\vec x$ of dimension $k$ \\ |
|
2529 \textbf{Output}. Vector $\vec x$ such that the carries have been propagated. \\ |
|
2530 \hline \\ |
|
2531 1. for $n$ from $0$ to $k - 1$ do \\ |
|
2532 \hspace{3mm}1.1 $\vec x_{n+1} \leftarrow \vec x_{n+1} + \lfloor \vec x_{n}/\beta \rfloor$ \\ |
|
2533 \hspace{3mm}1.2 $\vec x_{n} \leftarrow \vec x_{n} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2534 2. Return($\vec x$). \\ |
|
2535 \hline |
|
2536 \end{tabular} |
|
2537 \end{center} |
|
2538 \end{small} |
|
2539 \caption{Algorithm Comba Fixup} |
|
2540 \end{figure} |
|
2541 |
|
2542 With that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$. In this case |
|
2543 $241 \cdot 576$ is in fact $138816$ and the procedure succeeded. If the algorithm is correct and as will be demonstrated shortly more |
|
2544 efficient than the baseline algorithm why not simply always use this algorithm? |
|
2545 |
|
2546 \subsubsection{Column Weight.} |
|
2547 At the nested $O(n^2)$ level the Comba method adds the product of two single precision variables to each column of the output |
|
2548 independently. A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix |
|
2549 the carries. For example, in the multiplication of two three-digit numbers the third column of output will be the sum of |
|
2550 three single precision multiplications. If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then |
|
2551 an overflow can occur and the carry information will be lost. For any $m$ and $n$ digit inputs the maximum weight of any column is |
|
2552 min$(m, n)$ which is fairly obvious. |
|
2553 |
|
2554 The maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used. Recall |
|
2555 from earlier that a double precision type has $\alpha$ bits of resolution and a single precision digit has $lg(\beta)$ bits of precision. Given these |
|
2556 two quantities we must not violate the following |
|
2557 |
|
2558 \begin{equation} |
|
2559 k \cdot \left (\beta - 1 \right )^2 < 2^{\alpha} |
|
2560 \end{equation} |
|
2561 |
|
2562 Which reduces to |
|
2563 |
|
2564 \begin{equation} |
|
2565 k \cdot \left ( \beta^2 - 2\beta + 1 \right ) < 2^{\alpha} |
|
2566 \end{equation} |
|
2567 |
|
2568 Let $\rho = lg(\beta)$ represent the number of bits in a single precision digit. By further re-arrangement of the equation the final solution is |
|
2569 found. |
|
2570 |
|
2571 \begin{equation} |
|
2572 k < {{2^{\alpha}} \over {\left (2^{2\rho} - 2^{\rho + 1} + 1 \right )}} |
|
2573 \end{equation} |
|
2574 |
|
2575 The defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$. In this configuration |
|
2576 the smaller input may not have more than $256$ digits if the Comba method is to be used. This is quite satisfactory for most applications since |
|
2577 $256$ digits would allow for numbers in the range of $0 \le x < 2^{7168}$ which, is much larger than most public key cryptographic algorithms require. |
|
2578 |
|
2579 \newpage\begin{figure}[!here] |
|
2580 \begin{small} |
|
2581 \begin{center} |
|
2582 \begin{tabular}{l} |
|
2583 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ |
|
2584 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ |
|
2585 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ |
|
2586 \hline \\ |
|
2587 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ |
|
2588 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ |
|
2589 2. If step 1 failed return(\textit{MP\_MEM}).\\ |
|
2590 \\ |
|
2591 Zero the temporary array $\hat W$. \\ |
|
2592 3. for $n$ from $0$ to $digs - 1$ do \\ |
|
2593 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\ |
|
2594 \\ |
|
2595 Compute the columns. \\ |
|
2596 4. for $ix$ from $0$ to $a.used - 1$ do \\ |
|
2597 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ |
|
2598 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ |
|
2599 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ |
|
2600 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ |
|
2601 \\ |
|
2602 Propagate the carries upwards. \\ |
|
2603 5. $oldused \leftarrow c.used$ \\ |
|
2604 6. $c.used \leftarrow digs$ \\ |
|
2605 7. If $digs > 1$ then do \\ |
|
2606 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ |
|
2607 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ |
|
2608 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2609 8. else do \\ |
|
2610 \hspace{3mm}8.1 $ix \leftarrow 0$ \\ |
|
2611 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
2612 \\ |
|
2613 Zero excess digits. \\ |
|
2614 10. If $digs < oldused$ then do \\ |
|
2615 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\ |
|
2616 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\ |
|
2617 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\ |
|
2618 12. Return(\textit{MP\_OKAY}). \\ |
|
2619 \hline |
|
2620 \end{tabular} |
|
2621 \end{center} |
|
2622 \end{small} |
|
2623 \caption{Algorithm fast\_s\_mp\_mul\_digs} |
|
2624 \label{fig:COMBAMULT} |
|
2625 \end{figure} |
|
2626 |
|
2627 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} |
|
2628 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm |
|
2629 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. |
|
2630 |
|
2631 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that |
|
2632 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. |
|
2633 |
|
2634 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of |
|
2635 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each |
|
2636 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. |
|
2637 |
|
2638 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the |
|
2639 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require |
|
2640 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, |
|
2641 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply |
|
2642 and addition operations in the nested loop in parallel. |
|
2643 |
|
2644 EXAM,bn_fast_s_mp_mul_digs.c |
|
2645 |
|
2646 The memset on line @47,memset@ clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication |
|
2647 implementation a series of aliases (\textit{lines @67, tmpx@, @70, tmpy@ and @75,_W@}) are used to simplify the inner $O(n^2)$ loop. |
|
2648 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. |
|
2649 |
|
2650 The inner loop on lines @83,for@, @84,mp_word@ and @85,}@ is where the algorithm will spend the majority of the time, which is why it has been |
|
2651 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the |
|
2652 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three |
|
2653 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop |
|
2654 and scheduling the instructions so there are very few dependency stalls. |
|
2655 |
|
2656 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the |
|
2657 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next |
|
2658 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can |
|
2659 be simultaneously used. |
|
2660 |
|
2661 \subsection{Polynomial Basis Multiplication} |
|
2662 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms |
|
2663 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and |
|
2664 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. |
|
2665 |
|
2666 The product $a \cdot b \equiv f(x)g(x)$ is the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$. The coefficients $w_i$ will |
|
2667 directly yield the desired product when $\beta$ is substituted for $x$. The direct solution to solve for the $2n + 1$ coefficients |
|
2668 requires $O(n^2)$ time and would in practice be slower than the Comba technique. |
|
2669 |
|
2670 However, numerical analysis theory indicates that only $2n + 1$ distinct points in $W(x)$ are required to determine the values of the $2n + 1$ unknown |
|
2671 coefficients. This means by finding $\zeta_y = W(y)$ for $2n + 1$ small values of $y$ the coefficients of $W(x)$ can be found with |
|
2672 Gaussian elimination. This technique is also occasionally refered to as the \textit{interpolation technique} (\textit{references please...}) since in |
|
2673 effect an interpolation based on $2n + 1$ points will yield a polynomial equivalent to $W(x)$. |
|
2674 |
|
2675 The coefficients of the polynomial $W(x)$ are unknown which makes finding $W(y)$ for any value of $y$ impossible. However, since |
|
2676 $W(x) = f(x)g(x)$ the equivalent $\zeta_y = f(y) g(y)$ can be used in its place. The benefit of this technique stems from the |
|
2677 fact that $f(y)$ and $g(y)$ are much smaller than either $a$ or $b$ respectively. As a result finding the $2n + 1$ relations required |
|
2678 by multiplying $f(y)g(y)$ involves multiplying integers that are much smaller than either of the inputs. |
|
2679 |
|
2680 When picking points to gather relations there are always three obvious points to choose, $y = 0, 1$ and $ \infty$. The $\zeta_0$ term |
|
2681 is simply the product $W(0) = w_0 = a_0 \cdot b_0$. The $\zeta_1$ term is the product |
|
2682 $W(1) = \left (\sum_{i = 0}^{n} a_i \right ) \left (\sum_{i = 0}^{n} b_i \right )$. The third point $\zeta_{\infty}$ is less obvious but rather |
|
2683 simple to explain. The $2n + 1$'th coefficient of $W(x)$ is numerically equivalent to the most significant column in an integer multiplication. |
|
2684 The point at $\infty$ is used symbolically to represent the most significant column, that is $W(\infty) = w_{2n} = a_nb_n$. Note that the |
|
2685 points at $y = 0$ and $\infty$ yield the coefficients $w_0$ and $w_{2n}$ directly. |
|
2686 |
|
2687 If more points are required they should be of small values and powers of two such as $2^q$ and the related \textit{mirror points} |
|
2688 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ for small values of $q$. The term ``mirror point'' stems from the fact that |
|
2689 $\left (2^q \right )^{2n} \cdot \zeta_{2^{-q}}$ can be calculated in the exact opposite fashion as $\zeta_{2^q}$. For |
|
2690 example, when $n = 2$ and $q = 1$ then following two equations are equivalent to the point $\zeta_{2}$ and its mirror. |
|
2691 |
|
2692 \begin{eqnarray} |
|
2693 \zeta_{2} = f(2)g(2) = (4a_2 + 2a_1 + a_0)(4b_2 + 2b_1 + b_0) \nonumber \\ |
|
2694 16 \cdot \zeta_{1 \over 2} = 4f({1\over 2}) \cdot 4g({1 \over 2}) = (a_2 + 2a_1 + 4a_0)(b_2 + 2b_1 + 4b_0) |
|
2695 \end{eqnarray} |
|
2696 |
|
2697 Using such points will allow the values of $f(y)$ and $g(y)$ to be independently calculated using only left shifts. For example, when $n = 2$ the |
|
2698 polynomial $f(2^q)$ is equal to $2^q((2^qa_2) + a_1) + a_0$. This technique of polynomial representation is known as Horner's method. |
|
2699 |
|
2700 As a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications. Each multiplication is of |
|
2701 multiplicands that have $n$ times fewer digits than the inputs. The asymptotic running time of this algorithm is |
|
2702 $O \left ( k^{lg_n(2n - 1)} \right )$ for $k$ digit inputs (\textit{assuming they have the same number of digits}). Figure~\ref{fig:exponent} |
|
2703 summarizes the exponents for various values of $n$. |
|
2704 |
|
2705 \begin{figure} |
|
2706 \begin{center} |
|
2707 \begin{tabular}{|c|c|c|} |
|
2708 \hline \textbf{Split into $n$ Parts} & \textbf{Exponent} & \textbf{Notes}\\ |
|
2709 \hline $2$ & $1.584962501$ & This is Karatsuba Multiplication. \\ |
|
2710 \hline $3$ & $1.464973520$ & This is Toom-Cook Multiplication. \\ |
|
2711 \hline $4$ & $1.403677461$ &\\ |
|
2712 \hline $5$ & $1.365212389$ &\\ |
|
2713 \hline $10$ & $1.278753601$ &\\ |
|
2714 \hline $100$ & $1.149426538$ &\\ |
|
2715 \hline $1000$ & $1.100270931$ &\\ |
|
2716 \hline $10000$ & $1.075252070$ &\\ |
|
2717 \hline |
|
2718 \end{tabular} |
|
2719 \end{center} |
|
2720 \caption{Asymptotic Running Time of Polynomial Basis Multiplication} |
|
2721 \label{fig:exponent} |
|
2722 \end{figure} |
|
2723 |
|
2724 At first it may seem like a good idea to choose $n = 1000$ since the exponent is approximately $1.1$. However, the overhead |
|
2725 of solving for the 2001 terms of $W(x)$ will certainly consume any savings the algorithm could offer for all but exceedingly large |
|
2726 numbers. |
|
2727 |
|
2728 \subsubsection{Cutoff Point} |
|
2729 The polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach. However, |
|
2730 the algorithms incur an overhead (\textit{at the $O(n)$ work level}) since they require a system of equations to be solved. This makes the |
|
2731 polynomial basis approach more costly to use with small inputs. |
|
2732 |
|
2733 Let $m$ represent the number of digits in the multiplicands (\textit{assume both multiplicands have the same number of digits}). There exists a |
|
2734 point $y$ such that when $m < y$ the polynomial basis algorithms are more costly than Comba, when $m = y$ they are roughly the same cost and |
|
2735 when $m > y$ the Comba methods are slower than the polynomial basis algorithms. |
|
2736 |
|
2737 The exact location of $y$ depends on several key architectural elements of the computer platform in question. |
|
2738 |
|
2739 \begin{enumerate} |
|
2740 \item The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc. For example |
|
2741 on the AMD Athlon the ratio is roughly $17 : 1$ while on the Intel P4 it is $29 : 1$. The higher the ratio in favour of multiplication the lower |
|
2742 the cutoff point $y$ will be. |
|
2743 |
|
2744 \item The complexity of the linear system of equations (\textit{for the coefficients of $W(x)$}) is. Generally speaking as the number of splits |
|
2745 grows the complexity grows substantially. Ideally solving the system will only involve addition, subtraction and shifting of integers. This |
|
2746 directly reflects on the ratio previous mentioned. |
|
2747 |
|
2748 \item To a lesser extent memory bandwidth and function call overheads. Provided the values are in the processor cache this is less of an |
|
2749 influence over the cutoff point. |
|
2750 |
|
2751 \end{enumerate} |
|
2752 |
|
2753 A clean cutoff point separation occurs when a point $y$ is found such that all of the cutoff point conditions are met. For example, if the point |
|
2754 is too low then there will be values of $m$ such that $m > y$ and the Comba method is still faster. Finding the cutoff points is fairly simple when |
|
2755 a high resolution timer is available. |
|
2756 |
|
2757 \subsection{Karatsuba Multiplication} |
|
2758 Karatsuba \cite{KARA} multiplication when originally proposed in 1962 was among the first set of algorithms to break the $O(n^2)$ barrier for |
|
2759 general purpose multiplication. Given two polynomial basis representations $f(x) = ax + b$ and $g(x) = cx + d$, Karatsuba proved with |
|
2760 light algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent. |
|
2761 |
|
2762 \begin{equation} |
|
2763 f(x) \cdot g(x) = acx^2 + ((a - b)(c - d) - (ac + bd))x + bd |
|
2764 \end{equation} |
|
2765 |
|
2766 Using the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product. Applying |
|
2767 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique. It turns |
|
2768 out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points |
|
2769 $\zeta_0$, $\zeta_{\infty}$ and $-\zeta_{-1}$. Consider the resultant system of equations. |
|
2770 |
|
2771 \begin{center} |
|
2772 \begin{tabular}{rcrcrcrc} |
|
2773 $\zeta_{0}$ & $=$ & & & & & $w_0$ \\ |
|
2774 $-\zeta_{-1}$ & $=$ & $-w_2$ & $+$ & $w_1$ & $-$ & $w_0$ \\ |
|
2775 $\zeta_{\infty}$ & $=$ & $w_2$ & & & & \\ |
|
2776 \end{tabular} |
|
2777 \end{center} |
|
2778 |
|
2779 By adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for. The simplicity |
|
2780 of this system of equations has made Karatsuba fairly popular. In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.} |
|
2781 making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman. It is worth noting that the point |
|
2782 $\zeta_1$ could be substituted for $-\zeta_{-1}$. In this case the first and third row are subtracted instead of added to the second row. |
|
2783 |
|
2784 \newpage\begin{figure}[!here] |
|
2785 \begin{small} |
|
2786 \begin{center} |
|
2787 \begin{tabular}{l} |
|
2788 \hline Algorithm \textbf{mp\_karatsuba\_mul}. \\ |
|
2789 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |
|
2790 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert$ \\ |
|
2791 \hline \\ |
|
2792 1. Init the following mp\_int variables: $x0$, $x1$, $y0$, $y1$, $t1$, $x0y0$, $x1y1$.\\ |
|
2793 2. If step 2 failed then return(\textit{MP\_MEM}). \\ |
|
2794 \\ |
|
2795 Split the input. e.g. $a = x1 \cdot \beta^B + x0$ \\ |
|
2796 3. $B \leftarrow \mbox{min}(a.used, b.used)/2$ \\ |
|
2797 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\ |
|
2798 5. $y0 \leftarrow b \mbox{ (mod }\beta^B\mbox{)}$ \\ |
|
2799 6. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_rshd}) \\ |
|
2800 7. $y1 \leftarrow \lfloor b / \beta^B \rfloor$ \\ |
|
2801 \\ |
|
2802 Calculate the three products. \\ |
|
2803 8. $x0y0 \leftarrow x0 \cdot y0$ (\textit{mp\_mul}) \\ |
|
2804 9. $x1y1 \leftarrow x1 \cdot y1$ \\ |
|
2805 10. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\ |
|
2806 11. $x0 \leftarrow y1 - y0$ \\ |
|
2807 12. $t1 \leftarrow t1 \cdot x0$ \\ |
|
2808 \\ |
|
2809 Calculate the middle term. \\ |
|
2810 13. $x0 \leftarrow x0y0 + x1y1$ \\ |
|
2811 14. $t1 \leftarrow x0 - t1$ \\ |
|
2812 \\ |
|
2813 Calculate the final product. \\ |
|
2814 15. $t1 \leftarrow t1 \cdot \beta^B$ (\textit{mp\_lshd}) \\ |
|
2815 16. $x1y1 \leftarrow x1y1 \cdot \beta^{2B}$ \\ |
|
2816 17. $t1 \leftarrow x0y0 + t1$ \\ |
|
2817 18. $c \leftarrow t1 + x1y1$ \\ |
|
2818 19. Clear all of the temporary variables. \\ |
|
2819 20. Return(\textit{MP\_OKAY}).\\ |
|
2820 \hline |
|
2821 \end{tabular} |
|
2822 \end{center} |
|
2823 \end{small} |
|
2824 \caption{Algorithm mp\_karatsuba\_mul} |
|
2825 \end{figure} |
|
2826 |
|
2827 \textbf{Algorithm mp\_karatsuba\_mul.} |
|
2828 This algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm. It is loosely based on the description |
|
2829 from Knuth \cite[pp. 294-295]{TAOCPV2}. |
|
2830 |
|
2831 \index{radix point} |
|
2832 In order to split the two inputs into their respective halves, a suitable \textit{radix point} must be chosen. The radix point chosen must |
|
2833 be used for both of the inputs meaning that it must be smaller than the smallest input. Step 3 chooses the radix point $B$ as half of the |
|
2834 smallest input \textbf{used} count. After the radix point is chosen the inputs are split into lower and upper halves. Step 4 and 5 |
|
2835 compute the lower halves. Step 6 and 7 computer the upper halves. |
|
2836 |
|
2837 After the halves have been computed the three intermediate half-size products must be computed. Step 8 and 9 compute the trivial products |
|
2838 $x0 \cdot y0$ and $x1 \cdot y1$. The mp\_int $x0$ is used as a temporary variable after $x1 - x0$ has been computed. By using $x0$ instead |
|
2839 of an additional temporary variable, the algorithm can avoid an addition memory allocation operation. |
|
2840 |
|
2841 The remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations. |
|
2842 |
|
2843 EXAM,bn_mp_karatsuba_mul.c |
|
2844 |
|
2845 The new coding element in this routine, not seen in previous routines, is the usage of goto statements. The conventional |
|
2846 wisdom is that goto statements should be avoided. This is generally true, however when every single function call can fail, it makes sense |
|
2847 to handle error recovery with a single piece of code. Lines @61,if@ to @75,if@ handle initializing all of the temporary variables |
|
2848 required. Note how each of the if statements goes to a different label in case of failure. This allows the routine to correctly free only |
|
2849 the temporaries that have been successfully allocated so far. |
|
2850 |
|
2851 The temporary variables are all initialized using the mp\_init\_size routine since they are expected to be large. This saves the |
|
2852 additional reallocation that would have been necessary. Also $x0$, $x1$, $y0$ and $y1$ have to be able to hold at least their respective |
|
2853 number of digits for the next section of code. |
|
2854 |
|
2855 The first algebraic portion of the algorithm is to split the two inputs into their halves. However, instead of using mp\_mod\_2d and mp\_rshd |
|
2856 to extract the halves, the respective code has been placed inline within the body of the function. To initialize the halves, the \textbf{used} and |
|
2857 \textbf{sign} members are copied first. The first for loop on line @98,for@ copies the lower halves. Since they are both the same magnitude it |
|
2858 is simpler to calculate both lower halves in a single loop. The for loop on lines @104,for@ and @109,for@ calculate the upper halves $x1$ and |
|
2859 $y1$ respectively. |
|
2860 |
|
2861 By inlining the calculation of the halves, the Karatsuba multiplier has a slightly lower overhead and can be used for smaller magnitude inputs. |
|
2862 |
|
2863 When line @152,err@ is reached, the algorithm has completed succesfully. The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that |
|
2864 the same code that handles errors can be used to clear the temporary variables and return. |
|
2865 |
|
2866 \subsection{Toom-Cook $3$-Way Multiplication} |
|
2867 Toom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points are |
|
2868 chosen such that $\zeta$ is easy to compute and the resulting system of equations easy to reduce. Here, the points $\zeta_{0}$, |
|
2869 $16 \cdot \zeta_{1 \over 2}$, $\zeta_1$, $\zeta_2$ and $\zeta_{\infty}$ make up the five required points to solve for the coefficients |
|
2870 of the $W(x)$. |
|
2871 |
|
2872 With the five relations that Toom-Cook specifies, the following system of equations is formed. |
|
2873 |
|
2874 \begin{center} |
|
2875 \begin{tabular}{rcrcrcrcrcr} |
|
2876 $\zeta_0$ & $=$ & $0w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $1w_0$ \\ |
|
2877 $16 \cdot \zeta_{1 \over 2}$ & $=$ & $1w_4$ & $+$ & $2w_3$ & $+$ & $4w_2$ & $+$ & $8w_1$ & $+$ & $16w_0$ \\ |
|
2878 $\zeta_1$ & $=$ & $1w_4$ & $+$ & $1w_3$ & $+$ & $1w_2$ & $+$ & $1w_1$ & $+$ & $1w_0$ \\ |
|
2879 $\zeta_2$ & $=$ & $16w_4$ & $+$ & $8w_3$ & $+$ & $4w_2$ & $+$ & $2w_1$ & $+$ & $1w_0$ \\ |
|
2880 $\zeta_{\infty}$ & $=$ & $1w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $0w_0$ \\ |
|
2881 \end{tabular} |
|
2882 \end{center} |
|
2883 |
|
2884 A trivial solution to this matrix requires $12$ subtractions, two multiplications by a small power of two, two divisions by a small power |
|
2885 of two, two divisions by three and one multiplication by three. All of these $19$ sub-operations require less than quadratic time, meaning that |
|
2886 the algorithm can be faster than a baseline multiplication. However, the greater complexity of this algorithm places the cutoff point |
|
2887 (\textbf{TOOM\_MUL\_CUTOFF}) where Toom-Cook becomes more efficient much higher than the Karatsuba cutoff point. |
|
2888 |
|
2889 \begin{figure}[!here] |
|
2890 \begin{small} |
|
2891 \begin{center} |
|
2892 \begin{tabular}{l} |
|
2893 \hline Algorithm \textbf{mp\_toom\_mul}. \\ |
|
2894 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |
|
2895 \textbf{Output}. $c \leftarrow a \cdot b $ \\ |
|
2896 \hline \\ |
|
2897 Split $a$ and $b$ into three pieces. E.g. $a = a_2 \beta^{2k} + a_1 \beta^{k} + a_0$ \\ |
|
2898 1. $k \leftarrow \lfloor \mbox{min}(a.used, b.used) / 3 \rfloor$ \\ |
|
2899 2. $a_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2900 3. $a_1 \leftarrow \lfloor a / \beta^k \rfloor$, $a_1 \leftarrow a_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2901 4. $a_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $a_2 \leftarrow a_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2902 5. $b_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2903 6. $b_1 \leftarrow \lfloor a / \beta^k \rfloor$, $b_1 \leftarrow b_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2904 7. $b_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $b_2 \leftarrow b_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\ |
|
2905 \\ |
|
2906 Find the five equations for $w_0, w_1, ..., w_4$. \\ |
|
2907 8. $w_0 \leftarrow a_0 \cdot b_0$ \\ |
|
2908 9. $w_4 \leftarrow a_2 \cdot b_2$ \\ |
|
2909 10. $tmp_1 \leftarrow 2 \cdot a_0$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_2$ \\ |
|
2910 11. $tmp_2 \leftarrow 2 \cdot b_0$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_2$ \\ |
|
2911 12. $w_1 \leftarrow tmp_1 \cdot tmp_2$ \\ |
|
2912 13. $tmp_1 \leftarrow 2 \cdot a_2$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_0$ \\ |
|
2913 14. $tmp_2 \leftarrow 2 \cdot b_2$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_0$ \\ |
|
2914 15. $w_3 \leftarrow tmp_1 \cdot tmp_2$ \\ |
|
2915 16. $tmp_1 \leftarrow a_0 + a_1$, $tmp_1 \leftarrow tmp_1 + a_2$, $tmp_2 \leftarrow b_0 + b_1$, $tmp_2 \leftarrow tmp_2 + b_2$ \\ |
|
2916 17. $w_2 \leftarrow tmp_1 \cdot tmp_2$ \\ |
|
2917 \\ |
|
2918 Continued on the next page.\\ |
|
2919 \hline |
|
2920 \end{tabular} |
|
2921 \end{center} |
|
2922 \end{small} |
|
2923 \caption{Algorithm mp\_toom\_mul} |
|
2924 \end{figure} |
|
2925 |
|
2926 \newpage\begin{figure}[!here] |
|
2927 \begin{small} |
|
2928 \begin{center} |
|
2929 \begin{tabular}{l} |
|
2930 \hline Algorithm \textbf{mp\_toom\_mul} (continued). \\ |
|
2931 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |
|
2932 \textbf{Output}. $c \leftarrow a \cdot b $ \\ |
|
2933 \hline \\ |
|
2934 Now solve the system of equations. \\ |
|
2935 18. $w_1 \leftarrow w_4 - w_1$, $w_3 \leftarrow w_3 - w_0$ \\ |
|
2936 19. $w_1 \leftarrow \lfloor w_1 / 2 \rfloor$, $w_3 \leftarrow \lfloor w_3 / 2 \rfloor$ \\ |
|
2937 20. $w_2 \leftarrow w_2 - w_0$, $w_2 \leftarrow w_2 - w_4$ \\ |
|
2938 21. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\ |
|
2939 22. $tmp_1 \leftarrow 8 \cdot w_0$, $w_1 \leftarrow w_1 - tmp_1$, $tmp_1 \leftarrow 8 \cdot w_4$, $w_3 \leftarrow w_3 - tmp_1$ \\ |
|
2940 23. $w_2 \leftarrow 3 \cdot w_2$, $w_2 \leftarrow w_2 - w_1$, $w_2 \leftarrow w_2 - w_3$ \\ |
|
2941 24. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\ |
|
2942 25. $w_1 \leftarrow \lfloor w_1 / 3 \rfloor, w_3 \leftarrow \lfloor w_3 / 3 \rfloor$ \\ |
|
2943 \\ |
|
2944 Now substitute $\beta^k$ for $x$ by shifting $w_0, w_1, ..., w_4$. \\ |
|
2945 26. for $n$ from $1$ to $4$ do \\ |
|
2946 \hspace{3mm}26.1 $w_n \leftarrow w_n \cdot \beta^{nk}$ \\ |
|
2947 27. $c \leftarrow w_0 + w_1$, $c \leftarrow c + w_2$, $c \leftarrow c + w_3$, $c \leftarrow c + w_4$ \\ |
|
2948 28. Return(\textit{MP\_OKAY}) \\ |
|
2949 \hline |
|
2950 \end{tabular} |
|
2951 \end{center} |
|
2952 \end{small} |
|
2953 \caption{Algorithm mp\_toom\_mul (continued)} |
|
2954 \end{figure} |
|
2955 |
|
2956 \textbf{Algorithm mp\_toom\_mul.} |
|
2957 This algorithm computes the product of two mp\_int variables $a$ and $b$ using the Toom-Cook approach. Compared to the Karatsuba multiplication, this |
|
2958 algorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead. In this |
|
2959 description, several statements have been compounded to save space. The intention is that the statements are executed from left to right across |
|
2960 any given step. |
|
2961 |
|
2962 The two inputs $a$ and $b$ are first split into three $k$-digit integers $a_0, a_1, a_2$ and $b_0, b_1, b_2$ respectively. From these smaller |
|
2963 integers the coefficients of the polynomial basis representations $f(x)$ and $g(x)$ are known and can be used to find the relations required. |
|
2964 |
|
2965 The first two relations $w_0$ and $w_4$ are the points $\zeta_{0}$ and $\zeta_{\infty}$ respectively. The relation $w_1, w_2$ and $w_3$ correspond |
|
2966 to the points $16 \cdot \zeta_{1 \over 2}, \zeta_{2}$ and $\zeta_{1}$ respectively. These are found using logical shifts to independently find |
|
2967 $f(y)$ and $g(y)$ which significantly speeds up the algorithm. |
|
2968 |
|
2969 After the five relations $w_0, w_1, \ldots, w_4$ have been computed, the system they represent must be solved in order for the unknown coefficients |
|
2970 $w_1, w_2$ and $w_3$ to be isolated. The steps 18 through 25 perform the system reduction required as previously described. Each step of |
|
2971 the reduction represents the comparable matrix operation that would be performed had this been performed by pencil. For example, step 18 indicates |
|
2972 that row $1$ must be subtracted from row $4$ and simultaneously row $0$ subtracted from row $3$. |
|
2973 |
|
2974 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer |
|
2975 result $a \cdot b$ is produced. |
|
2976 |
|
2977 EXAM,bn_mp_toom_mul.c |
|
2978 |
|
2979 -- Comments to be added during editing phase. |
|
2980 |
|
2981 \subsection{Signed Multiplication} |
|
2982 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all |
|
2983 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. |
|
2984 |
|
2985 \newpage\begin{figure}[!here] |
|
2986 \begin{small} |
|
2987 \begin{center} |
|
2988 \begin{tabular}{l} |
|
2989 \hline Algorithm \textbf{mp\_mul}. \\ |
|
2990 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |
|
2991 \textbf{Output}. $c \leftarrow a \cdot b$ \\ |
|
2992 \hline \\ |
|
2993 1. If $a.sign = b.sign$ then \\ |
|
2994 \hspace{3mm}1.1 $sign = MP\_ZPOS$ \\ |
|
2995 2. else \\ |
|
2996 \hspace{3mm}2.1 $sign = MP\_ZNEG$ \\ |
|
2997 3. If min$(a.used, b.used) \ge TOOM\_MUL\_CUTOFF$ then \\ |
|
2998 \hspace{3mm}3.1 $c \leftarrow a \cdot b$ using algorithm mp\_toom\_mul \\ |
|
2999 4. else if min$(a.used, b.used) \ge KARATSUBA\_MUL\_CUTOFF$ then \\ |
|
3000 \hspace{3mm}4.1 $c \leftarrow a \cdot b$ using algorithm mp\_karatsuba\_mul \\ |
|
3001 5. else \\ |
|
3002 \hspace{3mm}5.1 $digs \leftarrow a.used + b.used + 1$ \\ |
|
3003 \hspace{3mm}5.2 If $digs < MP\_ARRAY$ and min$(a.used, b.used) \le \delta$ then \\ |
|
3004 \hspace{6mm}5.2.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm fast\_s\_mp\_mul\_digs. \\ |
|
3005 \hspace{3mm}5.3 else \\ |
|
3006 \hspace{6mm}5.3.1 $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm s\_mp\_mul\_digs. \\ |
|
3007 6. $c.sign \leftarrow sign$ \\ |
|
3008 7. Return the result of the unsigned multiplication performed. \\ |
|
3009 \hline |
|
3010 \end{tabular} |
|
3011 \end{center} |
|
3012 \end{small} |
|
3013 \caption{Algorithm mp\_mul} |
|
3014 \end{figure} |
|
3015 |
|
3016 \textbf{Algorithm mp\_mul.} |
|
3017 This algorithm performs the signed multiplication of two inputs. It will make use of any of the three unsigned multiplication algorithms |
|
3018 available when the input is of appropriate size. The \textbf{sign} of the result is not set until the end of the algorithm since algorithm |
|
3019 s\_mp\_mul\_digs will clear it. |
|
3020 |
|
3021 EXAM,bn_mp_mul.c |
|
3022 |
|
3023 The implementation is rather simplistic and is not particularly noteworthy. Line @22,?@ computes the sign of the result using the ``?'' |
|
3024 operator from the C programming language. Line @37,<<@ computes $\delta$ using the fact that $1 << k$ is equal to $2^k$. |
|
3025 |
|
3026 \section{Squaring} |
|
3027 \label{sec:basesquare} |
|
3028 |
|
3029 Squaring is a special case of multiplication where both multiplicands are equal. At first it may seem like there is no significant optimization |
|
3030 available but in fact there is. Consider the multiplication of $576$ against $241$. In total there will be nine single precision multiplications |
|
3031 performed which are $1\cdot 6$, $1 \cdot 7$, $1 \cdot 5$, $4 \cdot 6$, $4 \cdot 7$, $4 \cdot 5$, $2 \cdot 6$, $2 \cdot 7$ and $2 \cdot 5$. Now consider |
|
3032 the multiplication of $123$ against $123$. The nine products are $3 \cdot 3$, $3 \cdot 2$, $3 \cdot 1$, $2 \cdot 3$, $2 \cdot 2$, $2 \cdot 1$, |
|
3033 $1 \cdot 3$, $1 \cdot 2$ and $1 \cdot 1$. On closer inspection some of the products are equivalent. For example, $3 \cdot 2 = 2 \cdot 3$ |
|
3034 and $3 \cdot 1 = 1 \cdot 3$. |
|
3035 |
|
3036 For any $n$-digit input, there are ${{\left (n^2 + n \right)}\over 2}$ possible unique single precision multiplications required compared to the $n^2$ |
|
3037 required for multiplication. The following diagram gives an example of the operations required. |
|
3038 |
|
3039 \begin{figure}[here] |
|
3040 \begin{center} |
|
3041 \begin{tabular}{ccccc|c} |
|
3042 &&1&2&3&\\ |
|
3043 $\times$ &&1&2&3&\\ |
|
3044 \hline && $3 \cdot 1$ & $3 \cdot 2$ & $3 \cdot 3$ & Row 0\\ |
|
3045 & $2 \cdot 1$ & $2 \cdot 2$ & $2 \cdot 3$ && Row 1 \\ |
|
3046 $1 \cdot 1$ & $1 \cdot 2$ & $1 \cdot 3$ &&& Row 2 \\ |
|
3047 \end{tabular} |
|
3048 \end{center} |
|
3049 \caption{Squaring Optimization Diagram} |
|
3050 \end{figure} |
|
3051 |
|
3052 MARK,SQUARE |
|
3053 Starting from zero and numbering the columns from right to left a very simple pattern becomes obvious. For the purposes of this discussion let $x$ |
|
3054 represent the number being squared. The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it. |
|
3055 |
|
3056 The second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product. Every non-square term of a column will |
|
3057 appear twice hence the name ``double product''. Every odd column is made up entirely of double products. In fact every column is made up of double |
|
3058 products and at most one square (\textit{see the exercise section}). |
|
3059 |
|
3060 The third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row, |
|
3061 occurs at column $2k + 1$. For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero. |
|
3062 Column two of row one is a square and column three is the first unique column. |
|
3063 |
|
3064 \subsection{The Baseline Squaring Algorithm} |
|
3065 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines |
|
3066 will not handle. |
|
3067 |
|
3068 \newpage\begin{figure}[!here] |
|
3069 \begin{small} |
|
3070 \begin{center} |
|
3071 \begin{tabular}{l} |
|
3072 \hline Algorithm \textbf{s\_mp\_sqr}. \\ |
|
3073 \textbf{Input}. mp\_int $a$ \\ |
|
3074 \textbf{Output}. $b \leftarrow a^2$ \\ |
|
3075 \hline \\ |
|
3076 1. Init a temporary mp\_int of at least $2 \cdot a.used +1$ digits. (\textit{mp\_init\_size}) \\ |
|
3077 2. If step 1 failed return(\textit{MP\_MEM}) \\ |
|
3078 3. $t.used \leftarrow 2 \cdot a.used + 1$ \\ |
|
3079 4. For $ix$ from 0 to $a.used - 1$ do \\ |
|
3080 \hspace{3mm}Calculate the square. \\ |
|
3081 \hspace{3mm}4.1 $\hat r \leftarrow t_{2ix} + \left (a_{ix} \right )^2$ \\ |
|
3082 \hspace{3mm}4.2 $t_{2ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3083 \hspace{3mm}Calculate the double products after the square. \\ |
|
3084 \hspace{3mm}4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
3085 \hspace{3mm}4.4 For $iy$ from $ix + 1$ to $a.used - 1$ do \\ |
|
3086 \hspace{6mm}4.4.1 $\hat r \leftarrow 2 \cdot a_{ix}a_{iy} + t_{ix + iy} + u$ \\ |
|
3087 \hspace{6mm}4.4.2 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3088 \hspace{6mm}4.4.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
3089 \hspace{3mm}Set the last carry. \\ |
|
3090 \hspace{3mm}4.5 While $u > 0$ do \\ |
|
3091 \hspace{6mm}4.5.1 $iy \leftarrow iy + 1$ \\ |
|
3092 \hspace{6mm}4.5.2 $\hat r \leftarrow t_{ix + iy} + u$ \\ |
|
3093 \hspace{6mm}4.5.3 $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3094 \hspace{6mm}4.5.4 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
3095 5. Clamp excess digits of $t$. (\textit{mp\_clamp}) \\ |
|
3096 6. Exchange $b$ and $t$. \\ |
|
3097 7. Clear $t$ (\textit{mp\_clear}) \\ |
|
3098 8. Return(\textit{MP\_OKAY}) \\ |
|
3099 \hline |
|
3100 \end{tabular} |
|
3101 \end{center} |
|
3102 \end{small} |
|
3103 \caption{Algorithm s\_mp\_sqr} |
|
3104 \end{figure} |
|
3105 |
|
3106 \textbf{Algorithm s\_mp\_sqr.} |
|
3107 This algorithm computes the square of an input using the three observations on squaring. It is based fairly faithfully on algorithm 14.16 of HAC |
|
3108 \cite[pp.596-597]{HAC}. Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring. This allows the |
|
3109 destination mp\_int to be the same as the source mp\_int. |
|
3110 |
|
3111 The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while |
|
3112 the inner loop computes the columns of the partial result. Step 4.1 and 4.2 compute the square term for each row, and step 4.3 and 4.4 propagate |
|
3113 the carry and compute the double products. |
|
3114 |
|
3115 The requirement that a mp\_word be able to represent the range $0 \le x < 2 \beta^2$ arises from this |
|
3116 very algorithm. The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that |
|
3117 when it is multiplied by two, it can be properly represented by a mp\_word. |
|
3118 |
|
3119 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial |
|
3120 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. |
|
3121 |
|
3122 EXAM,bn_s_mp_sqr.c |
|
3123 |
|
3124 Inside the outer loop (\textit{see line @32,for@}) the square term is calculated on line @35,r =@. Line @42,>>@ extracts the carry from the square |
|
3125 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines @45,tmpx@ and @48,tmpt@ respectively. The doubling is performed using two |
|
3126 additions (\textit{see line @57,r + r@}) since it is usually faster than shifting,if not at least as fast. |
|
3127 |
|
3128 \subsection{Faster Squaring by the ``Comba'' Method} |
|
3129 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional |
|
3130 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these |
|
3131 performance hazards. |
|
3132 |
|
3133 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry |
|
3134 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact |
|
3135 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, |
|
3136 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. |
|
3137 |
|
3138 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word |
|
3139 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be |
|
3140 moved to a $O(n)$ work level outside the $O(n^2)$ level. |
|
3141 |
|
3142 \newpage\begin{figure}[!here] |
|
3143 \begin{small} |
|
3144 \begin{center} |
|
3145 \begin{tabular}{l} |
|
3146 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ |
|
3147 \textbf{Input}. mp\_int $a$ \\ |
|
3148 \textbf{Output}. $b \leftarrow a^2$ \\ |
|
3149 \hline \\ |
|
3150 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ |
|
3151 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ |
|
3152 2. If step 1 failed return(\textit{MP\_MEM}). \\ |
|
3153 3. for $ix$ from $0$ to $2a.used + 1$ do \\ |
|
3154 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ |
|
3155 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\ |
|
3156 4. for $ix$ from $0$ to $a.used - 1$ do \\ |
|
3157 \hspace{3mm}Compute the square.\\ |
|
3158 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\ |
|
3159 \\ |
|
3160 \hspace{3mm}Compute the double products.\\ |
|
3161 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ |
|
3162 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ |
|
3163 5. $oldused \leftarrow b.used$ \\ |
|
3164 6. $b.used \leftarrow 2a.used + 1$ \\ |
|
3165 \\ |
|
3166 Double the products and propagate the carries simultaneously. \\ |
|
3167 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ |
|
3168 8. for $ix$ from $1$ to $2a.used$ do \\ |
|
3169 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ |
|
3170 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ |
|
3171 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3172 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3173 10. if $2a.used + 1 < oldused$ then do \\ |
|
3174 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\ |
|
3175 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\ |
|
3176 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ |
|
3177 12. Return(\textit{MP\_OKAY}). \\ |
|
3178 \hline |
|
3179 \end{tabular} |
|
3180 \end{center} |
|
3181 \end{small} |
|
3182 \caption{Algorithm fast\_s\_mp\_sqr} |
|
3183 \end{figure} |
|
3184 |
|
3185 \textbf{Algorithm fast\_s\_mp\_sqr.} |
|
3186 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when |
|
3187 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. |
|
3188 |
|
3189 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second |
|
3190 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most |
|
3191 processors to simply make it a full size array. |
|
3192 |
|
3193 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how |
|
3194 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop |
|
3195 computes the sum of the products for each column. They are not doubled until later. |
|
3196 |
|
3197 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both |
|
3198 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the |
|
3199 squares in place. |
|
3200 |
|
3201 EXAM,bn_fast_s_mp_sqr.c |
|
3202 |
|
3203 -- Write something deep and insightful later, Tom. |
|
3204 |
|
3205 \subsection{Polynomial Basis Squaring} |
|
3206 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception |
|
3207 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ |
|
3208 multiplications to find the $\zeta$ relations, squaring operations are performed instead. |
|
3209 |
|
3210 \subsection{Karatsuba Squaring} |
|
3211 Let $f(x) = ax + b$ represent the polynomial basis representation of a number to square. |
|
3212 Let $h(x) = \left ( f(x) \right )^2$ represent the square of the polynomial. The Karatsuba equation can be modified to square a |
|
3213 number with the following equation. |
|
3214 |
|
3215 \begin{equation} |
|
3216 h(x) = a^2x^2 + \left (a^2 + b^2 - (a - b)^2 \right )x + b^2 |
|
3217 \end{equation} |
|
3218 |
|
3219 Upon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a - b)^2$. As in |
|
3220 Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of |
|
3221 $O \left ( n^{lg(3)} \right )$. |
|
3222 |
|
3223 If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm |
|
3224 instead? The answer to this arises from the cutoff point for squaring. As in multiplication there exists a cutoff point, at which the |
|
3225 time required for a Comba based squaring and a Karatsuba based squaring meet. Due to the overhead inherent in the Karatsuba method, the cutoff |
|
3226 point is fairly high. For example, on an AMD Athlon XP processor with $\beta = 2^{28}$, the cutoff point is around 127 digits. |
|
3227 |
|
3228 Consider squaring a 200 digit number with this technique. It will be split into two 100 digit halves which are subsequently squared. |
|
3229 The 100 digit halves will not be squared using Karatsuba, but instead using the faster Comba based squaring algorithm. If Karatsuba multiplication |
|
3230 were used instead, the 100 digit numbers would be squared with a slower Comba based multiplication. |
|
3231 |
|
3232 \newpage\begin{figure}[!here] |
|
3233 \begin{small} |
|
3234 \begin{center} |
|
3235 \begin{tabular}{l} |
|
3236 \hline Algorithm \textbf{mp\_karatsuba\_sqr}. \\ |
|
3237 \textbf{Input}. mp\_int $a$ \\ |
|
3238 \textbf{Output}. $b \leftarrow a^2$ \\ |
|
3239 \hline \\ |
|
3240 1. Initialize the following temporary mp\_ints: $x0$, $x1$, $t1$, $t2$, $x0x0$ and $x1x1$. \\ |
|
3241 2. If any of the initializations on step 1 failed return(\textit{MP\_MEM}). \\ |
|
3242 \\ |
|
3243 Split the input. e.g. $a = x1\beta^B + x0$ \\ |
|
3244 3. $B \leftarrow \lfloor a.used / 2 \rfloor$ \\ |
|
3245 4. $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\ |
|
3246 5. $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_lshd}) \\ |
|
3247 \\ |
|
3248 Calculate the three squares. \\ |
|
3249 6. $x0x0 \leftarrow x0^2$ (\textit{mp\_sqr}) \\ |
|
3250 7. $x1x1 \leftarrow x1^2$ \\ |
|
3251 8. $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\ |
|
3252 9. $t1 \leftarrow t1^2$ \\ |
|
3253 \\ |
|
3254 Compute the middle term. \\ |
|
3255 10. $t2 \leftarrow x0x0 + x1x1$ (\textit{s\_mp\_add}) \\ |
|
3256 11. $t1 \leftarrow t2 - t1$ \\ |
|
3257 \\ |
|
3258 Compute final product. \\ |
|
3259 12. $t1 \leftarrow t1\beta^B$ (\textit{mp\_lshd}) \\ |
|
3260 13. $x1x1 \leftarrow x1x1\beta^{2B}$ \\ |
|
3261 14. $t1 \leftarrow t1 + x0x0$ \\ |
|
3262 15. $b \leftarrow t1 + x1x1$ \\ |
|
3263 16. Return(\textit{MP\_OKAY}). \\ |
|
3264 \hline |
|
3265 \end{tabular} |
|
3266 \end{center} |
|
3267 \end{small} |
|
3268 \caption{Algorithm mp\_karatsuba\_sqr} |
|
3269 \end{figure} |
|
3270 |
|
3271 \textbf{Algorithm mp\_karatsuba\_sqr.} |
|
3272 This algorithm computes the square of an input $a$ using the Karatsuba technique. This algorithm is very similar to the Karatsuba based |
|
3273 multiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings. |
|
3274 |
|
3275 The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is |
|
3276 placed just below the middle. Step 3, 4 and 5 compute the two halves required using $B$ |
|
3277 as the radix point. The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form. |
|
3278 |
|
3279 By expanding $\left (x1 - x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $x1^2 + x0^2 - (x1 - x0)^2 = 2 \cdot x0 \cdot x1$. |
|
3280 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then |
|
3281 this method is faster. Assuming no further recursions occur, the difference can be estimated with the following inequality. |
|
3282 |
|
3283 Let $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or |
|
3284 machine clock cycles.}. |
|
3285 |
|
3286 \begin{equation} |
|
3287 5pn +{{q(n^2 + n)} \over 2} \le pn + qn^2 |
|
3288 \end{equation} |
|
3289 |
|
3290 For example, on an AMD Athlon XP processor $p = {1 \over 3}$ and $q = 6$. This implies that the following inequality should hold. |
|
3291 \begin{center} |
|
3292 \begin{tabular}{rcl} |
|
3293 ${5n \over 3} + 3n^2 + 3n$ & $<$ & ${n \over 3} + 6n^2$ \\ |
|
3294 ${5 \over 3} + 3n + 3$ & $<$ & ${1 \over 3} + 6n$ \\ |
|
3295 ${13 \over 9}$ & $<$ & $n$ \\ |
|
3296 \end{tabular} |
|
3297 \end{center} |
|
3298 |
|
3299 This results in a cutoff point around $n = 2$. As a consequence it is actually faster to compute the middle term the ``long way'' on processors |
|
3300 where multiplication is substantially slower\footnote{On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication. On |
|
3301 the Intel P4 processor this ratio is 1:29 making this method even more beneficial. The only common exception is the ARMv4 processor which has a |
|
3302 ratio of 1:7. } than simpler operations such as addition. |
|
3303 |
|
3304 EXAM,bn_mp_karatsuba_sqr.c |
|
3305 |
|
3306 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul. It uses the same inline style to copy and |
|
3307 shift the input into the two halves. The loop from line @54,{@ to line @70,}@ has been modified since only one input exists. The \textbf{used} |
|
3308 count of both $x0$ and $x1$ is fixed up and $x0$ is clamped before the calculations begin. At this point $x1$ and $x0$ are valid equivalents |
|
3309 to the respective halves as if mp\_rshd and mp\_mod\_2d had been used. |
|
3310 |
|
3311 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point |
|
3312 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 |
|
3313 it is actually below the Comba limit (\textit{at 110 digits}). |
|
3314 |
|
3315 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to |
|
3316 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. |
|
3317 |
|
3318 \textit{Last paragraph sucks. re-write! -- Tom} |
|
3319 |
|
3320 \subsection{Toom-Cook Squaring} |
|
3321 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used |
|
3322 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to |
|
3323 derive their own Toom-Cook squaring algorithm. |
|
3324 |
|
3325 \subsection{High Level Squaring} |
|
3326 \newpage\begin{figure}[!here] |
|
3327 \begin{small} |
|
3328 \begin{center} |
|
3329 \begin{tabular}{l} |
|
3330 \hline Algorithm \textbf{mp\_sqr}. \\ |
|
3331 \textbf{Input}. mp\_int $a$ \\ |
|
3332 \textbf{Output}. $b \leftarrow a^2$ \\ |
|
3333 \hline \\ |
|
3334 1. If $a.used \ge TOOM\_SQR\_CUTOFF$ then \\ |
|
3335 \hspace{3mm}1.1 $b \leftarrow a^2$ using algorithm mp\_toom\_sqr \\ |
|
3336 2. else if $a.used \ge KARATSUBA\_SQR\_CUTOFF$ then \\ |
|
3337 \hspace{3mm}2.1 $b \leftarrow a^2$ using algorithm mp\_karatsuba\_sqr \\ |
|
3338 3. else \\ |
|
3339 \hspace{3mm}3.1 $digs \leftarrow a.used + b.used + 1$ \\ |
|
3340 \hspace{3mm}3.2 If $digs < MP\_ARRAY$ and $a.used \le \delta$ then \\ |
|
3341 \hspace{6mm}3.2.1 $b \leftarrow a^2$ using algorithm fast\_s\_mp\_sqr. \\ |
|
3342 \hspace{3mm}3.3 else \\ |
|
3343 \hspace{6mm}3.3.1 $b \leftarrow a^2$ using algorithm s\_mp\_sqr. \\ |
|
3344 4. $b.sign \leftarrow MP\_ZPOS$ \\ |
|
3345 5. Return the result of the unsigned squaring performed. \\ |
|
3346 \hline |
|
3347 \end{tabular} |
|
3348 \end{center} |
|
3349 \end{small} |
|
3350 \caption{Algorithm mp\_sqr} |
|
3351 \end{figure} |
|
3352 |
|
3353 \textbf{Algorithm mp\_sqr.} |
|
3354 This algorithm computes the square of the input using one of four different algorithms. If the input is very large and has at least |
|
3355 \textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used. If |
|
3356 neither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used. |
|
3357 |
|
3358 EXAM,bn_mp_sqr.c |
|
3359 |
|
3360 \section*{Exercises} |
|
3361 \begin{tabular}{cl} |
|
3362 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ |
|
3363 & that have different number of digits in Karatsuba multiplication. \\ |
|
3364 & \\ |
|
3365 $\left [ 3 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ |
|
3366 & of double products and at most one square is stated. Prove this statement. \\ |
|
3367 & \\ |
|
3368 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\ |
|
3369 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\ |
|
3370 & \\ |
|
3371 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ |
|
3372 & \\ |
|
3373 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ |
|
3374 & \\ |
|
3375 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ |
|
3376 & required for equation $6.7$ to be true. \\ |
|
3377 & \\ |
|
3378 \end{tabular} |
|
3379 |
|
3380 \chapter{Modular Reduction} |
|
3381 MARK,REDUCTION |
|
3382 \section{Basics of Modular Reduction} |
|
3383 \index{modular residue} |
|
3384 Modular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms, |
|
3385 such as factoring. Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set. A number $a$ is said to be \textit{reduced} |
|
3386 modulo another number $b$ by finding the remainder of the division $a/b$. Full integer division with remainder is a topic to be covered |
|
3387 in~\ref{sec:division}. |
|
3388 |
|
3389 Modular reduction is equivalent to solving for $r$ in the following equation. $a = bq + r$ where $q = \lfloor a/b \rfloor$. The result |
|
3390 $r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$. In other vernacular $r$ is known as the |
|
3391 ``modular residue'' which leads to ``quadratic residue''\footnote{That's fancy talk for $b \equiv a^2 \mbox{ (mod }p\mbox{)}$.} and |
|
3392 other forms of residues. |
|
3393 |
|
3394 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions |
|
3395 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the |
|
3396 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in |
|
3397 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular |
|
3398 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the |
|
3399 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check |
|
3400 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. |
|
3401 |
|
3402 \section{The Barrett Reduction} |
|
3403 The Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate |
|
3404 division. Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to |
|
3405 |
|
3406 \begin{equation} |
|
3407 c = a - b \cdot \lfloor a/b \rfloor |
|
3408 \end{equation} |
|
3409 |
|
3410 Since algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper |
|
3411 targeted the DSP56K processor.} intuition would indicate the next step would be to replace $a/b$ by a multiplication by the reciprocal. However, |
|
3412 DSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types. |
|
3413 It would take another common optimization to optimize the algorithm. |
|
3414 |
|
3415 \subsection{Fixed Point Arithmetic} |
|
3416 The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers. Fixed |
|
3417 point arithmetic would become very popular as it greatly optimize the ``3d-shooter'' genre of games in the mid 1990s when floating point units were |
|
3418 fairly slow if not unavailable. The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit |
|
3419 integer and a $q$-bit fraction part (\textit{where $p+q = k$}). |
|
3420 |
|
3421 In this system a $k$-bit integer $n$ would actually represent $n/2^q$. For example, with $q = 4$ the integer $n = 37$ would actually represent the |
|
3422 value $2.3125$. To multiply two fixed point numbers the integers are multiplied using traditional arithmetic and subsequently normalized by |
|
3423 moving the implied decimal point back to where it should be. For example, with $q = 4$ to multiply the integers $9$ and $5$ they must be converted |
|
3424 to fixed point first by multiplying by $2^q$. Let $a = 9(2^q)$ represent the fixed point representation of $9$ and $b = 5(2^q)$ represent the |
|
3425 fixed point representation of $5$. The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$. |
|
3426 |
|
3427 This technique became popular since a normal integer multiplication and logical shift right are the only required operations to perform a multiplication |
|
3428 of two fixed point numbers. Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal. If $2^q$ is |
|
3429 equivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic. Using this fact dividing an integer |
|
3430 $a$ by another integer $b$ can be achieved with the following expression. |
|
3431 |
|
3432 \begin{equation} |
|
3433 \lfloor a / b \rfloor \mbox{ }\approx\mbox{ } \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor |
|
3434 \end{equation} |
|
3435 |
|
3436 The precision of the division is proportional to the value of $q$. If the divisor $b$ is used frequently as is the case with |
|
3437 modular exponentiation pre-computing $2^q/b$ will allow a division to be performed with a multiplication and a right shift. Both operations |
|
3438 are considerably faster than division on most processors. |
|
3439 |
|
3440 Consider dividing $19$ by $5$. The correct result is $\lfloor 19/5 \rfloor = 3$. With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which |
|
3441 leads to a product of $19$ which when divided by $2^q$ produces $2$. However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and |
|
3442 the result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct. The value of $2^q$ must be close to or ideally |
|
3443 larger than the dividend. In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach |
|
3444 to work correctly. Plugging this form of divison into the original equation the following modular residue equation arises. |
|
3445 |
|
3446 \begin{equation} |
|
3447 c = a - b \cdot \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor |
|
3448 \end{equation} |
|
3449 |
|
3450 Using the notation from \cite{BARRETT} the value of $\lfloor 2^q / b \rfloor$ will be represented by the $\mu$ symbol. Using the $\mu$ |
|
3451 variable also helps re-inforce the idea that it is meant to be computed once and re-used. |
|
3452 |
|
3453 \begin{equation} |
|
3454 c = a - b \cdot \lfloor (a \cdot \mu)/2^q \rfloor |
|
3455 \end{equation} |
|
3456 |
|
3457 Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one. In the context of Barrett |
|
3458 reduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough |
|
3459 precision. |
|
3460 |
|
3461 Let $n$ represent the number of digits in $b$. This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and |
|
3462 another $n^2$ single precision multiplications to find the residue. In total $3n^2$ single precision multiplications are required to |
|
3463 reduce the number. |
|
3464 |
|
3465 For example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$. Consider reducing |
|
3466 $a = 180388626447$ modulo $b$ using the above reduction equation. The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$. |
|
3467 By subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found. |
|
3468 |
|
3469 \subsection{Choosing a Radix Point} |
|
3470 Using the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications. If that were the best |
|
3471 that could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$. |
|
3472 See~\ref{sec:division} for further details.} might as well be used in its place. The key to optimizing the reduction is to reduce the precision of |
|
3473 the initial multiplication that finds the quotient. |
|
3474 |
|
3475 Let $a$ represent the number of which the residue is sought. Let $b$ represent the modulus used to find the residue. Let $m$ represent |
|
3476 the number of digits in $b$. For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if |
|
3477 two $m$-digit numbers have been multiplied. Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer. Digits below the |
|
3478 $m - 1$'th digit of $a$ will contribute at most a value of $1$ to the quotient because $\beta^k < b$ for any $0 \le k \le m - 1$. Another way to |
|
3479 express this is by re-writing $a$ as two parts. If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then |
|
3480 ${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$. Since $a'$ is bound to be less than $b$ the quotient |
|
3481 is bound by $0 \le {a' \over b} < 1$. |
|
3482 |
|
3483 Since the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero. However, if the digits |
|
3484 ``might as well be zero'' they might as well not be there in the first place. Let $q_0 = \lfloor a/\beta^{m-1} \rfloor$ represent the input |
|
3485 with the irrelevant digits trimmed. Now the modular reduction is trimmed to the almost equivalent equation |
|
3486 |
|
3487 \begin{equation} |
|
3488 c = a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor |
|
3489 \end{equation} |
|
3490 |
|
3491 Note that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the |
|
3492 exponent on the divisor when added to the amount $q_0$ was shifted by equals $2m$. If the optimization had not been performed the divisor |
|
3493 would have the exponent $2m$ so in the end the exponents do ``add up''. Using the above equation the quotient |
|
3494 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ can be off from the true quotient by at most two. The original fixed point quotient can be off |
|
3495 by as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient |
|
3496 can be off by an additional value of one for a total of at most two. This implies that |
|
3497 $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. By first subtracting $b$ times the quotient and then conditionally subtracting |
|
3498 $b$ once or twice the residue is found. |
|
3499 |
|
3500 The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single |
|
3501 precision multiplications, ignoring the subtractions required. In total $2m^2 + m$ single precision multiplications are required to find the residue. |
|
3502 This is considerably faster than the original attempt. |
|
3503 |
|
3504 For example, let $\beta = 10$ represent the radix of the digits. Let $b = 9999$ represent the modulus which implies $m = 4$. Let $a = 99929878$ |
|
3505 represent the value of which the residue is desired. In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$. |
|
3506 With the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$. The quotient is then |
|
3507 $\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor = 9993$. Subtracting $9993b$ from $a$ and the correct residue $a \equiv 9871 \mbox{ (mod }b\mbox{)}$ |
|
3508 is found. |
|
3509 |
|
3510 \subsection{Trimming the Quotient} |
|
3511 So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications. As |
|
3512 it stands now the algorithm is already fairly fast compared to a full integer division algorithm. However, there is still room for |
|
3513 optimization. |
|
3514 |
|
3515 After the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower |
|
3516 half of the product. It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision |
|
3517 multiplications. If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly. |
|
3518 In fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed. |
|
3519 |
|
3520 The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number. Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision |
|
3521 multiplications would be required. Using a multiplier that will only produce digits at and above the $m - 1$'th digit reduces the number |
|
3522 of single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications. |
|
3523 |
|
3524 \subsection{Trimming the Residue} |
|
3525 After the quotient has been calculated it is used to reduce the input. As previously noted the algorithm is not exact and it can be off by a small |
|
3526 multiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. If $b$ is $m$ digits than the |
|
3527 result of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are |
|
3528 implicitly zero. |
|
3529 |
|
3530 The next optimization arises from this very fact. Instead of computing $b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ using a full |
|
3531 $O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed. Similarly the value of $a$ can |
|
3532 be reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well. A multiplication that produces |
|
3533 only the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications. |
|
3534 |
|
3535 With both optimizations in place the algorithm is the algorithm Barrett proposed. It requires $m^2 + 2m - 1$ single precision multiplications which |
|
3536 is considerably faster than the straightforward $3m^2$ method. |
|
3537 |
|
3538 \subsection{The Barrett Algorithm} |
|
3539 \newpage\begin{figure}[!here] |
|
3540 \begin{small} |
|
3541 \begin{center} |
|
3542 \begin{tabular}{l} |
|
3543 \hline Algorithm \textbf{mp\_reduce}. \\ |
|
3544 \textbf{Input}. mp\_int $a$, mp\_int $b$ and $\mu = \lfloor \beta^{2m}/b \rfloor, m = \lceil lg_{\beta}(b) \rceil, (0 \le a < b^2, b > 1)$ \\ |
|
3545 \textbf{Output}. $a \mbox{ (mod }b\mbox{)}$ \\ |
|
3546 \hline \\ |
|
3547 Let $m$ represent the number of digits in $b$. \\ |
|
3548 1. Make a copy of $a$ and store it in $q$. (\textit{mp\_init\_copy}) \\ |
|
3549 2. $q \leftarrow \lfloor q / \beta^{m - 1} \rfloor$ (\textit{mp\_rshd}) \\ |
|
3550 \\ |
|
3551 Produce the quotient. \\ |
|
3552 3. $q \leftarrow q \cdot \mu$ (\textit{note: only produce digits at or above $m-1$}) \\ |
|
3553 4. $q \leftarrow \lfloor q / \beta^{m + 1} \rfloor$ \\ |
|
3554 \\ |
|
3555 Subtract the multiple of modulus from the input. \\ |
|
3556 5. $a \leftarrow a \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{mp\_mod\_2d}) \\ |
|
3557 6. $q \leftarrow q \cdot b \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{s\_mp\_mul\_digs}) \\ |
|
3558 7. $a \leftarrow a - q$ (\textit{mp\_sub}) \\ |
|
3559 \\ |
|
3560 Add $\beta^{m+1}$ if a carry occured. \\ |
|
3561 8. If $a < 0$ then (\textit{mp\_cmp\_d}) \\ |
|
3562 \hspace{3mm}8.1 $q \leftarrow 1$ (\textit{mp\_set}) \\ |
|
3563 \hspace{3mm}8.2 $q \leftarrow q \cdot \beta^{m+1}$ (\textit{mp\_lshd}) \\ |
|
3564 \hspace{3mm}8.3 $a \leftarrow a + q$ \\ |
|
3565 \\ |
|
3566 Now subtract the modulus if the residue is too large (e.g. quotient too small). \\ |
|
3567 9. While $a \ge b$ do (\textit{mp\_cmp}) \\ |
|
3568 \hspace{3mm}9.1 $c \leftarrow a - b$ \\ |
|
3569 10. Clear $q$. \\ |
|
3570 11. Return(\textit{MP\_OKAY}) \\ |
|
3571 \hline |
|
3572 \end{tabular} |
|
3573 \end{center} |
|
3574 \end{small} |
|
3575 \caption{Algorithm mp\_reduce} |
|
3576 \end{figure} |
|
3577 |
|
3578 \textbf{Algorithm mp\_reduce.} |
|
3579 This algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm. It is loosely based on algorithm 14.42 of HAC |
|
3580 \cite[pp. 602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}. The algorithm has several restrictions and assumptions which must |
|
3581 be adhered to for the algorithm to work. |
|
3582 |
|
3583 First the modulus $b$ is assumed to be positive and greater than one. If the modulus were less than or equal to one than subtracting |
|
3584 a multiple of it would either accomplish nothing or actually enlarge the input. The input $a$ must be in the range $0 \le a < b^2$ in order |
|
3585 for the quotient to have enough precision. If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem. |
|
3586 Technically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish. The value of $\mu$ is passed as an argument to this |
|
3587 algorithm and is assumed to be calculated and stored before the algorithm is used. |
|
3588 |
|
3589 Recall that the multiplication for the quotient on step 3 must only produce digits at or above the $m-1$'th position. An algorithm called |
|
3590 $s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task. The algorithm is based on $s\_mp\_mul\_digs$ except that |
|
3591 instead of stopping at a given level of precision it starts at a given level of precision. This optimal algorithm can only be used if the number |
|
3592 of digits in $b$ is very much smaller than $\beta$. |
|
3593 |
|
3594 While it is known that |
|
3595 $a \ge b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ only the lower $m+1$ digits are being used to compute the residue, so an implied |
|
3596 ``borrow'' from the higher digits might leave a negative result. After the multiple of the modulus has been subtracted from $a$ the residue must be |
|
3597 fixed up in case it is negative. The invariant $\beta^{m+1}$ must be added to the residue to make it positive again. |
|
3598 |
|
3599 The while loop at step 9 will subtract $b$ until the residue is less than $b$. If the algorithm is performed correctly this step is |
|
3600 performed at most twice, and on average once. However, if $a \ge b^2$ than it will iterate substantially more times than it should. |
|
3601 |
|
3602 EXAM,bn_mp_reduce.c |
|
3603 |
|
3604 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up. This essentially halves |
|
3605 the number of single precision multiplications required. However, the optimization is only safe if $\beta$ is much larger than the number of digits |
|
3606 in the modulus. In the source code this is evaluated on lines @36,if@ to @44,}@ where algorithm s\_mp\_mul\_high\_digs is used when it is |
|
3607 safe to do so. |
|
3608 |
|
3609 \subsection{The Barrett Setup Algorithm} |
|
3610 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for |
|
3611 future use so that the Barrett algorithm can be used without delay. |
|
3612 |
|
3613 \begin{figure}[!here] |
|
3614 \begin{small} |
|
3615 \begin{center} |
|
3616 \begin{tabular}{l} |
|
3617 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ |
|
3618 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ |
|
3619 \textbf{Output}. $\mu \leftarrow \lfloor \beta^{2m}/a \rfloor$ \\ |
|
3620 \hline \\ |
|
3621 1. $\mu \leftarrow 2^{2 \cdot lg(\beta) \cdot m}$ (\textit{mp\_2expt}) \\ |
|
3622 2. $\mu \leftarrow \lfloor \mu / b \rfloor$ (\textit{mp\_div}) \\ |
|
3623 3. Return(\textit{MP\_OKAY}) \\ |
|
3624 \hline |
|
3625 \end{tabular} |
|
3626 \end{center} |
|
3627 \end{small} |
|
3628 \caption{Algorithm mp\_reduce\_setup} |
|
3629 \end{figure} |
|
3630 |
|
3631 \textbf{Algorithm mp\_reduce\_setup.} |
|
3632 This algorithm computes the reciprocal $\mu$ required for Barrett reduction. First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot m}$ which |
|
3633 is equivalent and much faster. The final value is computed by taking the integer quotient of $\lfloor \mu / b \rfloor$. |
|
3634 |
|
3635 EXAM,bn_mp_reduce_setup.c |
|
3636 |
|
3637 This simple routine calculates the reciprocal $\mu$ required by Barrett reduction. Note the extended usage of algorithm mp\_div where the variable |
|
3638 which would received the remainder is passed as NULL. As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the |
|
3639 remainder to be passed as NULL meaning to ignore the value. |
|
3640 |
|
3641 \section{The Montgomery Reduction} |
|
3642 Montgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting |
|
3643 form of reduction in common use. It computes a modular residue which is not actually equal to the residue of the input yet instead equal to a |
|
3644 residue times a constant. However, as perplexing as this may sound the algorithm is relatively simple and very efficient. |
|
3645 |
|
3646 Throughout this entire section the variable $n$ will represent the modulus used to form the residue. As will be discussed shortly the value of |
|
3647 $n$ must be odd. The variable $x$ will represent the quantity of which the residue is sought. Similar to the Barrett algorithm the input |
|
3648 is restricted to $0 \le x < n^2$. To begin the description some simple number theory facts must be established. |
|
3649 |
|
3650 \textbf{Fact 1.} Adding $n$ to $x$ does not change the residue since in effect it adds one to the quotient $\lfloor x / n \rfloor$. Another way |
|
3651 to explain this is that $n$ is (\textit{or multiples of $n$ are}) congruent to zero modulo $n$. Adding zero will not change the value of the residue. |
|
3652 |
|
3653 \textbf{Fact 2.} If $x$ is even then performing a division by two in $\Z$ is congruent to $x \cdot 2^{-1} \mbox{ (mod }n\mbox{)}$. Actually |
|
3654 this is an application of the fact that if $x$ is evenly divisible by any $k \in \Z$ then division in $\Z$ will be congruent to |
|
3655 multiplication by $k^{-1}$ modulo $n$. |
|
3656 |
|
3657 From these two simple facts the following simple algorithm can be derived. |
|
3658 |
|
3659 \newpage\begin{figure}[!here] |
|
3660 \begin{small} |
|
3661 \begin{center} |
|
3662 \begin{tabular}{l} |
|
3663 \hline Algorithm \textbf{Montgomery Reduction}. \\ |
|
3664 \textbf{Input}. Integer $x$, $n$ and $k$ \\ |
|
3665 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\ |
|
3666 \hline \\ |
|
3667 1. for $t$ from $1$ to $k$ do \\ |
|
3668 \hspace{3mm}1.1 If $x$ is odd then \\ |
|
3669 \hspace{6mm}1.1.1 $x \leftarrow x + n$ \\ |
|
3670 \hspace{3mm}1.2 $x \leftarrow x/2$ \\ |
|
3671 2. Return $x$. \\ |
|
3672 \hline |
|
3673 \end{tabular} |
|
3674 \end{center} |
|
3675 \end{small} |
|
3676 \caption{Algorithm Montgomery Reduction} |
|
3677 \end{figure} |
|
3678 |
|
3679 The algorithm reduces the input one bit at a time using the two congruencies stated previously. Inside the loop $n$, which is odd, is |
|
3680 added to $x$ if $x$ is odd. This forces $x$ to be even which allows the division by two in $\Z$ to be congruent to a modular division by two. Since |
|
3681 $x$ is assumed to be initially much larger than $n$ the addition of $n$ will contribute an insignificant magnitude to $x$. Let $r$ represent the |
|
3682 final result of the Montgomery algorithm. If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to |
|
3683 $0 \le r < \lfloor x/2^k \rfloor + n$. As a result at most a single subtraction is required to get the residue desired. |
|
3684 |
|
3685 \begin{figure}[here] |
|
3686 \begin{small} |
|
3687 \begin{center} |
|
3688 \begin{tabular}{|c|l|} |
|
3689 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} \\ |
|
3690 \hline $1$ & $x + n = 5812$, $x/2 = 2906$ \\ |
|
3691 \hline $2$ & $x/2 = 1453$ \\ |
|
3692 \hline $3$ & $x + n = 1710$, $x/2 = 855$ \\ |
|
3693 \hline $4$ & $x + n = 1112$, $x/2 = 556$ \\ |
|
3694 \hline $5$ & $x/2 = 278$ \\ |
|
3695 \hline $6$ & $x/2 = 139$ \\ |
|
3696 \hline $7$ & $x + n = 396$, $x/2 = 198$ \\ |
|
3697 \hline $8$ & $x/2 = 99$ \\ |
|
3698 \hline |
|
3699 \end{tabular} |
|
3700 \end{center} |
|
3701 \end{small} |
|
3702 \caption{Example of Montgomery Reduction (I)} |
|
3703 \label{fig:MONT1} |
|
3704 \end{figure} |
|
3705 |
|
3706 Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 8$. The result of the algorithm $r = 99$ is |
|
3707 congruent to the value of $2^{-8} \cdot 5555 \mbox{ (mod }257\mbox{)}$. When $r$ is multiplied by $2^8$ modulo $257$ the correct residue |
|
3708 $r \equiv 158$ is produced. |
|
3709 |
|
3710 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$. The current algorithm requires $2k^2$ single precision shifts |
|
3711 and $k^2$ single precision additions. At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful. |
|
3712 Fortunately there exists an alternative representation of the algorithm. |
|
3713 |
|
3714 \begin{figure}[!here] |
|
3715 \begin{small} |
|
3716 \begin{center} |
|
3717 \begin{tabular}{l} |
|
3718 \hline Algorithm \textbf{Montgomery Reduction} (modified I). \\ |
|
3719 \textbf{Input}. Integer $x$, $n$ and $k$ \\ |
|
3720 \textbf{Output}. $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\ |
|
3721 \hline \\ |
|
3722 1. for $t$ from $0$ to $k - 1$ do \\ |
|
3723 \hspace{3mm}1.1 If the $t$'th bit of $x$ is one then \\ |
|
3724 \hspace{6mm}1.1.1 $x \leftarrow x + 2^tn$ \\ |
|
3725 2. Return $x/2^k$. \\ |
|
3726 \hline |
|
3727 \end{tabular} |
|
3728 \end{center} |
|
3729 \end{small} |
|
3730 \caption{Algorithm Montgomery Reduction (modified I)} |
|
3731 \end{figure} |
|
3732 |
|
3733 This algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2. The number of single |
|
3734 precision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement. |
|
3735 |
|
3736 \begin{figure}[here] |
|
3737 \begin{small} |
|
3738 \begin{center} |
|
3739 \begin{tabular}{|c|l|r|} |
|
3740 \hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} & \textbf{Result ($x$) in Binary} \\ |
|
3741 \hline -- & $5555$ & $1010110110011$ \\ |
|
3742 \hline $1$ & $x + 2^{0}n = 5812$ & $1011010110100$ \\ |
|
3743 \hline $2$ & $5812$ & $1011010110100$ \\ |
|
3744 \hline $3$ & $x + 2^{2}n = 6840$ & $1101010111000$ \\ |
|
3745 \hline $4$ & $x + 2^{3}n = 8896$ & $10001011000000$ \\ |
|
3746 \hline $5$ & $8896$ & $10001011000000$ \\ |
|
3747 \hline $6$ & $8896$ & $10001011000000$ \\ |
|
3748 \hline $7$ & $x + 2^{6}n = 25344$ & $110001100000000$ \\ |
|
3749 \hline $8$ & $25344$ & $110001100000000$ \\ |
|
3750 \hline -- & $x/2^k = 99$ & \\ |
|
3751 \hline |
|
3752 \end{tabular} |
|
3753 \end{center} |
|
3754 \end{small} |
|
3755 \caption{Example of Montgomery Reduction (II)} |
|
3756 \label{fig:MONT2} |
|
3757 \end{figure} |
|
3758 |
|
3759 Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 8$. |
|
3760 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the |
|
3761 loop. Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed. In those iterations the $t$'th bit of $x$ is |
|
3762 zero and the appropriate multiple of $n$ does not need to be added to force the $t$'th bit of the result to zero. |
|
3763 |
|
3764 \subsection{Digit Based Montgomery Reduction} |
|
3765 Instead of computing the reduction on a bit-by-bit basis it is actually much faster to compute it on digit-by-digit basis. Consider the |
|
3766 previous algorithm re-written to compute the Montgomery reduction in this new fashion. |
|
3767 |
|
3768 \begin{figure}[!here] |
|
3769 \begin{small} |
|
3770 \begin{center} |
|
3771 \begin{tabular}{l} |
|
3772 \hline Algorithm \textbf{Montgomery Reduction} (modified II). \\ |
|
3773 \textbf{Input}. Integer $x$, $n$ and $k$ \\ |
|
3774 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ |
|
3775 \hline \\ |
|
3776 1. for $t$ from $0$ to $k - 1$ do \\ |
|
3777 \hspace{3mm}1.1 $x \leftarrow x + \mu n \beta^t$ \\ |
|
3778 2. Return $x/\beta^k$. \\ |
|
3779 \hline |
|
3780 \end{tabular} |
|
3781 \end{center} |
|
3782 \end{small} |
|
3783 \caption{Algorithm Montgomery Reduction (modified II)} |
|
3784 \end{figure} |
|
3785 |
|
3786 The value $\mu n \beta^t$ is a multiple of the modulus $n$ meaning that it will not change the residue. If the first digit of |
|
3787 the value $\mu n \beta^t$ equals the negative (modulo $\beta$) of the $t$'th digit of $x$ then the addition will result in a zero digit. This |
|
3788 problem breaks down to solving the following congruency. |
|
3789 |
|
3790 \begin{center} |
|
3791 \begin{tabular}{rcl} |
|
3792 $x_t + \mu n_0$ & $\equiv$ & $0 \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3793 $\mu n_0$ & $\equiv$ & $-x_t \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3794 $\mu$ & $\equiv$ & $-x_t/n_0 \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3795 \end{tabular} |
|
3796 \end{center} |
|
3797 |
|
3798 In each iteration of the loop on step 1 a new value of $\mu$ must be calculated. The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used |
|
3799 extensively in this algorithm and should be precomputed. Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$. |
|
3800 |
|
3801 For example, let $\beta = 10$ represent the radix. Let $n = 17$ represent the modulus which implies $k = 2$ and $\rho \equiv 7$. Let $x = 33$ |
|
3802 represent the value to reduce. |
|
3803 |
|
3804 \newpage\begin{figure} |
|
3805 \begin{center} |
|
3806 \begin{tabular}{|c|c|c|} |
|
3807 \hline \textbf{Step ($t$)} & \textbf{Value of $x$} & \textbf{Value of $\mu$} \\ |
|
3808 \hline -- & $33$ & --\\ |
|
3809 \hline $0$ & $33 + \mu n = 50$ & $1$ \\ |
|
3810 \hline $1$ & $50 + \mu n \beta = 900$ & $5$ \\ |
|
3811 \hline |
|
3812 \end{tabular} |
|
3813 \end{center} |
|
3814 \caption{Example of Montgomery Reduction} |
|
3815 \end{figure} |
|
3816 |
|
3817 The final result $900$ is then divided by $\beta^k$ to produce the final result $9$. The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$ |
|
3818 which implies the result is not the modular residue of $x$ modulo $n$. However, recall that the residue is actually multiplied by $\beta^{-k}$ in |
|
3819 the algorithm. To get the true residue the value must be multiplied by $\beta^k$. In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and |
|
3820 the correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$. |
|
3821 |
|
3822 \subsection{Baseline Montgomery Reduction} |
|
3823 The baseline Montgomery reduction algorithm will produce the residue for any size input. It is designed to be a catch-all algororithm for |
|
3824 Montgomery reductions. |
|
3825 |
|
3826 \newpage\begin{figure}[!here] |
|
3827 \begin{small} |
|
3828 \begin{center} |
|
3829 \begin{tabular}{l} |
|
3830 \hline Algorithm \textbf{mp\_montgomery\_reduce}. \\ |
|
3831 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\ |
|
3832 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\ |
|
3833 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ |
|
3834 \hline \\ |
|
3835 1. $digs \leftarrow 2n.used + 1$ \\ |
|
3836 2. If $digs < MP\_ARRAY$ and $m.used < \delta$ then \\ |
|
3837 \hspace{3mm}2.1 Use algorithm fast\_mp\_montgomery\_reduce instead. \\ |
|
3838 \\ |
|
3839 Setup $x$ for the reduction. \\ |
|
3840 3. If $x.alloc < digs$ then grow $x$ to $digs$ digits. \\ |
|
3841 4. $x.used \leftarrow digs$ \\ |
|
3842 \\ |
|
3843 Eliminate the lower $k$ digits. \\ |
|
3844 5. For $ix$ from $0$ to $k - 1$ do \\ |
|
3845 \hspace{3mm}5.1 $\mu \leftarrow x_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3846 \hspace{3mm}5.2 $u \leftarrow 0$ \\ |
|
3847 \hspace{3mm}5.3 For $iy$ from $0$ to $k - 1$ do \\ |
|
3848 \hspace{6mm}5.3.1 $\hat r \leftarrow \mu n_{iy} + x_{ix + iy} + u$ \\ |
|
3849 \hspace{6mm}5.3.2 $x_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3850 \hspace{6mm}5.3.3 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
3851 \hspace{3mm}5.4 While $u > 0$ do \\ |
|
3852 \hspace{6mm}5.4.1 $iy \leftarrow iy + 1$ \\ |
|
3853 \hspace{6mm}5.4.2 $x_{ix + iy} \leftarrow x_{ix + iy} + u$ \\ |
|
3854 \hspace{6mm}5.4.3 $u \leftarrow \lfloor x_{ix+iy} / \beta \rfloor$ \\ |
|
3855 \hspace{6mm}5.4.4 $x_{ix + iy} \leftarrow x_{ix+iy} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3856 \\ |
|
3857 Divide by $\beta^k$ and fix up as required. \\ |
|
3858 6. $x \leftarrow \lfloor x / \beta^k \rfloor$ \\ |
|
3859 7. If $x \ge n$ then \\ |
|
3860 \hspace{3mm}7.1 $x \leftarrow x - n$ \\ |
|
3861 8. Return(\textit{MP\_OKAY}). \\ |
|
3862 \hline |
|
3863 \end{tabular} |
|
3864 \end{center} |
|
3865 \end{small} |
|
3866 \caption{Algorithm mp\_montgomery\_reduce} |
|
3867 \end{figure} |
|
3868 |
|
3869 \textbf{Algorithm mp\_montgomery\_reduce.} |
|
3870 This algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm. The algorithm is loosely based |
|
3871 on algorithm 14.32 of \cite[pp.601]{HAC} except it merges the multiplication of $\mu n \beta^t$ with the addition in the inner loop. The |
|
3872 restrictions on this algorithm are fairly easy to adapt to. First $0 \le x < n^2$ bounds the input to numbers in the same range as |
|
3873 for the Barrett algorithm. Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$. $\rho$ must be calculated in |
|
3874 advance of this algorithm. Finally the variable $k$ is fixed and a pseudonym for $n.used$. |
|
3875 |
|
3876 Step 2 decides whether a faster Montgomery algorithm can be used. It is based on the Comba technique meaning that there are limits on |
|
3877 the size of the input. This algorithm is discussed in ~COMBARED~. |
|
3878 |
|
3879 Step 5 is the main reduction loop of the algorithm. The value of $\mu$ is calculated once per iteration in the outer loop. The inner loop |
|
3880 calculates $x + \mu n \beta^{ix}$ by multiplying $\mu n$ and adding the result to $x$ shifted by $ix$ digits. Both the addition and |
|
3881 multiplication are performed in the same loop to save time and memory. Step 5.4 will handle any additional carries that escape the inner loop. |
|
3882 |
|
3883 Using a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications |
|
3884 in the inner loop. In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision |
|
3885 multiplications. |
|
3886 |
|
3887 EXAM,bn_mp_montgomery_reduce.c |
|
3888 |
|
3889 This is the baseline implementation of the Montgomery reduction algorithm. Lines @30,digs@ to @35,}@ determine if the Comba based |
|
3890 routine can be used instead. Line @47,mu@ computes the value of $\mu$ for that particular iteration of the outer loop. |
|
3891 |
|
3892 The multiplication $\mu n \beta^{ix}$ is performed in one step in the inner loop. The alias $tmpx$ refers to the $ix$'th digit of $x$ and |
|
3893 the alias $tmpn$ refers to the modulus $n$. |
|
3894 |
|
3895 \subsection{Faster ``Comba'' Montgomery Reduction} |
|
3896 MARK,COMBARED |
|
3897 |
|
3898 The Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial |
|
3899 nature of the inner loop. The Barrett reduction algorithm requires two slightly modified multipliers which can be implemented with the Comba |
|
3900 technique. The Montgomery reduction algorithm cannot directly use the Comba technique to any significant advantage since the inner loop calculates |
|
3901 a $k \times 1$ product $k$ times. |
|
3902 |
|
3903 The biggest obstacle is that at the $ix$'th iteration of the outer loop the value of $x_{ix}$ is required to calculate $\mu$. This means the |
|
3904 carries from $0$ to $ix - 1$ must have been propagated upwards to form a valid $ix$'th digit. The solution as it turns out is very simple. |
|
3905 Perform a Comba like multiplier and inside the outer loop just after the inner loop fix up the $ix + 1$'th digit by forwarding the carry. |
|
3906 |
|
3907 With this change in place the Montgomery reduction algorithm can be performed with a Comba style multiplication loop which substantially increases |
|
3908 the speed of the algorithm. |
|
3909 |
|
3910 \newpage\begin{figure}[!here] |
|
3911 \begin{small} |
|
3912 \begin{center} |
|
3913 \begin{tabular}{l} |
|
3914 \hline Algorithm \textbf{fast\_mp\_montgomery\_reduce}. \\ |
|
3915 \textbf{Input}. mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\ |
|
3916 \hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\ |
|
3917 \textbf{Output}. $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\ |
|
3918 \hline \\ |
|
3919 Place an array of \textbf{MP\_WARRAY} mp\_word variables called $\hat W$ on the stack. \\ |
|
3920 1. if $x.alloc < n.used + 1$ then grow $x$ to $n.used + 1$ digits. \\ |
|
3921 Copy the digits of $x$ into the array $\hat W$ \\ |
|
3922 2. For $ix$ from $0$ to $x.used - 1$ do \\ |
|
3923 \hspace{3mm}2.1 $\hat W_{ix} \leftarrow x_{ix}$ \\ |
|
3924 3. For $ix$ from $x.used$ to $2n.used - 1$ do \\ |
|
3925 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ |
|
3926 Elimiate the lower $k$ digits. \\ |
|
3927 4. for $ix$ from $0$ to $n.used - 1$ do \\ |
|
3928 \hspace{3mm}4.1 $\mu \leftarrow \hat W_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3929 \hspace{3mm}4.2 For $iy$ from $0$ to $n.used - 1$ do \\ |
|
3930 \hspace{6mm}4.2.1 $\hat W_{iy + ix} \leftarrow \hat W_{iy + ix} + \mu \cdot n_{iy}$ \\ |
|
3931 \hspace{3mm}4.3 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\ |
|
3932 Propagate carries upwards. \\ |
|
3933 5. for $ix$ from $n.used$ to $2n.used + 1$ do \\ |
|
3934 \hspace{3mm}5.1 $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\ |
|
3935 Shift right and reduce modulo $\beta$ simultaneously. \\ |
|
3936 6. for $ix$ from $0$ to $n.used + 1$ do \\ |
|
3937 \hspace{3mm}6.1 $x_{ix} \leftarrow \hat W_{ix + n.used} \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3938 Zero excess digits and fixup $x$. \\ |
|
3939 7. if $x.used > n.used + 1$ then do \\ |
|
3940 \hspace{3mm}7.1 for $ix$ from $n.used + 1$ to $x.used - 1$ do \\ |
|
3941 \hspace{6mm}7.1.1 $x_{ix} \leftarrow 0$ \\ |
|
3942 8. $x.used \leftarrow n.used + 1$ \\ |
|
3943 9. Clamp excessive digits of $x$. \\ |
|
3944 10. If $x \ge n$ then \\ |
|
3945 \hspace{3mm}10.1 $x \leftarrow x - n$ \\ |
|
3946 11. Return(\textit{MP\_OKAY}). \\ |
|
3947 \hline |
|
3948 \end{tabular} |
|
3949 \end{center} |
|
3950 \end{small} |
|
3951 \caption{Algorithm fast\_mp\_montgomery\_reduce} |
|
3952 \end{figure} |
|
3953 |
|
3954 \textbf{Algorithm fast\_mp\_montgomery\_reduce.} |
|
3955 This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique. It is on most computer platforms significantly |
|
3956 faster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}). The algorithm has the same restrictions |
|
3957 on the input as the baseline reduction algorithm. An additional two restrictions are imposed on this algorithm. The number of digits $k$ in the |
|
3958 the modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$. When $\beta = 2^{28}$ this algorithm can be used to reduce modulo |
|
3959 a modulus of at most $3,556$ bits in length. |
|
3960 |
|
3961 As in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product. It is initially filled with the |
|
3962 contents of $x$ with the excess digits zeroed. The reduction loop is very similar the to the baseline loop at heart. The multiplication on step |
|
3963 4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$. Some multipliers such |
|
3964 as those on the ARM processors take a variable length time to complete depending on the number of bytes of result it must produce. By performing |
|
3965 a single precision multiplication instead half the amount of time is spent. |
|
3966 |
|
3967 Also note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work. That is what step |
|
3968 4.3 will do. In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards. Note |
|
3969 how the upper bits of those same words are not reduced modulo $\beta$. This is because those values will be discarded shortly and there is no |
|
3970 point. |
|
3971 |
|
3972 Step 5 will propagate the remainder of the carries upwards. On step 6 the columns are reduced modulo $\beta$ and shifted simultaneously as they are |
|
3973 stored in the destination $x$. |
|
3974 |
|
3975 EXAM,bn_fast_mp_montgomery_reduce.c |
|
3976 |
|
3977 The $\hat W$ array is first filled with digits of $x$ on line @49,for@ then the rest of the digits are zeroed on line @54,for@. Both loops share |
|
3978 the same alias variables to make the code easier to read. |
|
3979 |
|
3980 The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This |
|
3981 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line @101,>>@ fixes the carry |
|
3982 for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$. |
|
3983 |
|
3984 The for loop on line @113,for@ propagates the rest of the carries upwards through the columns. The for loop on line @126,for@ reduces the columns |
|
3985 modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th |
|
3986 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$. |
|
3987 |
|
3988 \subsection{Montgomery Setup} |
|
3989 To calculate the variable $\rho$ a relatively simple algorithm will be required. |
|
3990 |
|
3991 \begin{figure}[!here] |
|
3992 \begin{small} |
|
3993 \begin{center} |
|
3994 \begin{tabular}{l} |
|
3995 \hline Algorithm \textbf{mp\_montgomery\_setup}. \\ |
|
3996 \textbf{Input}. mp\_int $n$ ($n > 1$ and $(n, 2) = 1$) \\ |
|
3997 \textbf{Output}. $\rho \equiv -1/n_0 \mbox{ (mod }\beta\mbox{)}$ \\ |
|
3998 \hline \\ |
|
3999 1. $b \leftarrow n_0$ \\ |
|
4000 2. If $b$ is even return(\textit{MP\_VAL}) \\ |
|
4001 3. $x \leftarrow ((b + 2) \mbox{ AND } 4) << 1) + b$ \\ |
|
4002 4. for $k$ from 0 to $\lceil lg(lg(\beta)) \rceil - 2$ do \\ |
|
4003 \hspace{3mm}4.1 $x \leftarrow x \cdot (2 - bx)$ \\ |
|
4004 5. $\rho \leftarrow \beta - x \mbox{ (mod }\beta\mbox{)}$ \\ |
|
4005 6. Return(\textit{MP\_OKAY}). \\ |
|
4006 \hline |
|
4007 \end{tabular} |
|
4008 \end{center} |
|
4009 \end{small} |
|
4010 \caption{Algorithm mp\_montgomery\_setup} |
|
4011 \end{figure} |
|
4012 |
|
4013 \textbf{Algorithm mp\_montgomery\_setup.} |
|
4014 This algorithm will calculate the value of $\rho$ required within the Montgomery reduction algorithms. It uses a very interesting trick |
|
4015 to calculate $1/n_0$ when $\beta$ is a power of two. |
|
4016 |
|
4017 EXAM,bn_mp_montgomery_setup.c |
|
4018 |
|
4019 This source code computes the value of $\rho$ required to perform Montgomery reduction. It has been modified to avoid performing excess |
|
4020 multiplications when $\beta$ is not the default 28-bits. |
|
4021 |
|
4022 \section{The Diminished Radix Algorithm} |
|
4023 The Diminished Radix method of modular reduction \cite{DRMET} is a fairly clever technique which can be more efficient than either the Barrett |
|
4024 or Montgomery methods for certain forms of moduli. The technique is based on the following simple congruence. |
|
4025 |
|
4026 \begin{equation} |
|
4027 (x \mbox{ mod } n) + k \lfloor x / n \rfloor \equiv x \mbox{ (mod }(n - k)\mbox{)} |
|
4028 \end{equation} |
|
4029 |
|
4030 This observation was used in the MMB \cite{MMB} block cipher to create a diffusion primitive. It used the fact that if $n = 2^{31}$ and $k=1$ that |
|
4031 then a x86 multiplier could produce the 62-bit product and use the ``shrd'' instruction to perform a double-precision right shift. The proof |
|
4032 of the above equation is very simple. First write $x$ in the product form. |
|
4033 |
|
4034 \begin{equation} |
|
4035 x = qn + r |
|
4036 \end{equation} |
|
4037 |
|
4038 Now reduce both sides modulo $(n - k)$. |
|
4039 |
|
4040 \begin{equation} |
|
4041 x \equiv qk + r \mbox{ (mod }(n-k)\mbox{)} |
|
4042 \end{equation} |
|
4043 |
|
4044 The variable $n$ reduces modulo $n - k$ to $k$. By putting $q = \lfloor x/n \rfloor$ and $r = x \mbox{ mod } n$ |
|
4045 into the equation the original congruence is reproduced, thus concluding the proof. The following algorithm is based on this observation. |
|
4046 |
|
4047 \begin{figure}[!here] |
|
4048 \begin{small} |
|
4049 \begin{center} |
|
4050 \begin{tabular}{l} |
|
4051 \hline Algorithm \textbf{Diminished Radix Reduction}. \\ |
|
4052 \textbf{Input}. Integer $x$, $n$, $k$ \\ |
|
4053 \textbf{Output}. $x \mbox{ mod } (n - k)$ \\ |
|
4054 \hline \\ |
|
4055 1. $q \leftarrow \lfloor x / n \rfloor$ \\ |
|
4056 2. $q \leftarrow k \cdot q$ \\ |
|
4057 3. $x \leftarrow x \mbox{ (mod }n\mbox{)}$ \\ |
|
4058 4. $x \leftarrow x + q$ \\ |
|
4059 5. If $x \ge (n - k)$ then \\ |
|
4060 \hspace{3mm}5.1 $x \leftarrow x - (n - k)$ \\ |
|
4061 \hspace{3mm}5.2 Goto step 1. \\ |
|
4062 6. Return $x$ \\ |
|
4063 \hline |
|
4064 \end{tabular} |
|
4065 \end{center} |
|
4066 \end{small} |
|
4067 \caption{Algorithm Diminished Radix Reduction} |
|
4068 \label{fig:DR} |
|
4069 \end{figure} |
|
4070 |
|
4071 This algorithm will reduce $x$ modulo $n - k$ and return the residue. If $0 \le x < (n - k)^2$ then the algorithm will loop almost always |
|
4072 once or twice and occasionally three times. For simplicity sake the value of $x$ is bounded by the following simple polynomial. |
|
4073 |
|
4074 \begin{equation} |
|
4075 0 \le x < n^2 + k^2 - 2nk |
|
4076 \end{equation} |
|
4077 |
|
4078 The true bound is $0 \le x < (n - k - 1)^2$ but this has quite a few more terms. The value of $q$ after step 1 is bounded by the following. |
|
4079 |
|
4080 \begin{equation} |
|
4081 q < n - 2k - k^2/n |
|
4082 \end{equation} |
|
4083 |
|
4084 Since $k^2$ is going to be considerably smaller than $n$ that term will always be zero. The value of $x$ after step 3 is bounded trivially as |
|
4085 $0 \le x < n$. By step four the sum $x + q$ is bounded by |
|
4086 |
|
4087 \begin{equation} |
|
4088 0 \le q + x < (k + 1)n - 2k^2 - 1 |
|
4089 \end{equation} |
|
4090 |
|
4091 With a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3. After the second pass it is highly unlike that the |
|
4092 sum in step 4 will exceed $n - k$. In practice fewer than three passes of the algorithm are required to reduce virtually every input in the |
|
4093 range $0 \le x < (n - k - 1)^2$. |
|
4094 |
|
4095 \begin{figure} |
|
4096 \begin{small} |
|
4097 \begin{center} |
|
4098 \begin{tabular}{|l|} |
|
4099 \hline |
|
4100 $x = 123456789, n = 256, k = 3$ \\ |
|
4101 \hline $q \leftarrow \lfloor x/n \rfloor = 482253$ \\ |
|
4102 $q \leftarrow q*k = 1446759$ \\ |
|
4103 $x \leftarrow x \mbox{ mod } n = 21$ \\ |
|
4104 $x \leftarrow x + q = 1446780$ \\ |
|
4105 $x \leftarrow x - (n - k) = 1446527$ \\ |
|
4106 \hline |
|
4107 $q \leftarrow \lfloor x/n \rfloor = 5650$ \\ |
|
4108 $q \leftarrow q*k = 16950$ \\ |
|
4109 $x \leftarrow x \mbox{ mod } n = 127$ \\ |
|
4110 $x \leftarrow x + q = 17077$ \\ |
|
4111 $x \leftarrow x - (n - k) = 16824$ \\ |
|
4112 \hline |
|
4113 $q \leftarrow \lfloor x/n \rfloor = 65$ \\ |
|
4114 $q \leftarrow q*k = 195$ \\ |
|
4115 $x \leftarrow x \mbox{ mod } n = 184$ \\ |
|
4116 $x \leftarrow x + q = 379$ \\ |
|
4117 $x \leftarrow x - (n - k) = 126$ \\ |
|
4118 \hline |
|
4119 \end{tabular} |
|
4120 \end{center} |
|
4121 \end{small} |
|
4122 \caption{Example Diminished Radix Reduction} |
|
4123 \label{fig:EXDR} |
|
4124 \end{figure} |
|
4125 |
|
4126 Figure~\ref{fig:EXDR} demonstrates the reduction of $x = 123456789$ modulo $n - k = 253$ when $n = 256$ and $k = 3$. Note that even while $x$ |
|
4127 is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast. In this case only |
|
4128 three passes were required to find the residue $x \equiv 126$. |
|
4129 |
|
4130 |
|
4131 \subsection{Choice of Moduli} |
|
4132 On the surface this algorithm looks like a very expensive algorithm. It requires a couple of subtractions followed by multiplication and other |
|
4133 modular reductions. The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen. |
|
4134 |
|
4135 Division in general is a very expensive operation to perform. The one exception is when the division is by a power of the radix of representation used. |
|
4136 Division by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right. Similarly division |
|
4137 by two (\textit{or powers of two}) is very simple for binary computers to perform. It would therefore seem logical to choose $n$ of the form $2^p$ |
|
4138 which would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits. |
|
4139 |
|
4140 However, there is one operation related to division of power of twos that is even faster than this. If $n = \beta^p$ then the division may be |
|
4141 performed by moving whole digits to the right $p$ places. In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$. |
|
4142 Also with the choice of $n = \beta^p$ reducing $x$ modulo $n$ merely requires zeroing the digits above the $p-1$'th digit of $x$. |
|
4143 |
|
4144 Throughout the next section the term ``restricted modulus'' will refer to a modulus of the form $\beta^p - k$ whereas the term ``unrestricted |
|
4145 modulus'' will refer to a modulus of the form $2^p - k$. The word ``restricted'' in this case refers to the fact that it is based on the |
|
4146 $2^p$ logic except $p$ must be a multiple of $lg(\beta)$. |
|
4147 |
|
4148 \subsection{Choice of $k$} |
|
4149 Now that division and reduction (\textit{step 1 and 3 of figure~\ref{fig:DR}}) have been optimized to simple digit operations the multiplication by $k$ |
|
4150 in step 2 is the most expensive operation. Fortunately the choice of $k$ is not terribly limited. For all intents and purposes it might |
|
4151 as well be a single digit. The smaller the value of $k$ is the faster the algorithm will be. |
|
4152 |
|
4153 \subsection{Restricted Diminished Radix Reduction} |
|
4154 The restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$. This algorithm can reduce |
|
4155 an input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}. The implementation |
|
4156 of this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition |
|
4157 of $x$ and $q$. The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular |
|
4158 exponentiations are performed. |
|
4159 |
|
4160 \newpage\begin{figure}[!here] |
|
4161 \begin{small} |
|
4162 \begin{center} |
|
4163 \begin{tabular}{l} |
|
4164 \hline Algorithm \textbf{mp\_dr\_reduce}. \\ |
|
4165 \textbf{Input}. mp\_int $x$, $n$ and a mp\_digit $k = \beta - n_0$ \\ |
|
4166 \hspace{11.5mm}($0 \le x < n^2$, $n > 1$, $0 < k < \beta$) \\ |
|
4167 \textbf{Output}. $x \mbox{ mod } n$ \\ |
|
4168 \hline \\ |
|
4169 1. $m \leftarrow n.used$ \\ |
|
4170 2. If $x.alloc < 2m$ then grow $x$ to $2m$ digits. \\ |
|
4171 3. $\mu \leftarrow 0$ \\ |
|
4172 4. for $i$ from $0$ to $m - 1$ do \\ |
|
4173 \hspace{3mm}4.1 $\hat r \leftarrow k \cdot x_{m+i} + x_{i} + \mu$ \\ |
|
4174 \hspace{3mm}4.2 $x_{i} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
4175 \hspace{3mm}4.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
4176 5. $x_{m} \leftarrow \mu$ \\ |
|
4177 6. for $i$ from $m + 1$ to $x.used - 1$ do \\ |
|
4178 \hspace{3mm}6.1 $x_{i} \leftarrow 0$ \\ |
|
4179 7. Clamp excess digits of $x$. \\ |
|
4180 8. If $x \ge n$ then \\ |
|
4181 \hspace{3mm}8.1 $x \leftarrow x - n$ \\ |
|
4182 \hspace{3mm}8.2 Goto step 3. \\ |
|
4183 9. Return(\textit{MP\_OKAY}). \\ |
|
4184 \hline |
|
4185 \end{tabular} |
|
4186 \end{center} |
|
4187 \end{small} |
|
4188 \caption{Algorithm mp\_dr\_reduce} |
|
4189 \end{figure} |
|
4190 |
|
4191 \textbf{Algorithm mp\_dr\_reduce.} |
|
4192 This algorithm will perform the Dimished Radix reduction of $x$ modulo $n$. It has similar restrictions to that of the Barrett reduction |
|
4193 with the addition that $n$ must be of the form $n = \beta^m - k$ where $0 < k <\beta$. |
|
4194 |
|
4195 This algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization. The division by $\beta^m$, multiplication by $k$ |
|
4196 and addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4. The division by $\beta^m$ is emulated by accessing |
|
4197 the term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position. After the loop the $m$'th |
|
4198 digit is set to the carry and the upper digits are zeroed. Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to |
|
4199 $x$ before the addition of the multiple of the upper half. |
|
4200 |
|
4201 At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required. First $n$ is subtracted from $x$ and then the algorithm resumes |
|
4202 at step 3. |
|
4203 |
|
4204 EXAM,bn_mp_dr_reduce.c |
|
4205 |
|
4206 The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$. The label on line @49,top:@ is where |
|
4207 the algorithm will resume if further reduction passes are required. In theory it could be placed at the top of the function however, the size of |
|
4208 the modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time. |
|
4209 |
|
4210 The aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits. By reading digits from $x$ offset by $m$ digits |
|
4211 a division by $\beta^m$ can be simulated virtually for free. The loop on line @61,for@ performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11}) |
|
4212 in this algorithm. |
|
4213 |
|
4214 By line @68,mu@ the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed. Similarly by line @71,for@ the |
|
4215 same pointer will point to the $m+1$'th digit where the zeroes will be placed. |
|
4216 |
|
4217 Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required. |
|
4218 With the same logic at line @82,sub@ the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used |
|
4219 as well. Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code |
|
4220 does not need to be checked. |
|
4221 |
|
4222 \subsubsection{Setup} |
|
4223 To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required. This algorithm is not really complicated but provided for |
|
4224 completeness. |
|
4225 |
|
4226 \begin{figure}[!here] |
|
4227 \begin{small} |
|
4228 \begin{center} |
|
4229 \begin{tabular}{l} |
|
4230 \hline Algorithm \textbf{mp\_dr\_setup}. \\ |
|
4231 \textbf{Input}. mp\_int $n$ \\ |
|
4232 \textbf{Output}. $k = \beta - n_0$ \\ |
|
4233 \hline \\ |
|
4234 1. $k \leftarrow \beta - n_0$ \\ |
|
4235 \hline |
|
4236 \end{tabular} |
|
4237 \end{center} |
|
4238 \end{small} |
|
4239 \caption{Algorithm mp\_dr\_setup} |
|
4240 \end{figure} |
|
4241 |
|
4242 EXAM,bn_mp_dr_setup.c |
|
4243 |
|
4244 \subsubsection{Modulus Detection} |
|
4245 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus. An integer is said to be |
|
4246 of restricted Diminished Radix form if all of the digits are equal to $\beta - 1$ except the trailing digit which may be any value. |
|
4247 |
|
4248 \begin{figure}[!here] |
|
4249 \begin{small} |
|
4250 \begin{center} |
|
4251 \begin{tabular}{l} |
|
4252 \hline Algorithm \textbf{mp\_dr\_is\_modulus}. \\ |
|
4253 \textbf{Input}. mp\_int $n$ \\ |
|
4254 \textbf{Output}. $1$ if $n$ is in D.R form, $0$ otherwise \\ |
|
4255 \hline |
|
4256 1. If $n.used < 2$ then return($0$). \\ |
|
4257 2. for $ix$ from $1$ to $n.used - 1$ do \\ |
|
4258 \hspace{3mm}2.1 If $n_{ix} \ne \beta - 1$ return($0$). \\ |
|
4259 3. Return($1$). \\ |
|
4260 \hline |
|
4261 \end{tabular} |
|
4262 \end{center} |
|
4263 \end{small} |
|
4264 \caption{Algorithm mp\_dr\_is\_modulus} |
|
4265 \end{figure} |
|
4266 |
|
4267 \textbf{Algorithm mp\_dr\_is\_modulus.} |
|
4268 This algorithm determines if a value is in Diminished Radix form. Step 1 rejects obvious cases where fewer than two digits are |
|
4269 in the mp\_int. Step 2 tests all but the first digit to see if they are equal to $\beta - 1$. If the algorithm manages to get to |
|
4270 step 3 then $n$ must be of Diminished Radix form. |
|
4271 |
|
4272 EXAM,bn_mp_dr_is_modulus.c |
|
4273 |
|
4274 \subsection{Unrestricted Diminished Radix Reduction} |
|
4275 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$. This algorithm |
|
4276 is a straightforward adaptation of algorithm~\ref{fig:DR}. |
|
4277 |
|
4278 In general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead. However, this new |
|
4279 algorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form. |
|
4280 |
|
4281 \begin{figure}[!here] |
|
4282 \begin{small} |
|
4283 \begin{center} |
|
4284 \begin{tabular}{l} |
|
4285 \hline Algorithm \textbf{mp\_reduce\_2k}. \\ |
|
4286 \textbf{Input}. mp\_int $a$ and $n$. mp\_digit $k$ \\ |
|
4287 \hspace{11.5mm}($a \ge 0$, $n > 1$, $0 < k < \beta$, $n + k$ is a power of two) \\ |
|
4288 \textbf{Output}. $a \mbox{ (mod }n\mbox{)}$ \\ |
|
4289 \hline |
|
4290 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ |
|
4291 2. While $a \ge n$ do \\ |
|
4292 \hspace{3mm}2.1 $q \leftarrow \lfloor a / 2^p \rfloor$ (\textit{mp\_div\_2d}) \\ |
|
4293 \hspace{3mm}2.2 $a \leftarrow a \mbox{ (mod }2^p\mbox{)}$ (\textit{mp\_mod\_2d}) \\ |
|
4294 \hspace{3mm}2.3 $q \leftarrow q \cdot k$ (\textit{mp\_mul\_d}) \\ |
|
4295 \hspace{3mm}2.4 $a \leftarrow a - q$ (\textit{s\_mp\_sub}) \\ |
|
4296 \hspace{3mm}2.5 If $a \ge n$ then do \\ |
|
4297 \hspace{6mm}2.5.1 $a \leftarrow a - n$ \\ |
|
4298 3. Return(\textit{MP\_OKAY}). \\ |
|
4299 \hline |
|
4300 \end{tabular} |
|
4301 \end{center} |
|
4302 \end{small} |
|
4303 \caption{Algorithm mp\_reduce\_2k} |
|
4304 \end{figure} |
|
4305 |
|
4306 \textbf{Algorithm mp\_reduce\_2k.} |
|
4307 This algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$. Division by $2^p$ is emulated with a right |
|
4308 shift which makes the algorithm fairly inexpensive to use. |
|
4309 |
|
4310 EXAM,bn_mp_reduce_2k.c |
|
4311 |
|
4312 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$. The call to mp\_div\_2d |
|
4313 on line @31,mp_div_2d@ calculates both the quotient $q$ and the remainder $a$ required. By doing both in a single function call the code size |
|
4314 is kept fairly small. The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without |
|
4315 any multiplications. |
|
4316 |
|
4317 The unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are |
|
4318 positive. By using the unsigned versions the overhead is kept to a minimum. |
|
4319 |
|
4320 \subsubsection{Unrestricted Setup} |
|
4321 To setup this reduction algorithm the value of $k = 2^p - n$ is required. |
|
4322 |
|
4323 \begin{figure}[!here] |
|
4324 \begin{small} |
|
4325 \begin{center} |
|
4326 \begin{tabular}{l} |
|
4327 \hline Algorithm \textbf{mp\_reduce\_2k\_setup}. \\ |
|
4328 \textbf{Input}. mp\_int $n$ \\ |
|
4329 \textbf{Output}. $k = 2^p - n$ \\ |
|
4330 \hline |
|
4331 1. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ |
|
4332 2. $x \leftarrow 2^p$ (\textit{mp\_2expt}) \\ |
|
4333 3. $x \leftarrow x - n$ (\textit{mp\_sub}) \\ |
|
4334 4. $k \leftarrow x_0$ \\ |
|
4335 5. Return(\textit{MP\_OKAY}). \\ |
|
4336 \hline |
|
4337 \end{tabular} |
|
4338 \end{center} |
|
4339 \end{small} |
|
4340 \caption{Algorithm mp\_reduce\_2k\_setup} |
|
4341 \end{figure} |
|
4342 |
|
4343 \textbf{Algorithm mp\_reduce\_2k\_setup.} |
|
4344 This algorithm computes the value of $k$ required for the algorithm mp\_reduce\_2k. By making a temporary variable $x$ equal to $2^p$ a subtraction |
|
4345 is sufficient to solve for $k$. Alternatively if $n$ has more than one digit the value of $k$ is simply $\beta - n_0$. |
|
4346 |
|
4347 EXAM,bn_mp_reduce_2k_setup.c |
|
4348 |
|
4349 \subsubsection{Unrestricted Detection} |
|
4350 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true. |
|
4351 |
|
4352 \begin{enumerate} |
|
4353 \item The number has only one digit. |
|
4354 \item The number has more than one digit and every bit from the $\beta$'th to the most significant is one. |
|
4355 \end{enumerate} |
|
4356 |
|
4357 If either condition is true than there is a power of two $2^p$ such that $0 < 2^p - n < \beta$. If the input is only |
|
4358 one digit than it will always be of the correct form. Otherwise all of the bits above the first digit must be one. This arises from the fact |
|
4359 that there will be value of $k$ that when added to the modulus causes a carry in the first digit which propagates all the way to the most |
|
4360 significant bit. The resulting sum will be a power of two. |
|
4361 |
|
4362 \begin{figure}[!here] |
|
4363 \begin{small} |
|
4364 \begin{center} |
|
4365 \begin{tabular}{l} |
|
4366 \hline Algorithm \textbf{mp\_reduce\_is\_2k}. \\ |
|
4367 \textbf{Input}. mp\_int $n$ \\ |
|
4368 \textbf{Output}. $1$ if of proper form, $0$ otherwise \\ |
|
4369 \hline |
|
4370 1. If $n.used = 0$ then return($0$). \\ |
|
4371 2. If $n.used = 1$ then return($1$). \\ |
|
4372 3. $p \leftarrow \lceil lg(n) \rceil$ (\textit{mp\_count\_bits}) \\ |
|
4373 4. for $x$ from $lg(\beta)$ to $p$ do \\ |
|
4374 \hspace{3mm}4.1 If the ($x \mbox{ mod }lg(\beta)$)'th bit of the $\lfloor x / lg(\beta) \rfloor$ of $n$ is zero then return($0$). \\ |
|
4375 5. Return($1$). \\ |
|
4376 \hline |
|
4377 \end{tabular} |
|
4378 \end{center} |
|
4379 \end{small} |
|
4380 \caption{Algorithm mp\_reduce\_is\_2k} |
|
4381 \end{figure} |
|
4382 |
|
4383 \textbf{Algorithm mp\_reduce\_is\_2k.} |
|
4384 This algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly. |
|
4385 |
|
4386 EXAM,bn_mp_reduce_is_2k.c |
|
4387 |
|
4388 |
|
4389 |
|
4390 \section{Algorithm Comparison} |
|
4391 So far three very different algorithms for modular reduction have been discussed. Each of the algorithms have their own strengths and weaknesses |
|
4392 that makes having such a selection very useful. The following table sumarizes the three algorithms along with comparisons of work factors. Since |
|
4393 all three algorithms have the restriction that $0 \le x < n^2$ and $n > 1$ those limitations are not included in the table. |
|
4394 |
|
4395 \begin{center} |
|
4396 \begin{small} |
|
4397 \begin{tabular}{|c|c|c|c|c|c|} |
|
4398 \hline \textbf{Method} & \textbf{Work Required} & \textbf{Limitations} & \textbf{$m = 8$} & \textbf{$m = 32$} & \textbf{$m = 64$} \\ |
|
4399 \hline Barrett & $m^2 + 2m - 1$ & None & $79$ & $1087$ & $4223$ \\ |
|
4400 \hline Montgomery & $m^2 + m$ & $n$ must be odd & $72$ & $1056$ & $4160$ \\ |
|
4401 \hline D.R. & $2m$ & $n = \beta^m - k$ & $16$ & $64$ & $128$ \\ |
|
4402 \hline |
|
4403 \end{tabular} |
|
4404 \end{small} |
|
4405 \end{center} |
|
4406 |
|
4407 In theory Montgomery and Barrett reductions would require roughly the same amount of time to complete. However, in practice since Montgomery |
|
4408 reduction can be written as a single function with the Comba technique it is much faster. Barrett reduction suffers from the overhead of |
|
4409 calling the half precision multipliers, addition and division by $\beta$ algorithms. |
|
4410 |
|
4411 For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice. The one set of algorithms where Diminished Radix reduction truly |
|
4412 shines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}. In these algorithms |
|
4413 primes of the form $\beta^m - k$ can be found and shared amongst users. These primes will allow the Diminished Radix algorithm to be used in |
|
4414 modular exponentiation to greatly speed up the operation. |
|
4415 |
|
4416 |
|
4417 |
|
4418 \section*{Exercises} |
|
4419 \begin{tabular}{cl} |
|
4420 $\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\ |
|
4421 & calculates the correct value of $\rho$. \\ |
|
4422 & \\ |
|
4423 $\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly. \\ |
|
4424 & \\ |
|
4425 $\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\ |
|
4426 & (\textit{figure~\ref{fig:DR}}) terminates. Also prove the probability that it will \\ |
|
4427 & terminate within $1 \le k \le 10$ iterations. \\ |
|
4428 & \\ |
|
4429 \end{tabular} |
|
4430 |
|
4431 |
|
4432 \chapter{Exponentiation} |
|
4433 Exponentiation is the operation of raising one variable to the power of another, for example, $a^b$. A variant of exponentiation, computed |
|
4434 in a finite field or ring, is called modular exponentiation. This latter style of operation is typically used in public key |
|
4435 cryptosystems such as RSA and Diffie-Hellman. The ability to quickly compute modular exponentiations is of great benefit to any |
|
4436 such cryptosystem and many methods have been sought to speed it up. |
|
4437 |
|
4438 \section{Exponentiation Basics} |
|
4439 A trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired. However, as $b$ grows in size |
|
4440 the number of multiplications becomes prohibitive. Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature |
|
4441 with a $1024$-bit key. Such a calculation could never be completed as it would take simply far too long. |
|
4442 |
|
4443 Fortunately there is a very simple algorithm based on the laws of exponents. Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which |
|
4444 are two trivial relationships between the base and the exponent. Let $b_i$ represent the $i$'th bit of $b$ starting from the least |
|
4445 significant bit. If $b$ is a $k$-bit integer than the following equation is true. |
|
4446 |
|
4447 \begin{equation} |
|
4448 a^b = \prod_{i=0}^{k-1} a^{2^i \cdot b_i} |
|
4449 \end{equation} |
|
4450 |
|
4451 By taking the base $a$ logarithm of both sides of the equation the following equation is the result. |
|
4452 |
|
4453 \begin{equation} |
|
4454 b = \sum_{i=0}^{k-1}2^i \cdot b_i |
|
4455 \end{equation} |
|
4456 |
|
4457 The term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to |
|
4458 $a^{2^{i+1}}$. This observation forms the basis of essentially all fast exponentiation algorithms. It requires $k$ squarings and on average |
|
4459 $k \over 2$ multiplications to compute the result. This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times. |
|
4460 |
|
4461 While this current method is a considerable speed up there are further improvements to be made. For example, the $a^{2^i}$ term does not need to |
|
4462 be computed in an auxilary variable. Consider the following equivalent algorithm. |
|
4463 |
|
4464 \begin{figure}[!here] |
|
4465 \begin{small} |
|
4466 \begin{center} |
|
4467 \begin{tabular}{l} |
|
4468 \hline Algorithm \textbf{Left to Right Exponentiation}. \\ |
|
4469 \textbf{Input}. Integer $a$, $b$ and $k$ \\ |
|
4470 \textbf{Output}. $c = a^b$ \\ |
|
4471 \hline \\ |
|
4472 1. $c \leftarrow 1$ \\ |
|
4473 2. for $i$ from $k - 1$ to $0$ do \\ |
|
4474 \hspace{3mm}2.1 $c \leftarrow c^2$ \\ |
|
4475 \hspace{3mm}2.2 $c \leftarrow c \cdot a^{b_i}$ \\ |
|
4476 3. Return $c$. \\ |
|
4477 \hline |
|
4478 \end{tabular} |
|
4479 \end{center} |
|
4480 \end{small} |
|
4481 \caption{Left to Right Exponentiation} |
|
4482 \label{fig:LTOR} |
|
4483 \end{figure} |
|
4484 |
|
4485 This algorithm starts from the most significant bit and works towards the least significant bit. When the $i$'th bit of $b$ is set $a$ is |
|
4486 multiplied against the current product. In each iteration the product is squared which doubles the exponent of the individual terms of the |
|
4487 product. |
|
4488 |
|
4489 For example, let $b = 101100_2 \equiv 44_{10}$. The following chart demonstrates the actions of the algorithm. |
|
4490 |
|
4491 \newpage\begin{figure} |
|
4492 \begin{center} |
|
4493 \begin{tabular}{|c|c|} |
|
4494 \hline \textbf{Value of $i$} & \textbf{Value of $c$} \\ |
|
4495 \hline - & $1$ \\ |
|
4496 \hline $5$ & $a$ \\ |
|
4497 \hline $4$ & $a^2$ \\ |
|
4498 \hline $3$ & $a^4 \cdot a$ \\ |
|
4499 \hline $2$ & $a^8 \cdot a^2 \cdot a$ \\ |
|
4500 \hline $1$ & $a^{16} \cdot a^4 \cdot a^2$ \\ |
|
4501 \hline $0$ & $a^{32} \cdot a^8 \cdot a^4$ \\ |
|
4502 \hline |
|
4503 \end{tabular} |
|
4504 \end{center} |
|
4505 \caption{Example of Left to Right Exponentiation} |
|
4506 \end{figure} |
|
4507 |
|
4508 When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation. This particular algorithm is |
|
4509 called ``Left to Right'' because it reads the exponent in that order. All of the exponentiation algorithms that will be presented are of this nature. |
|
4510 |
|
4511 \subsection{Single Digit Exponentiation} |
|
4512 The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit. It is intended |
|
4513 to be used when a small power of an input is required (\textit{e.g. $a^5$}). It is faster than simply multiplying $b - 1$ times for all values of |
|
4514 $b$ that are greater than three. |
|
4515 |
|
4516 \newpage\begin{figure}[!here] |
|
4517 \begin{small} |
|
4518 \begin{center} |
|
4519 \begin{tabular}{l} |
|
4520 \hline Algorithm \textbf{mp\_expt\_d}. \\ |
|
4521 \textbf{Input}. mp\_int $a$ and mp\_digit $b$ \\ |
|
4522 \textbf{Output}. $c = a^b$ \\ |
|
4523 \hline \\ |
|
4524 1. $g \leftarrow a$ (\textit{mp\_init\_copy}) \\ |
|
4525 2. $c \leftarrow 1$ (\textit{mp\_set}) \\ |
|
4526 3. for $x$ from 1 to $lg(\beta)$ do \\ |
|
4527 \hspace{3mm}3.1 $c \leftarrow c^2$ (\textit{mp\_sqr}) \\ |
|
4528 \hspace{3mm}3.2 If $b$ AND $2^{lg(\beta) - 1} \ne 0$ then \\ |
|
4529 \hspace{6mm}3.2.1 $c \leftarrow c \cdot g$ (\textit{mp\_mul}) \\ |
|
4530 \hspace{3mm}3.3 $b \leftarrow b << 1$ \\ |
|
4531 4. Clear $g$. \\ |
|
4532 5. Return(\textit{MP\_OKAY}). \\ |
|
4533 \hline |
|
4534 \end{tabular} |
|
4535 \end{center} |
|
4536 \end{small} |
|
4537 \caption{Algorithm mp\_expt\_d} |
|
4538 \end{figure} |
|
4539 |
|
4540 \textbf{Algorithm mp\_expt\_d.} |
|
4541 This algorithm computes the value of $a$ raised to the power of a single digit $b$. It uses the left to right exponentiation algorithm to |
|
4542 quickly compute the exponentiation. It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the |
|
4543 exponent is a fixed width. |
|
4544 |
|
4545 A copy of $a$ is made first to allow destination variable $c$ be the same as the source variable $a$. The result is set to the initial value of |
|
4546 $1$ in the subsequent step. |
|
4547 |
|
4548 Inside the loop the exponent is read from the most significant bit first down to the least significant bit. First $c$ is invariably squared |
|
4549 on step 3.1. In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$. The value |
|
4550 of $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit. In effect each |
|
4551 iteration of the loop moves the bits of the exponent $b$ upwards to the most significant location. |
|
4552 |
|
4553 EXAM,bn_mp_expt_d.c |
|
4554 |
|
4555 Line @29,mp_set@ sets the initial value of the result to $1$. Next the loop on line @31,for@ steps through each bit of the exponent starting from |
|
4556 the most significant down towards the least significant. The invariant squaring operation placed on line @333,mp_sqr@ is performed first. After |
|
4557 the squaring the result $c$ is multiplied by the base $g$ if and only if the most significant bit of the exponent is set. The shift on line |
|
4558 @47,<<@ moves all of the bits of the exponent upwards towards the most significant location. |
|
4559 |
|
4560 \section{$k$-ary Exponentiation} |
|
4561 When calculating an exponentiation the most time consuming bottleneck is the multiplications which are in general a small factor |
|
4562 slower than squaring. Recall from the previous algorithm that $b_{i}$ refers to the $i$'th bit of the exponent $b$. Suppose instead it referred to |
|
4563 the $i$'th $k$-bit digit of the exponent of $b$. For $k = 1$ the definitions are synonymous and for $k > 1$ algorithm~\ref{fig:KARY} |
|
4564 computes the same exponentiation. A group of $k$ bits from the exponent is called a \textit{window}. That is it is a small window on only a |
|
4565 portion of the entire exponent. Consider the following modification to the basic left to right exponentiation algorithm. |
|
4566 |
|
4567 \begin{figure}[!here] |
|
4568 \begin{small} |
|
4569 \begin{center} |
|
4570 \begin{tabular}{l} |
|
4571 \hline Algorithm \textbf{$k$-ary Exponentiation}. \\ |
|
4572 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\ |
|
4573 \textbf{Output}. $c = a^b$ \\ |
|
4574 \hline \\ |
|
4575 1. $c \leftarrow 1$ \\ |
|
4576 2. for $i$ from $t - 1$ to $0$ do \\ |
|
4577 \hspace{3mm}2.1 $c \leftarrow c^{2^k} $ \\ |
|
4578 \hspace{3mm}2.2 Extract the $i$'th $k$-bit word from $b$ and store it in $g$. \\ |
|
4579 \hspace{3mm}2.3 $c \leftarrow c \cdot a^g$ \\ |
|
4580 3. Return $c$. \\ |
|
4581 \hline |
|
4582 \end{tabular} |
|
4583 \end{center} |
|
4584 \end{small} |
|
4585 \caption{$k$-ary Exponentiation} |
|
4586 \label{fig:KARY} |
|
4587 \end{figure} |
|
4588 |
|
4589 The squaring on step 2.1 can be calculated by squaring the value $c$ successively $k$ times. If the values of $a^g$ for $0 < g < 2^k$ have been |
|
4590 precomputed this algorithm requires only $t$ multiplications and $tk$ squarings. The table can be generated with $2^{k - 1} - 1$ squarings and |
|
4591 $2^{k - 1} + 1$ multiplications. This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$. |
|
4592 However, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}. |
|
4593 |
|
4594 Suppose $k = 4$ and $t = 100$. This modified algorithm will require $109$ multiplications and $408$ squarings to compute the exponentiation. The |
|
4595 original algorithm would on average have required $200$ multiplications and $400$ squrings to compute the same value. The total number of squarings |
|
4596 has increased slightly but the number of multiplications has nearly halved. |
|
4597 |
|
4598 \subsection{Optimal Values of $k$} |
|
4599 An optimal value of $k$ will minimize $2^{k} + \lceil n / k \rceil + n - 1$ for a fixed number of bits in the exponent $n$. The simplest |
|
4600 approach is to brute force search amongst the values $k = 2, 3, \ldots, 8$ for the lowest result. Table~\ref{fig:OPTK} lists optimal values of $k$ |
|
4601 for various exponent sizes and compares the number of multiplication and squarings required against algorithm~\ref{fig:LTOR}. |
|
4602 |
|
4603 \begin{figure}[here] |
|
4604 \begin{center} |
|
4605 \begin{small} |
|
4606 \begin{tabular}{|c|c|c|c|c|c|} |
|
4607 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:LTOR}} \\ |
|
4608 \hline $16$ & $2$ & $27$ & $24$ \\ |
|
4609 \hline $32$ & $3$ & $49$ & $48$ \\ |
|
4610 \hline $64$ & $3$ & $92$ & $96$ \\ |
|
4611 \hline $128$ & $4$ & $175$ & $192$ \\ |
|
4612 \hline $256$ & $4$ & $335$ & $384$ \\ |
|
4613 \hline $512$ & $5$ & $645$ & $768$ \\ |
|
4614 \hline $1024$ & $6$ & $1257$ & $1536$ \\ |
|
4615 \hline $2048$ & $6$ & $2452$ & $3072$ \\ |
|
4616 \hline $4096$ & $7$ & $4808$ & $6144$ \\ |
|
4617 \hline |
|
4618 \end{tabular} |
|
4619 \end{small} |
|
4620 \end{center} |
|
4621 \caption{Optimal Values of $k$ for $k$-ary Exponentiation} |
|
4622 \label{fig:OPTK} |
|
4623 \end{figure} |
|
4624 |
|
4625 \subsection{Sliding-Window Exponentiation} |
|
4626 A simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$. Essentially |
|
4627 this is a table for all values of $g$ where the most significant bit of $g$ is a one. However, in order for this to be allowed in the |
|
4628 algorithm values of $g$ in the range $0 \le g < 2^{k-1}$ must be avoided. |
|
4629 |
|
4630 Table~\ref{fig:OPTK2} lists optimal values of $k$ for various exponent sizes and compares the work required against algorithm~\ref{fig:KARY}. |
|
4631 |
|
4632 \begin{figure}[here] |
|
4633 \begin{center} |
|
4634 \begin{small} |
|
4635 \begin{tabular}{|c|c|c|c|c|c|} |
|
4636 \hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:KARY}} \\ |
|
4637 \hline $16$ & $3$ & $24$ & $27$ \\ |
|
4638 \hline $32$ & $3$ & $45$ & $49$ \\ |
|
4639 \hline $64$ & $4$ & $87$ & $92$ \\ |
|
4640 \hline $128$ & $4$ & $167$ & $175$ \\ |
|
4641 \hline $256$ & $5$ & $322$ & $335$ \\ |
|
4642 \hline $512$ & $6$ & $628$ & $645$ \\ |
|
4643 \hline $1024$ & $6$ & $1225$ & $1257$ \\ |
|
4644 \hline $2048$ & $7$ & $2403$ & $2452$ \\ |
|
4645 \hline $4096$ & $8$ & $4735$ & $4808$ \\ |
|
4646 \hline |
|
4647 \end{tabular} |
|
4648 \end{small} |
|
4649 \end{center} |
|
4650 \caption{Optimal Values of $k$ for Sliding Window Exponentiation} |
|
4651 \label{fig:OPTK2} |
|
4652 \end{figure} |
|
4653 |
|
4654 \newpage\begin{figure}[!here] |
|
4655 \begin{small} |
|
4656 \begin{center} |
|
4657 \begin{tabular}{l} |
|
4658 \hline Algorithm \textbf{Sliding Window $k$-ary Exponentiation}. \\ |
|
4659 \textbf{Input}. Integer $a$, $b$, $k$ and $t$ \\ |
|
4660 \textbf{Output}. $c = a^b$ \\ |
|
4661 \hline \\ |
|
4662 1. $c \leftarrow 1$ \\ |
|
4663 2. for $i$ from $t - 1$ to $0$ do \\ |
|
4664 \hspace{3mm}2.1 If the $i$'th bit of $b$ is a zero then \\ |
|
4665 \hspace{6mm}2.1.1 $c \leftarrow c^2$ \\ |
|
4666 \hspace{3mm}2.2 else do \\ |
|
4667 \hspace{6mm}2.2.1 $c \leftarrow c^{2^k}$ \\ |
|
4668 \hspace{6mm}2.2.2 Extract the $k$ bits from $(b_{i}b_{i-1}\ldots b_{i-(k-1)})$ and store it in $g$. \\ |
|
4669 \hspace{6mm}2.2.3 $c \leftarrow c \cdot a^g$ \\ |
|
4670 \hspace{6mm}2.2.4 $i \leftarrow i - k$ \\ |
|
4671 3. Return $c$. \\ |
|
4672 \hline |
|
4673 \end{tabular} |
|
4674 \end{center} |
|
4675 \end{small} |
|
4676 \caption{Sliding Window $k$-ary Exponentiation} |
|
4677 \end{figure} |
|
4678 |
|
4679 Similar to the previous algorithm this algorithm must have a special handler when fewer than $k$ bits are left in the exponent. While this |
|
4680 algorithm requires the same number of squarings it can potentially have fewer multiplications. The pre-computed table $a^g$ is also half |
|
4681 the size as the previous table. |
|
4682 |
|
4683 Consider the exponent $b = 111101011001000_2 \equiv 31432_{10}$ with $k = 3$ using both algorithms. The first algorithm will divide the exponent up as |
|
4684 the following five $3$-bit words $b \equiv \left ( 111, 101, 011, 001, 000 \right )_{2}$. The second algorithm will break the |
|
4685 exponent as $b \equiv \left ( 111, 101, 0, 110, 0, 100, 0 \right )_{2}$. The single digit $0$ in the second representation are where |
|
4686 a single squaring took place instead of a squaring and multiplication. In total the first method requires $10$ multiplications and $18$ |
|
4687 squarings. The second method requires $8$ multiplications and $18$ squarings. |
|
4688 |
|
4689 In general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster. |
|
4690 |
|
4691 \section{Modular Exponentiation} |
|
4692 |
|
4693 Modular exponentiation is essentially computing the power of a base within a finite field or ring. For example, computing |
|
4694 $d \equiv a^b \mbox{ (mod }c\mbox{)}$ is a modular exponentiation. Instead of first computing $a^b$ and then reducing it |
|
4695 modulo $c$ the intermediate result is reduced modulo $c$ after every squaring or multiplication operation. |
|
4696 |
|
4697 This guarantees that any intermediate result is bounded by $0 \le d \le c^2 - 2c + 1$ and can be reduced modulo $c$ quickly using |
|
4698 one of the algorithms presented in ~REDUCTION~. |
|
4699 |
|
4700 Before the actual modular exponentiation algorithm can be written a wrapper algorithm must be written first. This algorithm |
|
4701 will allow the exponent $b$ to be negative which is computed as $c \equiv \left (1 / a \right )^{\vert b \vert} \mbox{(mod }d\mbox{)}$. The |
|
4702 value of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}). If no inverse exists the algorithm |
|
4703 terminates with an error. |
|
4704 |
|
4705 \begin{figure}[!here] |
|
4706 \begin{small} |
|
4707 \begin{center} |
|
4708 \begin{tabular}{l} |
|
4709 \hline Algorithm \textbf{mp\_exptmod}. \\ |
|
4710 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ |
|
4711 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ |
|
4712 \hline \\ |
|
4713 1. If $c.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\ |
|
4714 2. If $b.sign = MP\_NEG$ then \\ |
|
4715 \hspace{3mm}2.1 $g' \leftarrow g^{-1} \mbox{ (mod }c\mbox{)}$ \\ |
|
4716 \hspace{3mm}2.2 $x' \leftarrow \vert x \vert$ \\ |
|
4717 \hspace{3mm}2.3 Compute $d \equiv g'^{x'} \mbox{ (mod }c\mbox{)}$ via recursion. \\ |
|
4718 3. if $p$ is odd \textbf{OR} $p$ is a D.R. modulus then \\ |
|
4719 \hspace{3mm}3.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm mp\_exptmod\_fast. \\ |
|
4720 4. else \\ |
|
4721 \hspace{3mm}4.1 Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm s\_mp\_exptmod. \\ |
|
4722 \hline |
|
4723 \end{tabular} |
|
4724 \end{center} |
|
4725 \end{small} |
|
4726 \caption{Algorithm mp\_exptmod} |
|
4727 \end{figure} |
|
4728 |
|
4729 \textbf{Algorithm mp\_exptmod.} |
|
4730 The first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod. It is a sliding window $k$-ary algorithm |
|
4731 which uses Barrett reduction to reduce the product modulo $p$. The second algorithm mp\_exptmod\_fast performs the same operation |
|
4732 except it uses either Montgomery or Diminished Radix reduction. The two latter reduction algorithms are clumped in the same exponentiation |
|
4733 algorithm since their arguments are essentially the same (\textit{two mp\_ints and one mp\_digit}). |
|
4734 |
|
4735 EXAM,bn_mp_exptmod.c |
|
4736 |
|
4737 In order to keep the algorithms in a known state the first step on line @29,if@ is to reject any negative modulus as input. If the exponent is |
|
4738 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$. The temporary variable $tmpG$ is assigned |
|
4739 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$. The algorithm will recuse with these new values with a positive |
|
4740 exponent. |
|
4741 |
|
4742 If the exponent is positive the algorithm resumes the exponentiation. Line @63,dr_@ determines if the modulus is of the restricted Diminished Radix |
|
4743 form. If it is not line @65,reduce@ attempts to determine if it is of a unrestricted Diminished Radix form. The integer $dr$ will take on one |
|
4744 of three values. |
|
4745 |
|
4746 \begin{enumerate} |
|
4747 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form. |
|
4748 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form. |
|
4749 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form. |
|
4750 \end{enumerate} |
|
4751 |
|
4752 Line @69,if@ determines if the fast modular exponentiation algorithm can be used. It is allowed if $dr \ne 0$ or if the modulus is odd. Otherwise, |
|
4753 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction. |
|
4754 |
|
4755 \subsection{Barrett Modular Exponentiation} |
|
4756 |
|
4757 \newpage\begin{figure}[!here] |
|
4758 \begin{small} |
|
4759 \begin{center} |
|
4760 \begin{tabular}{l} |
|
4761 \hline Algorithm \textbf{s\_mp\_exptmod}. \\ |
|
4762 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ |
|
4763 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ |
|
4764 \hline \\ |
|
4765 1. $k \leftarrow lg(x)$ \\ |
|
4766 2. $winsize \leftarrow \left \lbrace \begin{array}{ll} |
|
4767 2 & \mbox{if }k \le 7 \\ |
|
4768 3 & \mbox{if }7 < k \le 36 \\ |
|
4769 4 & \mbox{if }36 < k \le 140 \\ |
|
4770 5 & \mbox{if }140 < k \le 450 \\ |
|
4771 6 & \mbox{if }450 < k \le 1303 \\ |
|
4772 7 & \mbox{if }1303 < k \le 3529 \\ |
|
4773 8 & \mbox{if }3529 < k \\ |
|
4774 \end{array} \right .$ \\ |
|
4775 3. Initialize $2^{winsize}$ mp\_ints in an array named $M$ and one mp\_int named $\mu$ \\ |
|
4776 4. Calculate the $\mu$ required for Barrett Reduction (\textit{mp\_reduce\_setup}). \\ |
|
4777 5. $M_1 \leftarrow g \mbox{ (mod }p\mbox{)}$ \\ |
|
4778 \\ |
|
4779 Setup the table of small powers of $g$. First find $g^{2^{winsize}}$ and then all multiples of it. \\ |
|
4780 6. $k \leftarrow 2^{winsize - 1}$ \\ |
|
4781 7. $M_{k} \leftarrow M_1$ \\ |
|
4782 8. for $ix$ from 0 to $winsize - 2$ do \\ |
|
4783 \hspace{3mm}8.1 $M_k \leftarrow \left ( M_k \right )^2$ (\textit{mp\_sqr}) \\ |
|
4784 \hspace{3mm}8.2 $M_k \leftarrow M_k \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\ |
|
4785 9. for $ix$ from $2^{winsize - 1} + 1$ to $2^{winsize} - 1$ do \\ |
|
4786 \hspace{3mm}9.1 $M_{ix} \leftarrow M_{ix - 1} \cdot M_{1}$ (\textit{mp\_mul}) \\ |
|
4787 \hspace{3mm}9.2 $M_{ix} \leftarrow M_{ix} \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\ |
|
4788 10. $res \leftarrow 1$ \\ |
|
4789 \\ |
|
4790 Start Sliding Window. \\ |
|
4791 11. $mode \leftarrow 0, bitcnt \leftarrow 1, buf \leftarrow 0, digidx \leftarrow x.used - 1, bitcpy \leftarrow 0, bitbuf \leftarrow 0$ \\ |
|
4792 12. Loop \\ |
|
4793 \hspace{3mm}12.1 $bitcnt \leftarrow bitcnt - 1$ \\ |
|
4794 \hspace{3mm}12.2 If $bitcnt = 0$ then do \\ |
|
4795 \hspace{6mm}12.2.1 If $digidx = -1$ goto step 13. \\ |
|
4796 \hspace{6mm}12.2.2 $buf \leftarrow x_{digidx}$ \\ |
|
4797 \hspace{6mm}12.2.3 $digidx \leftarrow digidx - 1$ \\ |
|
4798 \hspace{6mm}12.2.4 $bitcnt \leftarrow lg(\beta)$ \\ |
|
4799 Continued on next page. \\ |
|
4800 \hline |
|
4801 \end{tabular} |
|
4802 \end{center} |
|
4803 \end{small} |
|
4804 \caption{Algorithm s\_mp\_exptmod} |
|
4805 \end{figure} |
|
4806 |
|
4807 \newpage\begin{figure}[!here] |
|
4808 \begin{small} |
|
4809 \begin{center} |
|
4810 \begin{tabular}{l} |
|
4811 \hline Algorithm \textbf{s\_mp\_exptmod} (\textit{continued}). \\ |
|
4812 \textbf{Input}. mp\_int $a$, $b$ and $c$ \\ |
|
4813 \textbf{Output}. $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\ |
|
4814 \hline \\ |
|
4815 \hspace{3mm}12.3 $y \leftarrow (buf >> (lg(\beta) - 1))$ AND $1$ \\ |
|
4816 \hspace{3mm}12.4 $buf \leftarrow buf << 1$ \\ |
|
4817 \hspace{3mm}12.5 if $mode = 0$ and $y = 0$ then goto step 12. \\ |
|
4818 \hspace{3mm}12.6 if $mode = 1$ and $y = 0$ then do \\ |
|
4819 \hspace{6mm}12.6.1 $res \leftarrow res^2$ \\ |
|
4820 \hspace{6mm}12.6.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ |
|
4821 \hspace{6mm}12.6.3 Goto step 12. \\ |
|
4822 \hspace{3mm}12.7 $bitcpy \leftarrow bitcpy + 1$ \\ |
|
4823 \hspace{3mm}12.8 $bitbuf \leftarrow bitbuf + (y << (winsize - bitcpy))$ \\ |
|
4824 \hspace{3mm}12.9 $mode \leftarrow 2$ \\ |
|
4825 \hspace{3mm}12.10 If $bitcpy = winsize$ then do \\ |
|
4826 \hspace{6mm}Window is full so perform the squarings and single multiplication. \\ |
|
4827 \hspace{6mm}12.10.1 for $ix$ from $0$ to $winsize -1$ do \\ |
|
4828 \hspace{9mm}12.10.1.1 $res \leftarrow res^2$ \\ |
|
4829 \hspace{9mm}12.10.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ |
|
4830 \hspace{6mm}12.10.2 $res \leftarrow res \cdot M_{bitbuf}$ \\ |
|
4831 \hspace{6mm}12.10.3 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ |
|
4832 \hspace{6mm}Reset the window. \\ |
|
4833 \hspace{6mm}12.10.4 $bitcpy \leftarrow 0, bitbuf \leftarrow 0, mode \leftarrow 1$ \\ |
|
4834 \\ |
|
4835 No more windows left. Check for residual bits of exponent. \\ |
|
4836 13. If $mode = 2$ and $bitcpy > 0$ then do \\ |
|
4837 \hspace{3mm}13.1 for $ix$ form $0$ to $bitcpy - 1$ do \\ |
|
4838 \hspace{6mm}13.1.1 $res \leftarrow res^2$ \\ |
|
4839 \hspace{6mm}13.1.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ |
|
4840 \hspace{6mm}13.1.3 $bitbuf \leftarrow bitbuf << 1$ \\ |
|
4841 \hspace{6mm}13.1.4 If $bitbuf$ AND $2^{winsize} \ne 0$ then do \\ |
|
4842 \hspace{9mm}13.1.4.1 $res \leftarrow res \cdot M_{1}$ \\ |
|
4843 \hspace{9mm}13.1.4.2 $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\ |
|
4844 14. $y \leftarrow res$ \\ |
|
4845 15. Clear $res$, $mu$ and the $M$ array. \\ |
|
4846 16. Return(\textit{MP\_OKAY}). \\ |
|
4847 \hline |
|
4848 \end{tabular} |
|
4849 \end{center} |
|
4850 \end{small} |
|
4851 \caption{Algorithm s\_mp\_exptmod (continued)} |
|
4852 \end{figure} |
|
4853 |
|
4854 \textbf{Algorithm s\_mp\_exptmod.} |
|
4855 This algorithm computes the $x$'th power of $g$ modulo $p$ and stores the result in $y$. It takes advantage of the Barrett reduction |
|
4856 algorithm to keep the product small throughout the algorithm. |
|
4857 |
|
4858 The first two steps determine the optimal window size based on the number of bits in the exponent. The larger the exponent the |
|
4859 larger the window size becomes. After a window size $winsize$ has been chosen an array of $2^{winsize}$ mp\_int variables is allocated. This |
|
4860 table will hold the values of $g^x \mbox{ (mod }p\mbox{)}$ for $2^{winsize - 1} \le x < 2^{winsize}$. |
|
4861 |
|
4862 After the table is allocated the first power of $g$ is found. Since $g \ge p$ is allowed it must be first reduced modulo $p$ to make |
|
4863 the rest of the algorithm more efficient. The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$ |
|
4864 times. The rest of the table elements are found by multiplying the previous element by $M_1$ modulo $p$. |
|
4865 |
|
4866 Now that the table is available the sliding window may begin. The following list describes the functions of all the variables in the window. |
|
4867 \begin{enumerate} |
|
4868 \item The variable $mode$ dictates how the bits of the exponent are interpreted. |
|
4869 \begin{enumerate} |
|
4870 \item When $mode = 0$ the bits are ignored since no non-zero bit of the exponent has been seen yet. For example, if the exponent were simply |
|
4871 $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit. In this case bits are ignored until a non-zero bit is found. |
|
4872 \item When $mode = 1$ a non-zero bit has been seen before and a new $winsize$-bit window has not been formed yet. In this mode leading $0$ bits |
|
4873 are read and a single squaring is performed. If a non-zero bit is read a new window is created. |
|
4874 \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit |
|
4875 downwards. |
|
4876 \end{enumerate} |
|
4877 \item The variable $bitcnt$ indicates how many bits are left in the current digit of the exponent left to be read. When it reaches zero a new digit |
|
4878 is fetched from the exponent. |
|
4879 \item The variable $buf$ holds the currently read digit of the exponent. |
|
4880 \item The variable $digidx$ is an index into the exponents digits. It starts at the leading digit $x.used - 1$ and moves towards the trailing digit. |
|
4881 \item The variable $bitcpy$ indicates how many bits are in the currently formed window. When it reaches $winsize$ the window is flushed and |
|
4882 the appropriate operations performed. |
|
4883 \item The variable $bitbuf$ holds the current bits of the window being formed. |
|
4884 \end{enumerate} |
|
4885 |
|
4886 All of step 12 is the window processing loop. It will iterate while there are digits available form the exponent to read. The first step |
|
4887 inside this loop is to extract a new digit if no more bits are available in the current digit. If there are no bits left a new digit is |
|
4888 read and if there are no digits left than the loop terminates. |
|
4889 |
|
4890 After a digit is made available step 12.3 will extract the most significant bit of the current digit and move all other bits in the digit |
|
4891 upwards. In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to |
|
4892 trailing edges the entire exponent is read from most significant bit to least significant bit. |
|
4893 |
|
4894 At step 12.5 if the $mode$ and currently extracted bit $y$ are both zero the bit is ignored and the next bit is read. This prevents the |
|
4895 algorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read. Step 12.6 and 12.7-10 handle |
|
4896 the two cases of $mode = 1$ and $mode = 2$ respectively. |
|
4897 |
|
4898 FIGU,expt_state,Sliding Window State Diagram |
|
4899 |
|
4900 By step 13 there are no more digits left in the exponent. However, there may be partial bits in the window left. If $mode = 2$ then |
|
4901 a Left-to-Right algorithm is used to process the remaining few bits. |
|
4902 |
|
4903 EXAM,bn_s_mp_exptmod.c |
|
4904 |
|
4905 Lines @26,if@ through @40,}@ determine the optimal window size based on the length of the exponent in bits. The window divisions are sorted |
|
4906 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested. For example, by the \textbf{if} statement |
|
4907 on line @32,if@ the value of $x$ is already known to be greater than $140$. |
|
4908 |
|
4909 The conditional piece of code beginning on line @42,ifdef@ allows the window size to be restricted to five bits. This logic is used to ensure |
|
4910 the table of precomputed powers of $G$ remains relatively small. |
|
4911 |
|
4912 The for loop on line @49,for@ initializes the $M$ array while lines @59,mp_init@ and @62,mp_reduce@ compute the value of $\mu$ required for |
|
4913 Barrett reduction. |
|
4914 |
|
4915 -- More later. |
|
4916 |
|
4917 \section{Quick Power of Two} |
|
4918 Calculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms. Recall that a logical shift left $m << k$ is |
|
4919 equivalent to $m \cdot 2^k$. By this logic when $m = 1$ a quick power of two can be achieved. |
|
4920 |
|
4921 \begin{figure}[!here] |
|
4922 \begin{small} |
|
4923 \begin{center} |
|
4924 \begin{tabular}{l} |
|
4925 \hline Algorithm \textbf{mp\_2expt}. \\ |
|
4926 \textbf{Input}. integer $b$ \\ |
|
4927 \textbf{Output}. $a \leftarrow 2^b$ \\ |
|
4928 \hline \\ |
|
4929 1. $a \leftarrow 0$ \\ |
|
4930 2. If $a.alloc < \lfloor b / lg(\beta) \rfloor + 1$ then grow $a$ appropriately. \\ |
|
4931 3. $a.used \leftarrow \lfloor b / lg(\beta) \rfloor + 1$ \\ |
|
4932 4. $a_{\lfloor b / lg(\beta) \rfloor} \leftarrow 1 << (b \mbox{ mod } lg(\beta))$ \\ |
|
4933 5. Return(\textit{MP\_OKAY}). \\ |
|
4934 \hline |
|
4935 \end{tabular} |
|
4936 \end{center} |
|
4937 \end{small} |
|
4938 \caption{Algorithm mp\_2expt} |
|
4939 \end{figure} |
|
4940 |
|
4941 \textbf{Algorithm mp\_2expt.} |
|
4942 |
|
4943 EXAM,bn_mp_2expt.c |
|
4944 |
|
4945 \chapter{Higher Level Algorithms} |
|
4946 |
|
4947 This chapter discusses the various higher level algorithms that are required to complete a well rounded multiple precision integer package. These |
|
4948 routines are less performance oriented than the algorithms of chapters five, six and seven but are no less important. |
|
4949 |
|
4950 The first section describes a method of integer division with remainder that is universally well known. It provides the signed division logic |
|
4951 for the package. The subsequent section discusses a set of algorithms which allow a single digit to be the 2nd operand for a variety of operations. |
|
4952 These algorithms serve mostly to simplify other algorithms where small constants are required. The last two sections discuss how to manipulate |
|
4953 various representations of integers. For example, converting from an mp\_int to a string of character. |
|
4954 |
|
4955 \section{Integer Division with Remainder} |
|
4956 \label{sec:division} |
|
4957 |
|
4958 Integer division aside from modular exponentiation is the most intensive algorithm to compute. Like addition, subtraction and multiplication |
|
4959 the basis of this algorithm is the long-hand division algorithm taught to school children. Throughout this discussion several common variables |
|
4960 will be used. Let $x$ represent the divisor and $y$ represent the dividend. Let $q$ represent the integer quotient $\lfloor y / x \rfloor$ and |
|
4961 let $r$ represent the remainder $r = y - x \lfloor y / x \rfloor$. The following simple algorithm will be used to start the discussion. |
|
4962 |
|
4963 \newpage\begin{figure}[!here] |
|
4964 \begin{small} |
|
4965 \begin{center} |
|
4966 \begin{tabular}{l} |
|
4967 \hline Algorithm \textbf{Radix-$\beta$ Integer Division}. \\ |
|
4968 \textbf{Input}. integer $x$ and $y$ \\ |
|
4969 \textbf{Output}. $q = \lfloor y/x\rfloor, r = y - xq$ \\ |
|
4970 \hline \\ |
|
4971 1. $q \leftarrow 0$ \\ |
|
4972 2. $n \leftarrow \vert \vert y \vert \vert - \vert \vert x \vert \vert$ \\ |
|
4973 3. for $t$ from $n$ down to $0$ do \\ |
|
4974 \hspace{3mm}3.1 Maximize $k$ such that $kx\beta^t$ is less than or equal to $y$ and $(k + 1)x\beta^t$ is greater. \\ |
|
4975 \hspace{3mm}3.2 $q \leftarrow q + k\beta^t$ \\ |
|
4976 \hspace{3mm}3.3 $y \leftarrow y - kx\beta^t$ \\ |
|
4977 4. $r \leftarrow y$ \\ |
|
4978 5. Return($q, r$) \\ |
|
4979 \hline |
|
4980 \end{tabular} |
|
4981 \end{center} |
|
4982 \end{small} |
|
4983 \caption{Algorithm Radix-$\beta$ Integer Division} |
|
4984 \label{fig:raddiv} |
|
4985 \end{figure} |
|
4986 |
|
4987 As children we are taught this very simple algorithm for the case of $\beta = 10$. Almost instinctively several optimizations are taught for which |
|
4988 their reason of existing are never explained. For this example let $y = 5471$ represent the dividend and $x = 23$ represent the divisor. |
|
4989 |
|
4990 To find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and |
|
4991 simultaneously $(k + 1)x\beta^t$ is greater than $y$. Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have. The habitual method |
|
4992 used to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient. By only using leading |
|
4993 digits a much simpler division may be used to form an educated guess at what the value must be. In this case $k = \lfloor 54/23\rfloor = 2$ quickly |
|
4994 arises as a possible solution. Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$. |
|
4995 As a result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$. |
|
4996 |
|
4997 Again this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder |
|
4998 $y = 841 - 3x\beta = 181$. Finally the last iteration of the loop produces $k = 7$ which leads to the quotient $q = 230 + 7 = 237$ and the |
|
4999 remainder $y = 181 - 7x = 20$. The final quotient and remainder found are $q = 237$ and $r = y = 20$ which are indeed correct since |
|
5000 $237 \cdot 23 + 20 = 5471$ is true. |
|
5001 |
|
5002 \subsection{Quotient Estimation} |
|
5003 \label{sec:divest} |
|
5004 As alluded to earlier the quotient digit $k$ can be estimated from only the leading digits of both the divisor and dividend. When $p$ leading |
|
5005 digits are used from both the divisor and dividend to form an estimation the accuracy of the estimation rises as $p$ grows. Technically |
|
5006 speaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the |
|
5007 dividend and divisor are zero. |
|
5008 |
|
5009 The value of the estimation may off by a few values in either direction and in general is fairly correct. A simplification \cite[pp. 271]{TAOCPV2} |
|
5010 of the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$. The estimate |
|
5011 using this technique is never too small. For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$ |
|
5012 represent the most significant digits of the dividend and divisor respectively. |
|
5013 |
|
5014 \textbf{Proof.}\textit{ The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to |
|
5015 $k = \lfloor y / (x \cdot \beta^{\vert \vert y \vert \vert - \vert \vert x \vert \vert - 1}) \rfloor$. } |
|
5016 The first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger. For all other |
|
5017 cases $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ and $\hat k x_s \ge y_t\beta + y_{t-1} - x_s + 1$. The latter portion of the inequalility |
|
5018 $-x_s + 1$ arises from the fact that a truncated integer division will give the same quotient for at most $x_s - 1$ values. Next a series of |
|
5019 inequalities will prove the hypothesis. |
|
5020 |
|
5021 \begin{equation} |
|
5022 y - \hat k x \le y - \hat k x_s\beta^s |
|
5023 \end{equation} |
|
5024 |
|
5025 This is trivially true since $x \ge x_s\beta^s$. Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$. |
|
5026 |
|
5027 \begin{equation} |
|
5028 y - \hat k x \le y_t\beta^t + \ldots + y_0 - (y_t\beta^t + y_{t-1}\beta^{t-1} - x_s\beta^t + \beta^s) |
|
5029 \end{equation} |
|
5030 |
|
5031 By simplifying the previous inequality the following inequality is formed. |
|
5032 |
|
5033 \begin{equation} |
|
5034 y - \hat k x \le y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s |
|
5035 \end{equation} |
|
5036 |
|
5037 Subsequently, |
|
5038 |
|
5039 \begin{equation} |
|
5040 y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s < x_s\beta^s \le x |
|
5041 \end{equation} |
|
5042 |
|
5043 Which proves that $y - \hat kx \le x$ and by consequence $\hat k \ge k$ which concludes the proof. \textbf{QED} |
|
5044 |
|
5045 |
|
5046 \subsection{Normalized Integers} |
|
5047 For the purposes of division a normalized input is when the divisors leading digit $x_n$ is greater than or equal to $\beta / 2$. By multiplying both |
|
5048 $x$ and $y$ by $j = \lfloor (\beta / 2) / x_n \rfloor$ the quotient remains unchanged and the remainder is simply $j$ times the original |
|
5049 remainder. The purpose of normalization is to ensure the leading digit of the divisor is sufficiently large such that the estimated quotient will |
|
5050 lie in the domain of a single digit. Consider the maximum dividend $(\beta - 1) \cdot \beta + (\beta - 1)$ and the minimum divisor $\beta / 2$. |
|
5051 |
|
5052 \begin{equation} |
|
5053 {{\beta^2 - 1} \over { \beta / 2}} \le 2\beta - {2 \over \beta} |
|
5054 \end{equation} |
|
5055 |
|
5056 At most the quotient approaches $2\beta$, however, in practice this will not occur since that would imply the previous quotient digit was too small. |
|
5057 |
|
5058 \subsection{Radix-$\beta$ Division with Remainder} |
|
5059 \newpage\begin{figure}[!here] |
|
5060 \begin{small} |
|
5061 \begin{center} |
|
5062 \begin{tabular}{l} |
|
5063 \hline Algorithm \textbf{mp\_div}. \\ |
|
5064 \textbf{Input}. mp\_int $a, b$ \\ |
|
5065 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\ |
|
5066 \hline \\ |
|
5067 1. If $b = 0$ return(\textit{MP\_VAL}). \\ |
|
5068 2. If $\vert a \vert < \vert b \vert$ then do \\ |
|
5069 \hspace{3mm}2.1 $d \leftarrow a$ \\ |
|
5070 \hspace{3mm}2.2 $c \leftarrow 0$ \\ |
|
5071 \hspace{3mm}2.3 Return(\textit{MP\_OKAY}). \\ |
|
5072 \\ |
|
5073 Setup the quotient to receive the digits. \\ |
|
5074 3. Grow $q$ to $a.used + 2$ digits. \\ |
|
5075 4. $q \leftarrow 0$ \\ |
|
5076 5. $x \leftarrow \vert a \vert , y \leftarrow \vert b \vert$ \\ |
|
5077 6. $sign \leftarrow \left \lbrace \begin{array}{ll} |
|
5078 MP\_ZPOS & \mbox{if }a.sign = b.sign \\ |
|
5079 MP\_NEG & \mbox{otherwise} \\ |
|
5080 \end{array} \right .$ \\ |
|
5081 \\ |
|
5082 Normalize the inputs such that the leading digit of $y$ is greater than or equal to $\beta / 2$. \\ |
|
5083 7. $norm \leftarrow (lg(\beta) - 1) - (\lceil lg(y) \rceil \mbox{ (mod }lg(\beta)\mbox{)})$ \\ |
|
5084 8. $x \leftarrow x \cdot 2^{norm}, y \leftarrow y \cdot 2^{norm}$ \\ |
|
5085 \\ |
|
5086 Find the leading digit of the quotient. \\ |
|
5087 9. $n \leftarrow x.used - 1, t \leftarrow y.used - 1$ \\ |
|
5088 10. $y \leftarrow y \cdot \beta^{n - t}$ \\ |
|
5089 11. While ($x \ge y$) do \\ |
|
5090 \hspace{3mm}11.1 $q_{n - t} \leftarrow q_{n - t} + 1$ \\ |
|
5091 \hspace{3mm}11.2 $x \leftarrow x - y$ \\ |
|
5092 12. $y \leftarrow \lfloor y / \beta^{n-t} \rfloor$ \\ |
|
5093 \\ |
|
5094 Continued on the next page. \\ |
|
5095 \hline |
|
5096 \end{tabular} |
|
5097 \end{center} |
|
5098 \end{small} |
|
5099 \caption{Algorithm mp\_div} |
|
5100 \end{figure} |
|
5101 |
|
5102 \newpage\begin{figure}[!here] |
|
5103 \begin{small} |
|
5104 \begin{center} |
|
5105 \begin{tabular}{l} |
|
5106 \hline Algorithm \textbf{mp\_div} (continued). \\ |
|
5107 \textbf{Input}. mp\_int $a, b$ \\ |
|
5108 \textbf{Output}. $c = \lfloor a/b \rfloor$, $d = a - bc$ \\ |
|
5109 \hline \\ |
|
5110 Now find the remainder fo the digits. \\ |
|
5111 13. for $i$ from $n$ down to $(t + 1)$ do \\ |
|
5112 \hspace{3mm}13.1 If $i > x.used$ then jump to the next iteration of this loop. \\ |
|
5113 \hspace{3mm}13.2 If $x_{i} = y_{t}$ then \\ |
|
5114 \hspace{6mm}13.2.1 $q_{i - t - 1} \leftarrow \beta - 1$ \\ |
|
5115 \hspace{3mm}13.3 else \\ |
|
5116 \hspace{6mm}13.3.1 $\hat r \leftarrow x_{i} \cdot \beta + x_{i - 1}$ \\ |
|
5117 \hspace{6mm}13.3.2 $\hat r \leftarrow \lfloor \hat r / y_{t} \rfloor$ \\ |
|
5118 \hspace{6mm}13.3.3 $q_{i - t - 1} \leftarrow \hat r$ \\ |
|
5119 \hspace{3mm}13.4 $q_{i - t - 1} \leftarrow q_{i - t - 1} + 1$ \\ |
|
5120 \\ |
|
5121 Fixup quotient estimation. \\ |
|
5122 \hspace{3mm}13.5 Loop \\ |
|
5123 \hspace{6mm}13.5.1 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\ |
|
5124 \hspace{6mm}13.5.2 t$1 \leftarrow 0$ \\ |
|
5125 \hspace{6mm}13.5.3 t$1_0 \leftarrow y_{t - 1}, $ t$1_1 \leftarrow y_t,$ t$1.used \leftarrow 2$ \\ |
|
5126 \hspace{6mm}13.5.4 $t1 \leftarrow t1 \cdot q_{i - t - 1}$ \\ |
|
5127 \hspace{6mm}13.5.5 t$2_0 \leftarrow x_{i - 2}, $ t$2_1 \leftarrow x_{i - 1}, $ t$2_2 \leftarrow x_i, $ t$2.used \leftarrow 3$ \\ |
|
5128 \hspace{6mm}13.5.6 If $\vert t1 \vert > \vert t2 \vert$ then goto step 13.5. \\ |
|
5129 \hspace{3mm}13.6 t$1 \leftarrow y \cdot q_{i - t - 1}$ \\ |
|
5130 \hspace{3mm}13.7 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\ |
|
5131 \hspace{3mm}13.8 $x \leftarrow x - $ t$1$ \\ |
|
5132 \hspace{3mm}13.9 If $x.sign = MP\_NEG$ then \\ |
|
5133 \hspace{6mm}13.10 t$1 \leftarrow y$ \\ |
|
5134 \hspace{6mm}13.11 t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\ |
|
5135 \hspace{6mm}13.12 $x \leftarrow x + $ t$1$ \\ |
|
5136 \hspace{6mm}13.13 $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\ |
|
5137 \\ |
|
5138 Finalize the result. \\ |
|
5139 14. Clamp excess digits of $q$ \\ |
|
5140 15. $c \leftarrow q, c.sign \leftarrow sign$ \\ |
|
5141 16. $x.sign \leftarrow a.sign$ \\ |
|
5142 17. $d \leftarrow \lfloor x / 2^{norm} \rfloor$ \\ |
|
5143 18. Return(\textit{MP\_OKAY}). \\ |
|
5144 \hline |
|
5145 \end{tabular} |
|
5146 \end{center} |
|
5147 \end{small} |
|
5148 \caption{Algorithm mp\_div (continued)} |
|
5149 \end{figure} |
|
5150 \textbf{Algorithm mp\_div.} |
|
5151 This algorithm will calculate quotient and remainder from an integer division given a dividend and divisor. The algorithm is a signed |
|
5152 division and will produce a fully qualified quotient and remainder. |
|
5153 |
|
5154 First the divisor $b$ must be non-zero which is enforced in step one. If the divisor is larger than the dividend than the quotient is implicitly |
|
5155 zero and the remainder is the dividend. |
|
5156 |
|
5157 After the first two trivial cases of inputs are handled the variable $q$ is setup to receive the digits of the quotient. Two unsigned copies of the |
|
5158 divisor $y$ and dividend $x$ are made as well. The core of the division algorithm is an unsigned division and will only work if the values are |
|
5159 positive. Now the two values $x$ and $y$ must be normalized such that the leading digit of $y$ is greater than or equal to $\beta / 2$. |
|
5160 This is performed by shifting both to the left by enough bits to get the desired normalization. |
|
5161 |
|
5162 At this point the division algorithm can begin producing digits of the quotient. Recall that maximum value of the estimation used is |
|
5163 $2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means. In this case $y$ is shifted |
|
5164 to the left (\textit{step ten}) so that it has the same number of digits as $x$. The loop on step eleven will subtract multiples of the |
|
5165 shifted copy of $y$ until $x$ is smaller. Since the leading digit of $y$ is greater than or equal to $\beta/2$ this loop will iterate at most two |
|
5166 times to produce the desired leading digit of the quotient. |
|
5167 |
|
5168 Now the remainder of the digits can be produced. The equation $\hat q = \lfloor {{x_i \beta + x_{i-1}}\over y_t} \rfloor$ is used to fairly |
|
5169 accurately approximate the true quotient digit. The estimation can in theory produce an estimation as high as $2\beta - {2 \over \beta}$ but by |
|
5170 induction the upper quotient digit is correct (\textit{as established on step eleven}) and the estimate must be less than $\beta$. |
|
5171 |
|
5172 Recall from section~\ref{sec:divest} that the estimation is never too low but may be too high. The next step of the estimation process is |
|
5173 to refine the estimation. The loop on step 13.5 uses $x_i\beta^2 + x_{i-1}\beta + x_{i-2}$ and $q_{i - t - 1}(y_t\beta + y_{t-1})$ as a higher |
|
5174 order approximation to adjust the quotient digit. |
|
5175 |
|
5176 After both phases of estimation the quotient digit may still be off by a value of one\footnote{This is similar to the error introduced |
|
5177 by optimizing Barrett reduction.}. Steps 13.6 and 13.7 subtract the multiple of the divisor from the dividend (\textit{Similar to step 3.3 of |
|
5178 algorithm~\ref{fig:raddiv}} and then subsequently add a multiple of the divisor if the quotient was too large. |
|
5179 |
|
5180 Now that the quotient has been determine finializing the result is a matter of clamping the quotient, fixing the sizes and de-normalizing the |
|
5181 remainder. An important aspect of this algorithm seemingly overlooked in other descriptions such as that of Algorithm 14.20 HAC \cite[pp. 598]{HAC} |
|
5182 is that when the estimations are being made (\textit{inside the loop on step 13.5}) that the digits $y_{t-1}$, $x_{i-2}$ and $x_{i-1}$ may lie |
|
5183 outside their respective boundaries. For example, if $t = 0$ or $i \le 1$ then the digits would be undefined. In those cases the digits should |
|
5184 respectively be replaced with a zero. |
|
5185 |
|
5186 EXAM,bn_mp_div.c |
|
5187 |
|
5188 The implementation of this algorithm differs slightly from the pseudo code presented previously. In this algorithm either of the quotient $c$ or |
|
5189 remainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired. For example, the C code to call the division |
|
5190 algorithm with only the quotient is |
|
5191 |
|
5192 \begin{verbatim} |
|
5193 mp_div(&a, &b, &c, NULL); /* c = [a/b] */ |
|
5194 \end{verbatim} |
|
5195 |
|
5196 Lines @37,if@ and @42,if@ handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor |
|
5197 respectively. After the two trivial cases all of the temporary variables are initialized. Line @76,neg@ determines the sign of |
|
5198 the quotient and line @77,sign@ ensures that both $x$ and $y$ are positive. |
|
5199 |
|
5200 The number of bits in the leading digit is calculated on line @80,norm@. Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits |
|
5201 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$. In this case $k$ is the number of bits in the leading digit which is |
|
5202 exactly what is required. For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting |
|
5203 them to the left by $lg(\beta) - 1 - k$ bits. |
|
5204 |
|
5205 Throughout the variables $n$ and $t$ will represent the highest digit of $x$ and $y$ respectively. These are first used to produce the |
|
5206 leading digit of the quotient. The loop beginning on line @113,for@ will produce the remainder of the quotient digits. |
|
5207 |
|
5208 The conditional ``continue'' on line @114,if@ is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the |
|
5209 algorithm eliminates multiple non-zero digits in a single iteration. This ensures that $x_i$ is always non-zero since by definition the digits |
|
5210 above the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}. |
|
5211 |
|
5212 Lines @142,t1@, @143,t1@ and @150,t2@ through @152,t2@ manually construct the high accuracy estimations by setting the digits of the two mp\_int |
|
5213 variables directly. |
|
5214 |
|
5215 \section{Single Digit Helpers} |
|
5216 |
|
5217 This section briefly describes a series of single digit helper algorithms which come in handy when working with small constants. All of |
|
5218 the helper functions assume the single digit input is positive and will treat them as such. |
|
5219 |
|
5220 \subsection{Single Digit Addition and Subtraction} |
|
5221 |
|
5222 Both addition and subtraction are performed by ``cheating'' and using mp\_set followed by the higher level addition or subtraction |
|
5223 algorithms. As a result these algorithms are subtantially simpler with a slight cost in performance. |
|
5224 |
|
5225 \newpage\begin{figure}[!here] |
|
5226 \begin{small} |
|
5227 \begin{center} |
|
5228 \begin{tabular}{l} |
|
5229 \hline Algorithm \textbf{mp\_add\_d}. \\ |
|
5230 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ |
|
5231 \textbf{Output}. $c = a + b$ \\ |
|
5232 \hline \\ |
|
5233 1. $t \leftarrow b$ (\textit{mp\_set}) \\ |
|
5234 2. $c \leftarrow a + t$ \\ |
|
5235 3. Return(\textit{MP\_OKAY}) \\ |
|
5236 \hline |
|
5237 \end{tabular} |
|
5238 \end{center} |
|
5239 \end{small} |
|
5240 \caption{Algorithm mp\_add\_d} |
|
5241 \end{figure} |
|
5242 |
|
5243 \textbf{Algorithm mp\_add\_d.} |
|
5244 This algorithm initiates a temporary mp\_int with the value of the single digit and uses algorithm mp\_add to add the two values together. |
|
5245 |
|
5246 EXAM,bn_mp_add_d.c |
|
5247 |
|
5248 Clever use of the letter 't'. |
|
5249 |
|
5250 \subsubsection{Subtraction} |
|
5251 The single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int. |
|
5252 |
|
5253 \subsection{Single Digit Multiplication} |
|
5254 Single digit multiplication arises enough in division and radix conversion that it ought to be implement as a special case of the baseline |
|
5255 multiplication algorithm. Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands |
|
5256 only has one digit. |
|
5257 |
|
5258 \begin{figure}[!here] |
|
5259 \begin{small} |
|
5260 \begin{center} |
|
5261 \begin{tabular}{l} |
|
5262 \hline Algorithm \textbf{mp\_mul\_d}. \\ |
|
5263 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ |
|
5264 \textbf{Output}. $c = ab$ \\ |
|
5265 \hline \\ |
|
5266 1. $pa \leftarrow a.used$ \\ |
|
5267 2. Grow $c$ to at least $pa + 1$ digits. \\ |
|
5268 3. $oldused \leftarrow c.used$ \\ |
|
5269 4. $c.used \leftarrow pa + 1$ \\ |
|
5270 5. $c.sign \leftarrow a.sign$ \\ |
|
5271 6. $\mu \leftarrow 0$ \\ |
|
5272 7. for $ix$ from $0$ to $pa - 1$ do \\ |
|
5273 \hspace{3mm}7.1 $\hat r \leftarrow \mu + a_{ix}b$ \\ |
|
5274 \hspace{3mm}7.2 $c_{ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\ |
|
5275 \hspace{3mm}7.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\ |
|
5276 8. $c_{pa} \leftarrow \mu$ \\ |
|
5277 9. for $ix$ from $pa + 1$ to $oldused$ do \\ |
|
5278 \hspace{3mm}9.1 $c_{ix} \leftarrow 0$ \\ |
|
5279 10. Clamp excess digits of $c$. \\ |
|
5280 11. Return(\textit{MP\_OKAY}). \\ |
|
5281 \hline |
|
5282 \end{tabular} |
|
5283 \end{center} |
|
5284 \end{small} |
|
5285 \caption{Algorithm mp\_mul\_d} |
|
5286 \end{figure} |
|
5287 \textbf{Algorithm mp\_mul\_d.} |
|
5288 This algorithm quickly multiplies an mp\_int by a small single digit value. It is specially tailored to the job and has a minimal of overhead. |
|
5289 Unlike the full multiplication algorithms this algorithm does not require any significnat temporary storage or memory allocations. |
|
5290 |
|
5291 EXAM,bn_mp_mul_d.c |
|
5292 |
|
5293 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is |
|
5294 read from the source. This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively. |
|
5295 |
|
5296 \subsection{Single Digit Division} |
|
5297 Like the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion. Since the |
|
5298 divisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient. |
|
5299 |
|
5300 \newpage\begin{figure}[!here] |
|
5301 \begin{small} |
|
5302 \begin{center} |
|
5303 \begin{tabular}{l} |
|
5304 \hline Algorithm \textbf{mp\_div\_d}. \\ |
|
5305 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ |
|
5306 \textbf{Output}. $c = \lfloor a / b \rfloor, d = a - cb$ \\ |
|
5307 \hline \\ |
|
5308 1. If $b = 0$ then return(\textit{MP\_VAL}).\\ |
|
5309 2. If $b = 3$ then use algorithm mp\_div\_3 instead. \\ |
|
5310 3. Init $q$ to $a.used$ digits. \\ |
|
5311 4. $q.used \leftarrow a.used$ \\ |
|
5312 5. $q.sign \leftarrow a.sign$ \\ |
|
5313 6. $\hat w \leftarrow 0$ \\ |
|
5314 7. for $ix$ from $a.used - 1$ down to $0$ do \\ |
|
5315 \hspace{3mm}7.1 $\hat w \leftarrow \hat w \beta + a_{ix}$ \\ |
|
5316 \hspace{3mm}7.2 If $\hat w \ge b$ then \\ |
|
5317 \hspace{6mm}7.2.1 $t \leftarrow \lfloor \hat w / b \rfloor$ \\ |
|
5318 \hspace{6mm}7.2.2 $\hat w \leftarrow \hat w \mbox{ (mod }b\mbox{)}$ \\ |
|
5319 \hspace{3mm}7.3 else\\ |
|
5320 \hspace{6mm}7.3.1 $t \leftarrow 0$ \\ |
|
5321 \hspace{3mm}7.4 $q_{ix} \leftarrow t$ \\ |
|
5322 8. $d \leftarrow \hat w$ \\ |
|
5323 9. Clamp excess digits of $q$. \\ |
|
5324 10. $c \leftarrow q$ \\ |
|
5325 11. Return(\textit{MP\_OKAY}). \\ |
|
5326 \hline |
|
5327 \end{tabular} |
|
5328 \end{center} |
|
5329 \end{small} |
|
5330 \caption{Algorithm mp\_div\_d} |
|
5331 \end{figure} |
|
5332 \textbf{Algorithm mp\_div\_d.} |
|
5333 This algorithm divides the mp\_int $a$ by the single mp\_digit $b$ using an optimized approach. Essentially in every iteration of the |
|
5334 algorithm another digit of the dividend is reduced and another digit of quotient produced. Provided $b < \beta$ the value of $\hat w$ |
|
5335 after step 7.1 will be limited such that $0 \le \lfloor \hat w / b \rfloor < \beta$. |
|
5336 |
|
5337 If the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3. It replaces the division by three with |
|
5338 a multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup. In essence it is much like the Barrett reduction |
|
5339 from chapter seven. |
|
5340 |
|
5341 EXAM,bn_mp_div_d.c |
|
5342 |
|
5343 Like the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to |
|
5344 indicate the respective value is not required. This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created. |
|
5345 |
|
5346 The division and remainder on lines @44,/@ and @45,%@ can be replaced often by a single division on most processors. For example, the 32-bit x86 based |
|
5347 processors can divide a 64-bit quantity by a 32-bit quantity and produce the quotient and remainder simultaneously. Unfortunately the GCC |
|
5348 compiler does not recognize that optimization and will actually produce two function calls to find the quotient and remainder respectively. |
|
5349 |
|
5350 \subsection{Single Digit Root Extraction} |
|
5351 |
|
5352 Finding the $n$'th root of an integer is fairly easy as far as numerical analysis is concerned. Algorithms such as the Newton-Raphson approximation |
|
5353 (\ref{eqn:newton}) series will converge very quickly to a root for any continuous function $f(x)$. |
|
5354 |
|
5355 \begin{equation} |
|
5356 x_{i+1} = x_i - {f(x_i) \over f'(x_i)} |
|
5357 \label{eqn:newton} |
|
5358 \end{equation} |
|
5359 |
|
5360 In this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired. The derivative of $f(x)$ is |
|
5361 simply $f'(x) = nx^{n - 1}$. Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain |
|
5362 such as the real numbers. As a result the root found can be above the true root by few and must be manually adjusted. Ideally at the end of the |
|
5363 algorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$. |
|
5364 |
|
5365 \newpage\begin{figure}[!here] |
|
5366 \begin{small} |
|
5367 \begin{center} |
|
5368 \begin{tabular}{l} |
|
5369 \hline Algorithm \textbf{mp\_n\_root}. \\ |
|
5370 \textbf{Input}. mp\_int $a$ and a mp\_digit $b$ \\ |
|
5371 \textbf{Output}. $c^b \le a$ \\ |
|
5372 \hline \\ |
|
5373 1. If $b$ is even and $a.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\ |
|
5374 2. $sign \leftarrow a.sign$ \\ |
|
5375 3. $a.sign \leftarrow MP\_ZPOS$ \\ |
|
5376 4. t$2 \leftarrow 2$ \\ |
|
5377 5. Loop \\ |
|
5378 \hspace{3mm}5.1 t$1 \leftarrow $ t$2$ \\ |
|
5379 \hspace{3mm}5.2 t$3 \leftarrow $ t$1^{b - 1}$ \\ |
|
5380 \hspace{3mm}5.3 t$2 \leftarrow $ t$3 $ $\cdot$ t$1$ \\ |
|
5381 \hspace{3mm}5.4 t$2 \leftarrow $ t$2 - a$ \\ |
|
5382 \hspace{3mm}5.5 t$3 \leftarrow $ t$3 \cdot b$ \\ |
|
5383 \hspace{3mm}5.6 t$3 \leftarrow \lfloor $t$2 / $t$3 \rfloor$ \\ |
|
5384 \hspace{3mm}5.7 t$2 \leftarrow $ t$1 - $ t$3$ \\ |
|
5385 \hspace{3mm}5.8 If t$1 \ne $ t$2$ then goto step 5. \\ |
|
5386 6. Loop \\ |
|
5387 \hspace{3mm}6.1 t$2 \leftarrow $ t$1^b$ \\ |
|
5388 \hspace{3mm}6.2 If t$2 > a$ then \\ |
|
5389 \hspace{6mm}6.2.1 t$1 \leftarrow $ t$1 - 1$ \\ |
|
5390 \hspace{6mm}6.2.2 Goto step 6. \\ |
|
5391 7. $a.sign \leftarrow sign$ \\ |
|
5392 8. $c \leftarrow $ t$1$ \\ |
|
5393 9. $c.sign \leftarrow sign$ \\ |
|
5394 10. Return(\textit{MP\_OKAY}). \\ |
|
5395 \hline |
|
5396 \end{tabular} |
|
5397 \end{center} |
|
5398 \end{small} |
|
5399 \caption{Algorithm mp\_n\_root} |
|
5400 \end{figure} |
|
5401 \textbf{Algorithm mp\_n\_root.} |
|
5402 This algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach. It is partially optimized based on the observation |
|
5403 that the numerator of ${f(x) \over f'(x)}$ can be derived from a partial denominator. That is at first the denominator is calculated by finding |
|
5404 $x^{b - 1}$. This value can then be multiplied by $x$ and have $a$ subtracted from it to find the numerator. This saves a total of $b - 1$ |
|
5405 multiplications by t$1$ inside the loop. |
|
5406 |
|
5407 The initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the |
|
5408 root. Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$. |
|
5409 |
|
5410 EXAM,bn_mp_n_root.c |
|
5411 |
|
5412 \section{Random Number Generation} |
|
5413 |
|
5414 Random numbers come up in a variety of activities from public key cryptography to simple simulations and various randomized algorithms. Pollard-Rho |
|
5415 factoring for example, can make use of random values as starting points to find factors of a composite integer. In this case the algorithm presented |
|
5416 is solely for simulations and not intended for cryptographic use. |
|
5417 |
|
5418 \newpage\begin{figure}[!here] |
|
5419 \begin{small} |
|
5420 \begin{center} |
|
5421 \begin{tabular}{l} |
|
5422 \hline Algorithm \textbf{mp\_rand}. \\ |
|
5423 \textbf{Input}. An integer $b$ \\ |
|
5424 \textbf{Output}. A pseudo-random number of $b$ digits \\ |
|
5425 \hline \\ |
|
5426 1. $a \leftarrow 0$ \\ |
|
5427 2. If $b \le 0$ return(\textit{MP\_OKAY}) \\ |
|
5428 3. Pick a non-zero random digit $d$. \\ |
|
5429 4. $a \leftarrow a + d$ \\ |
|
5430 5. for $ix$ from 1 to $d - 1$ do \\ |
|
5431 \hspace{3mm}5.1 $a \leftarrow a \cdot \beta$ \\ |
|
5432 \hspace{3mm}5.2 Pick a random digit $d$. \\ |
|
5433 \hspace{3mm}5.3 $a \leftarrow a + d$ \\ |
|
5434 6. Return(\textit{MP\_OKAY}). \\ |
|
5435 \hline |
|
5436 \end{tabular} |
|
5437 \end{center} |
|
5438 \end{small} |
|
5439 \caption{Algorithm mp\_rand} |
|
5440 \end{figure} |
|
5441 \textbf{Algorithm mp\_rand.} |
|
5442 This algorithm produces a pseudo-random integer of $b$ digits. By ensuring that the first digit is non-zero the algorithm also guarantees that the |
|
5443 final result has at least $b$ digits. It relies heavily on a third-part random number generator which should ideally generate uniformly all of |
|
5444 the integers from $0$ to $\beta - 1$. |
|
5445 |
|
5446 EXAM,bn_mp_rand.c |
|
5447 |
|
5448 \section{Formatted Representations} |
|
5449 The ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties. For example, the ability to |
|
5450 be given a string of characters such as ``114585'' and turn it into the radix-$\beta$ equivalent would make it easier to enter numbers |
|
5451 into a program. |
|
5452 |
|
5453 \subsection{Reading Radix-n Input} |
|
5454 For the purposes of this text we will assume that a simple lower ASCII map (\ref{fig:ASC}) is used for the values of from $0$ to $63$ to |
|
5455 printable characters. For example, when the character ``N'' is read it represents the integer $23$. The first $16$ characters of the |
|
5456 map are for the common representations up to hexadecimal. After that they match the ``base64'' encoding scheme which are suitable chosen |
|
5457 such that they are printable. While outputting as base64 may not be too helpful for human operators it does allow communication via non binary |
|
5458 mediums. |
|
5459 |
|
5460 \newpage\begin{figure}[here] |
|
5461 \begin{center} |
|
5462 \begin{tabular}{cc|cc|cc|cc} |
|
5463 \hline \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} \\ |
|
5464 \hline |
|
5465 0 & 0 & 1 & 1 & 2 & 2 & 3 & 3 \\ |
|
5466 4 & 4 & 5 & 5 & 6 & 6 & 7 & 7 \\ |
|
5467 8 & 8 & 9 & 9 & 10 & A & 11 & B \\ |
|
5468 12 & C & 13 & D & 14 & E & 15 & F \\ |
|
5469 16 & G & 17 & H & 18 & I & 19 & J \\ |
|
5470 20 & K & 21 & L & 22 & M & 23 & N \\ |
|
5471 24 & O & 25 & P & 26 & Q & 27 & R \\ |
|
5472 28 & S & 29 & T & 30 & U & 31 & V \\ |
|
5473 32 & W & 33 & X & 34 & Y & 35 & Z \\ |
|
5474 36 & a & 37 & b & 38 & c & 39 & d \\ |
|
5475 40 & e & 41 & f & 42 & g & 43 & h \\ |
|
5476 44 & i & 45 & j & 46 & k & 47 & l \\ |
|
5477 48 & m & 49 & n & 50 & o & 51 & p \\ |
|
5478 52 & q & 53 & r & 54 & s & 55 & t \\ |
|
5479 56 & u & 57 & v & 58 & w & 59 & x \\ |
|
5480 60 & y & 61 & z & 62 & $+$ & 63 & $/$ \\ |
|
5481 \hline |
|
5482 \end{tabular} |
|
5483 \end{center} |
|
5484 \caption{Lower ASCII Map} |
|
5485 \label{fig:ASC} |
|
5486 \end{figure} |
|
5487 |
|
5488 \newpage\begin{figure}[!here] |
|
5489 \begin{small} |
|
5490 \begin{center} |
|
5491 \begin{tabular}{l} |
|
5492 \hline Algorithm \textbf{mp\_read\_radix}. \\ |
|
5493 \textbf{Input}. A string $str$ of length $sn$ and radix $r$. \\ |
|
5494 \textbf{Output}. The radix-$\beta$ equivalent mp\_int. \\ |
|
5495 \hline \\ |
|
5496 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\ |
|
5497 2. $ix \leftarrow 0$ \\ |
|
5498 3. If $str_0 =$ ``-'' then do \\ |
|
5499 \hspace{3mm}3.1 $ix \leftarrow ix + 1$ \\ |
|
5500 \hspace{3mm}3.2 $sign \leftarrow MP\_NEG$ \\ |
|
5501 4. else \\ |
|
5502 \hspace{3mm}4.1 $sign \leftarrow MP\_ZPOS$ \\ |
|
5503 5. $a \leftarrow 0$ \\ |
|
5504 6. for $iy$ from $ix$ to $sn - 1$ do \\ |
|
5505 \hspace{3mm}6.1 Let $y$ denote the position in the map of $str_{iy}$. \\ |
|
5506 \hspace{3mm}6.2 If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\ |
|
5507 \hspace{3mm}6.3 $a \leftarrow a \cdot r$ \\ |
|
5508 \hspace{3mm}6.4 $a \leftarrow a + y$ \\ |
|
5509 7. If $a \ne 0$ then $a.sign \leftarrow sign$ \\ |
|
5510 8. Return(\textit{MP\_OKAY}). \\ |
|
5511 \hline |
|
5512 \end{tabular} |
|
5513 \end{center} |
|
5514 \end{small} |
|
5515 \caption{Algorithm mp\_read\_radix} |
|
5516 \end{figure} |
|
5517 \textbf{Algorithm mp\_read\_radix.} |
|
5518 This algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer. A minus symbol ``-'' may precede the |
|
5519 string to indicate the value is negative, otherwise it is assumed to be positive. The algorithm will read up to $sn$ characters from the input |
|
5520 and will stop when it reads a character it cannot map the algorithm stops reading characters from the string. This allows numbers to be embedded |
|
5521 as part of larger input without any significant problem. |
|
5522 |
|
5523 EXAM,bn_mp_read_radix.c |
|
5524 |
|
5525 \subsection{Generating Radix-$n$ Output} |
|
5526 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm. |
|
5527 |
|
5528 \newpage\begin{figure}[!here] |
|
5529 \begin{small} |
|
5530 \begin{center} |
|
5531 \begin{tabular}{l} |
|
5532 \hline Algorithm \textbf{mp\_toradix}. \\ |
|
5533 \textbf{Input}. A mp\_int $a$ and an integer $r$\\ |
|
5534 \textbf{Output}. The radix-$r$ representation of $a$ \\ |
|
5535 \hline \\ |
|
5536 1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\ |
|
5537 2. If $a = 0$ then $str = $ ``$0$'' and return(\textit{MP\_OKAY}). \\ |
|
5538 3. $t \leftarrow a$ \\ |
|
5539 4. $str \leftarrow$ ``'' \\ |
|
5540 5. if $t.sign = MP\_NEG$ then \\ |
|
5541 \hspace{3mm}5.1 $str \leftarrow str + $ ``-'' \\ |
|
5542 \hspace{3mm}5.2 $t.sign = MP\_ZPOS$ \\ |
|
5543 6. While ($t \ne 0$) do \\ |
|
5544 \hspace{3mm}6.1 $d \leftarrow t \mbox{ (mod }r\mbox{)}$ \\ |
|
5545 \hspace{3mm}6.2 $t \leftarrow \lfloor t / r \rfloor$ \\ |
|
5546 \hspace{3mm}6.3 Look up $d$ in the map and store the equivalent character in $y$. \\ |
|
5547 \hspace{3mm}6.4 $str \leftarrow str + y$ \\ |
|
5548 7. If $str_0 = $``$-$'' then \\ |
|
5549 \hspace{3mm}7.1 Reverse the digits $str_1, str_2, \ldots str_n$. \\ |
|
5550 8. Otherwise \\ |
|
5551 \hspace{3mm}8.1 Reverse the digits $str_0, str_1, \ldots str_n$. \\ |
|
5552 9. Return(\textit{MP\_OKAY}).\\ |
|
5553 \hline |
|
5554 \end{tabular} |
|
5555 \end{center} |
|
5556 \end{small} |
|
5557 \caption{Algorithm mp\_toradix} |
|
5558 \end{figure} |
|
5559 \textbf{Algorithm mp\_toradix.} |
|
5560 This algorithm computes the radix-$r$ representation of an mp\_int $a$. The ``digits'' of the representation are extracted by reducing |
|
5561 successive powers of $\lfloor a / r^k \rfloor$ the input modulo $r$ until $r^k > a$. Note that instead of actually dividing by $r^k$ in |
|
5562 each iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration. As a result a series of trivial $n \times 1$ divisions |
|
5563 are required instead of a series of $n \times k$ divisions. One design flaw of this approach is that the digits are produced in the reverse order |
|
5564 (see~\ref{fig:mpradix}). To remedy this flaw the digits must be swapped or simply ``reversed''. |
|
5565 |
|
5566 \begin{figure} |
|
5567 \begin{center} |
|
5568 \begin{tabular}{|c|c|c|} |
|
5569 \hline \textbf{Value of $a$} & \textbf{Value of $d$} & \textbf{Value of $str$} \\ |
|
5570 \hline $1234$ & -- & -- \\ |
|
5571 \hline $123$ & $4$ & ``4'' \\ |
|
5572 \hline $12$ & $3$ & ``43'' \\ |
|
5573 \hline $1$ & $2$ & ``432'' \\ |
|
5574 \hline $0$ & $1$ & ``4321'' \\ |
|
5575 \hline |
|
5576 \end{tabular} |
|
5577 \end{center} |
|
5578 \caption{Example of Algorithm mp\_toradix.} |
|
5579 \label{fig:mpradix} |
|
5580 \end{figure} |
|
5581 |
|
5582 EXAM,bn_mp_toradix.c |
|
5583 |
|
5584 \chapter{Number Theoretic Algorithms} |
|
5585 This chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi |
|
5586 symbol computation. These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and |
|
5587 various Sieve based factoring algorithms. |
|
5588 |
|
5589 \section{Greatest Common Divisor} |
|
5590 The greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of |
|
5591 both $a$ and $b$. That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur |
|
5592 simultaneously. |
|
5593 |
|
5594 The most common approach (cite) is to reduce one input modulo another. That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then |
|
5595 $r$ is also divisible by $k$. The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$. |
|
5596 |
|
5597 \newpage\begin{figure}[!here] |
|
5598 \begin{small} |
|
5599 \begin{center} |
|
5600 \begin{tabular}{l} |
|
5601 \hline Algorithm \textbf{Greatest Common Divisor (I)}. \\ |
|
5602 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ |
|
5603 \textbf{Output}. The greatest common divisor $(a, b)$. \\ |
|
5604 \hline \\ |
|
5605 1. While ($b > 0$) do \\ |
|
5606 \hspace{3mm}1.1 $r \leftarrow a \mbox{ (mod }b\mbox{)}$ \\ |
|
5607 \hspace{3mm}1.2 $a \leftarrow b$ \\ |
|
5608 \hspace{3mm}1.3 $b \leftarrow r$ \\ |
|
5609 2. Return($a$). \\ |
|
5610 \hline |
|
5611 \end{tabular} |
|
5612 \end{center} |
|
5613 \end{small} |
|
5614 \caption{Algorithm Greatest Common Divisor (I)} |
|
5615 \label{fig:gcd1} |
|
5616 \end{figure} |
|
5617 |
|
5618 This algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly. However, divisions are |
|
5619 relatively expensive operations to perform and should ideally be avoided. There is another approach based on a similar relationship of |
|
5620 greatest common divisors. The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$. |
|
5621 In particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$. |
|
5622 |
|
5623 \begin{figure}[!here] |
|
5624 \begin{small} |
|
5625 \begin{center} |
|
5626 \begin{tabular}{l} |
|
5627 \hline Algorithm \textbf{Greatest Common Divisor (II)}. \\ |
|
5628 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ |
|
5629 \textbf{Output}. The greatest common divisor $(a, b)$. \\ |
|
5630 \hline \\ |
|
5631 1. While ($b > 0$) do \\ |
|
5632 \hspace{3mm}1.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\ |
|
5633 \hspace{3mm}1.2 $b \leftarrow b - a$ \\ |
|
5634 2. Return($a$). \\ |
|
5635 \hline |
|
5636 \end{tabular} |
|
5637 \end{center} |
|
5638 \end{small} |
|
5639 \caption{Algorithm Greatest Common Divisor (II)} |
|
5640 \label{fig:gcd2} |
|
5641 \end{figure} |
|
5642 |
|
5643 \textbf{Proof} \textit{Algorithm~\ref{fig:gcd2} will return the greatest common divisor of $a$ and $b$.} |
|
5644 The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$. In other |
|
5645 words in every iteration that tuple $\left < a, b \right >$ decrease in magnitude until eventually $a = b$. Since both $a$ and $b$ are always |
|
5646 divisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the |
|
5647 second to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof. \textbf{QED}. |
|
5648 |
|
5649 As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful. Specially if $b$ is much larger than $a$ such that |
|
5650 $b - a$ is still very much larger than $a$. A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does |
|
5651 not divide the greatest common divisor but will divide $b - a$. In this case ${b - a} \over p$ is also an integer and still divisible by |
|
5652 the greatest common divisor. |
|
5653 |
|
5654 However, instead of factoring $b - a$ to find a suitable value of $p$ the powers of $p$ can be removed from $a$ and $b$ that are in common first. |
|
5655 Then inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed. |
|
5656 |
|
5657 \begin{figure}[!here] |
|
5658 \begin{small} |
|
5659 \begin{center} |
|
5660 \begin{tabular}{l} |
|
5661 \hline Algorithm \textbf{Greatest Common Divisor (III)}. \\ |
|
5662 \textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\ |
|
5663 \textbf{Output}. The greatest common divisor $(a, b)$. \\ |
|
5664 \hline \\ |
|
5665 1. $k \leftarrow 0$ \\ |
|
5666 2. While $a$ and $b$ are both divisible by $p$ do \\ |
|
5667 \hspace{3mm}2.1 $a \leftarrow \lfloor a / p \rfloor$ \\ |
|
5668 \hspace{3mm}2.2 $b \leftarrow \lfloor b / p \rfloor$ \\ |
|
5669 \hspace{3mm}2.3 $k \leftarrow k + 1$ \\ |
|
5670 3. While $a$ is divisible by $p$ do \\ |
|
5671 \hspace{3mm}3.1 $a \leftarrow \lfloor a / p \rfloor$ \\ |
|
5672 4. While $b$ is divisible by $p$ do \\ |
|
5673 \hspace{3mm}4.1 $b \leftarrow \lfloor b / p \rfloor$ \\ |
|
5674 5. While ($b > 0$) do \\ |
|
5675 \hspace{3mm}5.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\ |
|
5676 \hspace{3mm}5.2 $b \leftarrow b - a$ \\ |
|
5677 \hspace{3mm}5.3 While $b$ is divisible by $p$ do \\ |
|
5678 \hspace{6mm}5.3.1 $b \leftarrow \lfloor b / p \rfloor$ \\ |
|
5679 6. Return($a \cdot p^k$). \\ |
|
5680 \hline |
|
5681 \end{tabular} |
|
5682 \end{center} |
|
5683 \end{small} |
|
5684 \caption{Algorithm Greatest Common Divisor (III)} |
|
5685 \label{fig:gcd3} |
|
5686 \end{figure} |
|
5687 |
|
5688 This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$ |
|
5689 decreases more rapidly. The first loop on step two removes powers of $p$ that are in common. A count, $k$, is kept which will present a common |
|
5690 divisor of $p^k$. After step two the remaining common divisor of $a$ and $b$ cannot be divisible by $p$. This means that $p$ can be safely |
|
5691 divided out of the difference $b - a$ so long as the division leaves no remainder. |
|
5692 |
|
5693 In particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often. It also helps that division by $p$ be easy |
|
5694 to compute. The ideal choice of $p$ is two since division by two amounts to a right logical shift. Another important observation is that by |
|
5695 step five both $a$ and $b$ are odd. Therefore, the diffrence $b - a$ must be even which means that each iteration removes one bit from the |
|
5696 largest of the pair. |
|
5697 |
|
5698 \subsection{Complete Greatest Common Divisor} |
|
5699 The algorithms presented so far cannot handle inputs which are zero or negative. The following algorithm can handle all input cases properly |
|
5700 and will produce the greatest common divisor. |
|
5701 |
|
5702 \newpage\begin{figure}[!here] |
|
5703 \begin{small} |
|
5704 \begin{center} |
|
5705 \begin{tabular}{l} |
|
5706 \hline Algorithm \textbf{mp\_gcd}. \\ |
|
5707 \textbf{Input}. mp\_int $a$ and $b$ \\ |
|
5708 \textbf{Output}. The greatest common divisor $c = (a, b)$. \\ |
|
5709 \hline \\ |
|
5710 1. If $a = 0$ and $b \ne 0$ then \\ |
|
5711 \hspace{3mm}1.1 $c \leftarrow b$ \\ |
|
5712 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ |
|
5713 2. If $a \ne 0$ and $b = 0$ then \\ |
|
5714 \hspace{3mm}2.1 $c \leftarrow a$ \\ |
|
5715 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\ |
|
5716 3. If $a = b = 0$ then \\ |
|
5717 \hspace{3mm}3.1 $c \leftarrow 1$ \\ |
|
5718 \hspace{3mm}3.2 Return(\textit{MP\_OKAY}). \\ |
|
5719 4. $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\ |
|
5720 5. $k \leftarrow 0$ \\ |
|
5721 6. While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
5722 \hspace{3mm}6.1 $k \leftarrow k + 1$ \\ |
|
5723 \hspace{3mm}6.2 $u \leftarrow \lfloor u / 2 \rfloor$ \\ |
|
5724 \hspace{3mm}6.3 $v \leftarrow \lfloor v / 2 \rfloor$ \\ |
|
5725 7. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
5726 \hspace{3mm}7.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\ |
|
5727 8. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
5728 \hspace{3mm}8.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ |
|
5729 9. While $v.used > 0$ \\ |
|
5730 \hspace{3mm}9.1 If $\vert u \vert > \vert v \vert$ then \\ |
|
5731 \hspace{6mm}9.1.1 Swap $u$ and $v$. \\ |
|
5732 \hspace{3mm}9.2 $v \leftarrow \vert v \vert - \vert u \vert$ \\ |
|
5733 \hspace{3mm}9.3 While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
5734 \hspace{6mm}9.3.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ |
|
5735 10. $c \leftarrow u \cdot 2^k$ \\ |
|
5736 11. Return(\textit{MP\_OKAY}). \\ |
|
5737 \hline |
|
5738 \end{tabular} |
|
5739 \end{center} |
|
5740 \end{small} |
|
5741 \caption{Algorithm mp\_gcd} |
|
5742 \end{figure} |
|
5743 \textbf{Algorithm mp\_gcd.} |
|
5744 This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$. The algorithm was originally based on Algorithm B of |
|
5745 Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain. In theory it achieves the same asymptotic working time as |
|
5746 Algorithm B and in practice this appears to be true. |
|
5747 |
|
5748 The first three steps handle the cases where either one of or both inputs are zero. If either input is zero the greatest common divisor is the |
|
5749 largest input or zero if they are both zero. If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of |
|
5750 $a$ and $b$ respectively and the algorithm will proceed to reduce the pair. |
|
5751 |
|
5752 Step six will divide out any common factors of two and keep track of the count in the variable $k$. After this step two is no longer a |
|
5753 factor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even. Step |
|
5754 seven and eight ensure that the $u$ and $v$ respectively have no more factors of two. At most only one of the while loops will iterate since |
|
5755 they cannot both be even. |
|
5756 |
|
5757 By step nine both of $u$ and $v$ are odd which is required for the inner logic. First the pair are swapped such that $v$ is equal to |
|
5758 or greater than $u$. This ensures that the subtraction on step 9.2 will always produce a positive and even result. Step 9.3 removes any |
|
5759 factors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd. |
|
5760 |
|
5761 After $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six. The result |
|
5762 must be adjusted by multiplying by the common factors of two ($2^k$) removed earlier. |
|
5763 |
|
5764 EXAM,bn_mp_gcd.c |
|
5765 |
|
5766 This function makes use of the macros mp\_iszero and mp\_iseven. The former evaluates to $1$ if the input mp\_int is equivalent to the |
|
5767 integer zero otherwise it evaluates to $0$. The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise |
|
5768 it evaluates to $0$. Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero. The three |
|
5769 trivial cases of inputs are handled on lines @25,zero@ through @34,}@. After those lines the inputs are assumed to be non-zero. |
|
5770 |
|
5771 Lines @36,if@ and @40,if@ make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively. At this point the common factors of two |
|
5772 must be divided out of the two inputs. The while loop on line @49,while@ iterates so long as both are even. The local integer $k$ is used to |
|
5773 keep track of how many factors of $2$ are pulled out of both values. It is assumed that the number of factors will not exceed the maximum |
|
5774 value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than entries than are accessible by an ``int'' so this is not |
|
5775 a limitation.}. |
|
5776 |
|
5777 At this point there are no more common factors of two in the two values. The while loops on lines @60,while@ and @65,while@ remove any independent |
|
5778 factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm. The while loop |
|
5779 on line @71, while@ performs the reduction of the pair until $v$ is equal to zero. The unsigned comparison and subtraction algorithms are used in |
|
5780 place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative. |
|
5781 |
|
5782 \section{Least Common Multiple} |
|
5783 The least common multiple of a pair of integers is their product divided by their greatest common divisor. For two integers $a$ and $b$ the |
|
5784 least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$. For example, if $a = 2 \cdot 2 \cdot 3 = 12$ |
|
5785 and $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$. |
|
5786 |
|
5787 The least common multiple arises often in coding theory as well as number theory. If two functions have periods of $a$ and $b$ respectively they will |
|
5788 collide, that is be in synchronous states, after only $[ a, b ]$ iterations. This is why, for example, random number generators based on |
|
5789 Linear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}). |
|
5790 Similarly in number theory if a composite $n$ has two prime factors $p$ and $q$ then maximal order of any unit of $\Z/n\Z$ will be $[ p - 1, q - 1] $. |
|
5791 |
|
5792 \begin{figure}[!here] |
|
5793 \begin{small} |
|
5794 \begin{center} |
|
5795 \begin{tabular}{l} |
|
5796 \hline Algorithm \textbf{mp\_lcm}. \\ |
|
5797 \textbf{Input}. mp\_int $a$ and $b$ \\ |
|
5798 \textbf{Output}. The least common multiple $c = [a, b]$. \\ |
|
5799 \hline \\ |
|
5800 1. $c \leftarrow (a, b)$ \\ |
|
5801 2. $t \leftarrow a \cdot b$ \\ |
|
5802 3. $c \leftarrow \lfloor t / c \rfloor$ \\ |
|
5803 4. Return(\textit{MP\_OKAY}). \\ |
|
5804 \hline |
|
5805 \end{tabular} |
|
5806 \end{center} |
|
5807 \end{small} |
|
5808 \caption{Algorithm mp\_lcm} |
|
5809 \end{figure} |
|
5810 \textbf{Algorithm mp\_lcm.} |
|
5811 This algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$. It computes the least common multiple directly by |
|
5812 dividing the product of the two inputs by their greatest common divisor. |
|
5813 |
|
5814 EXAM,bn_mp_lcm.c |
|
5815 |
|
5816 \section{Jacobi Symbol Computation} |
|
5817 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is |
|
5818 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is |
|
5819 equivalent to equation \ref{eqn:legendre}. |
|
5820 |
|
5821 \begin{equation} |
|
5822 a^{(p-1)/2} \equiv \begin{array}{rl} |
|
5823 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ |
|
5824 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ |
|
5825 1 & \mbox{if }a\mbox{ is a quadratic residue}. |
|
5826 \end{array} \mbox{ (mod }p\mbox{)} |
|
5827 \label{eqn:legendre} |
|
5828 \end{equation} |
|
5829 |
|
5830 \textbf{Proof.} \textit{Equation \ref{eqn:legendre} correctly identifies the residue status of an integer $a$ modulo a prime $p$.} |
|
5831 An integer $a$ is a quadratic residue if the following equation has a solution. |
|
5832 |
|
5833 \begin{equation} |
|
5834 x^2 \equiv a \mbox{ (mod }p\mbox{)} |
|
5835 \label{eqn:root} |
|
5836 \end{equation} |
|
5837 |
|
5838 Consider the following equation. |
|
5839 |
|
5840 \begin{equation} |
|
5841 0 \equiv x^{p-1} - 1 \equiv \left \lbrace \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \right \rbrace + \left ( a^{(p-1)/2} - 1 \right ) \mbox{ (mod }p\mbox{)} |
|
5842 \label{eqn:rooti} |
|
5843 \end{equation} |
|
5844 |
|
5845 Whether equation \ref{eqn:root} has a solution or not equation \ref{eqn:rooti} is always true. If $a^{(p-1)/2} - 1 \equiv 0 \mbox{ (mod }p\mbox{)}$ |
|
5846 then the quantity in the braces must be zero. By reduction, |
|
5847 |
|
5848 \begin{eqnarray} |
|
5849 \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \equiv 0 \nonumber \\ |
|
5850 \left (x^2 \right )^{(p-1)/2} \equiv a^{(p-1)/2} \nonumber \\ |
|
5851 x^2 \equiv a \mbox{ (mod }p\mbox{)} |
|
5852 \end{eqnarray} |
|
5853 |
|
5854 As a result there must be a solution to the quadratic equation and in turn $a$ must be a quadratic residue. If $a$ does not divide $p$ and $a$ |
|
5855 is not a quadratic residue then the only other value $a^{(p-1)/2}$ may be congruent to is $-1$ since |
|
5856 \begin{equation} |
|
5857 0 \equiv a^{p - 1} - 1 \equiv (a^{(p-1)/2} + 1)(a^{(p-1)/2} - 1) \mbox{ (mod }p\mbox{)} |
|
5858 \end{equation} |
|
5859 One of the terms on the right hand side must be zero. \textbf{QED} |
|
5860 |
|
5861 \subsection{Jacobi Symbol} |
|
5862 The Jacobi symbol is a generalization of the Legendre function for any odd non prime moduli $p$ greater than 2. If $p = \prod_{i=0}^n p_i$ then |
|
5863 the Jacobi symbol $\left ( { a \over p } \right )$ is equal to the following equation. |
|
5864 |
|
5865 \begin{equation} |
|
5866 \left ( { a \over p } \right ) = \left ( { a \over p_0} \right ) \left ( { a \over p_1} \right ) \ldots \left ( { a \over p_n} \right ) |
|
5867 \end{equation} |
|
5868 |
|
5869 By inspection if $p$ is prime the Jacobi symbol is equivalent to the Legendre function. The following facts\footnote{See HAC \cite[pp. 72-74]{HAC} for |
|
5870 further details.} will be used to derive an efficient Jacobi symbol algorithm. Where $p$ is an odd integer greater than two and $a, b \in \Z$ the |
|
5871 following are true. |
|
5872 |
|
5873 \begin{enumerate} |
|
5874 \item $\left ( { a \over p} \right )$ equals $-1$, $0$ or $1$. |
|
5875 \item $\left ( { ab \over p} \right ) = \left ( { a \over p} \right )\left ( { b \over p} \right )$. |
|
5876 \item If $a \equiv b$ then $\left ( { a \over p} \right ) = \left ( { b \over p} \right )$. |
|
5877 \item $\left ( { 2 \over p} \right )$ equals $1$ if $p \equiv 1$ or $7 \mbox{ (mod }8\mbox{)}$. Otherwise, it equals $-1$. |
|
5878 \item $\left ( { a \over p} \right ) \equiv \left ( { p \over a} \right ) \cdot (-1)^{(p-1)(a-1)/4}$. More specifically |
|
5879 $\left ( { a \over p} \right ) = \left ( { p \over a} \right )$ if $p \equiv a \equiv 1 \mbox{ (mod }4\mbox{)}$. |
|
5880 \end{enumerate} |
|
5881 |
|
5882 Using these facts if $a = 2^k \cdot a'$ then |
|
5883 |
|
5884 \begin{eqnarray} |
|
5885 \left ( { a \over p } \right ) = \left ( {{2^k} \over p } \right ) \left ( {a' \over p} \right ) \nonumber \\ |
|
5886 = \left ( {2 \over p } \right )^k \left ( {a' \over p} \right ) |
|
5887 \label{eqn:jacobi} |
|
5888 \end{eqnarray} |
|
5889 |
|
5890 By fact five, |
|
5891 |
|
5892 \begin{equation} |
|
5893 \left ( { a \over p } \right ) = \left ( { p \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4} |
|
5894 \end{equation} |
|
5895 |
|
5896 Subsequently by fact three since $p \equiv (p \mbox{ mod }a) \mbox{ (mod }a\mbox{)}$ then |
|
5897 |
|
5898 \begin{equation} |
|
5899 \left ( { a \over p } \right ) = \left ( { {p \mbox{ mod } a} \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4} |
|
5900 \end{equation} |
|
5901 |
|
5902 By putting both observations into equation \ref{eqn:jacobi} the following simplified equation is formed. |
|
5903 |
|
5904 \begin{equation} |
|
5905 \left ( { a \over p } \right ) = \left ( {2 \over p } \right )^k \left ( {{p\mbox{ mod }a'} \over a'} \right ) \cdot (-1)^{(p-1)(a'-1)/4} |
|
5906 \end{equation} |
|
5907 |
|
5908 The value of $\left ( {{p \mbox{ mod }a'} \over a'} \right )$ can be found by using the same equation recursively. The value of |
|
5909 $\left ( {2 \over p } \right )^k$ equals $1$ if $k$ is even otherwise it equals $\left ( {2 \over p } \right )$. Using this approach the |
|
5910 factors of $p$ do not have to be known. Furthermore, if $(a, p) = 1$ then the algorithm will terminate when the recursion requests the |
|
5911 Jacobi symbol computation of $\left ( {1 \over a'} \right )$ which is simply $1$. |
|
5912 |
|
5913 \newpage\begin{figure}[!here] |
|
5914 \begin{small} |
|
5915 \begin{center} |
|
5916 \begin{tabular}{l} |
|
5917 \hline Algorithm \textbf{mp\_jacobi}. \\ |
|
5918 \textbf{Input}. mp\_int $a$ and $p$, $a \ge 0$, $p \ge 3$, $p \equiv 1 \mbox{ (mod }2\mbox{)}$ \\ |
|
5919 \textbf{Output}. The Jacobi symbol $c = \left ( {a \over p } \right )$. \\ |
|
5920 \hline \\ |
|
5921 1. If $a = 0$ then \\ |
|
5922 \hspace{3mm}1.1 $c \leftarrow 0$ \\ |
|
5923 \hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\ |
|
5924 2. If $a = 1$ then \\ |
|
5925 \hspace{3mm}2.1 $c \leftarrow 1$ \\ |
|
5926 \hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\ |
|
5927 3. $a' \leftarrow a$ \\ |
|
5928 4. $k \leftarrow 0$ \\ |
|
5929 5. While $a'.used > 0$ and $a'_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
5930 \hspace{3mm}5.1 $k \leftarrow k + 1$ \\ |
|
5931 \hspace{3mm}5.2 $a' \leftarrow \lfloor a' / 2 \rfloor$ \\ |
|
5932 6. If $k \equiv 0 \mbox{ (mod }2\mbox{)}$ then \\ |
|
5933 \hspace{3mm}6.1 $s \leftarrow 1$ \\ |
|
5934 7. else \\ |
|
5935 \hspace{3mm}7.1 $r \leftarrow p_0 \mbox{ (mod }8\mbox{)}$ \\ |
|
5936 \hspace{3mm}7.2 If $r = 1$ or $r = 7$ then \\ |
|
5937 \hspace{6mm}7.2.1 $s \leftarrow 1$ \\ |
|
5938 \hspace{3mm}7.3 else \\ |
|
5939 \hspace{6mm}7.3.1 $s \leftarrow -1$ \\ |
|
5940 8. If $p_0 \equiv a'_0 \equiv 3 \mbox{ (mod }4\mbox{)}$ then \\ |
|
5941 \hspace{3mm}8.1 $s \leftarrow -s$ \\ |
|
5942 9. If $a' \ne 1$ then \\ |
|
5943 \hspace{3mm}9.1 $p' \leftarrow p \mbox{ (mod }a'\mbox{)}$ \\ |
|
5944 \hspace{3mm}9.2 $s \leftarrow s \cdot \mbox{mp\_jacobi}(p', a')$ \\ |
|
5945 10. $c \leftarrow s$ \\ |
|
5946 11. Return(\textit{MP\_OKAY}). \\ |
|
5947 \hline |
|
5948 \end{tabular} |
|
5949 \end{center} |
|
5950 \end{small} |
|
5951 \caption{Algorithm mp\_jacobi} |
|
5952 \end{figure} |
|
5953 \textbf{Algorithm mp\_jacobi.} |
|
5954 This algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three. The algorithm |
|
5955 is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}. |
|
5956 |
|
5957 Step numbers one and two handle the trivial cases of $a = 0$ and $a = 1$ respectively. Step five determines the number of two factors in the |
|
5958 input $a$. If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one. If $k$ is odd than the term evaluates to one |
|
5959 if $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled |
|
5960 the $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$. The latter term evaluates to one if both $p$ and $a'$ |
|
5961 are congruent to one modulo four, otherwise it evaluates to negative one. |
|
5962 |
|
5963 By step nine if $a'$ does not equal one a recursion is required. Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute |
|
5964 $\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product. |
|
5965 |
|
5966 EXAM,bn_mp_jacobi.c |
|
5967 |
|
5968 As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C |
|
5969 variable name character. |
|
5970 |
|
5971 The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm. If the input is non-trivial the algorithm |
|
5972 has to proceed compute the Jacobi. The variable $s$ is used to hold the current Jacobi product. Note that $s$ is merely a C ``int'' data type since |
|
5973 the values it may obtain are merely $-1$, $0$ and $1$. |
|
5974 |
|
5975 After a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$. Technically only the least significant |
|
5976 bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same |
|
5977 processor requirements and neither is faster than the other. |
|
5978 |
|
5979 Line @59, if@ through @70, }@ determines the value of $\left ( { 2 \over p } \right )^k$. If the least significant bit of $k$ is zero than |
|
5980 $k$ is even and the value is one. Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight. The value of |
|
5981 $(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines @73, if@ through @75, }@. |
|
5982 |
|
5983 Finally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$. |
|
5984 |
|
5985 \textit{-- Comment about default $s$ and such...} |
|
5986 |
|
5987 \section{Modular Inverse} |
|
5988 \label{sec:modinv} |
|
5989 The modular inverse of a number actually refers to the modular multiplicative inverse. Essentially for any integer $a$ such that $(a, p) = 1$ there |
|
5990 exist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$. The integer $b$ is called the multiplicative inverse of $a$ which is |
|
5991 denoted as $b = a^{-1}$. Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and |
|
5992 fields of integers. However, the former will be the matter of discussion. |
|
5993 |
|
5994 The simplest approach is to compute the algebraic inverse of the input. That is to compute $b \equiv a^{\Phi(p) - 1}$. If $\Phi(p)$ is the |
|
5995 order of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$. The proof of which is trivial. |
|
5996 |
|
5997 \begin{equation} |
|
5998 ab \equiv a \left (a^{\Phi(p) - 1} \right ) \equiv a^{\Phi(p)} \equiv a^0 \equiv 1 \mbox{ (mod }p\mbox{)} |
|
5999 \end{equation} |
|
6000 |
|
6001 However, as simple as this approach may be it has two serious flaws. It requires that the value of $\Phi(p)$ be known which if $p$ is composite |
|
6002 requires all of the prime factors. This approach also is very slow as the size of $p$ grows. |
|
6003 |
|
6004 A simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear |
|
6005 Diophantine\footnote{See LeVeque \cite[pp. 40-43]{LeVeque} for more information.} equation. |
|
6006 |
|
6007 \begin{equation} |
|
6008 ab + pq = 1 |
|
6009 \end{equation} |
|
6010 |
|
6011 Where $a$, $b$, $p$ and $q$ are all integers. If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of |
|
6012 $a$ modulo $p$. The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$. |
|
6013 However, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place. The |
|
6014 binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine |
|
6015 equation. |
|
6016 |
|
6017 \subsection{General Case} |
|
6018 \newpage\begin{figure}[!here] |
|
6019 \begin{small} |
|
6020 \begin{center} |
|
6021 \begin{tabular}{l} |
|
6022 \hline Algorithm \textbf{mp\_invmod}. \\ |
|
6023 \textbf{Input}. mp\_int $a$ and $b$, $(a, b) = 1$, $p \ge 2$, $0 < a < p$. \\ |
|
6024 \textbf{Output}. The modular inverse $c \equiv a^{-1} \mbox{ (mod }b\mbox{)}$. \\ |
|
6025 \hline \\ |
|
6026 1. If $b \le 0$ then return(\textit{MP\_VAL}). \\ |
|
6027 2. If $b_0 \equiv 1 \mbox{ (mod }2\mbox{)}$ then use algorithm fast\_mp\_invmod. \\ |
|
6028 3. $x \leftarrow \vert a \vert, y \leftarrow b$ \\ |
|
6029 4. If $x_0 \equiv y_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ then return(\textit{MP\_VAL}). \\ |
|
6030 5. $B \leftarrow 0, C \leftarrow 0, A \leftarrow 1, D \leftarrow 1$ \\ |
|
6031 6. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
6032 \hspace{3mm}6.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\ |
|
6033 \hspace{3mm}6.2 If ($A.used > 0$ and $A_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($B.used > 0$ and $B_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\ |
|
6034 \hspace{6mm}6.2.1 $A \leftarrow A + y$ \\ |
|
6035 \hspace{6mm}6.2.2 $B \leftarrow B - x$ \\ |
|
6036 \hspace{3mm}6.3 $A \leftarrow \lfloor A / 2 \rfloor$ \\ |
|
6037 \hspace{3mm}6.4 $B \leftarrow \lfloor B / 2 \rfloor$ \\ |
|
6038 7. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
6039 \hspace{3mm}7.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\ |
|
6040 \hspace{3mm}7.2 If ($C.used > 0$ and $C_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($D.used > 0$ and $D_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\ |
|
6041 \hspace{6mm}7.2.1 $C \leftarrow C + y$ \\ |
|
6042 \hspace{6mm}7.2.2 $D \leftarrow D - x$ \\ |
|
6043 \hspace{3mm}7.3 $C \leftarrow \lfloor C / 2 \rfloor$ \\ |
|
6044 \hspace{3mm}7.4 $D \leftarrow \lfloor D / 2 \rfloor$ \\ |
|
6045 8. If $u \ge v$ then \\ |
|
6046 \hspace{3mm}8.1 $u \leftarrow u - v$ \\ |
|
6047 \hspace{3mm}8.2 $A \leftarrow A - C$ \\ |
|
6048 \hspace{3mm}8.3 $B \leftarrow B - D$ \\ |
|
6049 9. else \\ |
|
6050 \hspace{3mm}9.1 $v \leftarrow v - u$ \\ |
|
6051 \hspace{3mm}9.2 $C \leftarrow C - A$ \\ |
|
6052 \hspace{3mm}9.3 $D \leftarrow D - B$ \\ |
|
6053 10. If $u \ne 0$ goto step 6. \\ |
|
6054 11. If $v \ne 1$ return(\textit{MP\_VAL}). \\ |
|
6055 12. While $C \le 0$ do \\ |
|
6056 \hspace{3mm}12.1 $C \leftarrow C + b$ \\ |
|
6057 13. While $C \ge b$ do \\ |
|
6058 \hspace{3mm}13.1 $C \leftarrow C - b$ \\ |
|
6059 14. $c \leftarrow C$ \\ |
|
6060 15. Return(\textit{MP\_OKAY}). \\ |
|
6061 \hline |
|
6062 \end{tabular} |
|
6063 \end{center} |
|
6064 \end{small} |
|
6065 \end{figure} |
|
6066 \textbf{Algorithm mp\_invmod.} |
|
6067 This algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$. This algorithm is a variation of the |
|
6068 extended binary Euclidean algorithm from HAC \cite[pp. 608]{HAC}. It has been modified to only compute the modular inverse and not a complete |
|
6069 Diophantine solution. |
|
6070 |
|
6071 If $b \le 0$ than the modulus is invalid and MP\_VAL is returned. Similarly if both $a$ and $b$ are even then there cannot be a multiplicative |
|
6072 inverse for $a$ and the error is reported. |
|
6073 |
|
6074 The astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd. In this case |
|
6075 the other variables to the Diophantine equation are solved. The algorithm terminates when $u = 0$ in which case the solution is |
|
6076 |
|
6077 \begin{equation} |
|
6078 Ca + Db = v |
|
6079 \end{equation} |
|
6080 |
|
6081 If $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists. Otherwise, $C$ |
|
6082 is the modular inverse of $a$. The actual value of $C$ is congruent to, but not necessarily equal to, the ideal modular inverse which should lie |
|
6083 within $1 \le a^{-1} < b$. Step numbers twelve and thirteen adjust the inverse until it is in range. If the original input $a$ is within $0 < a < p$ |
|
6084 then only a couple of additions or subtractions will be required to adjust the inverse. |
|
6085 |
|
6086 EXAM,bn_mp_invmod.c |
|
6087 |
|
6088 \subsubsection{Odd Moduli} |
|
6089 |
|
6090 When the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse. In particular by attempting to solve |
|
6091 the Diophantine $Cb + Da = 1$ only $B$ and $D$ are required to find the inverse of $a$. |
|
6092 |
|
6093 The algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed. This |
|
6094 optimization will halve the time required to compute the modular inverse. |
|
6095 |
|
6096 \section{Primality Tests} |
|
6097 |
|
6098 A non-zero integer $a$ is said to be prime if it is not divisible by any other integer excluding one and itself. For example, $a = 7$ is prime |
|
6099 since the integers $2 \ldots 6$ do not evenly divide $a$. By contrast, $a = 6$ is not prime since $a = 6 = 2 \cdot 3$. |
|
6100 |
|
6101 Prime numbers arise in cryptography considerably as they allow finite fields to be formed. The ability to determine whether an integer is prime or |
|
6102 not quickly has been a viable subject in cryptography and number theory for considerable time. The algorithms that will be presented are all |
|
6103 probablistic algorithms in that when they report an integer is composite it must be composite. However, when the algorithms report an integer is |
|
6104 prime the algorithm may be incorrect. |
|
6105 |
|
6106 As will be discussed it is possible to limit the probability of error so well that for practical purposes the probablity of error might as |
|
6107 well be zero. For the purposes of these discussions let $n$ represent the candidate integer of which the primality is in question. |
|
6108 |
|
6109 \subsection{Trial Division} |
|
6110 |
|
6111 Trial division means to attempt to evenly divide a candidate integer by small prime integers. If the candidate can be evenly divided it obviously |
|
6112 cannot be prime. By dividing by all primes $1 < p \le \sqrt{n}$ this test can actually prove whether an integer is prime. However, such a test |
|
6113 would require a prohibitive amount of time as $n$ grows. |
|
6114 |
|
6115 Instead of dividing by every prime, a smaller, more mangeable set of primes may be used instead. By performing trial division with only a subset |
|
6116 of the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime. However, often it can prove a candidate is not prime. |
|
6117 |
|
6118 The benefit of this test is that trial division by small values is fairly efficient. Specially compared to the other algorithms that will be |
|
6119 discussed shortly. The probability that this approach correctly identifies a composite candidate when tested with all primes upto $q$ is given by |
|
6120 $1 - {1.12 \over ln(q)}$. The graph (\ref{pic:primality}, will be added later) demonstrates the probability of success for the range |
|
6121 $3 \le q \le 100$. |
|
6122 |
|
6123 At approximately $q = 30$ the gain of performing further tests diminishes fairly quickly. At $q = 90$ further testing is generally not going to |
|
6124 be of any practical use. In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate |
|
6125 approximately $80\%$ of all candidate integers. The constant \textbf{PRIME\_SIZE} is equal to the number of primes in the test base. The |
|
6126 array \_\_prime\_tab is an array of the first \textbf{PRIME\_SIZE} prime numbers. |
|
6127 |
|
6128 \begin{figure}[!here] |
|
6129 \begin{small} |
|
6130 \begin{center} |
|
6131 \begin{tabular}{l} |
|
6132 \hline Algorithm \textbf{mp\_prime\_is\_divisible}. \\ |
|
6133 \textbf{Input}. mp\_int $a$ \\ |
|
6134 \textbf{Output}. $c = 1$ if $n$ is divisible by a small prime, otherwise $c = 0$. \\ |
|
6135 \hline \\ |
|
6136 1. for $ix$ from $0$ to $PRIME\_SIZE$ do \\ |
|
6137 \hspace{3mm}1.1 $d \leftarrow n \mbox{ (mod }\_\_prime\_tab_{ix}\mbox{)}$ \\ |
|
6138 \hspace{3mm}1.2 If $d = 0$ then \\ |
|
6139 \hspace{6mm}1.2.1 $c \leftarrow 1$ \\ |
|
6140 \hspace{6mm}1.2.2 Return(\textit{MP\_OKAY}). \\ |
|
6141 2. $c \leftarrow 0$ \\ |
|
6142 3. Return(\textit{MP\_OKAY}). \\ |
|
6143 \hline |
|
6144 \end{tabular} |
|
6145 \end{center} |
|
6146 \end{small} |
|
6147 \caption{Algorithm mp\_prime\_is\_divisible} |
|
6148 \end{figure} |
|
6149 \textbf{Algorithm mp\_prime\_is\_divisible.} |
|
6150 This algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions. |
|
6151 |
|
6152 EXAM,bn_mp_prime_is_divisible.c |
|
6153 |
|
6154 The algorithm defaults to a return of $0$ in case an error occurs. The values in the prime table are all specified to be in the range of a |
|
6155 mp\_digit. The table \_\_prime\_tab is defined in the following file. |
|
6156 |
|
6157 EXAM,bn_prime_tab.c |
|
6158 |
|
6159 Note that there are two possible tables. When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes |
|
6160 upto $1619$ are used. Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit. |
|
6161 |
|
6162 \subsection{The Fermat Test} |
|
6163 The Fermat test is probably one the oldest tests to have a non-trivial probability of success. It is based on the fact that if $n$ is in |
|
6164 fact prime then $a^{n} \equiv a \mbox{ (mod }n\mbox{)}$ for all $0 < a < n$. The reason being that if $n$ is prime than the order of |
|
6165 the multiplicative sub group is $n - 1$. Any base $a$ must have an order which divides $n - 1$ and as such $a^n$ is equivalent to |
|
6166 $a^1 = a$. |
|
6167 |
|
6168 If $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$. In which case |
|
6169 it is possible that $a^n \nequiv a \mbox{ (mod }n\mbox{)}$. However, this test is not absolute as it is possible that the order |
|
6170 of a base will divide $n - 1$ which would then be reported as prime. Such a base yields what is known as a Fermat pseudo-prime. Several |
|
6171 integers known as Carmichael numbers will be a pseudo-prime to all valid bases. Fortunately such numbers are extremely rare as $n$ grows |
|
6172 in size. |
|
6173 |
|
6174 \begin{figure}[!here] |
|
6175 \begin{small} |
|
6176 \begin{center} |
|
6177 \begin{tabular}{l} |
|
6178 \hline Algorithm \textbf{mp\_prime\_fermat}. \\ |
|
6179 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\ |
|
6180 \textbf{Output}. $c = 1$ if $b^a \equiv b \mbox{ (mod }a\mbox{)}$, otherwise $c = 0$. \\ |
|
6181 \hline \\ |
|
6182 1. $t \leftarrow b^a \mbox{ (mod }a\mbox{)}$ \\ |
|
6183 2. If $t = b$ then \\ |
|
6184 \hspace{3mm}2.1 $c = 1$ \\ |
|
6185 3. else \\ |
|
6186 \hspace{3mm}3.1 $c = 0$ \\ |
|
6187 4. Return(\textit{MP\_OKAY}). \\ |
|
6188 \hline |
|
6189 \end{tabular} |
|
6190 \end{center} |
|
6191 \end{small} |
|
6192 \caption{Algorithm mp\_prime\_fermat} |
|
6193 \end{figure} |
|
6194 \textbf{Algorithm mp\_prime\_fermat.} |
|
6195 This algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not. It uses a single modular exponentiation to |
|
6196 determine the result. |
|
6197 |
|
6198 EXAM,bn_mp_prime_fermat.c |
|
6199 |
|
6200 \subsection{The Miller-Rabin Test} |
|
6201 The Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen |
|
6202 candidate integers. The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the |
|
6203 value must be equal to $-1$. The squarings are stopped as soon as $-1$ is observed. If the value of $1$ is observed first it means that |
|
6204 some value not congruent to $\pm 1$ when squared equals one which cannot occur if $n$ is prime. |
|
6205 |
|
6206 \begin{figure}[!here] |
|
6207 \begin{small} |
|
6208 \begin{center} |
|
6209 \begin{tabular}{l} |
|
6210 \hline Algorithm \textbf{mp\_prime\_miller\_rabin}. \\ |
|
6211 \textbf{Input}. mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$. \\ |
|
6212 \textbf{Output}. $c = 1$ if $a$ is a Miller-Rabin prime to the base $a$, otherwise $c = 0$. \\ |
|
6213 \hline |
|
6214 1. $a' \leftarrow a - 1$ \\ |
|
6215 2. $r \leftarrow n1$ \\ |
|
6216 3. $c \leftarrow 0, s \leftarrow 0$ \\ |
|
6217 4. While $r.used > 0$ and $r_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\ |
|
6218 \hspace{3mm}4.1 $s \leftarrow s + 1$ \\ |
|
6219 \hspace{3mm}4.2 $r \leftarrow \lfloor r / 2 \rfloor$ \\ |
|
6220 5. $y \leftarrow b^r \mbox{ (mod }a\mbox{)}$ \\ |
|
6221 6. If $y \nequiv \pm 1$ then \\ |
|
6222 \hspace{3mm}6.1 $j \leftarrow 1$ \\ |
|
6223 \hspace{3mm}6.2 While $j \le (s - 1)$ and $y \nequiv a'$ \\ |
|
6224 \hspace{6mm}6.2.1 $y \leftarrow y^2 \mbox{ (mod }a\mbox{)}$ \\ |
|
6225 \hspace{6mm}6.2.2 If $y = 1$ then goto step 8. \\ |
|
6226 \hspace{6mm}6.2.3 $j \leftarrow j + 1$ \\ |
|
6227 \hspace{3mm}6.3 If $y \nequiv a'$ goto step 8. \\ |
|
6228 7. $c \leftarrow 1$\\ |
|
6229 8. Return(\textit{MP\_OKAY}). \\ |
|
6230 \hline |
|
6231 \end{tabular} |
|
6232 \end{center} |
|
6233 \end{small} |
|
6234 \caption{Algorithm mp\_prime\_miller\_rabin} |
|
6235 \end{figure} |
|
6236 \textbf{Algorithm mp\_prime\_miller\_rabin.} |
|
6237 This algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$. It will set $c = 1$ if the algorithm cannot determine |
|
6238 if $b$ is composite or $c = 0$ if $b$ is provably composite. The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$. |
|
6239 |
|
6240 If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not. Otherwise, the algorithm will |
|
6241 square $y$ upto $s - 1$ times stopping only when $y \equiv -1$. If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$ |
|
6242 is provably composite. If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite. If $a$ is not provably |
|
6243 composite then it is \textit{probably} prime. |
|
6244 |
|
6245 EXAM,bn_mp_prime_miller_rabin.c |
|
6246 |
|
6247 |
|
6248 |
|
6249 |
|
6250 \backmatter |
|
6251 \appendix |
|
6252 \begin{thebibliography}{ABCDEF} |
|
6253 \bibitem[1]{TAOCPV2} |
|
6254 Donald Knuth, \textit{The Art of Computer Programming}, Third Edition, Volume Two, Seminumerical Algorithms, Addison-Wesley, 1998 |
|
6255 |
|
6256 \bibitem[2]{HAC} |
|
6257 A. Menezes, P. van Oorschot, S. Vanstone, \textit{Handbook of Applied Cryptography}, CRC Press, 1996 |
|
6258 |
|
6259 \bibitem[3]{ROSE} |
|
6260 Michael Rosing, \textit{Implementing Elliptic Curve Cryptography}, Manning Publications, 1999 |
|
6261 |
|
6262 \bibitem[4]{COMBA} |
|
6263 Paul G. Comba, \textit{Exponentiation Cryptosystems on the IBM PC}. IBM Systems Journal 29(4): 526-538 (1990) |
|
6264 |
|
6265 \bibitem[5]{KARA} |
|
6266 A. Karatsuba, Doklay Akad. Nauk SSSR 145 (1962), pp.293-294 |
|
6267 |
|
6268 \bibitem[6]{KARAP} |
|
6269 Andre Weimerskirch and Christof Paar, \textit{Generalizations of the Karatsuba Algorithm for Polynomial Multiplication}, Submitted to Design, Codes and Cryptography, March 2002 |
|
6270 |
|
6271 \bibitem[7]{BARRETT} |
|
6272 Paul Barrett, \textit{Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor}, Advances in Cryptology, Crypto '86, Springer-Verlag. |
|
6273 |
|
6274 \bibitem[8]{MONT} |
|
6275 P.L.Montgomery. \textit{Modular multiplication without trial division}. Mathematics of Computation, 44(170):519-521, April 1985. |
|
6276 |
|
6277 \bibitem[9]{DRMET} |
|
6278 Chae Hoon Lim and Pil Joong Lee, \textit{Generating Efficient Primes for Discrete Log Cryptosystems}, POSTECH Information Research Laboratories |
|
6279 |
|
6280 \bibitem[10]{MMB} |
|
6281 J. Daemen and R. Govaerts and J. Vandewalle, \textit{Block ciphers based on Modular Arithmetic}, State and {P}rogress in the {R}esearch of {C}ryptography, 1993, pp. 80-89 |
|
6282 |
|
6283 \bibitem[11]{RSAREF} |
|
6284 R.L. Rivest, A. Shamir, L. Adleman, \textit{A Method for Obtaining Digital Signatures and Public-Key Cryptosystems} |
|
6285 |
|
6286 \bibitem[12]{DHREF} |
|
6287 Whitfield Diffie, Martin E. Hellman, \textit{New Directions in Cryptography}, IEEE Transactions on Information Theory, 1976 |
|
6288 |
|
6289 \bibitem[13]{IEEE} |
|
6290 IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) |
|
6291 |
|
6292 \bibitem[14]{GMP} |
|
6293 GNU Multiple Precision (GMP), \url{http://www.swox.com/gmp/} |
|
6294 |
|
6295 \bibitem[15]{MPI} |
|
6296 Multiple Precision Integer Library (MPI), Michael Fromberger, \url{http://thayer.dartmouth.edu/~sting/mpi/} |
|
6297 |
|
6298 \bibitem[16]{OPENSSL} |
|
6299 OpenSSL Cryptographic Toolkit, \url{http://openssl.org} |
|
6300 |
|
6301 \bibitem[17]{LIP} |
|
6302 Large Integer Package, \url{http://home.hetnet.nl/~ecstr/LIP.zip} |
|
6303 |
|
6304 \bibitem[18]{ISOC} |
|
6305 JTC1/SC22/WG14, ISO/IEC 9899:1999, ``A draft rationale for the C99 standard.'' |
|
6306 |
|
6307 \bibitem[19]{JAVA} |
|
6308 The Sun Java Website, \url{http://java.sun.com/} |
|
6309 |
|
6310 \end{thebibliography} |
|
6311 |
|
6312 \input{tommath.ind} |
|
6313 |
|
6314 \end{document} |