comparison tommath.src @ 190:d8254fc979e9 libtommath-orig LTM_0.35

Initial import of libtommath 0.35
author Matt Johnston <matt@ucc.asn.au>
date Fri, 06 May 2005 08:59:30 +0000
parents d29b64170cf0
children
comparison
equal deleted inserted replaced
142:d29b64170cf0 190:d8254fc979e9
47 \def\gap{\vspace{0.5ex}} 47 \def\gap{\vspace{0.5ex}}
48 \makeindex 48 \makeindex
49 \begin{document} 49 \begin{document}
50 \frontmatter 50 \frontmatter
51 \pagestyle{empty} 51 \pagestyle{empty}
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } 52 \title{Multi--Precision Math}
53 \author{\mbox{ 53 \author{\mbox{
54 %\begin{small} 54 %\begin{small}
55 \begin{tabular}{c} 55 \begin{tabular}{c}
56 Tom St Denis \\ 56 Tom St Denis \\
57 Algonquin College \\ 57 Algonquin College \\
64 \end{tabular} 64 \end{tabular}
65 %\end{small} 65 %\end{small}
66 } 66 }
67 } 67 }
68 \maketitle 68 \maketitle
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the 69 This text has been placed in the public domain. This text corresponds to the v0.35 release of the
70 LibTomMath project. 70 LibTomMath project.
71 71
72 \begin{alltt} 72 \begin{alltt}
73 Tom St Denis 73 Tom St Denis
74 111 Banning Rd 74 111 Banning Rd
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} 83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
84 {\em book} macro package and the Perl {\em booker} package. 84 {\em book} macro package and the Perl {\em booker} package.
85 85
86 \tableofcontents 86 \tableofcontents
87 \listoffigures 87 \listoffigures
88 \chapter*{Prefaces to the Draft Edition} 88 \chapter*{Prefaces}
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions 89 When I tell people about my LibTom projects and that I release them as public domain they are often puzzled.
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their 90 They ask why I did it and especially why I continue to work on them for free. The best I can explain it is ``Because I can.''
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the 91 Which seems odd and perhaps too terse for adult conversation. I often qualify it with ``I am able, I am willing.'' which
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance 92 perhaps explains it better. I am the first to admit there is not anything that special with what I have done. Perhaps
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the 93 others can see that too and then we would have a society to be proud of. My LibTom projects are what I am doing to give
94 text. 94 back to society in the form of tools and knowledge that can help others in their endeavours.
95 95
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit 96 I started writing this book because it was the most logical task to further my goal of open academia. The LibTomMath source
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only 97 code itself was written to be easy to follow and learn from. There are times, however, where pure C source code does not
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted 98 explain the algorithms properly. Hence this book. The book literally starts with the foundation of the library and works
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have 99 itself outwards to the more complicated algorithms. The use of both pseudo--code and verbatim source code provides a duality
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain 100 of ``theory'' and ``practice'' that the computer science students of the world shall appreciate. I never deviate too far
101 rewarding. 101 from relatively straightforward algebra and I hope that this book can be a valuable learning asset.
102 102
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. 103 This book and indeed much of the LibTom projects would not exist in their current form if it was not for a plethora
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even 104 of kind people donating their time, resources and kind words to help support my work. Writing a text of significant
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then 105 length (along with the source code) is a tiresome and lengthy process. Currently the LibTom project is four years old,
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy 106 comprises of literally thousands of users and over 100,000 lines of source code, TeX and other material. People like Mads and Greg
107 onto finishing the book not securing a contract. 107 were there at the beginning to encourage me to work well. It is amazing how timely validation from others can boost morale to
108 108 continue the project. Definitely my parents were there for me by providing room and board during the many months of work in 2003.
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. 109
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written 110 To my many friends whom I have met through the years I thank you for the good times and the words of encouragement. I hope I
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy 111 honour your kind gestures with this project.
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, 112
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely 113 Open Source. Open Academia. Open Minds.
114 without hinderance.
115
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C.
122
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic.
125
126 At this time I feel I should share a little information about myself. The most common question I was asked at
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training.
130
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am
133 still far off from that goal.
134
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors
141 in my written English have saved me on several occasions to say the least.
142
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many
147 people who will take it.
148 114
149 \begin{flushright} Tom St Denis \end{flushright} 115 \begin{flushright} Tom St Denis \end{flushright}
150 116
151 \newpage 117 \newpage
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also 118 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also
935 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are 901 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are
936 assumed to contain undefined values they are initially set to zero. 902 assumed to contain undefined values they are initially set to zero.
937 903
938 EXAM,bn_mp_grow.c 904 EXAM,bn_mp_grow.c
939 905
940 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @23,if@) checks 906 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @24,alloc@) checks
941 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} 907 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc}
942 the function skips the re-allocation part thus saving time. 908 the function skips the re-allocation part thus saving time.
943 909
944 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is 910 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is
945 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@). The XREALLOC function is used 911 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@). The XREALLOC function is used
1308 \section{Sign Manipulation} 1274 \section{Sign Manipulation}
1309 \subsection{Absolute Value} 1275 \subsection{Absolute Value}
1310 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute 1276 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute
1311 the absolute value of an mp\_int. 1277 the absolute value of an mp\_int.
1312 1278
1313 \newpage\begin{figure}[here] 1279 \begin{figure}[here]
1314 \begin{center} 1280 \begin{center}
1315 \begin{tabular}{l} 1281 \begin{tabular}{l}
1316 \hline Algorithm \textbf{mp\_abs}. \\ 1282 \hline Algorithm \textbf{mp\_abs}. \\
1317 \textbf{Input}. An mp\_int $a$ \\ 1283 \textbf{Input}. An mp\_int $a$ \\
1318 \textbf{Output}. Computes $b = \vert a \vert$ \\ 1284 \textbf{Output}. Computes $b = \vert a \vert$ \\
1332 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, 1298 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows,
1333 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition 1299 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition
1334 logic to handle it. 1300 logic to handle it.
1335 1301
1336 EXAM,bn_mp_abs.c 1302 EXAM,bn_mp_abs.c
1303
1304 This fairly trivial algorithm first eliminates non--required duplications (line @27,a != b@) and then sets the
1305 \textbf{sign} flag to \textbf{MP\_ZPOS}.
1337 1306
1338 \subsection{Integer Negation} 1307 \subsection{Integer Negation}
1339 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute 1308 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute
1340 the negative of an mp\_int input. 1309 the negative of an mp\_int input.
1341 1310
1366 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return 1335 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return
1367 zero as negative. 1336 zero as negative.
1368 1337
1369 EXAM,bn_mp_neg.c 1338 EXAM,bn_mp_neg.c
1370 1339
1340 Like mp\_abs() this function avoids non--required duplications (line @21,a != b@) and then sets the sign. We
1341 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}. If the mp\_int is zero
1342 than the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}.
1343
1371 \section{Small Constants} 1344 \section{Small Constants}
1372 \subsection{Setting Small Constants} 1345 \subsection{Setting Small Constants}
1373 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. 1346 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful.
1374 1347
1375 \begin{figure}[here] 1348 \newpage\begin{figure}[here]
1376 \begin{center} 1349 \begin{center}
1377 \begin{tabular}{l} 1350 \begin{tabular}{l}
1378 \hline Algorithm \textbf{mp\_set}. \\ 1351 \hline Algorithm \textbf{mp\_set}. \\
1379 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ 1352 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\
1380 \textbf{Output}. Make $a$ equivalent to $b$ \\ 1353 \textbf{Output}. Make $a$ equivalent to $b$ \\
1395 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The 1368 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The
1396 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. 1369 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly.
1397 1370
1398 EXAM,bn_mp_set.c 1371 EXAM,bn_mp_set.c
1399 1372
1400 Line @21,mp_zero@ calls mp\_zero() to clear the mp\_int and reset the sign. Line @22,MP_MASK@ copies the digit 1373 First we zero (line @21,mp_zero@) the mp\_int to make sure that the other members are initialized for a
1401 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly 1374 small positive constant. mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count
1402 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with 1375 is zero. Next we set the digit and reduce it modulo $\beta$ (line @22,MP_MASK@). After this step we have to
1403 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line @23,a->used@ will set the \textbf{used} member with respect to the 1376 check if the resulting digit is zero or not. If it is not then we set the \textbf{used} count to one, otherwise
1404 digit actually set. This function will always make the integer positive. 1377 to zero.
1378
1379 We can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with
1380 $2^k - 1$ will perform the same operation.
1405 1381
1406 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses 1382 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses
1407 this function should take that into account. Only trivially small constants can be set using this function. 1383 this function should take that into account. Only trivially small constants can be set using this function.
1408 1384
1409 \subsection{Setting Large Constants} 1385 \subsection{Setting Large Constants}
1501 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to 1477 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to
1502 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. 1478 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
1503 1479
1504 EXAM,bn_mp_cmp_mag.c 1480 EXAM,bn_mp_cmp_mag.c
1505 1481
1506 The two if statements on lines @24,if@ and @28,if@ compare the number of digits in the two inputs. These two are performed before all of the digits 1482 The two if statements (lines @24,if@ and @28,if@) compare the number of digits in the two inputs. These two are
1507 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid 1483 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save
1508 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the 1484 considerable time. The implementation given is also not valid without those two statements. $b.alloc$ may be
1509 array of digits. 1485 smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the array of digits.
1486
1487
1510 1488
1511 \subsection{Signed Comparisons} 1489 \subsection{Signed Comparisons}
1512 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude 1490 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude
1513 comparison a trivial signed comparison algorithm can be written. 1491 comparison a trivial signed comparison algorithm can be written.
1514 1492
1537 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then 1515 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then
1538 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. 1516 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive.
1539 1517
1540 EXAM,bn_mp_cmp.c 1518 EXAM,bn_mp_cmp.c
1541 1519
1542 The two if statements on lines @22,if@ and @26,if@ perform the initial sign comparison. If the signs are not the equal then which ever 1520 The two if statements (lines @22,if@ and @26,if@) perform the initial sign comparison. If the signs are not the equal then which ever
1543 has the positive sign is larger. At line @30,if@, the inputs are compared based on magnitudes. If the signs were both negative then 1521 has the positive sign is larger. The inputs are compared (line @30,if@) based on magnitudes. If the signs were both
1544 the unsigned comparison is performed in the opposite direction (\textit{line @31,mp_cmp_mag@}). Otherwise, the signs are assumed to 1522 negative then the unsigned comparison is performed in the opposite direction (line @31,mp_cmp_mag@). Otherwise, the signs are assumed to
1545 be both positive and a forward direction unsigned comparison is performed. 1523 be both positive and a forward direction unsigned comparison is performed.
1546 1524
1547 \section*{Exercises} 1525 \section*{Exercises}
1548 \begin{tabular}{cl} 1526 \begin{tabular}{cl}
1549 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ 1527 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
1662 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. 1640 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition.
1663 1641
1664 1642
1665 EXAM,bn_s_mp_add.c 1643 EXAM,bn_s_mp_add.c
1666 1644
1667 Lines @27,if@ to @35,}@ perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a 1645 We first sort (lines @27,if@ to @35,}@) the inputs based on magnitude and determine the $min$ and $max$ variables.
1668 mp\_int assigned to the largest input, in effect it is a local alias. Lines @37,init@ to @42,}@ ensure that the destination is grown to 1646 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias. Next we
1669 accomodate the result of the addition. 1647 grow the destination (@37,init@ to @42,}@) ensure that it can accomodate the result of the addition.
1670 1648
1671 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on 1649 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on
1672 lines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively. These aliases are used to ensure the 1650 lines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively. These aliases are used to ensure the
1673 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. 1651 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int.
1674 1652
1675 The initial carry $u$ is cleared on line @65,u = 0@, note that $u$ is of type mp\_digit which ensures type compatibility within the 1653 The initial carry $u$ will be cleared (line @65,u = 0@), note that $u$ is of type mp\_digit which ensures type
1676 implementation. The initial addition loop begins on line @66,for@ and ends on line @75,}@. Similarly the conditional addition loop 1654 compatibility within the implementation. The initial addition (line @66,for@ to @75,}@) adds digits from
1677 begins on line @81,for@ and ends on line @90,}@. The addition is finished with the final carry being stored in $tmpc$ on line @94,tmpc++@. 1655 both inputs until the smallest input runs out of digits. Similarly the conditional addition loop
1678 Note the ``++'' operator on the same line. After line @94,tmpc++@ $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful 1656 (line @81,for@ to @90,}@) adds the remaining digits from the larger of the two inputs. The addition is finished
1679 for the next loop on lines @97,for@ to @99,}@ which set any old upper digits to zero. 1657 with the final carry being stored in $tmpc$ (line @94,tmpc++@). Note the ``++'' operator within the same expression.
1658 After line @94,tmpc++@, $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful
1659 for the next loop (line @97,for@ to @99,}@) which set any old upper digits to zero.
1680 1660
1681 \subsection{Low Level Subtraction} 1661 \subsection{Low Level Subtraction}
1682 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the 1662 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the
1683 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must 1663 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
1684 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. 1664 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
1690 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For 1670 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For
1691 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a 1671 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a
1692 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). 1672 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}).
1693 1673
1694 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' 1674 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long''
1695 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. 1675 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$.
1696 1676
1697 \newpage\begin{figure}[!here] 1677 \newpage\begin{figure}[!here]
1698 \begin{center} 1678 \begin{center}
1699 \begin{small} 1679 \begin{small}
1700 \begin{tabular}{l} 1680 \begin{tabular}{l}
1757 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step 1737 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step
1758 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. 1738 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed.
1759 1739
1760 EXAM,bn_s_mp_sub.c 1740 EXAM,bn_s_mp_sub.c
1761 1741
1762 Line @24,min@ and @25,max@ perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only 1742 Like low level addition we ``sort'' the inputs. Except in this case the sorting is hardcoded
1763 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines @42,tmpa@, @43,tmpb@ and @44,tmpc@ initialize the aliases for 1743 (lines @24,min@ and @25,max@). In reality the $min$ and $max$ variables are only aliases and are only
1764 $a$, $b$ and $c$ respectively. 1744 used to make the source code easier to read. Again the pointer alias optimization is used
1765 1745 within this algorithm. The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized
1766 The first subtraction loop occurs on lines @47,u = 0@ through @61,}@. The theory behind the subtraction loop is exactly the same as that for 1746 (lines @42,tmpa@, @43,tmpb@ and @44,tmpc@) for $a$, $b$ and $c$ respectively.
1767 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry 1747
1768 (\textit{see line @57, >>@}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND 1748 The first subtraction loop (lines @47,u = 0@ through @61,}@) subtract digits from both inputs until the smaller of
1769 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry 1749 the two inputs has been exhausted. As remarked earlier there is an implementation reason for using the ``awkward''
1770 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply 1750 method of extracting the carry (line @57, >>@). The traditional method for extracting the carry would be to shift
1771 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on 1751 by $lg(\beta)$ positions and logically AND the least significant bit. The AND operation is required because all of
1772 twos compliment machines which is a safe assumption to make. 1752 the bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction. This carry
1773 1753 extraction requires two relatively cheap operations to extract the carry. The other method is to simply shift the
1774 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines @64,for@ through @73,}@}) is required to propagate the carry through 1754 most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This
1775 $a$ and copy the result to $c$. 1755 optimization only works on twos compliment machines which is a safe assumption to make.
1756
1757 If $a$ has a larger magnitude than $b$ an additional loop (lines @64,for@ through @73,}@) is required to propagate
1758 the carry through $a$ and copy the result to $c$.
1776 1759
1777 \subsection{High Level Addition} 1760 \subsection{High Level Addition}
1778 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be 1761 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be
1779 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data 1762 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data
1780 types. 1763 types.
2096 \newpage 2079 \newpage
2097 FIGU,sliding_window,Sliding Window Movement 2080 FIGU,sliding_window,Sliding Window Movement
2098 2081
2099 EXAM,bn_mp_lshd.c 2082 EXAM,bn_mp_lshd.c
2100 2083
2101 The if statement on line @24,if@ ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before 2084 The if statement (line @24,if@) ensures that the $b$ variable is greater than zero since we do not interpret negative
2102 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line @42,top@ is an alias 2085 shift counts properly. The \textbf{used} count is incremented by $b$ before the copy loop begins. This elminates
2103 for the leading digit while $bottom$ on line @45,bottom@ is an alias for the trailing edge. The aliases form a window of exactly $b$ digits 2086 the need for an additional variable in the for loop. The variable $top$ (line @42,top@) is an alias
2104 over the input. 2087 for the leading digit while $bottom$ (line @45,bottom@) is an alias for the trailing edge. The aliases form a
2088 window of exactly $b$ digits over the input.
2105 2089
2106 \subsection{Division by $x$} 2090 \subsection{Division by $x$}
2107 2091
2108 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. 2092 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit.
2109 2093
2149 2133
2150 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. 2134 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented.
2151 2135
2152 EXAM,bn_mp_rshd.c 2136 EXAM,bn_mp_rshd.c
2153 2137
2154 The only noteworthy element of this routine is the lack of a return type. 2138 The only noteworthy element of this routine is the lack of a return type since it cannot fail. Like mp\_lshd() we
2155 2139 form a sliding window except we copy in the other direction. After the window (line @59,for (;@) we then zero
2156 -- Will update later to give it a return type...Tom 2140 the upper digits of the input to make sure the result is correct.
2157 2141
2158 \section{Powers of Two} 2142 \section{Powers of Two}
2159 2143
2160 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For 2144 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For
2161 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single 2145 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single
2212 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to 2196 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
2213 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. 2197 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
2214 2198
2215 EXAM,bn_mp_mul_2d.c 2199 EXAM,bn_mp_mul_2d.c
2216 2200
2217 Notes to be revised when code is updated. -- Tom 2201 The shifting is performed in--place which means the first step (line @24,a != c@) is to copy the input to the
2202 destination. We avoid calling mp\_copy() by making sure the mp\_ints are different. The destination then
2203 has to be grown (line @31,grow@) to accomodate the result.
2204
2205 If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples
2206 of $lg(\beta)$. Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left. Inside the actual shift
2207 loop (lines @45,if@ to @76,}@) we make use of pre--computed values $shift$ and $mask$. These are used to
2208 extract the carry bit(s) to pass into the next iteration of the loop. The $r$ and $rr$ variables form a
2209 chain between consecutive iterations to propagate the carry.
2218 2210
2219 \subsection{Division by Power of Two} 2211 \subsection{Division by Power of Two}
2220 2212
2221 \newpage\begin{figure}[!here] 2213 \newpage\begin{figure}[!here]
2222 \begin{small} 2214 \begin{small}
2261 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally 2253 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally
2262 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the 2254 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the
2263 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before 2255 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before
2264 the quotient is obtained. 2256 the quotient is obtained.
2265 2257
2266 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). 2258 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. The only significant difference is
2259 the direction of the shifts.
2267 2260
2268 \subsection{Remainder of Division by Power of Two} 2261 \subsection{Remainder of Division by Power of Two}
2269 2262
2270 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This 2263 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This
2271 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. 2264 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
2304 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ 2297 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$
2305 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. 2298 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count.
2306 2299
2307 EXAM,bn_mp_mod_2d.c 2300 EXAM,bn_mp_mod_2d.c
2308 2301
2309 -- Add comments later, Tom. 2302 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases. Next if $2^b$ is larger
2303 than the input we just mp\_copy() the input and return right away. After this point we know we must actually
2304 perform some work to produce the remainder.
2305
2306 Recalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce
2307 the number. First we zero any digits above the last digit in $2^b$ (line @41,for@). Next we reduce the
2308 leading digit of both (line @45,&=@) and then mp\_clamp().
2310 2309
2311 \section*{Exercises} 2310 \section*{Exercises}
2312 \begin{tabular}{cl} 2311 \begin{tabular}{cl}
2313 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ 2312 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
2314 & in $O(n)$ time. \\ 2313 & in $O(n)$ time. \\
2462 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result 2461 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result
2463 exceed the precision requested. 2462 exceed the precision requested.
2464 2463
2465 EXAM,bn_s_mp_mul_digs.c 2464 EXAM,bn_s_mp_mul_digs.c
2466 2465
2467 Lines @31,if@ to @35,}@ determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and 2466 First we determine (line @30,if@) if the Comba method can be used first since it's faster. The conditions for
2468 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control 2467 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than
2469 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. 2468 \textbf{MP\_WARRAY}. This new constant is used to control the stack usage in the Comba routines. By default it is
2470 2469 set to $\delta$ but can be reduced when memory is at a premium.
2471 Of particular importance is the calculation of the $ix+iy$'th column on lines @64,mp_word@, @65,mp_word@ and @66,mp_word@. Note how all of the 2470
2472 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations 2471 If we cannot use the Comba method we proceed to setup the baseline routine. We allocate the the destination mp\_int
2473 are used instead of single precision. The multiplication on line @65,) * (@ makes use of a specific GCC optimizer behaviour. On the outset it looks like 2472 $t$ (line @36,init@) to the exact size of the output to avoid further re--allocations. At this point we now
2474 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most 2473 begin the $O(n^2)$ loop.
2475 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For 2474
2476 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. 2475 This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of
2476 digits as output. In each iteration of the outer loop the $pb$ variable is set (line @48,MIN@) to the maximum
2477 number of inner loop iterations.
2478
2479 Inside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the
2480 carry from the previous iteration. A particularly important observation is that most modern optimizing
2481 C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that
2482 is required for the product. In x86 terms for example, this means using the MUL instruction.
2483
2484 Each digit of the product is stored in turn (line @68,tmpt@) and the carry propagated (line @71,>>@) to the
2485 next iteration.
2477 2486
2478 \subsection{Faster Multiplication by the ``Comba'' Method} 2487 \subsection{Faster Multiplication by the ``Comba'' Method}
2479 MARK,COMBA 2488 MARK,COMBA
2480 2489
2481 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This 2490 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be
2482 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known 2491 computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement
2483 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested 2492 in parallel. The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G.
2484 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in 2493 Comba who described a method of implementing fast multipliers that do not require nested carry fixup operations. As an
2485 his 1986 paper \cite{BARRETT} written five years before. 2494 interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written
2486 2495 five years before.
2487 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how 2496
2488 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the 2497 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight
2489 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. 2498 twist is placed on how the columns of the result are produced. In the standard long-hand algorithm rows of products
2490 2499 are produced then added together to form the final result. In the baseline algorithm the columns are added together
2491 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a 2500 after each iteration to get the result instantaneously.
2492 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount 2501
2493 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. 2502 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at
2503 the $O(n^2)$ level a simple multiplication and addition step is performed. The carries of the columns are propagated
2504 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute
2505 the product vector $\vec x$ as follows.
2494 2506
2495 \begin{equation} 2507 \begin{equation}
2496 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace 2508 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace
2497 \end{equation} 2509 \end{equation}
2498 2510
2582 \begin{tabular}{l} 2594 \begin{tabular}{l}
2583 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ 2595 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\
2584 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ 2596 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
2585 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ 2597 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
2586 \hline \\ 2598 \hline \\
2587 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ 2599 Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\
2588 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ 2600 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\
2589 2. If step 1 failed return(\textit{MP\_MEM}).\\ 2601 2. If step 1 failed return(\textit{MP\_MEM}).\\
2590 \\ 2602 \\
2591 Zero the temporary array $\hat W$. \\ 2603 3. $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\
2592 3. for $n$ from $0$ to $digs - 1$ do \\
2593 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\
2594 \\ 2604 \\
2595 Compute the columns. \\ 2605 4. $\_ \hat W \leftarrow 0$ \\
2596 4. for $ix$ from $0$ to $a.used - 1$ do \\ 2606 5. for $ix$ from 0 to $pa - 1$ do \\
2597 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ 2607 \hspace{3mm}5.1 $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\
2598 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ 2608 \hspace{3mm}5.2 $tx \leftarrow ix - ty$ \\
2599 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ 2609 \hspace{3mm}5.3 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
2600 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ 2610 \hspace{3mm}5.4 for $iz$ from 0 to $iy - 1$ do \\
2611 \hspace{6mm}5.4.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\
2612 \hspace{3mm}5.5 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\
2613 \hspace{3mm}5.6 $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
2614 6. $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
2601 \\ 2615 \\
2602 Propagate the carries upwards. \\ 2616 7. $oldused \leftarrow c.used$ \\
2603 5. $oldused \leftarrow c.used$ \\ 2617 8. $c.used \leftarrow digs$ \\
2604 6. $c.used \leftarrow digs$ \\ 2618 9. for $ix$ from $0$ to $pa$ do \\
2605 7. If $digs > 1$ then do \\ 2619 \hspace{3mm}9.1 $c_{ix} \leftarrow W_{ix}$ \\
2606 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ 2620 10. for $ix$ from $pa + 1$ to $oldused - 1$ do \\
2607 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ 2621 \hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\
2608 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\
2609 8. else do \\
2610 \hspace{3mm}8.1 $ix \leftarrow 0$ \\
2611 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\
2612 \\ 2622 \\
2613 Zero excess digits. \\ 2623 11. Clamp $c$. \\
2614 10. If $digs < oldused$ then do \\ 2624 12. Return MP\_OKAY. \\
2615 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\
2616 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\
2617 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\
2618 12. Return(\textit{MP\_OKAY}). \\
2619 \hline 2625 \hline
2620 \end{tabular} 2626 \end{tabular}
2621 \end{center} 2627 \end{center}
2622 \end{small} 2628 \end{small}
2623 \caption{Algorithm fast\_s\_mp\_mul\_digs} 2629 \caption{Algorithm fast\_s\_mp\_mul\_digs}
2624 \label{fig:COMBAMULT} 2630 \label{fig:COMBAMULT}
2625 \end{figure} 2631 \end{figure}
2626 2632
2627 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} 2633 \textbf{Algorithm fast\_s\_mp\_mul\_digs.}
2628 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm 2634 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.
2629 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. 2635
2630 2636 The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the
2631 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that 2637 loop we want to produce one column per pass. This allows the accumulator $\_ \hat W$ to be placed in CPU registers and
2632 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. 2638 reduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration.
2633 2639
2634 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of 2640 The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$. That way if $a$ has more digits than
2635 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each 2641 $b$ this will be limited to $b.used - 1$. The $tx$ variable is set to the to the distance past $b.used$ the variable
2636 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. 2642 $ix$ is. This is used for the immediately subsequent statement where we find $iy$.
2643
2644 The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out. Computing one column at a time
2645 means we have to scan one integer upwards and the other downwards. $a$ starts at $tx$ and $b$ starts at $ty$. In each
2646 pass we are producing the $ix$'th output column and we note that $tx + ty = ix$. As we move $tx$ upwards we have to
2647 move $ty$ downards so the equality remains valid. The $iy$ variable is the number of iterations until
2648 $tx \ge a.used$ or $ty < 0$ occurs.
2649
2650 After every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator
2651 into the next round by dividing $\_ \hat W$ by $\beta$.
2637 2652
2638 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the 2653 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the
2639 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require 2654 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require
2640 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, 2655 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice,
2641 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply 2656 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
2642 and addition operations in the nested loop in parallel. 2657 and addition operations in the nested loop in parallel.
2643 2658
2644 EXAM,bn_fast_s_mp_mul_digs.c 2659 EXAM,bn_fast_s_mp_mul_digs.c
2645 2660
2646 The memset on line @47,memset@ clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication 2661 As per the pseudo--code we first calculate $pa$ (line @47,MIN@) as the number of digits to output. Next we begin the outer loop
2647 implementation a series of aliases (\textit{lines @67, tmpx@, @70, tmpy@ and @75,_W@}) are used to simplify the inner $O(n^2)$ loop. 2662 to produce the individual columns of the product. We use the two aliases $tmpx$ and $tmpy$ (lines @61,tmpx@, @62,tmpy@) to point
2648 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. 2663 inside the two multiplicands quickly.
2649 2664
2650 The inner loop on lines @83,for@, @84,mp_word@ and @85,}@ is where the algorithm will spend the majority of the time, which is why it has been 2665 The inner loop (lines @70,for@ to @72,}@) of this implementation is where the tradeoff come into play. Originally this comba
2651 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the 2666 implementation was ``row--major'' which means it adds to each of the columns in each pass. After the outer loop it would then fix
2652 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three 2667 the carries. This was very fast except it had an annoying drawback. You had to read a mp\_word and two mp\_digits and write
2653 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop 2668 one mp\_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth
2654 and scheduling the instructions so there are very few dependency stalls. 2669 is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded cpus where cache is often
2655 2670 slower and also often doesn't exist. This new algorithm only performs two reads per iteration under the assumption that the
2656 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the 2671 compiler has aliased $\_ \hat W$ to a CPU register.
2657 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next 2672
2658 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can 2673 After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines @75,W[ix]@, @78,>>@) to forward it as
2659 be simultaneously used. 2674 a carry for the next pass. After the outer loop we use the final carry (line @82,W[ix]@) as the last digit of the product.
2660 2675
2661 \subsection{Polynomial Basis Multiplication} 2676 \subsection{Polynomial Basis Multiplication}
2662 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms 2677 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms
2663 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and 2678 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and
2664 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. 2679 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
2974 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer 2989 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer
2975 result $a \cdot b$ is produced. 2990 result $a \cdot b$ is produced.
2976 2991
2977 EXAM,bn_mp_toom_mul.c 2992 EXAM,bn_mp_toom_mul.c
2978 2993
2979 -- Comments to be added during editing phase. 2994 The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very
2995 large numbers. For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with
2996 Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$). For most ``crypto'' sized numbers this
2997 algorithm is not practical as Karatsuba has a much lower cutoff point.
2998
2999 First we split $a$ and $b$ into three roughly equal portions. This has been accomplished (lines @40,mod@ to @69,rshd@) with
3000 combinations of mp\_rshd() and mp\_mod\_2d() function calls. At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly
3001 for $b$.
3002
3003 Next we compute the five points $w0, w1, w2, w3$ and $w4$. Recall that $w0$ and $w4$ can be computed directly from the portions so
3004 we get those out of the way first (lines @72,mul@ and @77,mul@). Next we compute $w1, w2$ and $w3$ using Horners method.
3005
3006 After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively
3007 straight forward.
2980 3008
2981 \subsection{Signed Multiplication} 3009 \subsection{Signed Multiplication}
2982 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all 3010 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all
2983 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. 3011 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.
2984 3012
2985 \newpage\begin{figure}[!here] 3013 \begin{figure}[!here]
2986 \begin{small} 3014 \begin{small}
2987 \begin{center} 3015 \begin{center}
2988 \begin{tabular}{l} 3016 \begin{tabular}{l}
2989 \hline Algorithm \textbf{mp\_mul}. \\ 3017 \hline Algorithm \textbf{mp\_mul}. \\
2990 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ 3018 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
3063 3091
3064 \subsection{The Baseline Squaring Algorithm} 3092 \subsection{The Baseline Squaring Algorithm}
3065 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines 3093 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines
3066 will not handle. 3094 will not handle.
3067 3095
3068 \newpage\begin{figure}[!here] 3096 \begin{figure}[!here]
3069 \begin{small} 3097 \begin{small}
3070 \begin{center} 3098 \begin{center}
3071 \begin{tabular}{l} 3099 \begin{tabular}{l}
3072 \hline Algorithm \textbf{s\_mp\_sqr}. \\ 3100 \hline Algorithm \textbf{s\_mp\_sqr}. \\
3073 \textbf{Input}. mp\_int $a$ \\ 3101 \textbf{Input}. mp\_int $a$ \\
3119 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial 3147 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
3120 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. 3148 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm.
3121 3149
3122 EXAM,bn_s_mp_sqr.c 3150 EXAM,bn_s_mp_sqr.c
3123 3151
3124 Inside the outer loop (\textit{see line @32,for@}) the square term is calculated on line @35,r =@. Line @42,>>@ extracts the carry from the square 3152 Inside the outer loop (line @32,for@) the square term is calculated on line @35,r =@. The carry (line @42,>>@) has been
3125 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines @45,tmpx@ and @48,tmpt@ respectively. The doubling is performed using two 3153 extracted from the mp\_word accumulator using a right shift. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized
3126 additions (\textit{see line @57,r + r@}) since it is usually faster than shifting,if not at least as fast. 3154 (lines @45,tmpx@ and @48,tmpt@) to simplify the inner loop. The doubling is performed using two
3155 additions (line @57,r + r@) since it is usually faster than shifting, if not at least as fast.
3156
3157 The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication. As such the inner loops
3158 get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to
3159 square a number.
3127 3160
3128 \subsection{Faster Squaring by the ``Comba'' Method} 3161 \subsection{Faster Squaring by the ``Comba'' Method}
3129 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional 3162 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional
3130 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these 3163 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these
3131 performance hazards. 3164 performance hazards.
3133 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry 3166 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry
3134 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact 3167 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact
3135 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, 3168 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example,
3136 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. 3169 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$.
3137 3170
3138 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word 3171 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two
3139 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be 3172 mp\_word arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and
3140 moved to a $O(n)$ work level outside the $O(n^2)$ level. 3173 carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level. In this case, we have an even simpler solution in mind.
3141 3174
3142 \newpage\begin{figure}[!here] 3175 \newpage\begin{figure}[!here]
3143 \begin{small} 3176 \begin{small}
3144 \begin{center} 3177 \begin{center}
3145 \begin{tabular}{l} 3178 \begin{tabular}{l}
3146 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ 3179 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\
3147 \textbf{Input}. mp\_int $a$ \\ 3180 \textbf{Input}. mp\_int $a$ \\
3148 \textbf{Output}. $b \leftarrow a^2$ \\ 3181 \textbf{Output}. $b \leftarrow a^2$ \\
3149 \hline \\ 3182 \hline \\
3150 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ 3183 Place an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\
3151 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ 3184 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\
3152 2. If step 1 failed return(\textit{MP\_MEM}). \\ 3185 2. If step 1 failed return(\textit{MP\_MEM}). \\
3153 3. for $ix$ from $0$ to $2a.used + 1$ do \\
3154 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\
3155 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\
3156 4. for $ix$ from $0$ to $a.used - 1$ do \\
3157 \hspace{3mm}Compute the square.\\
3158 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\
3159 \\ 3186 \\
3160 \hspace{3mm}Compute the double products.\\ 3187 3. $pa \leftarrow 2 \cdot a.used$ \\
3161 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ 3188 4. $\hat W1 \leftarrow 0$ \\
3162 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ 3189 5. for $ix$ from $0$ to $pa - 1$ do \\
3163 5. $oldused \leftarrow b.used$ \\ 3190 \hspace{3mm}5.1 $\_ \hat W \leftarrow 0$ \\
3164 6. $b.used \leftarrow 2a.used + 1$ \\ 3191 \hspace{3mm}5.2 $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\
3192 \hspace{3mm}5.3 $tx \leftarrow ix - ty$ \\
3193 \hspace{3mm}5.4 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
3194 \hspace{3mm}5.5 $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\
3195 \hspace{3mm}5.6 for $iz$ from $0$ to $iz - 1$ do \\
3196 \hspace{6mm}5.6.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\
3197 \hspace{3mm}5.7 $\_ \hat W \leftarrow 2 \cdot \_ \hat W + \hat W1$ \\
3198 \hspace{3mm}5.8 if $ix$ is even then \\
3199 \hspace{6mm}5.8.1 $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\
3200 \hspace{3mm}5.9 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
3201 \hspace{3mm}5.10 $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
3165 \\ 3202 \\
3166 Double the products and propagate the carries simultaneously. \\ 3203 6. $oldused \leftarrow b.used$ \\
3167 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ 3204 7. $b.used \leftarrow 2 \cdot a.used$ \\
3168 8. for $ix$ from $1$ to $2a.used$ do \\ 3205 8. for $ix$ from $0$ to $pa - 1$ do \\
3169 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ 3206 \hspace{3mm}8.1 $b_{ix} \leftarrow W_{ix}$ \\
3170 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ 3207 9. for $ix$ from $pa$ to $oldused - 1$ do \\
3171 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ 3208 \hspace{3mm}9.1 $b_{ix} \leftarrow 0$ \\
3172 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ 3209 10. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\
3173 10. if $2a.used + 1 < oldused$ then do \\ 3210 11. Return(\textit{MP\_OKAY}). \\
3174 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\
3175 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\
3176 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\
3177 12. Return(\textit{MP\_OKAY}). \\
3178 \hline 3211 \hline
3179 \end{tabular} 3212 \end{tabular}
3180 \end{center} 3213 \end{center}
3181 \end{small} 3214 \end{small}
3182 \caption{Algorithm fast\_s\_mp\_sqr} 3215 \caption{Algorithm fast\_s\_mp\_sqr}
3183 \end{figure} 3216 \end{figure}
3184 3217
3185 \textbf{Algorithm fast\_s\_mp\_sqr.} 3218 \textbf{Algorithm fast\_s\_mp\_sqr.}
3186 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when 3219 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm
3187 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. 3220 s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.
3188 3221 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.
3189 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second 3222
3190 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most 3223 First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively. This is because the inner loop
3191 processors to simply make it a full size array. 3224 products are to be doubled. If we had added the previous carry in we would be doubling too much. Next we perform an
3192 3225 addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits. For example, $a_3 \cdot a_5$ is equal
3193 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how 3226 $a_5 \cdot a_3$. Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum
3194 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop 3227 of the products just outside the inner loop we have to avoid doing this. This is also a good thing since we perform
3195 computes the sum of the products for each column. They are not doubled until later. 3228 fewer multiplications and the routine ends up being faster.
3196 3229
3197 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both 3230 Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8). We add in the square
3198 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the 3231 only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position.
3199 squares in place.
3200 3232
3201 EXAM,bn_fast_s_mp_sqr.c 3233 EXAM,bn_fast_s_mp_sqr.c
3202 3234
3203 -- Write something deep and insightful later, Tom. 3235 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for
3236 the special case of squaring.
3204 3237
3205 \subsection{Polynomial Basis Squaring} 3238 \subsection{Polynomial Basis Squaring}
3206 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception 3239 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception
3207 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ 3240 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$
3208 multiplications to find the $\zeta$ relations, squaring operations are performed instead. 3241 multiplications to find the $\zeta$ relations, squaring operations are performed instead.
3310 3343
3311 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point 3344 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point
3312 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 3345 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4
3313 it is actually below the Comba limit (\textit{at 110 digits}). 3346 it is actually below the Comba limit (\textit{at 110 digits}).
3314 3347
3315 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to 3348 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are
3316 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. 3349 redirected to the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and
3317 3350 mp\_clears are executed normally.
3318 \textit{Last paragraph sucks. re-write! -- Tom}
3319 3351
3320 \subsection{Toom-Cook Squaring} 3352 \subsection{Toom-Cook Squaring}
3321 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used 3353 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
3322 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to 3354 instead of multiplication to find the five relations. The reader is encouraged to read the description of the latter algorithm and try to
3323 derive their own Toom-Cook squaring algorithm. 3355 derive their own Toom-Cook squaring algorithm.
3324 3356
3325 \subsection{High Level Squaring} 3357 \subsection{High Level Squaring}
3326 \newpage\begin{figure}[!here] 3358 \newpage\begin{figure}[!here]
3327 \begin{small} 3359 \begin{small}
3360 \section*{Exercises} 3392 \section*{Exercises}
3361 \begin{tabular}{cl} 3393 \begin{tabular}{cl}
3362 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ 3394 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\
3363 & that have different number of digits in Karatsuba multiplication. \\ 3395 & that have different number of digits in Karatsuba multiplication. \\
3364 & \\ 3396 & \\
3365 $\left [ 3 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ 3397 $\left [ 2 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\
3366 & of double products and at most one square is stated. Prove this statement. \\ 3398 & of double products and at most one square is stated. Prove this statement. \\
3367 & \\ 3399 & \\
3368 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\
3369 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\
3370 & \\
3371 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ 3400 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\
3372 & \\ 3401 & \\
3373 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ 3402 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\
3374 & \\ 3403 & \\
3375 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ 3404 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\
3376 & required for equation $6.7$ to be true. \\ 3405 & required for equation $6.7$ to be true. \\
3406 & \\
3407 $\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\
3408 & compute subsets of the columns in each thread. Determine a cutoff point where \\
3409 & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\
3410 &\\
3411 $\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook. You must \\
3412 & increase the throughput of mp\_exptmod() for random odd moduli in the range \\
3413 & $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\
3377 & \\ 3414 & \\
3378 \end{tabular} 3415 \end{tabular}
3379 3416
3380 \chapter{Modular Reduction} 3417 \chapter{Modular Reduction}
3381 MARK,REDUCTION 3418 MARK,REDUCTION
3392 other forms of residues. 3429 other forms of residues.
3393 3430
3394 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions 3431 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions
3395 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the 3432 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the
3396 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in 3433 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in
3397 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular 3434 elliptic curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular
3398 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the 3435 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the
3399 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check 3436 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check
3400 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. 3437 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.
3401 3438
3402 \section{The Barrett Reduction} 3439 \section{The Barrett Reduction}
3608 3645
3609 \subsection{The Barrett Setup Algorithm} 3646 \subsection{The Barrett Setup Algorithm}
3610 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for 3647 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for
3611 future use so that the Barrett algorithm can be used without delay. 3648 future use so that the Barrett algorithm can be used without delay.
3612 3649
3613 \begin{figure}[!here] 3650 \newpage\begin{figure}[!here]
3614 \begin{small} 3651 \begin{small}
3615 \begin{center} 3652 \begin{center}
3616 \begin{tabular}{l} 3653 \begin{tabular}{l}
3617 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ 3654 \hline Algorithm \textbf{mp\_reduce\_setup}. \\
3618 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ 3655 \textbf{Input}. mp\_int $a$ ($a > 1$) \\
5816 \section{Jacobi Symbol Computation} 5853 \section{Jacobi Symbol Computation}
5817 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is 5854 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is
5818 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is 5855 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is
5819 equivalent to equation \ref{eqn:legendre}. 5856 equivalent to equation \ref{eqn:legendre}.
5820 5857
5858 \textit{-- Tom, don't be an ass, cite your source here...!}
5859
5821 \begin{equation} 5860 \begin{equation}
5822 a^{(p-1)/2} \equiv \begin{array}{rl} 5861 a^{(p-1)/2} \equiv \begin{array}{rl}
5823 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ 5862 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\
5824 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ 5863 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\
5825 1 & \mbox{if }a\mbox{ is a quadratic residue}. 5864 1 & \mbox{if }a\mbox{ is a quadratic residue}.