Mercurial > dropbear
comparison tommath.src @ 190:d8254fc979e9 libtommath-orig LTM_0.35
Initial import of libtommath 0.35
author | Matt Johnston <matt@ucc.asn.au> |
---|---|
date | Fri, 06 May 2005 08:59:30 +0000 |
parents | d29b64170cf0 |
children |
comparison
equal
deleted
inserted
replaced
142:d29b64170cf0 | 190:d8254fc979e9 |
---|---|
47 \def\gap{\vspace{0.5ex}} | 47 \def\gap{\vspace{0.5ex}} |
48 \makeindex | 48 \makeindex |
49 \begin{document} | 49 \begin{document} |
50 \frontmatter | 50 \frontmatter |
51 \pagestyle{empty} | 51 \pagestyle{empty} |
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } | 52 \title{Multi--Precision Math} |
53 \author{\mbox{ | 53 \author{\mbox{ |
54 %\begin{small} | 54 %\begin{small} |
55 \begin{tabular}{c} | 55 \begin{tabular}{c} |
56 Tom St Denis \\ | 56 Tom St Denis \\ |
57 Algonquin College \\ | 57 Algonquin College \\ |
64 \end{tabular} | 64 \end{tabular} |
65 %\end{small} | 65 %\end{small} |
66 } | 66 } |
67 } | 67 } |
68 \maketitle | 68 \maketitle |
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the | 69 This text has been placed in the public domain. This text corresponds to the v0.35 release of the |
70 LibTomMath project. | 70 LibTomMath project. |
71 | 71 |
72 \begin{alltt} | 72 \begin{alltt} |
73 Tom St Denis | 73 Tom St Denis |
74 111 Banning Rd | 74 111 Banning Rd |
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} | 83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} |
84 {\em book} macro package and the Perl {\em booker} package. | 84 {\em book} macro package and the Perl {\em booker} package. |
85 | 85 |
86 \tableofcontents | 86 \tableofcontents |
87 \listoffigures | 87 \listoffigures |
88 \chapter*{Prefaces to the Draft Edition} | 88 \chapter*{Prefaces} |
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions | 89 When I tell people about my LibTom projects and that I release them as public domain they are often puzzled. |
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their | 90 They ask why I did it and especially why I continue to work on them for free. The best I can explain it is ``Because I can.'' |
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the | 91 Which seems odd and perhaps too terse for adult conversation. I often qualify it with ``I am able, I am willing.'' which |
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance | 92 perhaps explains it better. I am the first to admit there is not anything that special with what I have done. Perhaps |
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the | 93 others can see that too and then we would have a society to be proud of. My LibTom projects are what I am doing to give |
94 text. | 94 back to society in the form of tools and knowledge that can help others in their endeavours. |
95 | 95 |
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit | 96 I started writing this book because it was the most logical task to further my goal of open academia. The LibTomMath source |
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only | 97 code itself was written to be easy to follow and learn from. There are times, however, where pure C source code does not |
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted | 98 explain the algorithms properly. Hence this book. The book literally starts with the foundation of the library and works |
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have | 99 itself outwards to the more complicated algorithms. The use of both pseudo--code and verbatim source code provides a duality |
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain | 100 of ``theory'' and ``practice'' that the computer science students of the world shall appreciate. I never deviate too far |
101 rewarding. | 101 from relatively straightforward algebra and I hope that this book can be a valuable learning asset. |
102 | 102 |
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. | 103 This book and indeed much of the LibTom projects would not exist in their current form if it was not for a plethora |
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even | 104 of kind people donating their time, resources and kind words to help support my work. Writing a text of significant |
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then | 105 length (along with the source code) is a tiresome and lengthy process. Currently the LibTom project is four years old, |
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy | 106 comprises of literally thousands of users and over 100,000 lines of source code, TeX and other material. People like Mads and Greg |
107 onto finishing the book not securing a contract. | 107 were there at the beginning to encourage me to work well. It is amazing how timely validation from others can boost morale to |
108 | 108 continue the project. Definitely my parents were there for me by providing room and board during the many months of work in 2003. |
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. | 109 |
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written | 110 To my many friends whom I have met through the years I thank you for the good times and the words of encouragement. I hope I |
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy | 111 honour your kind gestures with this project. |
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, | 112 |
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely | 113 Open Source. Open Academia. Open Minds. |
114 without hinderance. | |
115 | |
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come | |
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their | |
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as | |
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing | |
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy | |
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C. | |
122 | |
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is | |
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic. | |
125 | |
126 At this time I feel I should share a little information about myself. The most common question I was asked at | |
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate | |
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which | |
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training. | |
130 | |
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen | |
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am | |
133 still far off from that goal. | |
134 | |
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing | |
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out | |
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even | |
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off | |
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what | |
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors | |
141 in my written English have saved me on several occasions to say the least. | |
142 | |
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've | |
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative | |
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college | |
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many | |
147 people who will take it. | |
148 | 114 |
149 \begin{flushright} Tom St Denis \end{flushright} | 115 \begin{flushright} Tom St Denis \end{flushright} |
150 | 116 |
151 \newpage | 117 \newpage |
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also | 118 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also |
935 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are | 901 akin to how the \textit{realloc} function from the standard C library works. Since the newly allocated digits are |
936 assumed to contain undefined values they are initially set to zero. | 902 assumed to contain undefined values they are initially set to zero. |
937 | 903 |
938 EXAM,bn_mp_grow.c | 904 EXAM,bn_mp_grow.c |
939 | 905 |
940 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @23,if@) checks | 906 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line @24,alloc@) checks |
941 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} | 907 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} |
942 the function skips the re-allocation part thus saving time. | 908 the function skips the re-allocation part thus saving time. |
943 | 909 |
944 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is | 910 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is |
945 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@). The XREALLOC function is used | 911 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@). The XREALLOC function is used |
1308 \section{Sign Manipulation} | 1274 \section{Sign Manipulation} |
1309 \subsection{Absolute Value} | 1275 \subsection{Absolute Value} |
1310 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute | 1276 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute |
1311 the absolute value of an mp\_int. | 1277 the absolute value of an mp\_int. |
1312 | 1278 |
1313 \newpage\begin{figure}[here] | 1279 \begin{figure}[here] |
1314 \begin{center} | 1280 \begin{center} |
1315 \begin{tabular}{l} | 1281 \begin{tabular}{l} |
1316 \hline Algorithm \textbf{mp\_abs}. \\ | 1282 \hline Algorithm \textbf{mp\_abs}. \\ |
1317 \textbf{Input}. An mp\_int $a$ \\ | 1283 \textbf{Input}. An mp\_int $a$ \\ |
1318 \textbf{Output}. Computes $b = \vert a \vert$ \\ | 1284 \textbf{Output}. Computes $b = \vert a \vert$ \\ |
1332 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, | 1298 algorithm where the check in mp\_copy that determines if the source and destination are equal proves useful. This allows, |
1333 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition | 1299 for instance, the developer to pass the same mp\_int as the source and destination to this function without addition |
1334 logic to handle it. | 1300 logic to handle it. |
1335 | 1301 |
1336 EXAM,bn_mp_abs.c | 1302 EXAM,bn_mp_abs.c |
1303 | |
1304 This fairly trivial algorithm first eliminates non--required duplications (line @27,a != b@) and then sets the | |
1305 \textbf{sign} flag to \textbf{MP\_ZPOS}. | |
1337 | 1306 |
1338 \subsection{Integer Negation} | 1307 \subsection{Integer Negation} |
1339 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute | 1308 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute |
1340 the negative of an mp\_int input. | 1309 the negative of an mp\_int input. |
1341 | 1310 |
1366 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return | 1335 $a$ had no digits then it must be positive by definition. Had step three been omitted then the algorithm would return |
1367 zero as negative. | 1336 zero as negative. |
1368 | 1337 |
1369 EXAM,bn_mp_neg.c | 1338 EXAM,bn_mp_neg.c |
1370 | 1339 |
1340 Like mp\_abs() this function avoids non--required duplications (line @21,a != b@) and then sets the sign. We | |
1341 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}. If the mp\_int is zero | |
1342 than the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}. | |
1343 | |
1371 \section{Small Constants} | 1344 \section{Small Constants} |
1372 \subsection{Setting Small Constants} | 1345 \subsection{Setting Small Constants} |
1373 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. | 1346 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. |
1374 | 1347 |
1375 \begin{figure}[here] | 1348 \newpage\begin{figure}[here] |
1376 \begin{center} | 1349 \begin{center} |
1377 \begin{tabular}{l} | 1350 \begin{tabular}{l} |
1378 \hline Algorithm \textbf{mp\_set}. \\ | 1351 \hline Algorithm \textbf{mp\_set}. \\ |
1379 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ | 1352 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ |
1380 \textbf{Output}. Make $a$ equivalent to $b$ \\ | 1353 \textbf{Output}. Make $a$ equivalent to $b$ \\ |
1395 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The | 1368 This algorithm sets a mp\_int to a small single digit value. Step number 1 ensures that the integer is reset to the default state. The |
1396 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. | 1369 single digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly. |
1397 | 1370 |
1398 EXAM,bn_mp_set.c | 1371 EXAM,bn_mp_set.c |
1399 | 1372 |
1400 Line @21,mp_zero@ calls mp\_zero() to clear the mp\_int and reset the sign. Line @22,MP_MASK@ copies the digit | 1373 First we zero (line @21,mp_zero@) the mp\_int to make sure that the other members are initialized for a |
1401 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly | 1374 small positive constant. mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count |
1402 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with | 1375 is zero. Next we set the digit and reduce it modulo $\beta$ (line @22,MP_MASK@). After this step we have to |
1403 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line @23,a->used@ will set the \textbf{used} member with respect to the | 1376 check if the resulting digit is zero or not. If it is not then we set the \textbf{used} count to one, otherwise |
1404 digit actually set. This function will always make the integer positive. | 1377 to zero. |
1378 | |
1379 We can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with | |
1380 $2^k - 1$ will perform the same operation. | |
1405 | 1381 |
1406 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses | 1382 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses |
1407 this function should take that into account. Only trivially small constants can be set using this function. | 1383 this function should take that into account. Only trivially small constants can be set using this function. |
1408 | 1384 |
1409 \subsection{Setting Large Constants} | 1385 \subsection{Setting Large Constants} |
1501 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to | 1477 By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to |
1502 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. | 1478 the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}. |
1503 | 1479 |
1504 EXAM,bn_mp_cmp_mag.c | 1480 EXAM,bn_mp_cmp_mag.c |
1505 | 1481 |
1506 The two if statements on lines @24,if@ and @28,if@ compare the number of digits in the two inputs. These two are performed before all of the digits | 1482 The two if statements (lines @24,if@ and @28,if@) compare the number of digits in the two inputs. These two are |
1507 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid | 1483 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save |
1508 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the | 1484 considerable time. The implementation given is also not valid without those two statements. $b.alloc$ may be |
1509 array of digits. | 1485 smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the array of digits. |
1486 | |
1487 | |
1510 | 1488 |
1511 \subsection{Signed Comparisons} | 1489 \subsection{Signed Comparisons} |
1512 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude | 1490 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude |
1513 comparison a trivial signed comparison algorithm can be written. | 1491 comparison a trivial signed comparison algorithm can be written. |
1514 | 1492 |
1537 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then | 1515 three the unsigned comparision flips the order of the arguments since they are both negative. For instance, if $-a > -b$ then |
1538 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. | 1516 $\vert a \vert < \vert b \vert$. Step number four will compare the two when they are both positive. |
1539 | 1517 |
1540 EXAM,bn_mp_cmp.c | 1518 EXAM,bn_mp_cmp.c |
1541 | 1519 |
1542 The two if statements on lines @22,if@ and @26,if@ perform the initial sign comparison. If the signs are not the equal then which ever | 1520 The two if statements (lines @22,if@ and @26,if@) perform the initial sign comparison. If the signs are not the equal then which ever |
1543 has the positive sign is larger. At line @30,if@, the inputs are compared based on magnitudes. If the signs were both negative then | 1521 has the positive sign is larger. The inputs are compared (line @30,if@) based on magnitudes. If the signs were both |
1544 the unsigned comparison is performed in the opposite direction (\textit{line @31,mp_cmp_mag@}). Otherwise, the signs are assumed to | 1522 negative then the unsigned comparison is performed in the opposite direction (line @31,mp_cmp_mag@). Otherwise, the signs are assumed to |
1545 be both positive and a forward direction unsigned comparison is performed. | 1523 be both positive and a forward direction unsigned comparison is performed. |
1546 | 1524 |
1547 \section*{Exercises} | 1525 \section*{Exercises} |
1548 \begin{tabular}{cl} | 1526 \begin{tabular}{cl} |
1549 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ | 1527 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ |
1662 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. | 1640 The final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition. |
1663 | 1641 |
1664 | 1642 |
1665 EXAM,bn_s_mp_add.c | 1643 EXAM,bn_s_mp_add.c |
1666 | 1644 |
1667 Lines @27,if@ to @35,}@ perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a | 1645 We first sort (lines @27,if@ to @35,}@) the inputs based on magnitude and determine the $min$ and $max$ variables. |
1668 mp\_int assigned to the largest input, in effect it is a local alias. Lines @37,init@ to @42,}@ ensure that the destination is grown to | 1646 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias. Next we |
1669 accomodate the result of the addition. | 1647 grow the destination (@37,init@ to @42,}@) ensure that it can accomodate the result of the addition. |
1670 | 1648 |
1671 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on | 1649 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on |
1672 lines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively. These aliases are used to ensure the | 1650 lines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively. These aliases are used to ensure the |
1673 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. | 1651 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. |
1674 | 1652 |
1675 The initial carry $u$ is cleared on line @65,u = 0@, note that $u$ is of type mp\_digit which ensures type compatibility within the | 1653 The initial carry $u$ will be cleared (line @65,u = 0@), note that $u$ is of type mp\_digit which ensures type |
1676 implementation. The initial addition loop begins on line @66,for@ and ends on line @75,}@. Similarly the conditional addition loop | 1654 compatibility within the implementation. The initial addition (line @66,for@ to @75,}@) adds digits from |
1677 begins on line @81,for@ and ends on line @90,}@. The addition is finished with the final carry being stored in $tmpc$ on line @94,tmpc++@. | 1655 both inputs until the smallest input runs out of digits. Similarly the conditional addition loop |
1678 Note the ``++'' operator on the same line. After line @94,tmpc++@ $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful | 1656 (line @81,for@ to @90,}@) adds the remaining digits from the larger of the two inputs. The addition is finished |
1679 for the next loop on lines @97,for@ to @99,}@ which set any old upper digits to zero. | 1657 with the final carry being stored in $tmpc$ (line @94,tmpc++@). Note the ``++'' operator within the same expression. |
1658 After line @94,tmpc++@, $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful | |
1659 for the next loop (line @97,for@ to @99,}@) which set any old upper digits to zero. | |
1680 | 1660 |
1681 \subsection{Low Level Subtraction} | 1661 \subsection{Low Level Subtraction} |
1682 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the | 1662 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the |
1683 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must | 1663 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must |
1684 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. | 1664 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. |
1690 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For | 1670 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For |
1691 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a | 1671 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a |
1692 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). | 1672 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). |
1693 | 1673 |
1694 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' | 1674 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' |
1695 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. | 1675 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$. |
1696 | 1676 |
1697 \newpage\begin{figure}[!here] | 1677 \newpage\begin{figure}[!here] |
1698 \begin{center} | 1678 \begin{center} |
1699 \begin{small} | 1679 \begin{small} |
1700 \begin{tabular}{l} | 1680 \begin{tabular}{l} |
1757 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step | 1737 If $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$. Step |
1758 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. | 1738 10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed. |
1759 | 1739 |
1760 EXAM,bn_s_mp_sub.c | 1740 EXAM,bn_s_mp_sub.c |
1761 | 1741 |
1762 Line @24,min@ and @25,max@ perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only | 1742 Like low level addition we ``sort'' the inputs. Except in this case the sorting is hardcoded |
1763 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines @42,tmpa@, @43,tmpb@ and @44,tmpc@ initialize the aliases for | 1743 (lines @24,min@ and @25,max@). In reality the $min$ and $max$ variables are only aliases and are only |
1764 $a$, $b$ and $c$ respectively. | 1744 used to make the source code easier to read. Again the pointer alias optimization is used |
1765 | 1745 within this algorithm. The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized |
1766 The first subtraction loop occurs on lines @47,u = 0@ through @61,}@. The theory behind the subtraction loop is exactly the same as that for | 1746 (lines @42,tmpa@, @43,tmpb@ and @44,tmpc@) for $a$, $b$ and $c$ respectively. |
1767 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry | 1747 |
1768 (\textit{see line @57, >>@}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND | 1748 The first subtraction loop (lines @47,u = 0@ through @61,}@) subtract digits from both inputs until the smaller of |
1769 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry | 1749 the two inputs has been exhausted. As remarked earlier there is an implementation reason for using the ``awkward'' |
1770 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply | 1750 method of extracting the carry (line @57, >>@). The traditional method for extracting the carry would be to shift |
1771 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on | 1751 by $lg(\beta)$ positions and logically AND the least significant bit. The AND operation is required because all of |
1772 twos compliment machines which is a safe assumption to make. | 1752 the bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction. This carry |
1773 | 1753 extraction requires two relatively cheap operations to extract the carry. The other method is to simply shift the |
1774 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines @64,for@ through @73,}@}) is required to propagate the carry through | 1754 most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This |
1775 $a$ and copy the result to $c$. | 1755 optimization only works on twos compliment machines which is a safe assumption to make. |
1756 | |
1757 If $a$ has a larger magnitude than $b$ an additional loop (lines @64,for@ through @73,}@) is required to propagate | |
1758 the carry through $a$ and copy the result to $c$. | |
1776 | 1759 |
1777 \subsection{High Level Addition} | 1760 \subsection{High Level Addition} |
1778 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be | 1761 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be |
1779 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data | 1762 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data |
1780 types. | 1763 types. |
2096 \newpage | 2079 \newpage |
2097 FIGU,sliding_window,Sliding Window Movement | 2080 FIGU,sliding_window,Sliding Window Movement |
2098 | 2081 |
2099 EXAM,bn_mp_lshd.c | 2082 EXAM,bn_mp_lshd.c |
2100 | 2083 |
2101 The if statement on line @24,if@ ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before | 2084 The if statement (line @24,if@) ensures that the $b$ variable is greater than zero since we do not interpret negative |
2102 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line @42,top@ is an alias | 2085 shift counts properly. The \textbf{used} count is incremented by $b$ before the copy loop begins. This elminates |
2103 for the leading digit while $bottom$ on line @45,bottom@ is an alias for the trailing edge. The aliases form a window of exactly $b$ digits | 2086 the need for an additional variable in the for loop. The variable $top$ (line @42,top@) is an alias |
2104 over the input. | 2087 for the leading digit while $bottom$ (line @45,bottom@) is an alias for the trailing edge. The aliases form a |
2088 window of exactly $b$ digits over the input. | |
2105 | 2089 |
2106 \subsection{Division by $x$} | 2090 \subsection{Division by $x$} |
2107 | 2091 |
2108 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. | 2092 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. |
2109 | 2093 |
2149 | 2133 |
2150 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. | 2134 Once the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented. |
2151 | 2135 |
2152 EXAM,bn_mp_rshd.c | 2136 EXAM,bn_mp_rshd.c |
2153 | 2137 |
2154 The only noteworthy element of this routine is the lack of a return type. | 2138 The only noteworthy element of this routine is the lack of a return type since it cannot fail. Like mp\_lshd() we |
2155 | 2139 form a sliding window except we copy in the other direction. After the window (line @59,for (;@) we then zero |
2156 -- Will update later to give it a return type...Tom | 2140 the upper digits of the input to make sure the result is correct. |
2157 | 2141 |
2158 \section{Powers of Two} | 2142 \section{Powers of Two} |
2159 | 2143 |
2160 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For | 2144 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For |
2161 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single | 2145 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single |
2212 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to | 2196 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to |
2213 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. | 2197 complete. It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow. |
2214 | 2198 |
2215 EXAM,bn_mp_mul_2d.c | 2199 EXAM,bn_mp_mul_2d.c |
2216 | 2200 |
2217 Notes to be revised when code is updated. -- Tom | 2201 The shifting is performed in--place which means the first step (line @24,a != c@) is to copy the input to the |
2202 destination. We avoid calling mp\_copy() by making sure the mp\_ints are different. The destination then | |
2203 has to be grown (line @31,grow@) to accomodate the result. | |
2204 | |
2205 If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples | |
2206 of $lg(\beta)$. Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left. Inside the actual shift | |
2207 loop (lines @45,if@ to @76,}@) we make use of pre--computed values $shift$ and $mask$. These are used to | |
2208 extract the carry bit(s) to pass into the next iteration of the loop. The $r$ and $rr$ variables form a | |
2209 chain between consecutive iterations to propagate the carry. | |
2218 | 2210 |
2219 \subsection{Division by Power of Two} | 2211 \subsection{Division by Power of Two} |
2220 | 2212 |
2221 \newpage\begin{figure}[!here] | 2213 \newpage\begin{figure}[!here] |
2222 \begin{small} | 2214 \begin{small} |
2261 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally | 2253 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally |
2262 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the | 2254 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the |
2263 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before | 2255 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before |
2264 the quotient is obtained. | 2256 the quotient is obtained. |
2265 | 2257 |
2266 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). | 2258 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. The only significant difference is |
2259 the direction of the shifts. | |
2267 | 2260 |
2268 \subsection{Remainder of Division by Power of Two} | 2261 \subsection{Remainder of Division by Power of Two} |
2269 | 2262 |
2270 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This | 2263 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This |
2271 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. | 2264 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. |
2304 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ | 2297 result is set to zero. If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns. Otherwise, $a$ |
2305 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. | 2298 is copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count. |
2306 | 2299 |
2307 EXAM,bn_mp_mod_2d.c | 2300 EXAM,bn_mp_mod_2d.c |
2308 | 2301 |
2309 -- Add comments later, Tom. | 2302 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases. Next if $2^b$ is larger |
2303 than the input we just mp\_copy() the input and return right away. After this point we know we must actually | |
2304 perform some work to produce the remainder. | |
2305 | |
2306 Recalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce | |
2307 the number. First we zero any digits above the last digit in $2^b$ (line @41,for@). Next we reduce the | |
2308 leading digit of both (line @45,&=@) and then mp\_clamp(). | |
2310 | 2309 |
2311 \section*{Exercises} | 2310 \section*{Exercises} |
2312 \begin{tabular}{cl} | 2311 \begin{tabular}{cl} |
2313 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ | 2312 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ |
2314 & in $O(n)$ time. \\ | 2313 & in $O(n)$ time. \\ |
2462 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result | 2461 digit since that digit is assumed to be zero at this point. However, if $ix + pb \ge digs$ the carry is not set as it would make the result |
2463 exceed the precision requested. | 2462 exceed the precision requested. |
2464 | 2463 |
2465 EXAM,bn_s_mp_mul_digs.c | 2464 EXAM,bn_s_mp_mul_digs.c |
2466 | 2465 |
2467 Lines @31,if@ to @35,}@ determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and | 2466 First we determine (line @30,if@) if the Comba method can be used first since it's faster. The conditions for |
2468 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control | 2467 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than |
2469 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. | 2468 \textbf{MP\_WARRAY}. This new constant is used to control the stack usage in the Comba routines. By default it is |
2470 | 2469 set to $\delta$ but can be reduced when memory is at a premium. |
2471 Of particular importance is the calculation of the $ix+iy$'th column on lines @64,mp_word@, @65,mp_word@ and @66,mp_word@. Note how all of the | 2470 |
2472 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations | 2471 If we cannot use the Comba method we proceed to setup the baseline routine. We allocate the the destination mp\_int |
2473 are used instead of single precision. The multiplication on line @65,) * (@ makes use of a specific GCC optimizer behaviour. On the outset it looks like | 2472 $t$ (line @36,init@) to the exact size of the output to avoid further re--allocations. At this point we now |
2474 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most | 2473 begin the $O(n^2)$ loop. |
2475 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For | 2474 |
2476 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. | 2475 This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of |
2476 digits as output. In each iteration of the outer loop the $pb$ variable is set (line @48,MIN@) to the maximum | |
2477 number of inner loop iterations. | |
2478 | |
2479 Inside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the | |
2480 carry from the previous iteration. A particularly important observation is that most modern optimizing | |
2481 C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that | |
2482 is required for the product. In x86 terms for example, this means using the MUL instruction. | |
2483 | |
2484 Each digit of the product is stored in turn (line @68,tmpt@) and the carry propagated (line @71,>>@) to the | |
2485 next iteration. | |
2477 | 2486 |
2478 \subsection{Faster Multiplication by the ``Comba'' Method} | 2487 \subsection{Faster Multiplication by the ``Comba'' Method} |
2479 MARK,COMBA | 2488 MARK,COMBA |
2480 | 2489 |
2481 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This | 2490 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be |
2482 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known | 2491 computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement |
2483 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested | 2492 in parallel. The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G. |
2484 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in | 2493 Comba who described a method of implementing fast multipliers that do not require nested carry fixup operations. As an |
2485 his 1986 paper \cite{BARRETT} written five years before. | 2494 interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written |
2486 | 2495 five years before. |
2487 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how | 2496 |
2488 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the | 2497 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight |
2489 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. | 2498 twist is placed on how the columns of the result are produced. In the standard long-hand algorithm rows of products |
2490 | 2499 are produced then added together to form the final result. In the baseline algorithm the columns are added together |
2491 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a | 2500 after each iteration to get the result instantaneously. |
2492 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount | 2501 |
2493 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. | 2502 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at |
2503 the $O(n^2)$ level a simple multiplication and addition step is performed. The carries of the columns are propagated | |
2504 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute | |
2505 the product vector $\vec x$ as follows. | |
2494 | 2506 |
2495 \begin{equation} | 2507 \begin{equation} |
2496 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace | 2508 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace |
2497 \end{equation} | 2509 \end{equation} |
2498 | 2510 |
2582 \begin{tabular}{l} | 2594 \begin{tabular}{l} |
2583 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ | 2595 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ |
2584 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ | 2596 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ |
2585 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ | 2597 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ |
2586 \hline \\ | 2598 \hline \\ |
2587 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ | 2599 Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\ |
2588 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ | 2600 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ |
2589 2. If step 1 failed return(\textit{MP\_MEM}).\\ | 2601 2. If step 1 failed return(\textit{MP\_MEM}).\\ |
2590 \\ | 2602 \\ |
2591 Zero the temporary array $\hat W$. \\ | 2603 3. $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\ |
2592 3. for $n$ from $0$ to $digs - 1$ do \\ | |
2593 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\ | |
2594 \\ | 2604 \\ |
2595 Compute the columns. \\ | 2605 4. $\_ \hat W \leftarrow 0$ \\ |
2596 4. for $ix$ from $0$ to $a.used - 1$ do \\ | 2606 5. for $ix$ from 0 to $pa - 1$ do \\ |
2597 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ | 2607 \hspace{3mm}5.1 $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\ |
2598 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ | 2608 \hspace{3mm}5.2 $tx \leftarrow ix - ty$ \\ |
2599 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ | 2609 \hspace{3mm}5.3 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\ |
2600 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ | 2610 \hspace{3mm}5.4 for $iz$ from 0 to $iy - 1$ do \\ |
2611 \hspace{6mm}5.4.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\ | |
2612 \hspace{3mm}5.5 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\ | |
2613 \hspace{3mm}5.6 $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\ | |
2614 6. $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\ | |
2601 \\ | 2615 \\ |
2602 Propagate the carries upwards. \\ | 2616 7. $oldused \leftarrow c.used$ \\ |
2603 5. $oldused \leftarrow c.used$ \\ | 2617 8. $c.used \leftarrow digs$ \\ |
2604 6. $c.used \leftarrow digs$ \\ | 2618 9. for $ix$ from $0$ to $pa$ do \\ |
2605 7. If $digs > 1$ then do \\ | 2619 \hspace{3mm}9.1 $c_{ix} \leftarrow W_{ix}$ \\ |
2606 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ | 2620 10. for $ix$ from $pa + 1$ to $oldused - 1$ do \\ |
2607 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ | 2621 \hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\ |
2608 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\ | |
2609 8. else do \\ | |
2610 \hspace{3mm}8.1 $ix \leftarrow 0$ \\ | |
2611 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\ | |
2612 \\ | 2622 \\ |
2613 Zero excess digits. \\ | 2623 11. Clamp $c$. \\ |
2614 10. If $digs < oldused$ then do \\ | 2624 12. Return MP\_OKAY. \\ |
2615 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\ | |
2616 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\ | |
2617 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\ | |
2618 12. Return(\textit{MP\_OKAY}). \\ | |
2619 \hline | 2625 \hline |
2620 \end{tabular} | 2626 \end{tabular} |
2621 \end{center} | 2627 \end{center} |
2622 \end{small} | 2628 \end{small} |
2623 \caption{Algorithm fast\_s\_mp\_mul\_digs} | 2629 \caption{Algorithm fast\_s\_mp\_mul\_digs} |
2624 \label{fig:COMBAMULT} | 2630 \label{fig:COMBAMULT} |
2625 \end{figure} | 2631 \end{figure} |
2626 | 2632 |
2627 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} | 2633 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} |
2628 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm | 2634 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. |
2629 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. | 2635 |
2630 | 2636 The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the |
2631 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that | 2637 loop we want to produce one column per pass. This allows the accumulator $\_ \hat W$ to be placed in CPU registers and |
2632 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. | 2638 reduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration. |
2633 | 2639 |
2634 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of | 2640 The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$. That way if $a$ has more digits than |
2635 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each | 2641 $b$ this will be limited to $b.used - 1$. The $tx$ variable is set to the to the distance past $b.used$ the variable |
2636 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. | 2642 $ix$ is. This is used for the immediately subsequent statement where we find $iy$. |
2643 | |
2644 The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out. Computing one column at a time | |
2645 means we have to scan one integer upwards and the other downwards. $a$ starts at $tx$ and $b$ starts at $ty$. In each | |
2646 pass we are producing the $ix$'th output column and we note that $tx + ty = ix$. As we move $tx$ upwards we have to | |
2647 move $ty$ downards so the equality remains valid. The $iy$ variable is the number of iterations until | |
2648 $tx \ge a.used$ or $ty < 0$ occurs. | |
2649 | |
2650 After every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator | |
2651 into the next round by dividing $\_ \hat W$ by $\beta$. | |
2637 | 2652 |
2638 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the | 2653 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the |
2639 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require | 2654 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require |
2640 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, | 2655 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, |
2641 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply | 2656 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply |
2642 and addition operations in the nested loop in parallel. | 2657 and addition operations in the nested loop in parallel. |
2643 | 2658 |
2644 EXAM,bn_fast_s_mp_mul_digs.c | 2659 EXAM,bn_fast_s_mp_mul_digs.c |
2645 | 2660 |
2646 The memset on line @47,memset@ clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication | 2661 As per the pseudo--code we first calculate $pa$ (line @47,MIN@) as the number of digits to output. Next we begin the outer loop |
2647 implementation a series of aliases (\textit{lines @67, tmpx@, @70, tmpy@ and @75,_W@}) are used to simplify the inner $O(n^2)$ loop. | 2662 to produce the individual columns of the product. We use the two aliases $tmpx$ and $tmpy$ (lines @61,tmpx@, @62,tmpy@) to point |
2648 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. | 2663 inside the two multiplicands quickly. |
2649 | 2664 |
2650 The inner loop on lines @83,for@, @84,mp_word@ and @85,}@ is where the algorithm will spend the majority of the time, which is why it has been | 2665 The inner loop (lines @70,for@ to @72,}@) of this implementation is where the tradeoff come into play. Originally this comba |
2651 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the | 2666 implementation was ``row--major'' which means it adds to each of the columns in each pass. After the outer loop it would then fix |
2652 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three | 2667 the carries. This was very fast except it had an annoying drawback. You had to read a mp\_word and two mp\_digits and write |
2653 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop | 2668 one mp\_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth |
2654 and scheduling the instructions so there are very few dependency stalls. | 2669 is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded cpus where cache is often |
2655 | 2670 slower and also often doesn't exist. This new algorithm only performs two reads per iteration under the assumption that the |
2656 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the | 2671 compiler has aliased $\_ \hat W$ to a CPU register. |
2657 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next | 2672 |
2658 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can | 2673 After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines @75,W[ix]@, @78,>>@) to forward it as |
2659 be simultaneously used. | 2674 a carry for the next pass. After the outer loop we use the final carry (line @82,W[ix]@) as the last digit of the product. |
2660 | 2675 |
2661 \subsection{Polynomial Basis Multiplication} | 2676 \subsection{Polynomial Basis Multiplication} |
2662 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms | 2677 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms |
2663 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and | 2678 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and |
2664 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. | 2679 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. |
2974 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer | 2989 Once the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known. By substituting $\beta^{k}$ for $x$, the integer |
2975 result $a \cdot b$ is produced. | 2990 result $a \cdot b$ is produced. |
2976 | 2991 |
2977 EXAM,bn_mp_toom_mul.c | 2992 EXAM,bn_mp_toom_mul.c |
2978 | 2993 |
2979 -- Comments to be added during editing phase. | 2994 The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very |
2995 large numbers. For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with | |
2996 Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$). For most ``crypto'' sized numbers this | |
2997 algorithm is not practical as Karatsuba has a much lower cutoff point. | |
2998 | |
2999 First we split $a$ and $b$ into three roughly equal portions. This has been accomplished (lines @40,mod@ to @69,rshd@) with | |
3000 combinations of mp\_rshd() and mp\_mod\_2d() function calls. At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly | |
3001 for $b$. | |
3002 | |
3003 Next we compute the five points $w0, w1, w2, w3$ and $w4$. Recall that $w0$ and $w4$ can be computed directly from the portions so | |
3004 we get those out of the way first (lines @72,mul@ and @77,mul@). Next we compute $w1, w2$ and $w3$ using Horners method. | |
3005 | |
3006 After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively | |
3007 straight forward. | |
2980 | 3008 |
2981 \subsection{Signed Multiplication} | 3009 \subsection{Signed Multiplication} |
2982 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all | 3010 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all |
2983 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. | 3011 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. |
2984 | 3012 |
2985 \newpage\begin{figure}[!here] | 3013 \begin{figure}[!here] |
2986 \begin{small} | 3014 \begin{small} |
2987 \begin{center} | 3015 \begin{center} |
2988 \begin{tabular}{l} | 3016 \begin{tabular}{l} |
2989 \hline Algorithm \textbf{mp\_mul}. \\ | 3017 \hline Algorithm \textbf{mp\_mul}. \\ |
2990 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ | 3018 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ |
3063 | 3091 |
3064 \subsection{The Baseline Squaring Algorithm} | 3092 \subsection{The Baseline Squaring Algorithm} |
3065 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines | 3093 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines |
3066 will not handle. | 3094 will not handle. |
3067 | 3095 |
3068 \newpage\begin{figure}[!here] | 3096 \begin{figure}[!here] |
3069 \begin{small} | 3097 \begin{small} |
3070 \begin{center} | 3098 \begin{center} |
3071 \begin{tabular}{l} | 3099 \begin{tabular}{l} |
3072 \hline Algorithm \textbf{s\_mp\_sqr}. \\ | 3100 \hline Algorithm \textbf{s\_mp\_sqr}. \\ |
3073 \textbf{Input}. mp\_int $a$ \\ | 3101 \textbf{Input}. mp\_int $a$ \\ |
3119 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial | 3147 Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial |
3120 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. | 3148 results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm. |
3121 | 3149 |
3122 EXAM,bn_s_mp_sqr.c | 3150 EXAM,bn_s_mp_sqr.c |
3123 | 3151 |
3124 Inside the outer loop (\textit{see line @32,for@}) the square term is calculated on line @35,r =@. Line @42,>>@ extracts the carry from the square | 3152 Inside the outer loop (line @32,for@) the square term is calculated on line @35,r =@. The carry (line @42,>>@) has been |
3125 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines @45,tmpx@ and @48,tmpt@ respectively. The doubling is performed using two | 3153 extracted from the mp\_word accumulator using a right shift. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized |
3126 additions (\textit{see line @57,r + r@}) since it is usually faster than shifting,if not at least as fast. | 3154 (lines @45,tmpx@ and @48,tmpt@) to simplify the inner loop. The doubling is performed using two |
3155 additions (line @57,r + r@) since it is usually faster than shifting, if not at least as fast. | |
3156 | |
3157 The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication. As such the inner loops | |
3158 get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to | |
3159 square a number. | |
3127 | 3160 |
3128 \subsection{Faster Squaring by the ``Comba'' Method} | 3161 \subsection{Faster Squaring by the ``Comba'' Method} |
3129 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional | 3162 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional |
3130 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these | 3163 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these |
3131 performance hazards. | 3164 performance hazards. |
3133 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry | 3166 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry |
3134 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact | 3167 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact |
3135 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, | 3168 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, |
3136 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. | 3169 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. |
3137 | 3170 |
3138 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word | 3171 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two |
3139 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be | 3172 mp\_word arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and |
3140 moved to a $O(n)$ work level outside the $O(n^2)$ level. | 3173 carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level. In this case, we have an even simpler solution in mind. |
3141 | 3174 |
3142 \newpage\begin{figure}[!here] | 3175 \newpage\begin{figure}[!here] |
3143 \begin{small} | 3176 \begin{small} |
3144 \begin{center} | 3177 \begin{center} |
3145 \begin{tabular}{l} | 3178 \begin{tabular}{l} |
3146 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ | 3179 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ |
3147 \textbf{Input}. mp\_int $a$ \\ | 3180 \textbf{Input}. mp\_int $a$ \\ |
3148 \textbf{Output}. $b \leftarrow a^2$ \\ | 3181 \textbf{Output}. $b \leftarrow a^2$ \\ |
3149 \hline \\ | 3182 \hline \\ |
3150 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ | 3183 Place an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\ |
3151 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ | 3184 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ |
3152 2. If step 1 failed return(\textit{MP\_MEM}). \\ | 3185 2. If step 1 failed return(\textit{MP\_MEM}). \\ |
3153 3. for $ix$ from $0$ to $2a.used + 1$ do \\ | |
3154 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\ | |
3155 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\ | |
3156 4. for $ix$ from $0$ to $a.used - 1$ do \\ | |
3157 \hspace{3mm}Compute the square.\\ | |
3158 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\ | |
3159 \\ | 3186 \\ |
3160 \hspace{3mm}Compute the double products.\\ | 3187 3. $pa \leftarrow 2 \cdot a.used$ \\ |
3161 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ | 3188 4. $\hat W1 \leftarrow 0$ \\ |
3162 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ | 3189 5. for $ix$ from $0$ to $pa - 1$ do \\ |
3163 5. $oldused \leftarrow b.used$ \\ | 3190 \hspace{3mm}5.1 $\_ \hat W \leftarrow 0$ \\ |
3164 6. $b.used \leftarrow 2a.used + 1$ \\ | 3191 \hspace{3mm}5.2 $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\ |
3192 \hspace{3mm}5.3 $tx \leftarrow ix - ty$ \\ | |
3193 \hspace{3mm}5.4 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\ | |
3194 \hspace{3mm}5.5 $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\ | |
3195 \hspace{3mm}5.6 for $iz$ from $0$ to $iz - 1$ do \\ | |
3196 \hspace{6mm}5.6.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\ | |
3197 \hspace{3mm}5.7 $\_ \hat W \leftarrow 2 \cdot \_ \hat W + \hat W1$ \\ | |
3198 \hspace{3mm}5.8 if $ix$ is even then \\ | |
3199 \hspace{6mm}5.8.1 $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\ | |
3200 \hspace{3mm}5.9 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\ | |
3201 \hspace{3mm}5.10 $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\ | |
3165 \\ | 3202 \\ |
3166 Double the products and propagate the carries simultaneously. \\ | 3203 6. $oldused \leftarrow b.used$ \\ |
3167 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ | 3204 7. $b.used \leftarrow 2 \cdot a.used$ \\ |
3168 8. for $ix$ from $1$ to $2a.used$ do \\ | 3205 8. for $ix$ from $0$ to $pa - 1$ do \\ |
3169 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ | 3206 \hspace{3mm}8.1 $b_{ix} \leftarrow W_{ix}$ \\ |
3170 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ | 3207 9. for $ix$ from $pa$ to $oldused - 1$ do \\ |
3171 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ | 3208 \hspace{3mm}9.1 $b_{ix} \leftarrow 0$ \\ |
3172 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ | 3209 10. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ |
3173 10. if $2a.used + 1 < oldused$ then do \\ | 3210 11. Return(\textit{MP\_OKAY}). \\ |
3174 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\ | |
3175 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\ | |
3176 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\ | |
3177 12. Return(\textit{MP\_OKAY}). \\ | |
3178 \hline | 3211 \hline |
3179 \end{tabular} | 3212 \end{tabular} |
3180 \end{center} | 3213 \end{center} |
3181 \end{small} | 3214 \end{small} |
3182 \caption{Algorithm fast\_s\_mp\_sqr} | 3215 \caption{Algorithm fast\_s\_mp\_sqr} |
3183 \end{figure} | 3216 \end{figure} |
3184 | 3217 |
3185 \textbf{Algorithm fast\_s\_mp\_sqr.} | 3218 \textbf{Algorithm fast\_s\_mp\_sqr.} |
3186 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when | 3219 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm |
3187 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. | 3220 s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. |
3188 | 3221 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of. |
3189 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second | 3222 |
3190 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most | 3223 First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively. This is because the inner loop |
3191 processors to simply make it a full size array. | 3224 products are to be doubled. If we had added the previous carry in we would be doubling too much. Next we perform an |
3192 | 3225 addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits. For example, $a_3 \cdot a_5$ is equal |
3193 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how | 3226 $a_5 \cdot a_3$. Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum |
3194 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop | 3227 of the products just outside the inner loop we have to avoid doing this. This is also a good thing since we perform |
3195 computes the sum of the products for each column. They are not doubled until later. | 3228 fewer multiplications and the routine ends up being faster. |
3196 | 3229 |
3197 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both | 3230 Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8). We add in the square |
3198 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the | 3231 only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position. |
3199 squares in place. | |
3200 | 3232 |
3201 EXAM,bn_fast_s_mp_sqr.c | 3233 EXAM,bn_fast_s_mp_sqr.c |
3202 | 3234 |
3203 -- Write something deep and insightful later, Tom. | 3235 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for |
3236 the special case of squaring. | |
3204 | 3237 |
3205 \subsection{Polynomial Basis Squaring} | 3238 \subsection{Polynomial Basis Squaring} |
3206 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception | 3239 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception |
3207 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ | 3240 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ |
3208 multiplications to find the $\zeta$ relations, squaring operations are performed instead. | 3241 multiplications to find the $\zeta$ relations, squaring operations are performed instead. |
3310 | 3343 |
3311 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point | 3344 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point |
3312 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 | 3345 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 |
3313 it is actually below the Comba limit (\textit{at 110 digits}). | 3346 it is actually below the Comba limit (\textit{at 110 digits}). |
3314 | 3347 |
3315 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to | 3348 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are |
3316 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. | 3349 redirected to the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and |
3317 | 3350 mp\_clears are executed normally. |
3318 \textit{Last paragraph sucks. re-write! -- Tom} | |
3319 | 3351 |
3320 \subsection{Toom-Cook Squaring} | 3352 \subsection{Toom-Cook Squaring} |
3321 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used | 3353 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used |
3322 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to | 3354 instead of multiplication to find the five relations. The reader is encouraged to read the description of the latter algorithm and try to |
3323 derive their own Toom-Cook squaring algorithm. | 3355 derive their own Toom-Cook squaring algorithm. |
3324 | 3356 |
3325 \subsection{High Level Squaring} | 3357 \subsection{High Level Squaring} |
3326 \newpage\begin{figure}[!here] | 3358 \newpage\begin{figure}[!here] |
3327 \begin{small} | 3359 \begin{small} |
3360 \section*{Exercises} | 3392 \section*{Exercises} |
3361 \begin{tabular}{cl} | 3393 \begin{tabular}{cl} |
3362 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ | 3394 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ |
3363 & that have different number of digits in Karatsuba multiplication. \\ | 3395 & that have different number of digits in Karatsuba multiplication. \\ |
3364 & \\ | 3396 & \\ |
3365 $\left [ 3 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ | 3397 $\left [ 2 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\ |
3366 & of double products and at most one square is stated. Prove this statement. \\ | 3398 & of double products and at most one square is stated. Prove this statement. \\ |
3367 & \\ | 3399 & \\ |
3368 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\ | |
3369 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\ | |
3370 & \\ | |
3371 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ | 3400 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ |
3372 & \\ | 3401 & \\ |
3373 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ | 3402 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ |
3374 & \\ | 3403 & \\ |
3375 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ | 3404 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ |
3376 & required for equation $6.7$ to be true. \\ | 3405 & required for equation $6.7$ to be true. \\ |
3406 & \\ | |
3407 $\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\ | |
3408 & compute subsets of the columns in each thread. Determine a cutoff point where \\ | |
3409 & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\ | |
3410 &\\ | |
3411 $\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook. You must \\ | |
3412 & increase the throughput of mp\_exptmod() for random odd moduli in the range \\ | |
3413 & $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\ | |
3377 & \\ | 3414 & \\ |
3378 \end{tabular} | 3415 \end{tabular} |
3379 | 3416 |
3380 \chapter{Modular Reduction} | 3417 \chapter{Modular Reduction} |
3381 MARK,REDUCTION | 3418 MARK,REDUCTION |
3392 other forms of residues. | 3429 other forms of residues. |
3393 | 3430 |
3394 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions | 3431 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions |
3395 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the | 3432 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the |
3396 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in | 3433 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in |
3397 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular | 3434 elliptic curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular |
3398 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the | 3435 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the |
3399 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check | 3436 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check |
3400 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. | 3437 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. |
3401 | 3438 |
3402 \section{The Barrett Reduction} | 3439 \section{The Barrett Reduction} |
3608 | 3645 |
3609 \subsection{The Barrett Setup Algorithm} | 3646 \subsection{The Barrett Setup Algorithm} |
3610 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for | 3647 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for |
3611 future use so that the Barrett algorithm can be used without delay. | 3648 future use so that the Barrett algorithm can be used without delay. |
3612 | 3649 |
3613 \begin{figure}[!here] | 3650 \newpage\begin{figure}[!here] |
3614 \begin{small} | 3651 \begin{small} |
3615 \begin{center} | 3652 \begin{center} |
3616 \begin{tabular}{l} | 3653 \begin{tabular}{l} |
3617 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ | 3654 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ |
3618 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ | 3655 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ |
5816 \section{Jacobi Symbol Computation} | 5853 \section{Jacobi Symbol Computation} |
5817 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is | 5854 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is |
5818 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is | 5855 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is |
5819 equivalent to equation \ref{eqn:legendre}. | 5856 equivalent to equation \ref{eqn:legendre}. |
5820 | 5857 |
5858 \textit{-- Tom, don't be an ass, cite your source here...!} | |
5859 | |
5821 \begin{equation} | 5860 \begin{equation} |
5822 a^{(p-1)/2} \equiv \begin{array}{rl} | 5861 a^{(p-1)/2} \equiv \begin{array}{rl} |
5823 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ | 5862 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ |
5824 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ | 5863 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ |
5825 1 & \mbox{if }a\mbox{ is a quadratic residue}. | 5864 1 & \mbox{if }a\mbox{ is a quadratic residue}. |