comparison tommath.tex @ 190:d8254fc979e9 libtommath-orig LTM_0.35

Initial import of libtommath 0.35
author Matt Johnston <matt@ucc.asn.au>
date Fri, 06 May 2005 08:59:30 +0000
parents d29b64170cf0
children
comparison
equal deleted inserted replaced
142:d29b64170cf0 190:d8254fc979e9
47 \def\gap{\vspace{0.5ex}} 47 \def\gap{\vspace{0.5ex}}
48 \makeindex 48 \makeindex
49 \begin{document} 49 \begin{document}
50 \frontmatter 50 \frontmatter
51 \pagestyle{empty} 51 \pagestyle{empty}
52 \title{Implementing Multiple Precision Arithmetic \\ ~ \\ Draft Edition } 52 \title{Multi--Precision Math}
53 \author{\mbox{ 53 \author{\mbox{
54 %\begin{small} 54 %\begin{small}
55 \begin{tabular}{c} 55 \begin{tabular}{c}
56 Tom St Denis \\ 56 Tom St Denis \\
57 Algonquin College \\ 57 Algonquin College \\
64 \end{tabular} 64 \end{tabular}
65 %\end{small} 65 %\end{small}
66 } 66 }
67 } 67 }
68 \maketitle 68 \maketitle
69 This text has been placed in the public domain. This text corresponds to the v0.30 release of the 69 This text has been placed in the public domain. This text corresponds to the v0.35 release of the
70 LibTomMath project. 70 LibTomMath project.
71 71
72 \begin{alltt} 72 \begin{alltt}
73 Tom St Denis 73 Tom St Denis
74 111 Banning Rd 74 111 Banning Rd
83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{} 83 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
84 {\em book} macro package and the Perl {\em booker} package. 84 {\em book} macro package and the Perl {\em booker} package.
85 85
86 \tableofcontents 86 \tableofcontents
87 \listoffigures 87 \listoffigures
88 \chapter*{Prefaces to the Draft Edition} 88 \chapter*{Prefaces}
89 I started this text in April 2003 to complement my LibTomMath library. That is, explain how to implement the functions 89 When I tell people about my LibTom projects and that I release them as public domain they are often puzzled.
90 contained in LibTomMath. The goal is to have a textbook that any Computer Science student can use when implementing their 90 They ask why I did it and especially why I continue to work on them for free. The best I can explain it is ``Because I can.''
91 own multiple precision arithmetic. The plan I wanted to follow was flesh out all the 91 Which seems odd and perhaps too terse for adult conversation. I often qualify it with ``I am able, I am willing.'' which
92 ideas and concepts I had floating around in my head and then work on it afterwards refining a little bit at a time. Chance 92 perhaps explains it better. I am the first to admit there is not anything that special with what I have done. Perhaps
93 would have it that I ended up with my summer off from Algonquin College and I was given four months solid to work on the 93 others can see that too and then we would have a society to be proud of. My LibTom projects are what I am doing to give
94 text. 94 back to society in the form of tools and knowledge that can help others in their endeavours.
95 95
96 Choosing to not waste any time I dove right into the project even before my spring semester was finished. I wrote a bit 96 I started writing this book because it was the most logical task to further my goal of open academia. The LibTomMath source
97 off and on at first. The moment my exams were finished I jumped into long 12 to 16 hour days. The result after only 97 code itself was written to be easy to follow and learn from. There are times, however, where pure C source code does not
98 a couple of months was a ten chapter, three hundred page draft that I quickly had distributed to anyone who wanted 98 explain the algorithms properly. Hence this book. The book literally starts with the foundation of the library and works
99 to read it. I had Jean-Luc Cooke print copies for me and I brought them to Crypto'03 in Santa Barbara. So far I have 99 itself outwards to the more complicated algorithms. The use of both pseudo--code and verbatim source code provides a duality
100 managed to grab a certain level of attention having people from around the world ask me for copies of the text was certain 100 of ``theory'' and ``practice'' that the computer science students of the world shall appreciate. I never deviate too far
101 rewarding. 101 from relatively straightforward algebra and I hope that this book can be a valuable learning asset.
102 102
103 Now we are past December 2003. By this time I had pictured that I would have at least finished my second draft of the text. 103 This book and indeed much of the LibTom projects would not exist in their current form if it was not for a plethora
104 Currently I am far off from this goal. I've done partial re-writes of chapters one, two and three but they are not even 104 of kind people donating their time, resources and kind words to help support my work. Writing a text of significant
105 finished yet. I haven't given up on the project, only had some setbacks. First O'Reilly declined to publish the text then 105 length (along with the source code) is a tiresome and lengthy process. Currently the LibTom project is four years old,
106 Addison-Wesley and Greg is tried another which I don't know the name of. However, at this point I want to focus my energy 106 comprises of literally thousands of users and over 100,000 lines of source code, TeX and other material. People like Mads and Greg
107 onto finishing the book not securing a contract. 107 were there at the beginning to encourage me to work well. It is amazing how timely validation from others can boost morale to
108 108 continue the project. Definitely my parents were there for me by providing room and board during the many months of work in 2003.
109 So why am I writing this text? It seems like a lot of work right? Most certainly it is a lot of work writing a textbook. 109
110 Even the simplest introductory material has to be lined with references and figures. A lot of the text has to be re-written 110 To my many friends whom I have met through the years I thank you for the good times and the words of encouragement. I hope I
111 from point form to prose form to ensure an easier read. Why am I doing all this work for free then? Simple. My philosophy 111 honour your kind gestures with this project.
112 is quite simply ``Open Source. Open Academia. Open Minds'' which means that to achieve a goal of open minds, that is, 112
113 people willing to accept new ideas and explore the unknown you have to make available material they can access freely 113 Open Source. Open Academia. Open Minds.
114 without hinderance.
115
116 I've been writing free software since I was about sixteen but only recently have I hit upon software that people have come
117 to depend upon. I started LibTomCrypt in December 2001 and now several major companies use it as integral portions of their
118 software. Several educational institutions use it as a matter of course and many freelance developers use it as
119 part of their projects. To further my contributions I started the LibTomMath project in December 2002 aimed at providing
120 multiple precision arithmetic routines that students could learn from. That is write routines that are not only easy
121 to understand and follow but provide quite impressive performance considering they are all in standard portable ISO C.
122
123 The second leg of my philosophy is ``Open Academia'' which is where this textbook comes in. In the end, when all is
124 said and done the text will be useable by educational institutions as a reference on multiple precision arithmetic.
125
126 At this time I feel I should share a little information about myself. The most common question I was asked at
127 Crypto'03, perhaps just out of professional courtesy, was which school I either taught at or attended. The unfortunate
128 truth is that I neither teach at or attend a school of academic reputation. I'm currently at Algonquin College which
129 is what I'd like to call ``somewhat academic but mostly vocational'' college. In otherwords, job training.
130
131 I'm a 21 year old computer science student mostly self-taught in the areas I am aware of (which includes a half-dozen
132 computer science fields, a few fields of mathematics and some English). I look forward to teaching someday but I am
133 still far off from that goal.
134
135 Now it would be improper for me to not introduce the rest of the texts co-authors. While they are only contributing
136 corrections and editorial feedback their support has been tremendously helpful in presenting the concepts laid out
137 in the text so far. Greg has always been there for me. He has tracked my LibTom projects since their inception and even
138 sent cheques to help pay tuition from time to time. His background has provided a wonderful source to bounce ideas off
139 of and improve the quality of my writing. Mads is another fellow who has just ``been there''. I don't even recall what
140 his interest in the LibTom projects is but I'm definitely glad he has been around. His ability to catch logical errors
141 in my written English have saved me on several occasions to say the least.
142
143 What to expect next? Well this is still a rough draft. I've only had the chance to update a few chapters. However, I've
144 been getting the feeling that people are starting to use my text and I owe them some updated material. My current tenative
145 plan is to edit one chapter every two weeks starting January 4th. It seems insane but my lower course load at college
146 should provide ample time. By Crypto'04 I plan to have a 2nd draft of the text polished and ready to hand out to as many
147 people who will take it.
148 114
149 \begin{flushright} Tom St Denis \end{flushright} 115 \begin{flushright} Tom St Denis \end{flushright}
150 116
151 \newpage 117 \newpage
152 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also 118 I found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also
1043 051 \} 1009 051 \}
1044 052 #endif 1010 052 #endif
1045 \end{alltt} 1011 \end{alltt}
1046 \end{small} 1012 \end{small}
1047 1013
1048 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line 23) checks 1014 A quick optimization is to first determine if a memory re-allocation is required at all. The if statement (line 24) checks
1049 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc} 1015 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count. If the count is not larger than \textbf{alloc}
1050 the function skips the re-allocation part thus saving time. 1016 the function skips the re-allocation part thus saving time.
1051 1017
1052 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is 1018 When a re-allocation is performed it is turned into an optimal request to save time in the future. The requested digit count is
1053 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line 26). The XREALLOC function is used 1019 padded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line 26). The XREALLOC function is used
1588 \hspace{-5.1mm}{\bf File}: bn\_mp\_zero.c 1554 \hspace{-5.1mm}{\bf File}: bn\_mp\_zero.c
1589 \vspace{-3mm} 1555 \vspace{-3mm}
1590 \begin{alltt} 1556 \begin{alltt}
1591 016 1557 016
1592 017 /* set to zero */ 1558 017 /* set to zero */
1593 018 void 1559 018 void mp_zero (mp_int * a)
1594 019 mp_zero (mp_int * a) 1560 019 \{
1595 020 \{ 1561 020 int n;
1596 021 a->sign = MP_ZPOS; 1562 021 mp_digit *tmp;
1597 022 a->used = 0; 1563 022
1598 023 memset (a->dp, 0, sizeof (mp_digit) * a->alloc); 1564 023 a->sign = MP_ZPOS;
1599 024 \} 1565 024 a->used = 0;
1600 025 #endif 1566 025
1567 026 tmp = a->dp;
1568 027 for (n = 0; n < a->alloc; n++) \{
1569 028 *tmp++ = 0;
1570 029 \}
1571 030 \}
1572 031 #endif
1601 \end{alltt} 1573 \end{alltt}
1602 \end{small} 1574 \end{small}
1603 1575
1604 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the 1576 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
1605 \textbf{sign} variable is set to \textbf{MP\_ZPOS}. 1577 \textbf{sign} variable is set to \textbf{MP\_ZPOS}.
1607 \section{Sign Manipulation} 1579 \section{Sign Manipulation}
1608 \subsection{Absolute Value} 1580 \subsection{Absolute Value}
1609 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute 1581 With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute
1610 the absolute value of an mp\_int. 1582 the absolute value of an mp\_int.
1611 1583
1612 \newpage\begin{figure}[here] 1584 \begin{figure}[here]
1613 \begin{center} 1585 \begin{center}
1614 \begin{tabular}{l} 1586 \begin{tabular}{l}
1615 \hline Algorithm \textbf{mp\_abs}. \\ 1587 \hline Algorithm \textbf{mp\_abs}. \\
1616 \textbf{Input}. An mp\_int $a$ \\ 1588 \textbf{Input}. An mp\_int $a$ \\
1617 \textbf{Output}. Computes $b = \vert a \vert$ \\ 1589 \textbf{Output}. Computes $b = \vert a \vert$ \\
1660 037 \} 1632 037 \}
1661 038 #endif 1633 038 #endif
1662 \end{alltt} 1634 \end{alltt}
1663 \end{small} 1635 \end{small}
1664 1636
1637 This fairly trivial algorithm first eliminates non--required duplications (line 27) and then sets the
1638 \textbf{sign} flag to \textbf{MP\_ZPOS}.
1639
1665 \subsection{Integer Negation} 1640 \subsection{Integer Negation}
1666 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute 1641 With the mp\_int representation of an integer, calculating the negation is also trivial. The mp\_neg algorithm will compute
1667 the negative of an mp\_int input. 1642 the negative of an mp\_int input.
1668 1643
1669 \begin{figure}[here] 1644 \begin{figure}[here]
1700 016 1675 016
1701 017 /* b = -a */ 1676 017 /* b = -a */
1702 018 int mp_neg (mp_int * a, mp_int * b) 1677 018 int mp_neg (mp_int * a, mp_int * b)
1703 019 \{ 1678 019 \{
1704 020 int res; 1679 020 int res;
1705 021 if ((res = mp_copy (a, b)) != MP_OKAY) \{ 1680 021 if (a != b) \{
1706 022 return res; 1681 022 if ((res = mp_copy (a, b)) != MP_OKAY) \{
1707 023 \} 1682 023 return res;
1708 024 if (mp_iszero(b) != MP_YES) \{ 1683 024 \}
1709 025 b->sign = (a->sign == MP_ZPOS) ? MP_NEG : MP_ZPOS; 1684 025 \}
1710 026 \} 1685 026
1711 027 return MP_OKAY; 1686 027 if (mp_iszero(b) != MP_YES) \{
1712 028 \} 1687 028 b->sign = (a->sign == MP_ZPOS) ? MP_NEG : MP_ZPOS;
1713 029 #endif 1688 029 \} else \{
1689 030 b->sign = MP_ZPOS;
1690 031 \}
1691 032
1692 033 return MP_OKAY;
1693 034 \}
1694 035 #endif
1714 \end{alltt} 1695 \end{alltt}
1715 \end{small} 1696 \end{small}
1697
1698 Like mp\_abs() this function avoids non--required duplications (line 21) and then sets the sign. We
1699 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}. If the mp\_int is zero
1700 than the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}.
1716 1701
1717 \section{Small Constants} 1702 \section{Small Constants}
1718 \subsection{Setting Small Constants} 1703 \subsection{Setting Small Constants}
1719 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful. 1704 Often a mp\_int must be set to a relatively small value such as $1$ or $2$. For these cases the mp\_set algorithm is useful.
1720 1705
1721 \begin{figure}[here] 1706 \newpage\begin{figure}[here]
1722 \begin{center} 1707 \begin{center}
1723 \begin{tabular}{l} 1708 \begin{tabular}{l}
1724 \hline Algorithm \textbf{mp\_set}. \\ 1709 \hline Algorithm \textbf{mp\_set}. \\
1725 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\ 1710 \textbf{Input}. An mp\_int $a$ and a digit $b$ \\
1726 \textbf{Output}. Make $a$ equivalent to $b$ \\ 1711 \textbf{Output}. Make $a$ equivalent to $b$ \\
1755 023 \} 1740 023 \}
1756 024 #endif 1741 024 #endif
1757 \end{alltt} 1742 \end{alltt}
1758 \end{small} 1743 \end{small}
1759 1744
1760 Line 20 calls mp\_zero() to clear the mp\_int and reset the sign. Line 21 copies the digit 1745 First we zero (line 20) the mp\_int to make sure that the other members are initialized for a
1761 into the least significant location. Note the usage of a new constant \textbf{MP\_MASK}. This constant is used to quickly 1746 small positive constant. mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count
1762 reduce an integer modulo $\beta$. Since $\beta$ is of the form $2^k$ for any suitable $k$ it suffices to perform a binary AND with 1747 is zero. Next we set the digit and reduce it modulo $\beta$ (line 21). After this step we have to
1763 $MP\_MASK = 2^k - 1$ to perform the reduction. Finally line 22 will set the \textbf{used} member with respect to the 1748 check if the resulting digit is zero or not. If it is not then we set the \textbf{used} count to one, otherwise
1764 digit actually set. This function will always make the integer positive. 1749 to zero.
1750
1751 We can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with
1752 $2^k - 1$ will perform the same operation.
1765 1753
1766 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses 1754 One important limitation of this function is that it will only set one digit. The size of a digit is not fixed, meaning source that uses
1767 this function should take that into account. Only trivially small constants can be set using this function. 1755 this function should take that into account. Only trivially small constants can be set using this function.
1768 1756
1769 \subsection{Setting Large Constants} 1757 \subsection{Setting Large Constants}
1934 049 \} 1922 049 \}
1935 050 #endif 1923 050 #endif
1936 \end{alltt} 1924 \end{alltt}
1937 \end{small} 1925 \end{small}
1938 1926
1939 The two if statements on lines 24 and 28 compare the number of digits in the two inputs. These two are performed before all of the digits 1927 The two if statements (lines 24 and 28) compare the number of digits in the two inputs. These two are
1940 are compared since it is a very cheap test to perform and can potentially save considerable time. The implementation given is also not valid 1928 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save
1941 without those two statements. $b.alloc$ may be smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the 1929 considerable time. The implementation given is also not valid without those two statements. $b.alloc$ may be
1942 array of digits. 1930 smaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the array of digits.
1931
1932
1943 1933
1944 \subsection{Signed Comparisons} 1934 \subsection{Signed Comparisons}
1945 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude 1935 Comparing with sign considerations is also fairly critical in several routines (\textit{division for example}). Based on an unsigned magnitude
1946 comparison a trivial signed comparison algorithm can be written. 1936 comparison a trivial signed comparison algorithm can be written.
1947 1937
1998 037 \} 1988 037 \}
1999 038 #endif 1989 038 #endif
2000 \end{alltt} 1990 \end{alltt}
2001 \end{small} 1991 \end{small}
2002 1992
2003 The two if statements on lines 22 and 23 perform the initial sign comparison. If the signs are not the equal then which ever 1993 The two if statements (lines 22 and 23) perform the initial sign comparison. If the signs are not the equal then which ever
2004 has the positive sign is larger. At line 31, the inputs are compared based on magnitudes. If the signs were both negative then 1994 has the positive sign is larger. The inputs are compared (line 31) based on magnitudes. If the signs were both
2005 the unsigned comparison is performed in the opposite direction (\textit{line 33}). Otherwise, the signs are assumed to 1995 negative then the unsigned comparison is performed in the opposite direction (line 33). Otherwise, the signs are assumed to
2006 be both positive and a forward direction unsigned comparison is performed. 1996 be both positive and a forward direction unsigned comparison is performed.
2007 1997
2008 \section*{Exercises} 1998 \section*{Exercises}
2009 \begin{tabular}{cl} 1999 \begin{tabular}{cl}
2010 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\ 2000 $\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
2216 103 \} 2206 103 \}
2217 104 #endif 2207 104 #endif
2218 \end{alltt} 2208 \end{alltt}
2219 \end{small} 2209 \end{small}
2220 2210
2221 Lines 27 to 35 perform the initial sorting of the inputs and determine the $min$ and $max$ variables. Note that $x$ is a pointer to a 2211 We first sort (lines 27 to 35) the inputs based on magnitude and determine the $min$ and $max$ variables.
2222 mp\_int assigned to the largest input, in effect it is a local alias. Lines 37 to 42 ensure that the destination is grown to 2212 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias. Next we
2223 accomodate the result of the addition. 2213 grow the destination (37 to 42) ensure that it can accomodate the result of the addition.
2224 2214
2225 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on 2215 Similar to the implementation of mp\_copy this function uses the braced code and local aliases coding style. The three aliases that are on
2226 lines 55, 58 and 61 represent the two inputs and destination variables respectively. These aliases are used to ensure the 2216 lines 55, 58 and 61 represent the two inputs and destination variables respectively. These aliases are used to ensure the
2227 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int. 2217 compiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int.
2228 2218
2229 The initial carry $u$ is cleared on line 64, note that $u$ is of type mp\_digit which ensures type compatibility within the 2219 The initial carry $u$ will be cleared (line 64), note that $u$ is of type mp\_digit which ensures type
2230 implementation. The initial addition loop begins on line 65 and ends on line 74. Similarly the conditional addition loop 2220 compatibility within the implementation. The initial addition (line 65 to 74) adds digits from
2231 begins on line 80 and ends on line 90. The addition is finished with the final carry being stored in $tmpc$ on line 93. 2221 both inputs until the smallest input runs out of digits. Similarly the conditional addition loop
2232 Note the ``++'' operator on the same line. After line 93 $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful 2222 (line 80 to 90) adds the remaining digits from the larger of the two inputs. The addition is finished
2233 for the next loop on lines 96 to 99 which set any old upper digits to zero. 2223 with the final carry being stored in $tmpc$ (line 93). Note the ``++'' operator within the same expression.
2224 After line 93, $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$. This is useful
2225 for the next loop (line 96 to 99) which set any old upper digits to zero.
2234 2226
2235 \subsection{Low Level Subtraction} 2227 \subsection{Low Level Subtraction}
2236 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the 2228 The low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm. The principle difference is that the
2237 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must 2229 unsigned subtraction algorithm requires the result to be positive. That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
2238 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly. 2230 be met for this algorithm to function properly. Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
2243 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For 2235 the range $0 \le x < 2\beta$ for the algorithms to work correctly. However, it is allowable that a mp\_digit represent a larger range of values. For
2244 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a 2236 this algorithm we will assume that the variable $\gamma$ represents the number of bits available in a
2245 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}). 2237 mp\_digit (\textit{this implies $2^{\gamma} > \beta$}).
2246 2238
2247 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long'' 2239 For example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$. In ISO C an ``unsigned long''
2248 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma = 32$. 2240 data type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$.
2249 2241
2250 \newpage\begin{figure}[!here] 2242 \newpage\begin{figure}[!here]
2251 \begin{center} 2243 \begin{center}
2252 \begin{small} 2244 \begin{small}
2253 \begin{tabular}{l} 2245 \begin{tabular}{l}
2385 083 2377 083
2386 084 #endif 2378 084 #endif
2387 \end{alltt} 2379 \end{alltt}
2388 \end{small} 2380 \end{small}
2389 2381
2390 Line 24 and 25 perform the initial hardcoded sorting of the inputs. In reality the $min$ and $max$ variables are only aliases and are only 2382 Like low level addition we ``sort'' the inputs. Except in this case the sorting is hardcoded
2391 used to make the source code easier to read. Again the pointer alias optimization is used within this algorithm. Lines 41, 42 and 43 initialize the aliases for 2383 (lines 24 and 25). In reality the $min$ and $max$ variables are only aliases and are only
2392 $a$, $b$ and $c$ respectively. 2384 used to make the source code easier to read. Again the pointer alias optimization is used
2393 2385 within this algorithm. The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized
2394 The first subtraction loop occurs on lines 46 through 60. The theory behind the subtraction loop is exactly the same as that for 2386 (lines 41, 42 and 43) for $a$, $b$ and $c$ respectively.
2395 the addition loop. As remarked earlier there is an implementation reason for using the ``awkward'' method of extracting the carry 2387
2396 (\textit{see line 56}). The traditional method for extracting the carry would be to shift by $lg(\beta)$ positions and logically AND 2388 The first subtraction loop (lines 46 through 60) subtract digits from both inputs until the smaller of
2397 the least significant bit. The AND operation is required because all of the bits above the $\lg(\beta)$'th bit will be set to one after a carry 2389 the two inputs has been exhausted. As remarked earlier there is an implementation reason for using the ``awkward''
2398 occurs from subtraction. This carry extraction requires two relatively cheap operations to extract the carry. The other method is to simply 2390 method of extracting the carry (line 56). The traditional method for extracting the carry would be to shift
2399 shift the most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This optimization only works on 2391 by $lg(\beta)$ positions and logically AND the least significant bit. The AND operation is required because all of
2400 twos compliment machines which is a safe assumption to make. 2392 the bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction. This carry
2401 2393 extraction requires two relatively cheap operations to extract the carry. The other method is to simply shift the
2402 If $a$ has a larger magnitude than $b$ an additional loop (\textit{see lines 63 through 72}) is required to propagate the carry through 2394 most significant bit to the least significant bit thus extracting the carry with a single cheap operation. This
2403 $a$ and copy the result to $c$. 2395 optimization only works on twos compliment machines which is a safe assumption to make.
2396
2397 If $a$ has a larger magnitude than $b$ an additional loop (lines 63 through 72) is required to propagate
2398 the carry through $a$ and copy the result to $c$.
2404 2399
2405 \subsection{High Level Addition} 2400 \subsection{High Level Addition}
2406 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be 2401 Now that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be
2407 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data 2402 established. This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data
2408 types. 2403 types.
2983 061 \} 2978 061 \}
2984 062 #endif 2979 062 #endif
2985 \end{alltt} 2980 \end{alltt}
2986 \end{small} 2981 \end{small}
2987 2982
2988 The if statement on line 23 ensures that the $b$ variable is greater than zero. The \textbf{used} count is incremented by $b$ before 2983 The if statement (line 23) ensures that the $b$ variable is greater than zero since we do not interpret negative
2989 the copy loop begins. This elminates the need for an additional variable in the for loop. The variable $top$ on line 41 is an alias 2984 shift counts properly. The \textbf{used} count is incremented by $b$ before the copy loop begins. This elminates
2990 for the leading digit while $bottom$ on line 44 is an alias for the trailing edge. The aliases form a window of exactly $b$ digits 2985 the need for an additional variable in the for loop. The variable $top$ (line 41) is an alias
2991 over the input. 2986 for the leading digit while $bottom$ (line 44) is an alias for the trailing edge. The aliases form a
2987 window of exactly $b$ digits over the input.
2992 2988
2993 \subsection{Division by $x$} 2989 \subsection{Division by $x$}
2994 2990
2995 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit. 2991 Division by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit.
2996 2992
3093 066 \} 3089 066 \}
3094 067 #endif 3090 067 #endif
3095 \end{alltt} 3091 \end{alltt}
3096 \end{small} 3092 \end{small}
3097 3093
3098 The only noteworthy element of this routine is the lack of a return type. 3094 The only noteworthy element of this routine is the lack of a return type since it cannot fail. Like mp\_lshd() we
3099 3095 form a sliding window except we copy in the other direction. After the window (line 59) we then zero
3100 -- Will update later to give it a return type...Tom 3096 the upper digits of the input to make sure the result is correct.
3101 3097
3102 \section{Powers of Two} 3098 \section{Powers of Two}
3103 3099
3104 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For 3100 Now that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required. For
3105 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single 3101 example, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful. Instead of performing single
3226 079 \} 3222 079 \}
3227 080 #endif 3223 080 #endif
3228 \end{alltt} 3224 \end{alltt}
3229 \end{small} 3225 \end{small}
3230 3226
3231 Notes to be revised when code is updated. -- Tom 3227 The shifting is performed in--place which means the first step (line 24) is to copy the input to the
3228 destination. We avoid calling mp\_copy() by making sure the mp\_ints are different. The destination then
3229 has to be grown (line 31) to accomodate the result.
3230
3231 If the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples
3232 of $lg(\beta)$. Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left. Inside the actual shift
3233 loop (lines 45 to 76) we make use of pre--computed values $shift$ and $mask$. These are used to
3234 extract the carry bit(s) to pass into the next iteration of the loop. The $r$ and $rr$ variables form a
3235 chain between consecutive iterations to propagate the carry.
3232 3236
3233 \subsection{Division by Power of Two} 3237 \subsection{Division by Power of Two}
3234 3238
3235 \newpage\begin{figure}[!here] 3239 \newpage\begin{figure}[!here]
3236 \begin{small} 3240 \begin{small}
3359 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally 3363 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies. The remainder $d$ may be optionally
3360 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the 3364 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable. The temporary mp\_int variable $t$ is used to hold the
3361 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before 3365 result of the remainder operation until the end. This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before
3362 the quotient is obtained. 3366 the quotient is obtained.
3363 3367
3364 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. (-- Fix this paragraph up later, Tom). 3368 The remainder of the source code is essentially the same as the source code for mp\_mul\_2d. The only significant difference is
3369 the direction of the shifts.
3365 3370
3366 \subsection{Remainder of Division by Power of Two} 3371 \subsection{Remainder of Division by Power of Two}
3367 3372
3368 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This 3373 The last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$. This
3369 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$. 3374 algorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
3418 025 mp_zero (c); 3423 025 mp_zero (c);
3419 026 return MP_OKAY; 3424 026 return MP_OKAY;
3420 027 \} 3425 027 \}
3421 028 3426 028
3422 029 /* if the modulus is larger than the value than return */ 3427 029 /* if the modulus is larger than the value than return */
3423 030 if (b > (int) (a->used * DIGIT_BIT)) \{ 3428 030 if (b >= (int) (a->used * DIGIT_BIT)) \{
3424 031 res = mp_copy (a, c); 3429 031 res = mp_copy (a, c);
3425 032 return res; 3430 032 return res;
3426 033 \} 3431 033 \}
3427 034 3432 034
3428 035 /* copy */ 3433 035 /* copy */
3444 049 \} 3449 049 \}
3445 050 #endif 3450 050 #endif
3446 \end{alltt} 3451 \end{alltt}
3447 \end{small} 3452 \end{small}
3448 3453
3449 -- Add comments later, Tom. 3454 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases. Next if $2^b$ is larger
3455 than the input we just mp\_copy() the input and return right away. After this point we know we must actually
3456 perform some work to produce the remainder.
3457
3458 Recalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce
3459 the number. First we zero any digits above the last digit in $2^b$ (line 41). Next we reduce the
3460 leading digit of both (line 45) and then mp\_clamp().
3450 3461
3451 \section*{Exercises} 3462 \section*{Exercises}
3452 \begin{tabular}{cl} 3463 \begin{tabular}{cl}
3453 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\ 3464 $\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
3454 & in $O(n)$ time. \\ 3465 & in $O(n)$ time. \\
3609 016 3620 016
3610 017 /* multiplies |a| * |b| and only computes upto digs digits of result 3621 017 /* multiplies |a| * |b| and only computes upto digs digits of result
3611 018 * HAC pp. 595, Algorithm 14.12 Modified so you can control how 3622 018 * HAC pp. 595, Algorithm 14.12 Modified so you can control how
3612 019 * many digits of output are created. 3623 019 * many digits of output are created.
3613 020 */ 3624 020 */
3614 021 int 3625 021 int s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs)
3615 022 s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs) 3626 022 \{
3616 023 \{ 3627 023 mp_int t;
3617 024 mp_int t; 3628 024 int res, pa, pb, ix, iy;
3618 025 int res, pa, pb, ix, iy; 3629 025 mp_digit u;
3619 026 mp_digit u; 3630 026 mp_word r;
3620 027 mp_word r; 3631 027 mp_digit tmpx, *tmpt, *tmpy;
3621 028 mp_digit tmpx, *tmpt, *tmpy; 3632 028
3622 029 3633 029 /* can we use the fast multiplier? */
3623 030 /* can we use the fast multiplier? */ 3634 030 if (((digs) < MP_WARRAY) &&
3624 031 if (((digs) < MP_WARRAY) && 3635 031 MIN (a->used, b->used) <
3625 032 MIN (a->used, b->used) < 3636 032 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{
3626 033 (1 << ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) \{ 3637 033 return fast_s_mp_mul_digs (a, b, c, digs);
3627 034 return fast_s_mp_mul_digs (a, b, c, digs); 3638 034 \}
3628 035 \} 3639 035
3629 036 3640 036 if ((res = mp_init_size (&t, digs)) != MP_OKAY) \{
3630 037 if ((res = mp_init_size (&t, digs)) != MP_OKAY) \{ 3641 037 return res;
3631 038 return res; 3642 038 \}
3632 039 \} 3643 039 t.used = digs;
3633 040 t.used = digs; 3644 040
3634 041 3645 041 /* compute the digits of the product directly */
3635 042 /* compute the digits of the product directly */ 3646 042 pa = a->used;
3636 043 pa = a->used; 3647 043 for (ix = 0; ix < pa; ix++) \{
3637 044 for (ix = 0; ix < pa; ix++) \{ 3648 044 /* set the carry to zero */
3638 045 /* set the carry to zero */ 3649 045 u = 0;
3639 046 u = 0; 3650 046
3640 047 3651 047 /* limit ourselves to making digs digits of output */
3641 048 /* limit ourselves to making digs digits of output */ 3652 048 pb = MIN (b->used, digs - ix);
3642 049 pb = MIN (b->used, digs - ix); 3653 049
3643 050 3654 050 /* setup some aliases */
3644 051 /* setup some aliases */ 3655 051 /* copy of the digit from a used within the nested loop */
3645 052 /* copy of the digit from a used within the nested loop */ 3656 052 tmpx = a->dp[ix];
3646 053 tmpx = a->dp[ix]; 3657 053
3647 054 3658 054 /* an alias for the destination shifted ix places */
3648 055 /* an alias for the destination shifted ix places */ 3659 055 tmpt = t.dp + ix;
3649 056 tmpt = t.dp + ix; 3660 056
3650 057 3661 057 /* an alias for the digits of b */
3651 058 /* an alias for the digits of b */ 3662 058 tmpy = b->dp;
3652 059 tmpy = b->dp; 3663 059
3653 060 3664 060 /* compute the columns of the output and propagate the carry */
3654 061 /* compute the columns of the output and propagate the carry */ 3665 061 for (iy = 0; iy < pb; iy++) \{
3655 062 for (iy = 0; iy < pb; iy++) \{ 3666 062 /* compute the column as a mp_word */
3656 063 /* compute the column as a mp_word */ 3667 063 r = ((mp_word)*tmpt) +
3657 064 r = ((mp_word)*tmpt) + 3668 064 ((mp_word)tmpx) * ((mp_word)*tmpy++) +
3658 065 ((mp_word)tmpx) * ((mp_word)*tmpy++) + 3669 065 ((mp_word) u);
3659 066 ((mp_word) u); 3670 066
3660 067 3671 067 /* the new column is the lower part of the result */
3661 068 /* the new column is the lower part of the result */ 3672 068 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
3662 069 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); 3673 069
3663 070 3674 070 /* get the carry word from the result */
3664 071 /* get the carry word from the result */ 3675 071 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT));
3665 072 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT)); 3676 072 \}
3666 073 \} 3677 073 /* set carry if it is placed below digs */
3667 074 /* set carry if it is placed below digs */ 3678 074 if (ix + iy < digs) \{
3668 075 if (ix + iy < digs) \{ 3679 075 *tmpt = u;
3669 076 *tmpt = u; 3680 076 \}
3670 077 \} 3681 077 \}
3671 078 \} 3682 078
3672 079 3683 079 mp_clamp (&t);
3673 080 mp_clamp (&t); 3684 080 mp_exch (&t, c);
3674 081 mp_exch (&t, c); 3685 081
3675 082 3686 082 mp_clear (&t);
3676 083 mp_clear (&t); 3687 083 return MP_OKAY;
3677 084 return MP_OKAY; 3688 084 \}
3678 085 \} 3689 085 #endif
3679 086 #endif
3680 \end{alltt} 3690 \end{alltt}
3681 \end{small} 3691 \end{small}
3682 3692
3683 Lines 31 to 35 determine if the Comba method can be used first. The conditions for using the Comba routine are that min$(a.used, b.used) < \delta$ and 3693 First we determine (line 30) if the Comba method can be used first since it's faster. The conditions for
3684 the number of digits of output is less than \textbf{MP\_WARRAY}. This new constant is used to control 3694 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than
3685 the stack usage in the Comba routines. By default it is set to $\delta$ but can be reduced when memory is at a premium. 3695 \textbf{MP\_WARRAY}. This new constant is used to control the stack usage in the Comba routines. By default it is
3686 3696 set to $\delta$ but can be reduced when memory is at a premium.
3687 Of particular importance is the calculation of the $ix+iy$'th column on lines 64, 65 and 66. Note how all of the 3697
3688 variables are cast to the type \textbf{mp\_word}, which is also the type of variable $\hat r$. That is to ensure that double precision operations 3698 If we cannot use the Comba method we proceed to setup the baseline routine. We allocate the the destination mp\_int
3689 are used instead of single precision. The multiplication on line 65 makes use of a specific GCC optimizer behaviour. On the outset it looks like 3699 $t$ (line 36) to the exact size of the output to avoid further re--allocations. At this point we now
3690 the compiler will have to use a double precision multiplication to produce the result required. Such an operation would be horribly slow on most 3700 begin the $O(n^2)$ loop.
3691 processors and drag this to a crawl. However, GCC is smart enough to realize that double wide output single precision multipliers can be used. For 3701
3692 example, the instruction ``MUL'' on the x86 processor can multiply two 32-bit values and produce a 64-bit result. 3702 This implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of
3703 digits as output. In each iteration of the outer loop the $pb$ variable is set (line 48) to the maximum
3704 number of inner loop iterations.
3705
3706 Inside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the
3707 carry from the previous iteration. A particularly important observation is that most modern optimizing
3708 C compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that
3709 is required for the product. In x86 terms for example, this means using the MUL instruction.
3710
3711 Each digit of the product is stored in turn (line 68) and the carry propagated (line 71) to the
3712 next iteration.
3693 3713
3694 \subsection{Faster Multiplication by the ``Comba'' Method} 3714 \subsection{Faster Multiplication by the ``Comba'' Method}
3695 3715
3696 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be computed and propagated upwards. This 3716 One of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be
3697 makes the nested loop very sequential and hard to unroll and implement in parallel. The ``Comba'' \cite{COMBA} method is named after little known 3717 computed and propagated upwards. This makes the nested loop very sequential and hard to unroll and implement
3698 (\textit{in cryptographic venues}) Paul G. Comba who described a method of implementing fast multipliers that do not require nested 3718 in parallel. The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G.
3699 carry fixup operations. As an interesting aside it seems that Paul Barrett describes a similar technique in 3719 Comba who described a method of implementing fast multipliers that do not require nested carry fixup operations. As an
3700 his 1986 paper \cite{BARRETT} written five years before. 3720 interesting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written
3701 3721 five years before.
3702 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight twist is placed on how 3722
3703 the columns of the result are produced. In the standard long-hand algorithm rows of products are produced then added together to form the 3723 At the heart of the Comba technique is once again the long-hand algorithm. Except in this case a slight
3704 final result. In the baseline algorithm the columns are added together after each iteration to get the result instantaneously. 3724 twist is placed on how the columns of the result are produced. In the standard long-hand algorithm rows of products
3705 3725 are produced then added together to form the final result. In the baseline algorithm the columns are added together
3706 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at the $O(n^2)$ level a 3726 after each iteration to get the result instantaneously.
3707 simple multiplication and addition step is performed. The carries of the columns are propagated after the nested loop to reduce the amount 3727
3708 of work requiored. Succintly the first step of the algorithm is to compute the product vector $\vec x$ as follows. 3728 In the Comba algorithm the columns of the result are produced entirely independently of each other. That is at
3729 the $O(n^2)$ level a simple multiplication and addition step is performed. The carries of the columns are propagated
3730 after the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute
3731 the product vector $\vec x$ as follows.
3709 3732
3710 \begin{equation} 3733 \begin{equation}
3711 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace 3734 \vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace
3712 \end{equation} 3735 \end{equation}
3713 3736
3797 \begin{tabular}{l} 3820 \begin{tabular}{l}
3798 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\ 3821 \hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\
3799 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\ 3822 \textbf{Input}. mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
3800 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\ 3823 \textbf{Output}. $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
3801 \hline \\ 3824 \hline \\
3802 Place an array of \textbf{MP\_WARRAY} double precision digits named $\hat W$ on the stack. \\ 3825 Place an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\
3803 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\ 3826 1. If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\
3804 2. If step 1 failed return(\textit{MP\_MEM}).\\ 3827 2. If step 1 failed return(\textit{MP\_MEM}).\\
3805 \\ 3828 \\
3806 Zero the temporary array $\hat W$. \\ 3829 3. $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\
3807 3. for $n$ from $0$ to $digs - 1$ do \\
3808 \hspace{3mm}3.1 $\hat W_n \leftarrow 0$ \\
3809 \\ 3830 \\
3810 Compute the columns. \\ 3831 4. $\_ \hat W \leftarrow 0$ \\
3811 4. for $ix$ from $0$ to $a.used - 1$ do \\ 3832 5. for $ix$ from 0 to $pa - 1$ do \\
3812 \hspace{3mm}4.1 $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\ 3833 \hspace{3mm}5.1 $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\
3813 \hspace{3mm}4.2 If $pb < 1$ then goto step 5. \\ 3834 \hspace{3mm}5.2 $tx \leftarrow ix - ty$ \\
3814 \hspace{3mm}4.3 for $iy$ from $0$ to $pb - 1$ do \\ 3835 \hspace{3mm}5.3 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
3815 \hspace{6mm}4.3.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}b_{iy}$ \\ 3836 \hspace{3mm}5.4 for $iz$ from 0 to $iy - 1$ do \\
3837 \hspace{6mm}5.4.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\
3838 \hspace{3mm}5.5 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\
3839 \hspace{3mm}5.6 $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
3840 6. $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
3816 \\ 3841 \\
3817 Propagate the carries upwards. \\ 3842 7. $oldused \leftarrow c.used$ \\
3818 5. $oldused \leftarrow c.used$ \\ 3843 8. $c.used \leftarrow digs$ \\
3819 6. $c.used \leftarrow digs$ \\ 3844 9. for $ix$ from $0$ to $pa$ do \\
3820 7. If $digs > 1$ then do \\ 3845 \hspace{3mm}9.1 $c_{ix} \leftarrow W_{ix}$ \\
3821 \hspace{3mm}7.1. for $ix$ from $1$ to $digs - 1$ do \\ 3846 10. for $ix$ from $pa + 1$ to $oldused - 1$ do \\
3822 \hspace{6mm}7.1.1 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix-1} / \beta \rfloor$ \\ 3847 \hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\
3823 \hspace{6mm}7.1.2 $c_{ix - 1} \leftarrow \hat W_{ix - 1} \mbox{ (mod }\beta\mbox{)}$ \\
3824 8. else do \\
3825 \hspace{3mm}8.1 $ix \leftarrow 0$ \\
3826 9. $c_{ix} \leftarrow \hat W_{ix} \mbox{ (mod }\beta\mbox{)}$ \\
3827 \\ 3848 \\
3828 Zero excess digits. \\ 3849 11. Clamp $c$. \\
3829 10. If $digs < oldused$ then do \\ 3850 12. Return MP\_OKAY. \\
3830 \hspace{3mm}10.1 for $n$ from $digs$ to $oldused - 1$ do \\
3831 \hspace{6mm}10.1.1 $c_n \leftarrow 0$ \\
3832 11. Clamp excessive digits of $c$. (\textit{mp\_clamp}) \\
3833 12. Return(\textit{MP\_OKAY}). \\
3834 \hline 3851 \hline
3835 \end{tabular} 3852 \end{tabular}
3836 \end{center} 3853 \end{center}
3837 \end{small} 3854 \end{small}
3838 \caption{Algorithm fast\_s\_mp\_mul\_digs} 3855 \caption{Algorithm fast\_s\_mp\_mul\_digs}
3839 \label{fig:COMBAMULT} 3856 \label{fig:COMBAMULT}
3840 \end{figure} 3857 \end{figure}
3841 3858
3842 \textbf{Algorithm fast\_s\_mp\_mul\_digs.} 3859 \textbf{Algorithm fast\_s\_mp\_mul\_digs.}
3843 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision. The algorithm 3860 This algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.
3844 essentially peforms the same calculation as algorithm s\_mp\_mul\_digs, just much faster. 3861
3845 3862 The outer loop of this algorithm is more complicated than that of the baseline multiplier. This is because on the inside of the
3846 The array $\hat W$ is meant to be on the stack when the algorithm is used. The size of the array does not change which is ideal. Note also that 3863 loop we want to produce one column per pass. This allows the accumulator $\_ \hat W$ to be placed in CPU registers and
3847 unlike algorithm s\_mp\_mul\_digs no temporary mp\_int is required since the result is calculated directly in $\hat W$. 3864 reduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration.
3848 3865
3849 The $O(n^2)$ loop on step four is where the Comba method's advantages begin to show through in comparison to the baseline algorithm. The lack of 3866 The $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$. That way if $a$ has more digits than
3850 a carry variable or propagation in this loop allows the loop to be performed with only single precision multiplication and additions. Now that each 3867 $b$ this will be limited to $b.used - 1$. The $tx$ variable is set to the to the distance past $b.used$ the variable
3851 iteration of the inner loop can be performed independent of the others the inner loop can be performed with a high level of parallelism. 3868 $ix$ is. This is used for the immediately subsequent statement where we find $iy$.
3869
3870 The variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out. Computing one column at a time
3871 means we have to scan one integer upwards and the other downwards. $a$ starts at $tx$ and $b$ starts at $ty$. In each
3872 pass we are producing the $ix$'th output column and we note that $tx + ty = ix$. As we move $tx$ upwards we have to
3873 move $ty$ downards so the equality remains valid. The $iy$ variable is the number of iterations until
3874 $tx \ge a.used$ or $ty < 0$ occurs.
3875
3876 After every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator
3877 into the next round by dividing $\_ \hat W$ by $\beta$.
3852 3878
3853 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the 3879 To measure the benefits of the Comba method over the baseline method consider the number of operations that are required. If the
3854 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require 3880 cost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require
3855 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice, 3881 $O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers. The Comba method requires only $O(pn^2 + qn)$ time, however in practice,
3856 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply 3882 the speed increase is actually much more. With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
3875 028 * required for fast Barrett reduction). 3901 028 * required for fast Barrett reduction).
3876 029 * 3902 029 *
3877 030 * Based on Algorithm 14.12 on pp.595 of HAC. 3903 030 * Based on Algorithm 14.12 on pp.595 of HAC.
3878 031 * 3904 031 *
3879 032 */ 3905 032 */
3880 033 int 3906 033 int fast_s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs)
3881 034 fast_s_mp_mul_digs (mp_int * a, mp_int * b, mp_int * c, int digs) 3907 034 \{
3882 035 \{ 3908 035 int olduse, res, pa, ix, iz;
3883 036 int olduse, res, pa, ix, iz; 3909 036 mp_digit W[MP_WARRAY];
3884 037 mp_digit W[MP_WARRAY]; 3910 037 register mp_word _W;
3885 038 register mp_word _W; 3911 038
3886 039 3912 039 /* grow the destination as required */
3887 040 /* grow the destination as required */ 3913 040 if (c->alloc < digs) \{
3888 041 if (c->alloc < digs) \{ 3914 041 if ((res = mp_grow (c, digs)) != MP_OKAY) \{
3889 042 if ((res = mp_grow (c, digs)) != MP_OKAY) \{ 3915 042 return res;
3890 043 return res; 3916 043 \}
3891 044 \} 3917 044 \}
3892 045 \} 3918 045
3893 046 3919 046 /* number of output digits to produce */
3894 047 /* number of output digits to produce */ 3920 047 pa = MIN(digs, a->used + b->used);
3895 048 pa = MIN(digs, a->used + b->used); 3921 048
3896 049 3922 049 /* clear the carry */
3897 050 /* clear the carry */ 3923 050 _W = 0;
3898 051 _W = 0; 3924 051 for (ix = 0; ix < pa; ix++) \{
3899 052 for (ix = 0; ix <= pa; ix++) \{ 3925 052 int tx, ty;
3900 053 int tx, ty; 3926 053 int iy;
3901 054 int iy; 3927 054 mp_digit *tmpx, *tmpy;
3902 055 mp_digit *tmpx, *tmpy; 3928 055
3903 056 3929 056 /* get offsets into the two bignums */
3904 057 /* get offsets into the two bignums */ 3930 057 ty = MIN(b->used-1, ix);
3905 058 ty = MIN(b->used-1, ix); 3931 058 tx = ix - ty;
3906 059 tx = ix - ty; 3932 059
3907 060 3933 060 /* setup temp aliases */
3908 061 /* setup temp aliases */ 3934 061 tmpx = a->dp + tx;
3909 062 tmpx = a->dp + tx; 3935 062 tmpy = b->dp + ty;
3910 063 tmpy = b->dp + ty; 3936 063
3911 064 3937 064 /* this is the number of times the loop will iterrate, essentially
3912 065 /* this is the number of times the loop will iterrate, essentially its 3938 065 while (tx++ < a->used && ty-- >= 0) \{ ... \}
3913 3939 066 */
3914 066 while (tx++ < a->used && ty-- >= 0) \{ ... \} 3940 067 iy = MIN(a->used-tx, ty+1);
3915 067 */ 3941 068
3916 068 iy = MIN(a->used-tx, ty+1); 3942 069 /* execute loop */
3917 069 3943 070 for (iz = 0; iz < iy; ++iz) \{
3918 070 /* execute loop */ 3944 071 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--);
3919 071 for (iz = 0; iz < iy; ++iz) \{ 3945 072 \}
3920 072 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--); 3946 073
3921 073 \} 3947 074 /* store term */
3922 074 3948 075 W[ix] = ((mp_digit)_W) & MP_MASK;
3923 075 /* store term */ 3949 076
3924 076 W[ix] = ((mp_digit)_W) & MP_MASK; 3950 077 /* make next carry */
3925 077 3951 078 _W = _W >> ((mp_word)DIGIT_BIT);
3926 078 /* make next carry */ 3952 079 \}
3927 079 _W = _W >> ((mp_word)DIGIT_BIT); 3953 080
3928 080 \} 3954 081 /* store final carry */
3929 081 3955 082 W[ix] = (mp_digit)(_W & MP_MASK);
3930 082 /* setup dest */ 3956 083
3931 083 olduse = c->used; 3957 084 /* setup dest */
3932 084 c->used = digs; 3958 085 olduse = c->used;
3933 085 3959 086 c->used = pa;
3934 086 \{ 3960 087
3935 087 register mp_digit *tmpc; 3961 088 \{
3936 088 tmpc = c->dp; 3962 089 register mp_digit *tmpc;
3937 089 for (ix = 0; ix < digs; ix++) \{ 3963 090 tmpc = c->dp;
3938 090 /* now extract the previous digit [below the carry] */ 3964 091 for (ix = 0; ix < pa+1; ix++) \{
3939 091 *tmpc++ = W[ix]; 3965 092 /* now extract the previous digit [below the carry] */
3940 092 \} 3966 093 *tmpc++ = W[ix];
3941 093 3967 094 \}
3942 094 /* clear unused digits [that existed in the old copy of c] */ 3968 095
3943 095 for (; ix < olduse; ix++) \{ 3969 096 /* clear unused digits [that existed in the old copy of c] */
3944 096 *tmpc++ = 0; 3970 097 for (; ix < olduse; ix++) \{
3945 097 \} 3971 098 *tmpc++ = 0;
3946 098 \} 3972 099 \}
3947 099 mp_clamp (c); 3973 100 \}
3948 100 return MP_OKAY; 3974 101 mp_clamp (c);
3949 101 \} 3975 102 return MP_OKAY;
3950 102 #endif 3976 103 \}
3977 104 #endif
3951 \end{alltt} 3978 \end{alltt}
3952 \end{small} 3979 \end{small}
3953 3980
3954 The memset on line @47,memset@ clears the initial $\hat W$ array to zero in a single step. Like the slower baseline multiplication 3981 As per the pseudo--code we first calculate $pa$ (line 47) as the number of digits to output. Next we begin the outer loop
3955 implementation a series of aliases (\textit{lines 62, 63 and 76}) are used to simplify the inner $O(n^2)$ loop. 3982 to produce the individual columns of the product. We use the two aliases $tmpx$ and $tmpy$ (lines 61, 62) to point
3956 In this case a new alias $\_\hat W$ has been added which refers to the double precision columns offset by $ix$ in each pass. 3983 inside the two multiplicands quickly.
3957 3984
3958 The inner loop on lines 89, 79 and 80 is where the algorithm will spend the majority of the time, which is why it has been 3985 The inner loop (lines 70 to 72) of this implementation is where the tradeoff come into play. Originally this comba
3959 stripped to the bones of any extra baggage\footnote{Hence the pointer aliases.}. On x86 processors the multiplication and additions amount to at the 3986 implementation was ``row--major'' which means it adds to each of the columns in each pass. After the outer loop it would then fix
3960 very least five instructions (\textit{two loads, two additions, one multiply}) while on the ARMv4 processors they amount to only three 3987 the carries. This was very fast except it had an annoying drawback. You had to read a mp\_word and two mp\_digits and write
3961 (\textit{one load, one store, one multiply-add}). For both of the x86 and ARMv4 processors the GCC compiler performs a good job at unrolling the loop 3988 one mp\_word per iteration. On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth
3962 and scheduling the instructions so there are very few dependency stalls. 3989 is very high and it can keep the ALU fed with data. It did, however, matter on older and embedded cpus where cache is often
3963 3990 slower and also often doesn't exist. This new algorithm only performs two reads per iteration under the assumption that the
3964 In theory the difference between the baseline and comba algorithms is a mere $O(qn)$ time difference. However, in the $O(n^2)$ nested loop of the 3991 compiler has aliased $\_ \hat W$ to a CPU register.
3965 baseline method there are dependency stalls as the algorithm must wait for the multiplier to finish before propagating the carry to the next 3992
3966 digit. As a result fewer of the often multiple execution units\footnote{The AMD Athlon has three execution units and the Intel P4 has four.} can 3993 After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines 75, 78) to forward it as
3967 be simultaneously used. 3994 a carry for the next pass. After the outer loop we use the final carry (line 82) as the last digit of the product.
3968 3995
3969 \subsection{Polynomial Basis Multiplication} 3996 \subsection{Polynomial Basis Multiplication}
3970 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms 3997 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication. In the following algorithms
3971 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and 3998 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and
3972 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree. 3999 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required. In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
4439 \vspace{-3mm} 4466 \vspace{-3mm}
4440 \begin{alltt} 4467 \begin{alltt}
4441 016 4468 016
4442 017 /* multiplication using the Toom-Cook 3-way algorithm 4469 017 /* multiplication using the Toom-Cook 3-way algorithm
4443 018 * 4470 018 *
4444 019 * Much more complicated than Karatsuba but has a lower asymptotic running t 4471 019 * Much more complicated than Karatsuba but has a lower
4445 ime of 4472 020 * asymptotic running time of O(N**1.464). This algorithm is
4446 020 * O(N**1.464). This algorithm is only particularly useful on VERY large 4473 021 * only particularly useful on VERY large inputs
4447 021 * inputs (we're talking 1000s of digits here...). 4474 022 * (we're talking 1000s of digits here...).
4448 022 */ 4475 023 */
4449 023 int mp_toom_mul(mp_int *a, mp_int *b, mp_int *c) 4476 024 int mp_toom_mul(mp_int *a, mp_int *b, mp_int *c)
4450 024 \{ 4477 025 \{
4451 025 mp_int w0, w1, w2, w3, w4, tmp1, tmp2, a0, a1, a2, b0, b1, b2; 4478 026 mp_int w0, w1, w2, w3, w4, tmp1, tmp2, a0, a1, a2, b0, b1, b2;
4452 026 int res, B; 4479 027 int res, B;
4453 027 4480 028
4454 028 /* init temps */ 4481 029 /* init temps */
4455 029 if ((res = mp_init_multi(&w0, &w1, &w2, &w3, &w4, 4482 030 if ((res = mp_init_multi(&w0, &w1, &w2, &w3, &w4,
4456 030 &a0, &a1, &a2, &b0, &b1, 4483 031 &a0, &a1, &a2, &b0, &b1,
4457 031 &b2, &tmp1, &tmp2, NULL)) != MP_OKAY) \{ 4484 032 &b2, &tmp1, &tmp2, NULL)) != MP_OKAY) \{
4458 032 return res; 4485 033 return res;
4459 033 \} 4486 034 \}
4460 034 4487 035
4461 035 /* B */ 4488 036 /* B */
4462 036 B = MIN(a->used, b->used) / 3; 4489 037 B = MIN(a->used, b->used) / 3;
4463 037 4490 038
4464 038 /* a = a2 * B**2 + a1 * B + a0 */ 4491 039 /* a = a2 * B**2 + a1 * B + a0 */
4465 039 if ((res = mp_mod_2d(a, DIGIT_BIT * B, &a0)) != MP_OKAY) \{ 4492 040 if ((res = mp_mod_2d(a, DIGIT_BIT * B, &a0)) != MP_OKAY) \{
4466 040 goto ERR; 4493 041 goto ERR;
4467 041 \} 4494 042 \}
4468 042 4495 043
4469 043 if ((res = mp_copy(a, &a1)) != MP_OKAY) \{ 4496 044 if ((res = mp_copy(a, &a1)) != MP_OKAY) \{
4470 044 goto ERR; 4497 045 goto ERR;
4471 045 \} 4498 046 \}
4472 046 mp_rshd(&a1, B); 4499 047 mp_rshd(&a1, B);
4473 047 mp_mod_2d(&a1, DIGIT_BIT * B, &a1); 4500 048 mp_mod_2d(&a1, DIGIT_BIT * B, &a1);
4474 048 4501 049
4475 049 if ((res = mp_copy(a, &a2)) != MP_OKAY) \{ 4502 050 if ((res = mp_copy(a, &a2)) != MP_OKAY) \{
4476 050 goto ERR; 4503 051 goto ERR;
4477 051 \} 4504 052 \}
4478 052 mp_rshd(&a2, B*2); 4505 053 mp_rshd(&a2, B*2);
4479 053 4506 054
4480 054 /* b = b2 * B**2 + b1 * B + b0 */ 4507 055 /* b = b2 * B**2 + b1 * B + b0 */
4481 055 if ((res = mp_mod_2d(b, DIGIT_BIT * B, &b0)) != MP_OKAY) \{ 4508 056 if ((res = mp_mod_2d(b, DIGIT_BIT * B, &b0)) != MP_OKAY) \{
4482 056 goto ERR; 4509 057 goto ERR;
4483 057 \} 4510 058 \}
4484 058 4511 059
4485 059 if ((res = mp_copy(b, &b1)) != MP_OKAY) \{ 4512 060 if ((res = mp_copy(b, &b1)) != MP_OKAY) \{
4486 060 goto ERR; 4513 061 goto ERR;
4487 061 \} 4514 062 \}
4488 062 mp_rshd(&b1, B); 4515 063 mp_rshd(&b1, B);
4489 063 mp_mod_2d(&b1, DIGIT_BIT * B, &b1); 4516 064 mp_mod_2d(&b1, DIGIT_BIT * B, &b1);
4490 064 4517 065
4491 065 if ((res = mp_copy(b, &b2)) != MP_OKAY) \{ 4518 066 if ((res = mp_copy(b, &b2)) != MP_OKAY) \{
4492 066 goto ERR; 4519 067 goto ERR;
4493 067 \} 4520 068 \}
4494 068 mp_rshd(&b2, B*2); 4521 069 mp_rshd(&b2, B*2);
4495 069 4522 070
4496 070 /* w0 = a0*b0 */ 4523 071 /* w0 = a0*b0 */
4497 071 if ((res = mp_mul(&a0, &b0, &w0)) != MP_OKAY) \{ 4524 072 if ((res = mp_mul(&a0, &b0, &w0)) != MP_OKAY) \{
4498 072 goto ERR; 4525 073 goto ERR;
4499 073 \} 4526 074 \}
4500 074 4527 075
4501 075 /* w4 = a2 * b2 */ 4528 076 /* w4 = a2 * b2 */
4502 076 if ((res = mp_mul(&a2, &b2, &w4)) != MP_OKAY) \{ 4529 077 if ((res = mp_mul(&a2, &b2, &w4)) != MP_OKAY) \{
4503 077 goto ERR; 4530 078 goto ERR;
4504 078 \} 4531 079 \}
4505 079 4532 080
4506 080 /* w1 = (a2 + 2(a1 + 2a0))(b2 + 2(b1 + 2b0)) */ 4533 081 /* w1 = (a2 + 2(a1 + 2a0))(b2 + 2(b1 + 2b0)) */
4507 081 if ((res = mp_mul_2(&a0, &tmp1)) != MP_OKAY) \{ 4534 082 if ((res = mp_mul_2(&a0, &tmp1)) != MP_OKAY) \{
4508 082 goto ERR; 4535 083 goto ERR;
4509 083 \} 4536 084 \}
4510 084 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{ 4537 085 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{
4511 085 goto ERR; 4538 086 goto ERR;
4512 086 \} 4539 087 \}
4513 087 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{ 4540 088 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{
4514 088 goto ERR; 4541 089 goto ERR;
4515 089 \} 4542 090 \}
4516 090 if ((res = mp_add(&tmp1, &a2, &tmp1)) != MP_OKAY) \{ 4543 091 if ((res = mp_add(&tmp1, &a2, &tmp1)) != MP_OKAY) \{
4517 091 goto ERR; 4544 092 goto ERR;
4518 092 \} 4545 093 \}
4519 093 4546 094
4520 094 if ((res = mp_mul_2(&b0, &tmp2)) != MP_OKAY) \{ 4547 095 if ((res = mp_mul_2(&b0, &tmp2)) != MP_OKAY) \{
4521 095 goto ERR; 4548 096 goto ERR;
4522 096 \} 4549 097 \}
4523 097 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{ 4550 098 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{
4524 098 goto ERR; 4551 099 goto ERR;
4525 099 \} 4552 100 \}
4526 100 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{ 4553 101 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{
4527 101 goto ERR; 4554 102 goto ERR;
4528 102 \} 4555 103 \}
4529 103 if ((res = mp_add(&tmp2, &b2, &tmp2)) != MP_OKAY) \{ 4556 104 if ((res = mp_add(&tmp2, &b2, &tmp2)) != MP_OKAY) \{
4530 104 goto ERR; 4557 105 goto ERR;
4531 105 \} 4558 106 \}
4532 106 4559 107
4533 107 if ((res = mp_mul(&tmp1, &tmp2, &w1)) != MP_OKAY) \{ 4560 108 if ((res = mp_mul(&tmp1, &tmp2, &w1)) != MP_OKAY) \{
4534 108 goto ERR; 4561 109 goto ERR;
4535 109 \} 4562 110 \}
4536 110 4563 111
4537 111 /* w3 = (a0 + 2(a1 + 2a2))(b0 + 2(b1 + 2b2)) */ 4564 112 /* w3 = (a0 + 2(a1 + 2a2))(b0 + 2(b1 + 2b2)) */
4538 112 if ((res = mp_mul_2(&a2, &tmp1)) != MP_OKAY) \{ 4565 113 if ((res = mp_mul_2(&a2, &tmp1)) != MP_OKAY) \{
4539 113 goto ERR; 4566 114 goto ERR;
4540 114 \} 4567 115 \}
4541 115 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{ 4568 116 if ((res = mp_add(&tmp1, &a1, &tmp1)) != MP_OKAY) \{
4542 116 goto ERR; 4569 117 goto ERR;
4543 117 \} 4570 118 \}
4544 118 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{ 4571 119 if ((res = mp_mul_2(&tmp1, &tmp1)) != MP_OKAY) \{
4545 119 goto ERR; 4572 120 goto ERR;
4546 120 \} 4573 121 \}
4547 121 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{ 4574 122 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{
4548 122 goto ERR; 4575 123 goto ERR;
4549 123 \} 4576 124 \}
4550 124 4577 125
4551 125 if ((res = mp_mul_2(&b2, &tmp2)) != MP_OKAY) \{ 4578 126 if ((res = mp_mul_2(&b2, &tmp2)) != MP_OKAY) \{
4552 126 goto ERR; 4579 127 goto ERR;
4553 127 \} 4580 128 \}
4554 128 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{ 4581 129 if ((res = mp_add(&tmp2, &b1, &tmp2)) != MP_OKAY) \{
4555 129 goto ERR; 4582 130 goto ERR;
4556 130 \} 4583 131 \}
4557 131 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{ 4584 132 if ((res = mp_mul_2(&tmp2, &tmp2)) != MP_OKAY) \{
4558 132 goto ERR; 4585 133 goto ERR;
4559 133 \} 4586 134 \}
4560 134 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{ 4587 135 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{
4561 135 goto ERR; 4588 136 goto ERR;
4562 136 \} 4589 137 \}
4563 137 4590 138
4564 138 if ((res = mp_mul(&tmp1, &tmp2, &w3)) != MP_OKAY) \{ 4591 139 if ((res = mp_mul(&tmp1, &tmp2, &w3)) != MP_OKAY) \{
4565 139 goto ERR; 4592 140 goto ERR;
4566 140 \} 4593 141 \}
4567 141 4594 142
4568 142 4595 143
4569 143 /* w2 = (a2 + a1 + a0)(b2 + b1 + b0) */ 4596 144 /* w2 = (a2 + a1 + a0)(b2 + b1 + b0) */
4570 144 if ((res = mp_add(&a2, &a1, &tmp1)) != MP_OKAY) \{ 4597 145 if ((res = mp_add(&a2, &a1, &tmp1)) != MP_OKAY) \{
4571 145 goto ERR; 4598 146 goto ERR;
4572 146 \} 4599 147 \}
4573 147 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{ 4600 148 if ((res = mp_add(&tmp1, &a0, &tmp1)) != MP_OKAY) \{
4574 148 goto ERR; 4601 149 goto ERR;
4575 149 \} 4602 150 \}
4576 150 if ((res = mp_add(&b2, &b1, &tmp2)) != MP_OKAY) \{ 4603 151 if ((res = mp_add(&b2, &b1, &tmp2)) != MP_OKAY) \{
4577 151 goto ERR; 4604 152 goto ERR;
4578 152 \} 4605 153 \}
4579 153 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{ 4606 154 if ((res = mp_add(&tmp2, &b0, &tmp2)) != MP_OKAY) \{
4580 154 goto ERR; 4607 155 goto ERR;
4581 155 \} 4608 156 \}
4582 156 if ((res = mp_mul(&tmp1, &tmp2, &w2)) != MP_OKAY) \{ 4609 157 if ((res = mp_mul(&tmp1, &tmp2, &w2)) != MP_OKAY) \{
4583 157 goto ERR; 4610 158 goto ERR;
4584 158 \} 4611 159 \}
4585 159 4612 160
4586 160 /* now solve the matrix 4613 161 /* now solve the matrix
4587 161 4614 162
4588 162 0 0 0 0 1 4615 163 0 0 0 0 1
4589 163 1 2 4 8 16 4616 164 1 2 4 8 16
4590 164 1 1 1 1 1 4617 165 1 1 1 1 1
4591 165 16 8 4 2 1 4618 166 16 8 4 2 1
4592 166 1 0 0 0 0 4619 167 1 0 0 0 0
4593 167 4620 168
4594 168 using 12 subtractions, 4 shifts, 4621 169 using 12 subtractions, 4 shifts,
4595 169 2 small divisions and 1 small multiplication 4622 170 2 small divisions and 1 small multiplication
4596 170 */ 4623 171 */
4597 171 4624 172
4598 172 /* r1 - r4 */ 4625 173 /* r1 - r4 */
4599 173 if ((res = mp_sub(&w1, &w4, &w1)) != MP_OKAY) \{ 4626 174 if ((res = mp_sub(&w1, &w4, &w1)) != MP_OKAY) \{
4600 174 goto ERR; 4627 175 goto ERR;
4601 175 \} 4628 176 \}
4602 176 /* r3 - r0 */ 4629 177 /* r3 - r0 */
4603 177 if ((res = mp_sub(&w3, &w0, &w3)) != MP_OKAY) \{ 4630 178 if ((res = mp_sub(&w3, &w0, &w3)) != MP_OKAY) \{
4604 178 goto ERR; 4631 179 goto ERR;
4605 179 \} 4632 180 \}
4606 180 /* r1/2 */ 4633 181 /* r1/2 */
4607 181 if ((res = mp_div_2(&w1, &w1)) != MP_OKAY) \{ 4634 182 if ((res = mp_div_2(&w1, &w1)) != MP_OKAY) \{
4608 182 goto ERR; 4635 183 goto ERR;
4609 183 \} 4636 184 \}
4610 184 /* r3/2 */ 4637 185 /* r3/2 */
4611 185 if ((res = mp_div_2(&w3, &w3)) != MP_OKAY) \{ 4638 186 if ((res = mp_div_2(&w3, &w3)) != MP_OKAY) \{
4612 186 goto ERR; 4639 187 goto ERR;
4613 187 \} 4640 188 \}
4614 188 /* r2 - r0 - r4 */ 4641 189 /* r2 - r0 - r4 */
4615 189 if ((res = mp_sub(&w2, &w0, &w2)) != MP_OKAY) \{ 4642 190 if ((res = mp_sub(&w2, &w0, &w2)) != MP_OKAY) \{
4616 190 goto ERR; 4643 191 goto ERR;
4617 191 \} 4644 192 \}
4618 192 if ((res = mp_sub(&w2, &w4, &w2)) != MP_OKAY) \{ 4645 193 if ((res = mp_sub(&w2, &w4, &w2)) != MP_OKAY) \{
4619 193 goto ERR; 4646 194 goto ERR;
4620 194 \} 4647 195 \}
4621 195 /* r1 - r2 */ 4648 196 /* r1 - r2 */
4622 196 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{ 4649 197 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{
4623 197 goto ERR; 4650 198 goto ERR;
4624 198 \} 4651 199 \}
4625 199 /* r3 - r2 */ 4652 200 /* r3 - r2 */
4626 200 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{ 4653 201 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{
4627 201 goto ERR; 4654 202 goto ERR;
4628 202 \} 4655 203 \}
4629 203 /* r1 - 8r0 */ 4656 204 /* r1 - 8r0 */
4630 204 if ((res = mp_mul_2d(&w0, 3, &tmp1)) != MP_OKAY) \{ 4657 205 if ((res = mp_mul_2d(&w0, 3, &tmp1)) != MP_OKAY) \{
4631 205 goto ERR; 4658 206 goto ERR;
4632 206 \} 4659 207 \}
4633 207 if ((res = mp_sub(&w1, &tmp1, &w1)) != MP_OKAY) \{ 4660 208 if ((res = mp_sub(&w1, &tmp1, &w1)) != MP_OKAY) \{
4634 208 goto ERR; 4661 209 goto ERR;
4635 209 \} 4662 210 \}
4636 210 /* r3 - 8r4 */ 4663 211 /* r3 - 8r4 */
4637 211 if ((res = mp_mul_2d(&w4, 3, &tmp1)) != MP_OKAY) \{ 4664 212 if ((res = mp_mul_2d(&w4, 3, &tmp1)) != MP_OKAY) \{
4638 212 goto ERR; 4665 213 goto ERR;
4639 213 \} 4666 214 \}
4640 214 if ((res = mp_sub(&w3, &tmp1, &w3)) != MP_OKAY) \{ 4667 215 if ((res = mp_sub(&w3, &tmp1, &w3)) != MP_OKAY) \{
4641 215 goto ERR; 4668 216 goto ERR;
4642 216 \} 4669 217 \}
4643 217 /* 3r2 - r1 - r3 */ 4670 218 /* 3r2 - r1 - r3 */
4644 218 if ((res = mp_mul_d(&w2, 3, &w2)) != MP_OKAY) \{ 4671 219 if ((res = mp_mul_d(&w2, 3, &w2)) != MP_OKAY) \{
4645 219 goto ERR; 4672 220 goto ERR;
4646 220 \} 4673 221 \}
4647 221 if ((res = mp_sub(&w2, &w1, &w2)) != MP_OKAY) \{ 4674 222 if ((res = mp_sub(&w2, &w1, &w2)) != MP_OKAY) \{
4648 222 goto ERR; 4675 223 goto ERR;
4649 223 \} 4676 224 \}
4650 224 if ((res = mp_sub(&w2, &w3, &w2)) != MP_OKAY) \{ 4677 225 if ((res = mp_sub(&w2, &w3, &w2)) != MP_OKAY) \{
4651 225 goto ERR; 4678 226 goto ERR;
4652 226 \} 4679 227 \}
4653 227 /* r1 - r2 */ 4680 228 /* r1 - r2 */
4654 228 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{ 4681 229 if ((res = mp_sub(&w1, &w2, &w1)) != MP_OKAY) \{
4655 229 goto ERR; 4682 230 goto ERR;
4656 230 \} 4683 231 \}
4657 231 /* r3 - r2 */ 4684 232 /* r3 - r2 */
4658 232 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{ 4685 233 if ((res = mp_sub(&w3, &w2, &w3)) != MP_OKAY) \{
4659 233 goto ERR; 4686 234 goto ERR;
4660 234 \} 4687 235 \}
4661 235 /* r1/3 */ 4688 236 /* r1/3 */
4662 236 if ((res = mp_div_3(&w1, &w1, NULL)) != MP_OKAY) \{ 4689 237 if ((res = mp_div_3(&w1, &w1, NULL)) != MP_OKAY) \{
4663 237 goto ERR; 4690 238 goto ERR;
4664 238 \} 4691 239 \}
4665 239 /* r3/3 */ 4692 240 /* r3/3 */
4666 240 if ((res = mp_div_3(&w3, &w3, NULL)) != MP_OKAY) \{ 4693 241 if ((res = mp_div_3(&w3, &w3, NULL)) != MP_OKAY) \{
4667 241 goto ERR; 4694 242 goto ERR;
4668 242 \} 4695 243 \}
4669 243 4696 244
4670 244 /* at this point shift W[n] by B*n */ 4697 245 /* at this point shift W[n] by B*n */
4671 245 if ((res = mp_lshd(&w1, 1*B)) != MP_OKAY) \{ 4698 246 if ((res = mp_lshd(&w1, 1*B)) != MP_OKAY) \{
4672 246 goto ERR; 4699 247 goto ERR;
4673 247 \} 4700 248 \}
4674 248 if ((res = mp_lshd(&w2, 2*B)) != MP_OKAY) \{ 4701 249 if ((res = mp_lshd(&w2, 2*B)) != MP_OKAY) \{
4675 249 goto ERR; 4702 250 goto ERR;
4676 250 \} 4703 251 \}
4677 251 if ((res = mp_lshd(&w3, 3*B)) != MP_OKAY) \{ 4704 252 if ((res = mp_lshd(&w3, 3*B)) != MP_OKAY) \{
4678 252 goto ERR; 4705 253 goto ERR;
4679 253 \} 4706 254 \}
4680 254 if ((res = mp_lshd(&w4, 4*B)) != MP_OKAY) \{ 4707 255 if ((res = mp_lshd(&w4, 4*B)) != MP_OKAY) \{
4681 255 goto ERR; 4708 256 goto ERR;
4682 256 \} 4709 257 \}
4683 257 4710 258
4684 258 if ((res = mp_add(&w0, &w1, c)) != MP_OKAY) \{ 4711 259 if ((res = mp_add(&w0, &w1, c)) != MP_OKAY) \{
4685 259 goto ERR; 4712 260 goto ERR;
4686 260 \} 4713 261 \}
4687 261 if ((res = mp_add(&w2, &w3, &tmp1)) != MP_OKAY) \{ 4714 262 if ((res = mp_add(&w2, &w3, &tmp1)) != MP_OKAY) \{
4688 262 goto ERR; 4715 263 goto ERR;
4689 263 \} 4716 264 \}
4690 264 if ((res = mp_add(&w4, &tmp1, &tmp1)) != MP_OKAY) \{ 4717 265 if ((res = mp_add(&w4, &tmp1, &tmp1)) != MP_OKAY) \{
4691 265 goto ERR; 4718 266 goto ERR;
4692 266 \} 4719 267 \}
4693 267 if ((res = mp_add(&tmp1, c, c)) != MP_OKAY) \{ 4720 268 if ((res = mp_add(&tmp1, c, c)) != MP_OKAY) \{
4694 268 goto ERR; 4721 269 goto ERR;
4695 269 \} 4722 270 \}
4696 270 4723 271
4697 271 ERR: 4724 272 ERR:
4698 272 mp_clear_multi(&w0, &w1, &w2, &w3, &w4, 4725 273 mp_clear_multi(&w0, &w1, &w2, &w3, &w4,
4699 273 &a0, &a1, &a2, &b0, &b1, 4726 274 &a0, &a1, &a2, &b0, &b1,
4700 274 &b2, &tmp1, &tmp2, NULL); 4727 275 &b2, &tmp1, &tmp2, NULL);
4701 275 return res; 4728 276 return res;
4702 276 \} 4729 277 \}
4703 277 4730 278
4704 278 #endif 4731 279 #endif
4705 \end{alltt} 4732 \end{alltt}
4706 \end{small} 4733 \end{small}
4707 4734
4708 -- Comments to be added during editing phase. 4735 The first obvious thing to note is that this algorithm is complicated. The complexity is worth it if you are multiplying very
4736 large numbers. For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with
4737 Toom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$). For most ``crypto'' sized numbers this
4738 algorithm is not practical as Karatsuba has a much lower cutoff point.
4739
4740 First we split $a$ and $b$ into three roughly equal portions. This has been accomplished (lines 40 to 69) with
4741 combinations of mp\_rshd() and mp\_mod\_2d() function calls. At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly
4742 for $b$.
4743
4744 Next we compute the five points $w0, w1, w2, w3$ and $w4$. Recall that $w0$ and $w4$ can be computed directly from the portions so
4745 we get those out of the way first (lines 72 and 77). Next we compute $w1, w2$ and $w3$ using Horners method.
4746
4747 After this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively
4748 straight forward.
4709 4749
4710 \subsection{Signed Multiplication} 4750 \subsection{Signed Multiplication}
4711 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all 4751 Now that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required. So far all
4712 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established. 4752 of the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.
4713 4753
4714 \newpage\begin{figure}[!here] 4754 \begin{figure}[!here]
4715 \begin{small} 4755 \begin{small}
4716 \begin{center} 4756 \begin{center}
4717 \begin{tabular}{l} 4757 \begin{tabular}{l}
4718 \hline Algorithm \textbf{mp\_mul}. \\ 4758 \hline Algorithm \textbf{mp\_mul}. \\
4719 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\ 4759 \textbf{Input}. mp\_int $a$ and mp\_int $b$ \\
4842 4882
4843 \subsection{The Baseline Squaring Algorithm} 4883 \subsection{The Baseline Squaring Algorithm}
4844 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines 4884 The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines
4845 will not handle. 4885 will not handle.
4846 4886
4847 \newpage\begin{figure}[!here] 4887 \begin{figure}[!here]
4848 \begin{small} 4888 \begin{small}
4849 \begin{center} 4889 \begin{center}
4850 \begin{tabular}{l} 4890 \begin{tabular}{l}
4851 \hline Algorithm \textbf{s\_mp\_sqr}. \\ 4891 \hline Algorithm \textbf{s\_mp\_sqr}. \\
4852 \textbf{Input}. mp\_int $a$ \\ 4892 \textbf{Input}. mp\_int $a$ \\
4902 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sqr.c 4942 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_sqr.c
4903 \vspace{-3mm} 4943 \vspace{-3mm}
4904 \begin{alltt} 4944 \begin{alltt}
4905 016 4945 016
4906 017 /* low level squaring, b = a*a, HAC pp.596-597, Algorithm 14.16 */ 4946 017 /* low level squaring, b = a*a, HAC pp.596-597, Algorithm 14.16 */
4907 018 int 4947 018 int s_mp_sqr (mp_int * a, mp_int * b)
4908 019 s_mp_sqr (mp_int * a, mp_int * b) 4948 019 \{
4909 020 \{ 4949 020 mp_int t;
4910 021 mp_int t; 4950 021 int res, ix, iy, pa;
4911 022 int res, ix, iy, pa; 4951 022 mp_word r;
4912 023 mp_word r; 4952 023 mp_digit u, tmpx, *tmpt;
4913 024 mp_digit u, tmpx, *tmpt; 4953 024
4914 025 4954 025 pa = a->used;
4915 026 pa = a->used; 4955 026 if ((res = mp_init_size (&t, 2*pa + 1)) != MP_OKAY) \{
4916 027 if ((res = mp_init_size (&t, 2*pa + 1)) != MP_OKAY) \{ 4956 027 return res;
4917 028 return res; 4957 028 \}
4918 029 \} 4958 029
4919 030 4959 030 /* default used is maximum possible size */
4920 031 /* default used is maximum possible size */ 4960 031 t.used = 2*pa + 1;
4921 032 t.used = 2*pa + 1; 4961 032
4922 033 4962 033 for (ix = 0; ix < pa; ix++) \{
4923 034 for (ix = 0; ix < pa; ix++) \{ 4963 034 /* first calculate the digit at 2*ix */
4924 035 /* first calculate the digit at 2*ix */ 4964 035 /* calculate double precision result */
4925 036 /* calculate double precision result */ 4965 036 r = ((mp_word) t.dp[2*ix]) +
4926 037 r = ((mp_word) t.dp[2*ix]) + 4966 037 ((mp_word)a->dp[ix])*((mp_word)a->dp[ix]);
4927 038 ((mp_word)a->dp[ix])*((mp_word)a->dp[ix]); 4967 038
4928 039 4968 039 /* store lower part in result */
4929 040 /* store lower part in result */ 4969 040 t.dp[ix+ix] = (mp_digit) (r & ((mp_word) MP_MASK));
4930 041 t.dp[ix+ix] = (mp_digit) (r & ((mp_word) MP_MASK)); 4970 041
4931 042 4971 042 /* get the carry */
4932 043 /* get the carry */ 4972 043 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4933 044 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); 4973 044
4934 045 4974 045 /* left hand side of A[ix] * A[iy] */
4935 046 /* left hand side of A[ix] * A[iy] */ 4975 046 tmpx = a->dp[ix];
4936 047 tmpx = a->dp[ix]; 4976 047
4937 048 4977 048 /* alias for where to store the results */
4938 049 /* alias for where to store the results */ 4978 049 tmpt = t.dp + (2*ix + 1);
4939 050 tmpt = t.dp + (2*ix + 1); 4979 050
4940 051 4980 051 for (iy = ix + 1; iy < pa; iy++) \{
4941 052 for (iy = ix + 1; iy < pa; iy++) \{ 4981 052 /* first calculate the product */
4942 053 /* first calculate the product */ 4982 053 r = ((mp_word)tmpx) * ((mp_word)a->dp[iy]);
4943 054 r = ((mp_word)tmpx) * ((mp_word)a->dp[iy]); 4983 054
4944 055 4984 055 /* now calculate the double precision result, note we use
4945 056 /* now calculate the double precision result, note we use 4985 056 * addition instead of *2 since it's easier to optimize
4946 057 * addition instead of *2 since it's easier to optimize 4986 057 */
4947 058 */ 4987 058 r = ((mp_word) *tmpt) + r + r + ((mp_word) u);
4948 059 r = ((mp_word) *tmpt) + r + r + ((mp_word) u); 4988 059
4949 060 4989 060 /* store lower part */
4950 061 /* store lower part */ 4990 061 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
4951 062 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); 4991 062
4952 063 4992 063 /* get carry */
4953 064 /* get carry */ 4993 064 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4954 065 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); 4994 065 \}
4955 066 \} 4995 066 /* propagate upwards */
4956 067 /* propagate upwards */ 4996 067 while (u != ((mp_digit) 0)) \{
4957 068 while (u != ((mp_digit) 0)) \{ 4997 068 r = ((mp_word) *tmpt) + ((mp_word) u);
4958 069 r = ((mp_word) *tmpt) + ((mp_word) u); 4998 069 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK));
4959 070 *tmpt++ = (mp_digit) (r & ((mp_word) MP_MASK)); 4999 070 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT));
4960 071 u = (mp_digit)(r >> ((mp_word) DIGIT_BIT)); 5000 071 \}
4961 072 \} 5001 072 \}
4962 073 \} 5002 073
4963 074 5003 074 mp_clamp (&t);
4964 075 mp_clamp (&t); 5004 075 mp_exch (&t, b);
4965 076 mp_exch (&t, b); 5005 076 mp_clear (&t);
4966 077 mp_clear (&t); 5006 077 return MP_OKAY;
4967 078 return MP_OKAY; 5007 078 \}
4968 079 \} 5008 079 #endif
4969 080 #endif
4970 \end{alltt} 5009 \end{alltt}
4971 \end{small} 5010 \end{small}
4972 5011
4973 Inside the outer loop (\textit{see line 34}) the square term is calculated on line 37. Line 44 extracts the carry from the square 5012 Inside the outer loop (line 33) the square term is calculated on line 36. The carry (line 43) has been
4974 term. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized on lines 47 and 50 respectively. The doubling is performed using two 5013 extracted from the mp\_word accumulator using a right shift. Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized
4975 additions (\textit{see line 59}) since it is usually faster than shifting,if not at least as fast. 5014 (lines 46 and 49) to simplify the inner loop. The doubling is performed using two
5015 additions (line 58) since it is usually faster than shifting, if not at least as fast.
5016
5017 The important observation is that the inner loop does not begin at $iy = 0$ like for multiplication. As such the inner loops
5018 get progressively shorter as the algorithm proceeds. This is what leads to the savings compared to using a multiplication to
5019 square a number.
4976 5020
4977 \subsection{Faster Squaring by the ``Comba'' Method} 5021 \subsection{Faster Squaring by the ``Comba'' Method}
4978 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional 5022 A major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop. Squaring has an additional
4979 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these 5023 drawback that it must double the product inside the inner loop as well. As for multiplication, the Comba technique can be used to eliminate these
4980 performance hazards. 5024 performance hazards.
4982 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry 5026 The first obvious solution is to make an array of mp\_words which will hold all of the columns. This will indeed eliminate all of the carry
4983 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact 5027 propagation operations from the inner loop. However, the inner product must still be doubled $O(n^2)$ times. The solution stems from the simple fact
4984 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example, 5028 that $2a + 2b + 2c = 2(a + b + c)$. That is the sum of all of the double products is equal to double the sum of all the products. For example,
4985 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$. 5029 $ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$.
4986 5030
4987 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two mp\_word 5031 However, we cannot simply double all of the columns, since the squares appear only once per row. The most practical solution is to have two
4988 arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and carry propagation can be 5032 mp\_word arrays. One array will hold the squares and the other array will hold the double products. With both arrays the doubling and
4989 moved to a $O(n)$ work level outside the $O(n^2)$ level. 5033 carry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level. In this case, we have an even simpler solution in mind.
4990 5034
4991 \newpage\begin{figure}[!here] 5035 \newpage\begin{figure}[!here]
4992 \begin{small} 5036 \begin{small}
4993 \begin{center} 5037 \begin{center}
4994 \begin{tabular}{l} 5038 \begin{tabular}{l}
4995 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\ 5039 \hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\
4996 \textbf{Input}. mp\_int $a$ \\ 5040 \textbf{Input}. mp\_int $a$ \\
4997 \textbf{Output}. $b \leftarrow a^2$ \\ 5041 \textbf{Output}. $b \leftarrow a^2$ \\
4998 \hline \\ 5042 \hline \\
4999 Place two arrays of \textbf{MP\_WARRAY} mp\_words named $\hat W$ and $\hat {X}$ on the stack. \\ 5043 Place an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\
5000 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\ 5044 1. If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits. (\textit{mp\_grow}). \\
5001 2. If step 1 failed return(\textit{MP\_MEM}). \\ 5045 2. If step 1 failed return(\textit{MP\_MEM}). \\
5002 3. for $ix$ from $0$ to $2a.used + 1$ do \\
5003 \hspace{3mm}3.1 $\hat W_{ix} \leftarrow 0$ \\
5004 \hspace{3mm}3.2 $\hat {X}_{ix} \leftarrow 0$ \\
5005 4. for $ix$ from $0$ to $a.used - 1$ do \\
5006 \hspace{3mm}Compute the square.\\
5007 \hspace{3mm}4.1 $\hat {X}_{ix+ix} \leftarrow \left ( a_{ix} \right )^2$ \\
5008 \\ 5046 \\
5009 \hspace{3mm}Compute the double products.\\ 5047 3. $pa \leftarrow 2 \cdot a.used$ \\
5010 \hspace{3mm}4.2 for $iy$ from $ix + 1$ to $a.used - 1$ do \\ 5048 4. $\hat W1 \leftarrow 0$ \\
5011 \hspace{6mm}4.2.1 $\hat W_{ix+iy} \leftarrow \hat W_{ix+iy} + a_{ix}a_{iy}$ \\ 5049 5. for $ix$ from $0$ to $pa - 1$ do \\
5012 5. $oldused \leftarrow b.used$ \\ 5050 \hspace{3mm}5.1 $\_ \hat W \leftarrow 0$ \\
5013 6. $b.used \leftarrow 2a.used + 1$ \\ 5051 \hspace{3mm}5.2 $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\
5052 \hspace{3mm}5.3 $tx \leftarrow ix - ty$ \\
5053 \hspace{3mm}5.4 $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
5054 \hspace{3mm}5.5 $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\
5055 \hspace{3mm}5.6 for $iz$ from $0$ to $iz - 1$ do \\
5056 \hspace{6mm}5.6.1 $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\
5057 \hspace{3mm}5.7 $\_ \hat W \leftarrow 2 \cdot \_ \hat W + \hat W1$ \\
5058 \hspace{3mm}5.8 if $ix$ is even then \\
5059 \hspace{6mm}5.8.1 $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\
5060 \hspace{3mm}5.9 $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
5061 \hspace{3mm}5.10 $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
5014 \\ 5062 \\
5015 Double the products and propagate the carries simultaneously. \\ 5063 6. $oldused \leftarrow b.used$ \\
5016 7. $\hat W_0 \leftarrow 2 \hat W_0 + \hat {X}_0$ \\ 5064 7. $b.used \leftarrow 2 \cdot a.used$ \\
5017 8. for $ix$ from $1$ to $2a.used$ do \\ 5065 8. for $ix$ from $0$ to $pa - 1$ do \\
5018 \hspace{3mm}8.1 $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ \\ 5066 \hspace{3mm}8.1 $b_{ix} \leftarrow W_{ix}$ \\
5019 \hspace{3mm}8.2 $\hat W_{ix} \leftarrow \hat W_{ix} + \lfloor \hat W_{ix - 1} / \beta \rfloor$ \\ 5067 9. for $ix$ from $pa$ to $oldused - 1$ do \\
5020 \hspace{3mm}8.3 $b_{ix-1} \leftarrow W_{ix-1} \mbox{ (mod }\beta\mbox{)}$ \\ 5068 \hspace{3mm}9.1 $b_{ix} \leftarrow 0$ \\
5021 9. $b_{2a.used} \leftarrow \hat W_{2a.used} \mbox{ (mod }\beta\mbox{)}$ \\ 5069 10. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\
5022 10. if $2a.used + 1 < oldused$ then do \\ 5070 11. Return(\textit{MP\_OKAY}). \\
5023 \hspace{3mm}10.1 for $ix$ from $2a.used + 1$ to $oldused$ do \\
5024 \hspace{6mm}10.1.1 $b_{ix} \leftarrow 0$ \\
5025 11. Clamp excess digits from $b$. (\textit{mp\_clamp}) \\
5026 12. Return(\textit{MP\_OKAY}). \\
5027 \hline 5071 \hline
5028 \end{tabular} 5072 \end{tabular}
5029 \end{center} 5073 \end{center}
5030 \end{small} 5074 \end{small}
5031 \caption{Algorithm fast\_s\_mp\_sqr} 5075 \caption{Algorithm fast\_s\_mp\_sqr}
5032 \end{figure} 5076 \end{figure}
5033 5077
5034 \textbf{Algorithm fast\_s\_mp\_sqr.} 5078 \textbf{Algorithm fast\_s\_mp\_sqr.}
5035 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm s\_mp\_sqr when 5079 This algorithm computes the square of an input using the Comba technique. It is designed to be a replacement for algorithm
5036 the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$. 5080 s\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.
5037 5081 This algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.
5038 This routine requires two arrays of mp\_words to be placed on the stack. The first array $\hat W$ will hold the double products and the second 5082
5039 array $\hat X$ will hold the squares. Though only at most $MP\_WARRAY \over 2$ words of $\hat X$ are used, it has proven faster on most 5083 First, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively. This is because the inner loop
5040 processors to simply make it a full size array. 5084 products are to be doubled. If we had added the previous carry in we would be doubling too much. Next we perform an
5041 5085 addition MIN condition on $iy$ (step 5.5) to prevent overlapping digits. For example, $a_3 \cdot a_5$ is equal
5042 The loop on step 3 will zero the two arrays to prepare them for the squaring step. Step 4.1 computes the squares of the product. Note how 5086 $a_5 \cdot a_3$. Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum
5043 it simply assigns the value into the $\hat X$ array. The nested loop on step 4.2 computes the doubles of the products. This loop 5087 of the products just outside the inner loop we have to avoid doing this. This is also a good thing since we perform
5044 computes the sum of the products for each column. They are not doubled until later. 5088 fewer multiplications and the routine ends up being faster.
5045 5089
5046 After the squaring loop, the products stored in $\hat W$ musted be doubled and the carries propagated forwards. It makes sense to do both 5090 Finally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8). We add in the square
5047 operations at the same time. The expression $\hat W_{ix} \leftarrow 2 \hat W_{ix} + \hat {X}_{ix}$ computes the sum of the double product and the 5091 only to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position.
5048 squares in place.
5049 5092
5050 \vspace{+3mm}\begin{small} 5093 \vspace{+3mm}\begin{small}
5051 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_sqr.c 5094 \hspace{-5.1mm}{\bf File}: bn\_fast\_s\_mp\_sqr.c
5052 \vspace{-3mm} 5095 \vspace{-3mm}
5053 \begin{alltt} 5096 \begin{alltt}
5054 016 5097 016
5055 017 /* fast squaring 5098 017 /* the jist of squaring...
5056 018 * 5099 018 * you do like mult except the offset of the tmpx [one that
5057 019 * This is the comba method where the columns of the product 5100 019 * starts closer to zero] can't equal the offset of tmpy.
5058 020 * are computed first then the carries are computed. This 5101 020 * So basically you set up iy like before then you min it with
5059 021 * has the effect of making a very simple inner loop that 5102 021 * (ty-tx) so that it never happens. You double all those
5060 022 * is executed the most 5103 022 * you add in the inner loop
5061 023 * 5104 023
5062 024 * W2 represents the outer products and W the inner. 5105 024 After that loop you do the squares and add them in.
5063 025 * 5106 025 */
5064 026 * A further optimizations is made because the inner 5107 026
5065 027 * products are of the form "A * B * 2". The *2 part does 5108 027 int fast_s_mp_sqr (mp_int * a, mp_int * b)
5066 028 * not need to be computed until the end which is good 5109 028 \{
5067 029 * because 64-bit shifts are slow! 5110 029 int olduse, res, pa, ix, iz;
5068 030 * 5111 030 mp_digit W[MP_WARRAY], *tmpx;
5069 031 * Based on Algorithm 14.16 on pp.597 of HAC. 5112 031 mp_word W1;
5070 032 * 5113 032
5071 033 */ 5114 033 /* grow the destination as required */
5072 034 /* the jist of squaring... 5115 034 pa = a->used + a->used;
5073 035 5116 035 if (b->alloc < pa) \{
5074 036 you do like mult except the offset of the tmpx [one that starts closer to ze 5117 036 if ((res = mp_grow (b, pa)) != MP_OKAY) \{
5075 ro] 5118 037 return res;
5076 037 can't equal the offset of tmpy. So basically you set up iy like before then 5119 038 \}
5077 you min it with 5120 039 \}
5078 038 (ty-tx) so that it never happens. You double all those you add in the inner 5121 040
5079 loop 5122 041 /* number of output digits to produce */
5080 039 5123 042 W1 = 0;
5081 040 After that loop you do the squares and add them in. 5124 043 for (ix = 0; ix < pa; ix++) \{
5082 041 5125 044 int tx, ty, iy;
5083 042 Remove W2 and don't memset W 5126 045 mp_word _W;
5084 043 5127 046 mp_digit *tmpy;
5085 044 */ 5128 047
5086 045 5129 048 /* clear counter */
5087 046 int fast_s_mp_sqr (mp_int * a, mp_int * b) 5130 049 _W = 0;
5088 047 \{ 5131 050
5089 048 int olduse, res, pa, ix, iz; 5132 051 /* get offsets into the two bignums */
5090 049 mp_digit W[MP_WARRAY], *tmpx; 5133 052 ty = MIN(a->used-1, ix);
5091 050 mp_word W1; 5134 053 tx = ix - ty;
5092 051 5135 054
5093 052 /* grow the destination as required */ 5136 055 /* setup temp aliases */
5094 053 pa = a->used + a->used; 5137 056 tmpx = a->dp + tx;
5095 054 if (b->alloc < pa) \{ 5138 057 tmpy = a->dp + ty;
5096 055 if ((res = mp_grow (b, pa)) != MP_OKAY) \{ 5139 058
5097 056 return res; 5140 059 /* this is the number of times the loop will iterrate, essentially
5098 057 \} 5141 060 while (tx++ < a->used && ty-- >= 0) \{ ... \}
5099 058 \} 5142 061 */
5100 059 5143 062 iy = MIN(a->used-tx, ty+1);
5101 060 /* number of output digits to produce */ 5144 063
5102 061 W1 = 0; 5145 064 /* now for squaring tx can never equal ty
5103 062 for (ix = 0; ix <= pa; ix++) \{ 5146 065 * we halve the distance since they approach at a rate of 2x
5104 063 int tx, ty, iy; 5147 066 * and we have to round because odd cases need to be executed
5105 064 mp_word _W; 5148 067 */
5106 065 mp_digit *tmpy; 5149 068 iy = MIN(iy, (ty-tx+1)>>1);
5107 066
5108 067 /* clear counter */
5109 068 _W = 0;
5110 069 5150 069
5111 070 /* get offsets into the two bignums */ 5151 070 /* execute loop */
5112 071 ty = MIN(a->used-1, ix); 5152 071 for (iz = 0; iz < iy; iz++) \{
5113 072 tx = ix - ty; 5153 072 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--);
5114 073 5154 073 \}
5115 074 /* setup temp aliases */ 5155 074
5116 075 tmpx = a->dp + tx; 5156 075 /* double the inner product and add carry */
5117 076 tmpy = a->dp + ty; 5157 076 _W = _W + _W + W1;
5118 077 5158 077
5119 078 /* this is the number of times the loop will iterrate, essentially its 5159 078 /* even columns have the square term in them */
5120 5160 079 if ((ix&1) == 0) \{
5121 079 while (tx++ < a->used && ty-- >= 0) \{ ... \} 5161 080 _W += ((mp_word)a->dp[ix>>1])*((mp_word)a->dp[ix>>1]);
5122 080 */ 5162 081 \}
5123 081 iy = MIN(a->used-tx, ty+1);
5124 082 5163 082
5125 083 /* now for squaring tx can never equal ty 5164 083 /* store it */
5126 084 * we halve the distance since they approach at a rate of 2x 5165 084 W[ix] = (mp_digit)(_W & MP_MASK);
5127 085 * and we have to round because odd cases need to be executed 5166 085
5128 086 */ 5167 086 /* make next carry */
5129 087 iy = MIN(iy, (ty-tx+1)>>1); 5168 087 W1 = _W >> ((mp_word)DIGIT_BIT);
5130 088 5169 088 \}
5131 089 /* execute loop */ 5170 089
5132 090 for (iz = 0; iz < iy; iz++) \{ 5171 090 /* setup dest */
5133 091 _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--); 5172 091 olduse = b->used;
5134 092 \} 5173 092 b->used = a->used+a->used;
5135 093 5174 093
5136 094 /* double the inner product and add carry */ 5175 094 \{
5137 095 _W = _W + _W + W1; 5176 095 mp_digit *tmpb;
5138 096 5177 096 tmpb = b->dp;
5139 097 /* even columns have the square term in them */ 5178 097 for (ix = 0; ix < pa; ix++) \{
5140 098 if ((ix&1) == 0) \{ 5179 098 *tmpb++ = W[ix] & MP_MASK;
5141 099 _W += ((mp_word)a->dp[ix>>1])*((mp_word)a->dp[ix>>1]); 5180 099 \}
5142 100 \} 5181 100
5143 101 5182 101 /* clear unused digits [that existed in the old copy of c] */
5144 102 /* store it */ 5183 102 for (; ix < olduse; ix++) \{
5145 103 W[ix] = _W; 5184 103 *tmpb++ = 0;
5146 104 5185 104 \}
5147 105 /* make next carry */ 5186 105 \}
5148 106 W1 = _W >> ((mp_word)DIGIT_BIT); 5187 106 mp_clamp (b);
5149 107 \} 5188 107 return MP_OKAY;
5150 108 5189 108 \}
5151 109 /* setup dest */ 5190 109 #endif
5152 110 olduse = b->used;
5153 111 b->used = a->used+a->used;
5154 112
5155 113 \{
5156 114 mp_digit *tmpb;
5157 115 tmpb = b->dp;
5158 116 for (ix = 0; ix < pa; ix++) \{
5159 117 *tmpb++ = W[ix] & MP_MASK;
5160 118 \}
5161 119
5162 120 /* clear unused digits [that existed in the old copy of c] */
5163 121 for (; ix < olduse; ix++) \{
5164 122 *tmpb++ = 0;
5165 123 \}
5166 124 \}
5167 125 mp_clamp (b);
5168 126 return MP_OKAY;
5169 127 \}
5170 128 #endif
5171 \end{alltt} 5191 \end{alltt}
5172 \end{small} 5192 \end{small}
5173 5193
5174 -- Write something deep and insightful later, Tom. 5194 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for
5195 the special case of squaring.
5175 5196
5176 \subsection{Polynomial Basis Squaring} 5197 \subsection{Polynomial Basis Squaring}
5177 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception 5198 The same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring. The minor exception
5178 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$ 5199 is that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$. Instead of performing $2n + 1$
5179 multiplications to find the $\zeta$ relations, squaring operations are performed instead. 5200 multiplications to find the $\zeta$ relations, squaring operations are performed instead.
5387 5408
5388 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point 5409 By inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered. On the Athlon the cutoff point
5389 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4 5410 is exactly at the point where Comba squaring can no longer be used (\textit{128 digits}). On slower processors such as the Intel P4
5390 it is actually below the Comba limit (\textit{at 110 digits}). 5411 it is actually below the Comba limit (\textit{at 110 digits}).
5391 5412
5392 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are redirected to 5413 This routine uses the same error trap coding style as mp\_karatsuba\_sqr. As the temporary variables are initialized errors are
5393 the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and mp\_clears are executed normally. 5414 redirected to the error trap higher up. If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and
5394 5415 mp\_clears are executed normally.
5395 \textit{Last paragraph sucks. re-write! -- Tom}
5396 5416
5397 \subsection{Toom-Cook Squaring} 5417 \subsection{Toom-Cook Squaring}
5398 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used 5418 The Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
5399 instead of multiplication to find the five relations.. The reader is encouraged to read the description of the latter algorithm and try to 5419 instead of multiplication to find the five relations. The reader is encouraged to read the description of the latter algorithm and try to
5400 derive their own Toom-Cook squaring algorithm. 5420 derive their own Toom-Cook squaring algorithm.
5401 5421
5402 \subsection{High Level Squaring} 5422 \subsection{High Level Squaring}
5403 \newpage\begin{figure}[!here] 5423 \newpage\begin{figure}[!here]
5404 \begin{small} 5424 \begin{small}
5480 \section*{Exercises} 5500 \section*{Exercises}
5481 \begin{tabular}{cl} 5501 \begin{tabular}{cl}
5482 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\ 5502 $\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\
5483 & that have different number of digits in Karatsuba multiplication. \\ 5503 & that have different number of digits in Karatsuba multiplication. \\
5484 & \\ 5504 & \\
5485 $\left [ 3 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\ 5505 $\left [ 2 \right ] $ & In section 5.3 the fact that every column of a squaring is made up \\
5486 & of double products and at most one square is stated. Prove this statement. \\ 5506 & of double products and at most one square is stated. Prove this statement. \\
5487 & \\ 5507 & \\
5488 $\left [ 2 \right ] $ & In the Comba squaring algorithm half of the $\hat X$ variables are not used. \\
5489 & Revise algorithm fast\_s\_mp\_sqr to shrink the $\hat X$ array. \\
5490 & \\
5491 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\ 5508 $\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\
5492 & \\ 5509 & \\
5493 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\ 5510 $\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\
5494 & \\ 5511 & \\
5495 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\ 5512 $\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\
5496 & required for equation $6.7$ to be true. \\ 5513 & required for equation $6.7$ to be true. \\
5497 & \\ 5514 & \\
5515 $\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\
5516 & compute subsets of the columns in each thread. Determine a cutoff point where \\
5517 & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\
5518 &\\
5519 $\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook. You must \\
5520 & increase the throughput of mp\_exptmod() for random odd moduli in the range \\
5521 & $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\
5522 & \\
5498 \end{tabular} 5523 \end{tabular}
5499 5524
5500 \chapter{Modular Reduction} 5525 \chapter{Modular Reduction}
5501 \section{Basics of Modular Reduction} 5526 \section{Basics of Modular Reduction}
5502 \index{modular residue} 5527 \index{modular residue}
5511 other forms of residues. 5536 other forms of residues.
5512 5537
5513 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions 5538 Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions
5514 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the 5539 is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the
5515 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in 5540 RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in
5516 Elliptic Curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular 5541 elliptic curve cryptographic algorithms. As will be discussed in the subsequent chapter there exist fast algorithms for computing modular
5517 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the 5542 exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the
5518 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check 5543 range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check
5519 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems. 5544 algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.
5520 5545
5521 \section{The Barrett Reduction} 5546 \section{The Barrett Reduction}
5725 016 5750 016
5726 017 /* reduces x mod m, assumes 0 < x < m**2, mu is 5751 017 /* reduces x mod m, assumes 0 < x < m**2, mu is
5727 018 * precomputed via mp_reduce_setup. 5752 018 * precomputed via mp_reduce_setup.
5728 019 * From HAC pp.604 Algorithm 14.42 5753 019 * From HAC pp.604 Algorithm 14.42
5729 020 */ 5754 020 */
5730 021 int 5755 021 int mp_reduce (mp_int * x, mp_int * m, mp_int * mu)
5731 022 mp_reduce (mp_int * x, mp_int * m, mp_int * mu) 5756 022 \{
5732 023 \{ 5757 023 mp_int q;
5733 024 mp_int q; 5758 024 int res, um = m->used;
5734 025 int res, um = m->used; 5759 025
5735 026 5760 026 /* q = x */
5736 027 /* q = x */ 5761 027 if ((res = mp_init_copy (&q, x)) != MP_OKAY) \{
5737 028 if ((res = mp_init_copy (&q, x)) != MP_OKAY) \{ 5762 028 return res;
5738 029 return res; 5763 029 \}
5739 030 \} 5764 030
5740 031 5765 031 /* q1 = x / b**(k-1) */
5741 032 /* q1 = x / b**(k-1) */ 5766 032 mp_rshd (&q, um - 1);
5742 033 mp_rshd (&q, um - 1); 5767 033
5743 034 5768 034 /* according to HAC this optimization is ok */
5744 035 /* according to HAC this optimization is ok */ 5769 035 if (((unsigned long) um) > (((mp_digit)1) << (DIGIT_BIT - 1))) \{
5745 036 if (((unsigned long) um) > (((mp_digit)1) << (DIGIT_BIT - 1))) \{ 5770 036 if ((res = mp_mul (&q, mu, &q)) != MP_OKAY) \{
5746 037 if ((res = mp_mul (&q, mu, &q)) != MP_OKAY) \{ 5771 037 goto CLEANUP;
5747 038 goto CLEANUP; 5772 038 \}
5748 039 \} 5773 039 \} else \{
5749 040 \} else \{ 5774 040 #ifdef BN_S_MP_MUL_HIGH_DIGS_C
5750 041 #ifdef BN_S_MP_MUL_HIGH_DIGS_C 5775 041 if ((res = s_mp_mul_high_digs (&q, mu, &q, um)) != MP_OKAY) \{
5751 042 if ((res = s_mp_mul_high_digs (&q, mu, &q, um - 1)) != MP_OKAY) \{ 5776 042 goto CLEANUP;
5752 043 goto CLEANUP; 5777 043 \}
5753 044 \} 5778 044 #elif defined(BN_FAST_S_MP_MUL_HIGH_DIGS_C)
5754 045 #elif defined(BN_FAST_S_MP_MUL_HIGH_DIGS_C) 5779 045 if ((res = fast_s_mp_mul_high_digs (&q, mu, &q, um)) != MP_OKAY) \{
5755 046 if ((res = fast_s_mp_mul_high_digs (&q, mu, &q, um - 1)) != MP_OKAY) \{ 5780 046 goto CLEANUP;
5756 047 goto CLEANUP; 5781 047 \}
5757 048 \} 5782 048 #else
5758 049 #else 5783 049 \{
5759 050 \{ 5784 050 res = MP_VAL;
5760 051 res = MP_VAL; 5785 051 goto CLEANUP;
5761 052 goto CLEANUP; 5786 052 \}
5762 053 \} 5787 053 #endif
5763 054 #endif 5788 054 \}
5764 055 \} 5789 055
5765 056 5790 056 /* q3 = q2 / b**(k+1) */
5766 057 /* q3 = q2 / b**(k+1) */ 5791 057 mp_rshd (&q, um + 1);
5767 058 mp_rshd (&q, um + 1); 5792 058
5768 059 5793 059 /* x = x mod b**(k+1), quick (no division) */
5769 060 /* x = x mod b**(k+1), quick (no division) */ 5794 060 if ((res = mp_mod_2d (x, DIGIT_BIT * (um + 1), x)) != MP_OKAY) \{
5770 061 if ((res = mp_mod_2d (x, DIGIT_BIT * (um + 1), x)) != MP_OKAY) \{ 5795 061 goto CLEANUP;
5771 062 goto CLEANUP; 5796 062 \}
5772 063 \} 5797 063
5773 064 5798 064 /* q = q * m mod b**(k+1), quick (no division) */
5774 065 /* q = q * m mod b**(k+1), quick (no division) */ 5799 065 if ((res = s_mp_mul_digs (&q, m, &q, um + 1)) != MP_OKAY) \{
5775 066 if ((res = s_mp_mul_digs (&q, m, &q, um + 1)) != MP_OKAY) \{ 5800 066 goto CLEANUP;
5776 067 goto CLEANUP; 5801 067 \}
5777 068 \} 5802 068
5778 069 5803 069 /* x = x - q */
5779 070 /* x = x - q */ 5804 070 if ((res = mp_sub (x, &q, x)) != MP_OKAY) \{
5780 071 if ((res = mp_sub (x, &q, x)) != MP_OKAY) \{ 5805 071 goto CLEANUP;
5781 072 goto CLEANUP; 5806 072 \}
5782 073 \} 5807 073
5783 074 5808 074 /* If x < 0, add b**(k+1) to it */
5784 075 /* If x < 0, add b**(k+1) to it */ 5809 075 if (mp_cmp_d (x, 0) == MP_LT) \{
5785 076 if (mp_cmp_d (x, 0) == MP_LT) \{ 5810 076 mp_set (&q, 1);
5786 077 mp_set (&q, 1); 5811 077 if ((res = mp_lshd (&q, um + 1)) != MP_OKAY)
5787 078 if ((res = mp_lshd (&q, um + 1)) != MP_OKAY) 5812 078 goto CLEANUP;
5788 079 goto CLEANUP; 5813 079 if ((res = mp_add (x, &q, x)) != MP_OKAY)
5789 080 if ((res = mp_add (x, &q, x)) != MP_OKAY) 5814 080 goto CLEANUP;
5790 081 goto CLEANUP; 5815 081 \}
5791 082 \} 5816 082
5792 083 5817 083 /* Back off if it's too big */
5793 084 /* Back off if it's too big */ 5818 084 while (mp_cmp (x, m) != MP_LT) \{
5794 085 while (mp_cmp (x, m) != MP_LT) \{ 5819 085 if ((res = s_mp_sub (x, m, x)) != MP_OKAY) \{
5795 086 if ((res = s_mp_sub (x, m, x)) != MP_OKAY) \{ 5820 086 goto CLEANUP;
5796 087 goto CLEANUP; 5821 087 \}
5797 088 \} 5822 088 \}
5798 089 \} 5823 089
5799 090 5824 090 CLEANUP:
5800 091 CLEANUP: 5825 091 mp_clear (&q);
5801 092 mp_clear (&q); 5826 092
5802 093 5827 093 return res;
5803 094 return res; 5828 094 \}
5804 095 \} 5829 095 #endif
5805 096 #endif
5806 \end{alltt} 5830 \end{alltt}
5807 \end{small} 5831 \end{small}
5808 5832
5809 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up. This essentially halves 5833 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up. This essentially halves
5810 the number of single precision multiplications required. However, the optimization is only safe if $\beta$ is much larger than the number of digits 5834 the number of single precision multiplications required. However, the optimization is only safe if $\beta$ is much larger than the number of digits
5811 in the modulus. In the source code this is evaluated on lines 36 to 44 where algorithm s\_mp\_mul\_high\_digs is used when it is 5835 in the modulus. In the source code this is evaluated on lines 36 to 43 where algorithm s\_mp\_mul\_high\_digs is used when it is
5812 safe to do so. 5836 safe to do so.
5813 5837
5814 \subsection{The Barrett Setup Algorithm} 5838 \subsection{The Barrett Setup Algorithm}
5815 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for 5839 In order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance. Ideally this value should be computed once and stored for
5816 future use so that the Barrett algorithm can be used without delay. 5840 future use so that the Barrett algorithm can be used without delay.
5817 5841
5818 \begin{figure}[!here] 5842 \newpage\begin{figure}[!here]
5819 \begin{small} 5843 \begin{small}
5820 \begin{center} 5844 \begin{center}
5821 \begin{tabular}{l} 5845 \begin{tabular}{l}
5822 \hline Algorithm \textbf{mp\_reduce\_setup}. \\ 5846 \hline Algorithm \textbf{mp\_reduce\_setup}. \\
5823 \textbf{Input}. mp\_int $a$ ($a > 1$) \\ 5847 \textbf{Input}. mp\_int $a$ ($a > 1$) \\
6309 020 * which uses the comba method to quickly calculate the columns of the 6333 020 * which uses the comba method to quickly calculate the columns of the
6310 021 * reduction. 6334 021 * reduction.
6311 022 * 6335 022 *
6312 023 * Based on Algorithm 14.32 on pp.601 of HAC. 6336 023 * Based on Algorithm 14.32 on pp.601 of HAC.
6313 024 */ 6337 024 */
6314 025 int 6338 025 int fast_mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho)
6315 026 fast_mp_montgomery_reduce (mp_int * x, mp_int * n, mp_digit rho) 6339 026 \{
6316 027 \{ 6340 027 int ix, res, olduse;
6317 028 int ix, res, olduse; 6341 028 mp_word W[MP_WARRAY];
6318 029 mp_word W[MP_WARRAY]; 6342 029
6319 030 6343 030 /* get old used count */
6320 031 /* get old used count */ 6344 031 olduse = x->used;
6321 032 olduse = x->used; 6345 032
6322 033 6346 033 /* grow a as required */
6323 034 /* grow a as required */ 6347 034 if (x->alloc < n->used + 1) \{
6324 035 if (x->alloc < n->used + 1) \{ 6348 035 if ((res = mp_grow (x, n->used + 1)) != MP_OKAY) \{
6325 036 if ((res = mp_grow (x, n->used + 1)) != MP_OKAY) \{ 6349 036 return res;
6326 037 return res; 6350 037 \}
6327 038 \} 6351 038 \}
6328 039 \} 6352 039
6329 040 6353 040 /* first we have to get the digits of the input into
6330 041 /* first we have to get the digits of the input into 6354 041 * an array of double precision words W[...]
6331 042 * an array of double precision words W[...] 6355 042 */
6332 043 */ 6356 043 \{
6333 044 \{ 6357 044 register mp_word *_W;
6334 045 register mp_word *_W; 6358 045 register mp_digit *tmpx;
6335 046 register mp_digit *tmpx; 6359 046
6336 047 6360 047 /* alias for the W[] array */
6337 048 /* alias for the W[] array */ 6361 048 _W = W;
6338 049 _W = W; 6362 049
6339 050 6363 050 /* alias for the digits of x*/
6340 051 /* alias for the digits of x*/ 6364 051 tmpx = x->dp;
6341 052 tmpx = x->dp; 6365 052
6342 053 6366 053 /* copy the digits of a into W[0..a->used-1] */
6343 054 /* copy the digits of a into W[0..a->used-1] */ 6367 054 for (ix = 0; ix < x->used; ix++) \{
6344 055 for (ix = 0; ix < x->used; ix++) \{ 6368 055 *_W++ = *tmpx++;
6345 056 *_W++ = *tmpx++; 6369 056 \}
6346 057 \} 6370 057
6347 058 6371 058 /* zero the high words of W[a->used..m->used*2] */
6348 059 /* zero the high words of W[a->used..m->used*2] */ 6372 059 for (; ix < n->used * 2 + 1; ix++) \{
6349 060 for (; ix < n->used * 2 + 1; ix++) \{ 6373 060 *_W++ = 0;
6350 061 *_W++ = 0; 6374 061 \}
6351 062 \} 6375 062 \}
6352 063 \} 6376 063
6353 064 6377 064 /* now we proceed to zero successive digits
6354 065 /* now we proceed to zero successive digits 6378 065 * from the least significant upwards
6355 066 * from the least significant upwards 6379 066 */
6356 067 */ 6380 067 for (ix = 0; ix < n->used; ix++) \{
6357 068 for (ix = 0; ix < n->used; ix++) \{ 6381 068 /* mu = ai * m' mod b
6358 069 /* mu = ai * m' mod b 6382 069 *
6359 070 * 6383 070 * We avoid a double precision multiplication (which isn't required)
6360 071 * We avoid a double precision multiplication (which isn't required) 6384 071 * by casting the value down to a mp_digit. Note this requires
6361 072 * by casting the value down to a mp_digit. Note this requires 6385 072 * that W[ix-1] have the carry cleared (see after the inner loop)
6362 073 * that W[ix-1] have the carry cleared (see after the inner loop) 6386 073 */
6363 074 */ 6387 074 register mp_digit mu;
6364 075 register mp_digit mu; 6388 075 mu = (mp_digit) (((W[ix] & MP_MASK) * rho) & MP_MASK);
6365 076 mu = (mp_digit) (((W[ix] & MP_MASK) * rho) & MP_MASK); 6389 076
6366 077 6390 077 /* a = a + mu * m * b**i
6367 078 /* a = a + mu * m * b**i 6391 078 *
6368 079 * 6392 079 * This is computed in place and on the fly. The multiplication
6369 080 * This is computed in place and on the fly. The multiplication 6393 080 * by b**i is handled by offseting which columns the results
6370 081 * by b**i is handled by offseting which columns the results 6394 081 * are added to.
6371 082 * are added to. 6395 082 *
6372 083 * 6396 083 * Note the comba method normally doesn't handle carries in the
6373 084 * Note the comba method normally doesn't handle carries in the 6397 084 * inner loop In this case we fix the carry from the previous
6374 085 * inner loop In this case we fix the carry from the previous 6398 085 * column since the Montgomery reduction requires digits of the
6375 086 * column since the Montgomery reduction requires digits of the 6399 086 * result (so far) [see above] to work. This is
6376 087 * result (so far) [see above] to work. This is 6400 087 * handled by fixing up one carry after the inner loop. The
6377 088 * handled by fixing up one carry after the inner loop. The 6401 088 * carry fixups are done in order so after these loops the
6378 089 * carry fixups are done in order so after these loops the 6402 089 * first m->used words of W[] have the carries fixed
6379 090 * first m->used words of W[] have the carries fixed 6403 090 */
6380 091 */ 6404 091 \{
6381 092 \{ 6405 092 register int iy;
6382 093 register int iy; 6406 093 register mp_digit *tmpn;
6383 094 register mp_digit *tmpn; 6407 094 register mp_word *_W;
6384 095 register mp_word *_W; 6408 095
6385 096 6409 096 /* alias for the digits of the modulus */
6386 097 /* alias for the digits of the modulus */ 6410 097 tmpn = n->dp;
6387 098 tmpn = n->dp; 6411 098
6388 099 6412 099 /* Alias for the columns set by an offset of ix */
6389 100 /* Alias for the columns set by an offset of ix */ 6413 100 _W = W + ix;
6390 101 _W = W + ix; 6414 101
6391 102 6415 102 /* inner loop */
6392 103 /* inner loop */ 6416 103 for (iy = 0; iy < n->used; iy++) \{
6393 104 for (iy = 0; iy < n->used; iy++) \{ 6417 104 *_W++ += ((mp_word)mu) * ((mp_word)*tmpn++);
6394 105 *_W++ += ((mp_word)mu) * ((mp_word)*tmpn++); 6418 105 \}
6395 106 \} 6419 106 \}
6396 107 \} 6420 107
6397 108 6421 108 /* now fix carry for next digit, W[ix+1] */
6398 109 /* now fix carry for next digit, W[ix+1] */ 6422 109 W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT);
6399 110 W[ix + 1] += W[ix] >> ((mp_word) DIGIT_BIT); 6423 110 \}
6400 111 \} 6424 111
6401 112 6425 112 /* now we have to propagate the carries and
6402 113 /* now we have to propagate the carries and 6426 113 * shift the words downward [all those least
6403 114 * shift the words downward [all those least 6427 114 * significant digits we zeroed].
6404 115 * significant digits we zeroed]. 6428 115 */
6405 116 */ 6429 116 \{
6406 117 \{ 6430 117 register mp_digit *tmpx;
6407 118 register mp_digit *tmpx; 6431 118 register mp_word *_W, *_W1;
6408 119 register mp_word *_W, *_W1; 6432 119
6409 120 6433 120 /* nox fix rest of carries */
6410 121 /* nox fix rest of carries */ 6434 121
6411 122 6435 122 /* alias for current word */
6412 123 /* alias for current word */ 6436 123 _W1 = W + ix;
6413 124 _W1 = W + ix; 6437 124
6414 125 6438 125 /* alias for next word, where the carry goes */
6415 126 /* alias for next word, where the carry goes */ 6439 126 _W = W + ++ix;
6416 127 _W = W + ++ix; 6440 127
6417 128 6441 128 for (; ix <= n->used * 2 + 1; ix++) \{
6418 129 for (; ix <= n->used * 2 + 1; ix++) \{ 6442 129 *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT);
6419 130 *_W++ += *_W1++ >> ((mp_word) DIGIT_BIT); 6443 130 \}
6420 131 \} 6444 131
6421 132 6445 132 /* copy out, A = A/b**n
6422 133 /* copy out, A = A/b**n 6446 133 *
6423 134 * 6447 134 * The result is A/b**n but instead of converting from an
6424 135 * The result is A/b**n but instead of converting from an 6448 135 * array of mp_word to mp_digit than calling mp_rshd
6425 136 * array of mp_word to mp_digit than calling mp_rshd 6449 136 * we just copy them in the right order
6426 137 * we just copy them in the right order 6450 137 */
6427 138 */ 6451 138
6428 139 6452 139 /* alias for destination word */
6429 140 /* alias for destination word */ 6453 140 tmpx = x->dp;
6430 141 tmpx = x->dp; 6454 141
6431 142 6455 142 /* alias for shifted double precision result */
6432 143 /* alias for shifted double precision result */ 6456 143 _W = W + n->used;
6433 144 _W = W + n->used; 6457 144
6434 145 6458 145 for (ix = 0; ix < n->used + 1; ix++) \{
6435 146 for (ix = 0; ix < n->used + 1; ix++) \{ 6459 146 *tmpx++ = (mp_digit)(*_W++ & ((mp_word) MP_MASK));
6436 147 *tmpx++ = (mp_digit)(*_W++ & ((mp_word) MP_MASK)); 6460 147 \}
6437 148 \} 6461 148
6438 149 6462 149 /* zero oldused digits, if the input a was larger than
6439 150 /* zero oldused digits, if the input a was larger than 6463 150 * m->used+1 we'll have to clear the digits
6440 151 * m->used+1 we'll have to clear the digits 6464 151 */
6441 152 */ 6465 152 for (; ix < olduse; ix++) \{
6442 153 for (; ix < olduse; ix++) \{ 6466 153 *tmpx++ = 0;
6443 154 *tmpx++ = 0; 6467 154 \}
6444 155 \} 6468 155 \}
6445 156 \} 6469 156
6446 157 6470 157 /* set the max used and clamp */
6447 158 /* set the max used and clamp */ 6471 158 x->used = n->used + 1;
6448 159 x->used = n->used + 1; 6472 159 mp_clamp (x);
6449 160 mp_clamp (x); 6473 160
6450 161 6474 161 /* if A >= m then A = A - m */
6451 162 /* if A >= m then A = A - m */ 6475 162 if (mp_cmp_mag (x, n) != MP_LT) \{
6452 163 if (mp_cmp_mag (x, n) != MP_LT) \{ 6476 163 return s_mp_sub (x, n, x);
6453 164 return s_mp_sub (x, n, x); 6477 164 \}
6454 165 \} 6478 165 return MP_OKAY;
6455 166 return MP_OKAY; 6479 166 \}
6456 167 \} 6480 167 #endif
6457 168 #endif
6458 \end{alltt} 6481 \end{alltt}
6459 \end{small} 6482 \end{small}
6460 6483
6461 The $\hat W$ array is first filled with digits of $x$ on line 48 then the rest of the digits are zeroed on line 55. Both loops share 6484 The $\hat W$ array is first filled with digits of $x$ on line 50 then the rest of the digits are zeroed on line 54. Both loops share
6462 the same alias variables to make the code easier to read. 6485 the same alias variables to make the code easier to read.
6463 6486
6464 The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This 6487 The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This
6465 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line 110 fixes the carry 6488 forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line 109 fixes the carry
6466 for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$. 6489 for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$.
6467 6490
6468 The for loop on line 109 propagates the rest of the carries upwards through the columns. The for loop on line 126 reduces the columns 6491 The for loop on line 108 propagates the rest of the carries upwards through the columns. The for loop on line 125 reduces the columns
6469 modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th 6492 modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th
6470 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$. 6493 digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$.
6471 6494
6472 \subsection{Montgomery Setup} 6495 \subsection{Montgomery Setup}
6473 To calculate the variable $\rho$ a relatively simple algorithm will be required. 6496 To calculate the variable $\rho$ a relatively simple algorithm will be required.
6737 017 /* reduce "x" in place modulo "n" using the Diminished Radix algorithm. 6760 017 /* reduce "x" in place modulo "n" using the Diminished Radix algorithm.
6738 018 * 6761 018 *
6739 019 * Based on algorithm from the paper 6762 019 * Based on algorithm from the paper
6740 020 * 6763 020 *
6741 021 * "Generating Efficient Primes for Discrete Log Cryptosystems" 6764 021 * "Generating Efficient Primes for Discrete Log Cryptosystems"
6742 022 * Chae Hoon Lim, Pil Loong Lee, 6765 022 * Chae Hoon Lim, Pil Joong Lee,
6743 023 * POSTECH Information Research Laboratories 6766 023 * POSTECH Information Research Laboratories
6744 024 * 6767 024 *
6745 025 * The modulus must be of a special format [see manual] 6768 025 * The modulus must be of a special format [see manual]
6746 026 * 6769 026 *
6747 027 * Has been modified to use algorithm 7.10 from the LTM book instead 6770 027 * Has been modified to use algorithm 7.10 from the LTM book instead
6963 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k.c 6986 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k.c
6964 \vspace{-3mm} 6987 \vspace{-3mm}
6965 \begin{alltt} 6988 \begin{alltt}
6966 016 6989 016
6967 017 /* reduces a modulo n where n is of the form 2**p - d */ 6990 017 /* reduces a modulo n where n is of the form 2**p - d */
6968 018 int 6991 018 int mp_reduce_2k(mp_int *a, mp_int *n, mp_digit d)
6969 019 mp_reduce_2k(mp_int *a, mp_int *n, mp_digit d) 6992 019 \{
6970 020 \{ 6993 020 mp_int q;
6971 021 mp_int q; 6994 021 int p, res;
6972 022 int p, res; 6995 022
6973 023 6996 023 if ((res = mp_init(&q)) != MP_OKAY) \{
6974 024 if ((res = mp_init(&q)) != MP_OKAY) \{ 6997 024 return res;
6975 025 return res; 6998 025 \}
6976 026 \} 6999 026
6977 027 7000 027 p = mp_count_bits(n);
6978 028 p = mp_count_bits(n); 7001 028 top:
6979 029 top: 7002 029 /* q = a/2**p, a = a mod 2**p */
6980 030 /* q = a/2**p, a = a mod 2**p */ 7003 030 if ((res = mp_div_2d(a, p, &q, a)) != MP_OKAY) \{
6981 031 if ((res = mp_div_2d(a, p, &q, a)) != MP_OKAY) \{ 7004 031 goto ERR;
6982 032 goto ERR; 7005 032 \}
6983 033 \} 7006 033
6984 034 7007 034 if (d != 1) \{
6985 035 if (d != 1) \{ 7008 035 /* q = q * d */
6986 036 /* q = q * d */ 7009 036 if ((res = mp_mul_d(&q, d, &q)) != MP_OKAY) \{
6987 037 if ((res = mp_mul_d(&q, d, &q)) != MP_OKAY) \{ 7010 037 goto ERR;
6988 038 goto ERR; 7011 038 \}
6989 039 \} 7012 039 \}
6990 040 \} 7013 040
6991 041 7014 041 /* a = a + q */
6992 042 /* a = a + q */ 7015 042 if ((res = s_mp_add(a, &q, a)) != MP_OKAY) \{
6993 043 if ((res = s_mp_add(a, &q, a)) != MP_OKAY) \{ 7016 043 goto ERR;
6994 044 goto ERR; 7017 044 \}
6995 045 \} 7018 045
6996 046 7019 046 if (mp_cmp_mag(a, n) != MP_LT) \{
6997 047 if (mp_cmp_mag(a, n) != MP_LT) \{ 7020 047 s_mp_sub(a, n, a);
6998 048 s_mp_sub(a, n, a); 7021 048 goto top;
6999 049 goto top; 7022 049 \}
7000 050 \} 7023 050
7001 051 7024 051 ERR:
7002 052 ERR: 7025 052 mp_clear(&q);
7003 053 mp_clear(&q); 7026 053 return res;
7004 054 return res; 7027 054 \}
7005 055 \} 7028 055
7006 056 7029 056 #endif
7007 057 #endif
7008 \end{alltt} 7030 \end{alltt}
7009 \end{small} 7031 \end{small}
7010 7032
7011 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$. The call to mp\_div\_2d 7033 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$. The call to mp\_div\_2d
7012 on line 31 calculates both the quotient $q$ and the remainder $a$ required. By doing both in a single function call the code size 7034 on line 30 calculates both the quotient $q$ and the remainder $a$ required. By doing both in a single function call the code size
7013 is kept fairly small. The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without 7035 is kept fairly small. The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without
7014 any multiplications. 7036 any multiplications.
7015 7037
7016 The unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are 7038 The unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are
7017 positive. By using the unsigned versions the overhead is kept to a minimum. 7039 positive. By using the unsigned versions the overhead is kept to a minimum.
7047 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k\_setup.c 7069 \hspace{-5.1mm}{\bf File}: bn\_mp\_reduce\_2k\_setup.c
7048 \vspace{-3mm} 7070 \vspace{-3mm}
7049 \begin{alltt} 7071 \begin{alltt}
7050 016 7072 016
7051 017 /* determines the setup value */ 7073 017 /* determines the setup value */
7052 018 int 7074 018 int mp_reduce_2k_setup(mp_int *a, mp_digit *d)
7053 019 mp_reduce_2k_setup(mp_int *a, mp_digit *d) 7075 019 \{
7054 020 \{ 7076 020 int res, p;
7055 021 int res, p; 7077 021 mp_int tmp;
7056 022 mp_int tmp; 7078 022
7057 023 7079 023 if ((res = mp_init(&tmp)) != MP_OKAY) \{
7058 024 if ((res = mp_init(&tmp)) != MP_OKAY) \{ 7080 024 return res;
7059 025 return res; 7081 025 \}
7060 026 \} 7082 026
7061 027 7083 027 p = mp_count_bits(a);
7062 028 p = mp_count_bits(a); 7084 028 if ((res = mp_2expt(&tmp, p)) != MP_OKAY) \{
7063 029 if ((res = mp_2expt(&tmp, p)) != MP_OKAY) \{ 7085 029 mp_clear(&tmp);
7064 030 mp_clear(&tmp); 7086 030 return res;
7065 031 return res; 7087 031 \}
7066 032 \} 7088 032
7067 033 7089 033 if ((res = s_mp_sub(&tmp, a, &tmp)) != MP_OKAY) \{
7068 034 if ((res = s_mp_sub(&tmp, a, &tmp)) != MP_OKAY) \{ 7090 034 mp_clear(&tmp);
7069 035 mp_clear(&tmp); 7091 035 return res;
7070 036 return res; 7092 036 \}
7071 037 \} 7093 037
7072 038 7094 038 *d = tmp.dp[0];
7073 039 *d = tmp.dp[0]; 7095 039 mp_clear(&tmp);
7074 040 mp_clear(&tmp); 7096 040 return MP_OKAY;
7075 041 return MP_OKAY; 7097 041 \}
7076 042 \} 7098 042 #endif
7077 043 #endif
7078 \end{alltt} 7099 \end{alltt}
7079 \end{small} 7100 \end{small}
7080 7101
7081 \subsubsection{Unrestricted Detection} 7102 \subsubsection{Unrestricted Detection}
7082 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true. 7103 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true.
7125 019 \{ 7146 019 \{
7126 020 int ix, iy, iw; 7147 020 int ix, iy, iw;
7127 021 mp_digit iz; 7148 021 mp_digit iz;
7128 022 7149 022
7129 023 if (a->used == 0) \{ 7150 023 if (a->used == 0) \{
7130 024 return 0; 7151 024 return MP_NO;
7131 025 \} else if (a->used == 1) \{ 7152 025 \} else if (a->used == 1) \{
7132 026 return 1; 7153 026 return MP_YES;
7133 027 \} else if (a->used > 1) \{ 7154 027 \} else if (a->used > 1) \{
7134 028 iy = mp_count_bits(a); 7155 028 iy = mp_count_bits(a);
7135 029 iz = 1; 7156 029 iz = 1;
7136 030 iw = 1; 7157 030 iw = 1;
7137 031 7158 031
7138 032 /* Test every bit from the second digit up, must be 1 */ 7159 032 /* Test every bit from the second digit up, must be 1 */
7139 033 for (ix = DIGIT_BIT; ix < iy; ix++) \{ 7160 033 for (ix = DIGIT_BIT; ix < iy; ix++) \{
7140 034 if ((a->dp[iw] & iz) == 0) \{ 7161 034 if ((a->dp[iw] & iz) == 0) \{
7141 035 return 0; 7162 035 return MP_NO;
7142 036 \} 7163 036 \}
7143 037 iz <<= 1; 7164 037 iz <<= 1;
7144 038 if (iz > (mp_digit)MP_MASK) \{ 7165 038 if (iz > (mp_digit)MP_MASK) \{
7145 039 ++iw; 7166 039 ++iw;
7146 040 iz = 1; 7167 040 iz = 1;
7147 041 \} 7168 041 \}
7148 042 \} 7169 042 \}
7149 043 \} 7170 043 \}
7150 044 return 1; 7171 044 return MP_YES;
7151 045 \} 7172 045 \}
7152 046 7173 046
7153 047 #endif 7174 047 #endif
7154 \end{alltt} 7175 \end{alltt}
7155 \end{small} 7176 \end{small}
7592 058 err = mp_exptmod(&tmpG, &tmpX, P, Y); 7613 058 err = mp_exptmod(&tmpG, &tmpX, P, Y);
7593 059 mp_clear_multi(&tmpG, &tmpX, NULL); 7614 059 mp_clear_multi(&tmpG, &tmpX, NULL);
7594 060 return err; 7615 060 return err;
7595 061 #else 7616 061 #else
7596 062 /* no invmod */ 7617 062 /* no invmod */
7597 063 return MP_VAL 7618 063 return MP_VAL;
7598 064 #endif 7619 064 #endif
7599 065 \} 7620 065 \}
7600 066 7621 066
7601 067 #ifdef BN_MP_DR_IS_MODULUS_C 7622 067 /* modified diminished radix reduction */
7602 068 /* is it a DR modulus? */ 7623 068 #if defined(BN_MP_REDUCE_IS_2K_L_C) && defined(BN_MP_REDUCE_2K_L_C)
7603 069 dr = mp_dr_is_modulus(P); 7624 069 if (mp_reduce_is_2k_l(P) == MP_YES) \{
7604 070 #else 7625 070 return s_mp_exptmod(G, X, P, Y, 1);
7605 071 dr = 0; 7626 071 \}
7606 072 #endif 7627 072 #endif
7607 073 7628 073
7608 074 #ifdef BN_MP_REDUCE_IS_2K_C 7629 074 #ifdef BN_MP_DR_IS_MODULUS_C
7609 075 /* if not, is it a uDR modulus? */ 7630 075 /* is it a DR modulus? */
7610 076 if (dr == 0) \{ 7631 076 dr = mp_dr_is_modulus(P);
7611 077 dr = mp_reduce_is_2k(P) << 1; 7632 077 #else
7612 078 \} 7633 078 /* default to no */
7613 079 #endif 7634 079 dr = 0;
7614 080 7635 080 #endif
7615 081 /* if the modulus is odd or dr != 0 use the fast method */ 7636 081
7616 082 #ifdef BN_MP_EXPTMOD_FAST_C 7637 082 #ifdef BN_MP_REDUCE_IS_2K_C
7617 083 if (mp_isodd (P) == 1 || dr != 0) \{ 7638 083 /* if not, is it a unrestricted DR modulus? */
7618 084 return mp_exptmod_fast (G, X, P, Y, dr); 7639 084 if (dr == 0) \{
7619 085 \} else \{ 7640 085 dr = mp_reduce_is_2k(P) << 1;
7620 086 #endif 7641 086 \}
7621 087 #ifdef BN_S_MP_EXPTMOD_C 7642 087 #endif
7622 088 /* otherwise use the generic Barrett reduction technique */ 7643 088
7623 089 return s_mp_exptmod (G, X, P, Y); 7644 089 /* if the modulus is odd or dr != 0 use the montgomery method */
7624 090 #else 7645 090 #ifdef BN_MP_EXPTMOD_FAST_C
7625 091 /* no exptmod for evens */ 7646 091 if (mp_isodd (P) == 1 || dr != 0) \{
7626 092 return MP_VAL; 7647 092 return mp_exptmod_fast (G, X, P, Y, dr);
7627 093 #endif 7648 093 \} else \{
7628 094 #ifdef BN_MP_EXPTMOD_FAST_C 7649 094 #endif
7629 095 \} 7650 095 #ifdef BN_S_MP_EXPTMOD_C
7630 096 #endif 7651 096 /* otherwise use the generic Barrett reduction technique */
7631 097 \} 7652 097 return s_mp_exptmod (G, X, P, Y, 0);
7632 098 7653 098 #else
7633 099 #endif 7654 099 /* no exptmod for evens */
7655 100 return MP_VAL;
7656 101 #endif
7657 102 #ifdef BN_MP_EXPTMOD_FAST_C
7658 103 \}
7659 104 #endif
7660 105 \}
7661 106
7662 107 #endif
7634 \end{alltt} 7663 \end{alltt}
7635 \end{small} 7664 \end{small}
7636 7665
7637 In order to keep the algorithms in a known state the first step on line 28 is to reject any negative modulus as input. If the exponent is 7666 In order to keep the algorithms in a known state the first step on line 28 is to reject any negative modulus as input. If the exponent is
7638 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$. The temporary variable $tmpG$ is assigned 7667 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$. The temporary variable $tmpG$ is assigned
7639 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$. The algorithm will recuse with these new values with a positive 7668 the modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$. The algorithm will recuse with these new values with a positive
7640 exponent. 7669 exponent.
7641 7670
7642 If the exponent is positive the algorithm resumes the exponentiation. Line 69 determines if the modulus is of the restricted Diminished Radix 7671 If the exponent is positive the algorithm resumes the exponentiation. Line 76 determines if the modulus is of the restricted Diminished Radix
7643 form. If it is not line 77 attempts to determine if it is of a unrestricted Diminished Radix form. The integer $dr$ will take on one 7672 form. If it is not line 69 attempts to determine if it is of a unrestricted Diminished Radix form. The integer $dr$ will take on one
7644 of three values. 7673 of three values.
7645 7674
7646 \begin{enumerate} 7675 \begin{enumerate}
7647 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form. 7676 \item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form.
7648 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form. 7677 \item $dr = 1$ means that the modulus is of restricted Diminished Radix form.
7649 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form. 7678 \item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form.
7650 \end{enumerate} 7679 \end{enumerate}
7651 7680
7652 Line 67 determines if the fast modular exponentiation algorithm can be used. It is allowed if $dr \ne 0$ or if the modulus is odd. Otherwise, 7681 Line 69 determines if the fast modular exponentiation algorithm can be used. It is allowed if $dr \ne 0$ or if the modulus is odd. Otherwise,
7653 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction. 7682 the slower s\_mp\_exptmod algorithm is used which uses Barrett reduction.
7654 7683
7655 \subsection{Barrett Modular Exponentiation} 7684 \subsection{Barrett Modular Exponentiation}
7656 7685
7657 \newpage\begin{figure}[!here] 7686 \newpage\begin{figure}[!here]
7815 018 #define TAB_SIZE 32 7844 018 #define TAB_SIZE 32
7816 019 #else 7845 019 #else
7817 020 #define TAB_SIZE 256 7846 020 #define TAB_SIZE 256
7818 021 #endif 7847 021 #endif
7819 022 7848 022
7820 023 int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y) 7849 023 int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y, int redmod
7850 e)
7821 024 \{ 7851 024 \{
7822 025 mp_int M[TAB_SIZE], res, mu; 7852 025 mp_int M[TAB_SIZE], res, mu;
7823 026 mp_digit buf; 7853 026 mp_digit buf;
7824 027 int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize; 7854 027 int err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize;
7825 028 7855 028 int (*redux)(mp_int*,mp_int*,mp_int*);
7826 029 /* find window size */ 7856 029
7827 030 x = mp_count_bits (X); 7857 030 /* find window size */
7828 031 if (x <= 7) \{ 7858 031 x = mp_count_bits (X);
7829 032 winsize = 2; 7859 032 if (x <= 7) \{
7830 033 \} else if (x <= 36) \{ 7860 033 winsize = 2;
7831 034 winsize = 3; 7861 034 \} else if (x <= 36) \{
7832 035 \} else if (x <= 140) \{ 7862 035 winsize = 3;
7833 036 winsize = 4; 7863 036 \} else if (x <= 140) \{
7834 037 \} else if (x <= 450) \{ 7864 037 winsize = 4;
7835 038 winsize = 5; 7865 038 \} else if (x <= 450) \{
7836 039 \} else if (x <= 1303) \{ 7866 039 winsize = 5;
7837 040 winsize = 6; 7867 040 \} else if (x <= 1303) \{
7838 041 \} else if (x <= 3529) \{ 7868 041 winsize = 6;
7839 042 winsize = 7; 7869 042 \} else if (x <= 3529) \{
7840 043 \} else \{ 7870 043 winsize = 7;
7841 044 winsize = 8; 7871 044 \} else \{
7842 045 \} 7872 045 winsize = 8;
7843 046 7873 046 \}
7844 047 #ifdef MP_LOW_MEM 7874 047
7845 048 if (winsize > 5) \{ 7875 048 #ifdef MP_LOW_MEM
7846 049 winsize = 5; 7876 049 if (winsize > 5) \{
7847 050 \} 7877 050 winsize = 5;
7848 051 #endif 7878 051 \}
7849 052 7879 052 #endif
7850 053 /* init M array */ 7880 053
7851 054 /* init first cell */ 7881 054 /* init M array */
7852 055 if ((err = mp_init(&M[1])) != MP_OKAY) \{ 7882 055 /* init first cell */
7853 056 return err; 7883 056 if ((err = mp_init(&M[1])) != MP_OKAY) \{
7854 057 \} 7884 057 return err;
7855 058 7885 058 \}
7856 059 /* now init the second half of the array */ 7886 059
7857 060 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{ 7887 060 /* now init the second half of the array */
7858 061 if ((err = mp_init(&M[x])) != MP_OKAY) \{ 7888 061 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
7859 062 for (y = 1<<(winsize-1); y < x; y++) \{ 7889 062 if ((err = mp_init(&M[x])) != MP_OKAY) \{
7860 063 mp_clear (&M[y]); 7890 063 for (y = 1<<(winsize-1); y < x; y++) \{
7861 064 \} 7891 064 mp_clear (&M[y]);
7862 065 mp_clear(&M[1]); 7892 065 \}
7863 066 return err; 7893 066 mp_clear(&M[1]);
7864 067 \} 7894 067 return err;
7865 068 \} 7895 068 \}
7866 069 7896 069 \}
7867 070 /* create mu, used for Barrett reduction */ 7897 070
7868 071 if ((err = mp_init (&mu)) != MP_OKAY) \{ 7898 071 /* create mu, used for Barrett reduction */
7869 072 goto __M; 7899 072 if ((err = mp_init (&mu)) != MP_OKAY) \{
7870 073 \} 7900 073 goto LBL_M;
7871 074 if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{ 7901 074 \}
7872 075 goto __MU; 7902 075
7873 076 \} 7903 076 if (redmode == 0) \{
7874 077 7904 077 if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{
7875 078 /* create M table 7905 078 goto LBL_MU;
7876 079 * 7906 079 \}
7877 080 * The M table contains powers of the base, 7907 080 redux = mp_reduce;
7878 081 * e.g. M[x] = G**x mod P 7908 081 \} else \{
7879 082 * 7909 082 if ((err = mp_reduce_2k_setup_l (P, &mu)) != MP_OKAY) \{
7880 083 * The first half of the table is not 7910 083 goto LBL_MU;
7881 084 * computed though accept for M[0] and M[1] 7911 084 \}
7882 085 */ 7912 085 redux = mp_reduce_2k_l;
7883 086 if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{ 7913 086 \}
7884 087 goto __MU; 7914 087
7885 088 \} 7915 088 /* create M table
7886 089 7916 089 *
7887 090 /* compute the value at M[1<<(winsize-1)] by squaring 7917 090 * The M table contains powers of the base,
7888 091 * M[1] (winsize-1) times 7918 091 * e.g. M[x] = G**x mod P
7889 092 */ 7919 092 *
7890 093 if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{ 7920 093 * The first half of the table is not
7891 094 goto __MU; 7921 094 * computed though accept for M[0] and M[1]
7892 095 \} 7922 095 */
7893 096 7923 096 if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{
7894 097 for (x = 0; x < (winsize - 1); x++) \{ 7924 097 goto LBL_MU;
7895 098 if ((err = mp_sqr (&M[1 << (winsize - 1)], 7925 098 \}
7896 099 &M[1 << (winsize - 1)])) != MP_OKAY) \{ 7926 099
7897 100 goto __MU; 7927 100 /* compute the value at M[1<<(winsize-1)] by squaring
7898 101 \} 7928 101 * M[1] (winsize-1) times
7899 102 if ((err = mp_reduce (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{ 7929 102 */
7900 103 goto __MU; 7930 103 if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{
7901 104 \} 7931 104 goto LBL_MU;
7902 105 \} 7932 105 \}
7903 106 7933 106
7904 107 /* create upper table, that is M[x] = M[x-1] * M[1] (mod P) 7934 107 for (x = 0; x < (winsize - 1); x++) \{
7905 108 * for x = (2**(winsize - 1) + 1) to (2**winsize - 1) 7935 108 /* square it */
7906 109 */ 7936 109 if ((err = mp_sqr (&M[1 << (winsize - 1)],
7907 110 for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{ 7937 110 &M[1 << (winsize - 1)])) != MP_OKAY) \{
7908 111 if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{ 7938 111 goto LBL_MU;
7909 112 goto __MU; 7939 112 \}
7910 113 \} 7940 113
7911 114 if ((err = mp_reduce (&M[x], P, &mu)) != MP_OKAY) \{ 7941 114 /* reduce modulo P */
7912 115 goto __MU; 7942 115 if ((err = redux (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{
7913 116 \} 7943 116 goto LBL_MU;
7914 117 \} 7944 117 \}
7915 118 7945 118 \}
7916 119 /* setup result */ 7946 119
7917 120 if ((err = mp_init (&res)) != MP_OKAY) \{ 7947 120 /* create upper table, that is M[x] = M[x-1] * M[1] (mod P)
7918 121 goto __MU; 7948 121 * for x = (2**(winsize - 1) + 1) to (2**winsize - 1)
7919 122 \} 7949 122 */
7920 123 mp_set (&res, 1); 7950 123 for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{
7921 124 7951 124 if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{
7922 125 /* set initial mode and bit cnt */ 7952 125 goto LBL_MU;
7923 126 mode = 0; 7953 126 \}
7924 127 bitcnt = 1; 7954 127 if ((err = redux (&M[x], P, &mu)) != MP_OKAY) \{
7925 128 buf = 0; 7955 128 goto LBL_MU;
7926 129 digidx = X->used - 1; 7956 129 \}
7927 130 bitcpy = 0; 7957 130 \}
7928 131 bitbuf = 0; 7958 131
7929 132 7959 132 /* setup result */
7930 133 for (;;) \{ 7960 133 if ((err = mp_init (&res)) != MP_OKAY) \{
7931 134 /* grab next digit as required */ 7961 134 goto LBL_MU;
7932 135 if (--bitcnt == 0) \{ 7962 135 \}
7933 136 /* if digidx == -1 we are out of digits */ 7963 136 mp_set (&res, 1);
7934 137 if (digidx == -1) \{ 7964 137
7935 138 break; 7965 138 /* set initial mode and bit cnt */
7936 139 \} 7966 139 mode = 0;
7937 140 /* read next digit and reset the bitcnt */ 7967 140 bitcnt = 1;
7938 141 buf = X->dp[digidx--]; 7968 141 buf = 0;
7939 142 bitcnt = (int) DIGIT_BIT; 7969 142 digidx = X->used - 1;
7940 143 \} 7970 143 bitcpy = 0;
7941 144 7971 144 bitbuf = 0;
7942 145 /* grab the next msb from the exponent */ 7972 145
7943 146 y = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1; 7973 146 for (;;) \{
7944 147 buf <<= (mp_digit)1; 7974 147 /* grab next digit as required */
7945 148 7975 148 if (--bitcnt == 0) \{
7946 149 /* if the bit is zero and mode == 0 then we ignore it 7976 149 /* if digidx == -1 we are out of digits */
7947 150 * These represent the leading zero bits before the first 1 bit 7977 150 if (digidx == -1) \{
7948 151 * in the exponent. Technically this opt is not required but it 7978 151 break;
7949 152 * does lower the # of trivial squaring/reductions used 7979 152 \}
7950 153 */ 7980 153 /* read next digit and reset the bitcnt */
7951 154 if (mode == 0 && y == 0) \{ 7981 154 buf = X->dp[digidx--];
7952 155 continue; 7982 155 bitcnt = (int) DIGIT_BIT;
7953 156 \} 7983 156 \}
7954 157 7984 157
7955 158 /* if the bit is zero and mode == 1 then we square */ 7985 158 /* grab the next msb from the exponent */
7956 159 if (mode == 1 && y == 0) \{ 7986 159 y = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1;
7957 160 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ 7987 160 buf <<= (mp_digit)1;
7958 161 goto __RES; 7988 161
7959 162 \} 7989 162 /* if the bit is zero and mode == 0 then we ignore it
7960 163 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ 7990 163 * These represent the leading zero bits before the first 1 bit
7961 164 goto __RES; 7991 164 * in the exponent. Technically this opt is not required but it
7962 165 \} 7992 165 * does lower the # of trivial squaring/reductions used
7963 166 continue; 7993 166 */
7964 167 \} 7994 167 if (mode == 0 && y == 0) \{
7965 168 7995 168 continue;
7966 169 /* else we add it to the window */ 7996 169 \}
7967 170 bitbuf |= (y << (winsize - ++bitcpy)); 7997 170
7968 171 mode = 2; 7998 171 /* if the bit is zero and mode == 1 then we square */
7969 172 7999 172 if (mode == 1 && y == 0) \{
7970 173 if (bitcpy == winsize) \{ 8000 173 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
7971 174 /* ok window is filled so square as required and multiply */ 8001 174 goto LBL_RES;
7972 175 /* square first */ 8002 175 \}
7973 176 for (x = 0; x < winsize; x++) \{ 8003 176 if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
7974 177 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ 8004 177 goto LBL_RES;
7975 178 goto __RES; 8005 178 \}
7976 179 \} 8006 179 continue;
7977 180 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ 8007 180 \}
7978 181 goto __RES; 8008 181
7979 182 \} 8009 182 /* else we add it to the window */
7980 183 \} 8010 183 bitbuf |= (y << (winsize - ++bitcpy));
7981 184 8011 184 mode = 2;
7982 185 /* then multiply */ 8012 185
7983 186 if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{ 8013 186 if (bitcpy == winsize) \{
7984 187 goto __RES; 8014 187 /* ok window is filled so square as required and multiply */
7985 188 \} 8015 188 /* square first */
7986 189 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ 8016 189 for (x = 0; x < winsize; x++) \{
7987 190 goto __RES; 8017 190 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
7988 191 \} 8018 191 goto LBL_RES;
7989 192 8019 192 \}
7990 193 /* empty window and reset */ 8020 193 if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
7991 194 bitcpy = 0; 8021 194 goto LBL_RES;
7992 195 bitbuf = 0; 8022 195 \}
7993 196 mode = 1; 8023 196 \}
7994 197 \} 8024 197
7995 198 \} 8025 198 /* then multiply */
7996 199 8026 199 if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{
7997 200 /* if bits remain then square/multiply */ 8027 200 goto LBL_RES;
7998 201 if (mode == 2 && bitcpy > 0) \{ 8028 201 \}
7999 202 /* square then multiply if the bit is set */ 8029 202 if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
8000 203 for (x = 0; x < bitcpy; x++) \{ 8030 203 goto LBL_RES;
8001 204 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{ 8031 204 \}
8002 205 goto __RES; 8032 205
8003 206 \} 8033 206 /* empty window and reset */
8004 207 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ 8034 207 bitcpy = 0;
8005 208 goto __RES; 8035 208 bitbuf = 0;
8006 209 \} 8036 209 mode = 1;
8007 210 8037 210 \}
8008 211 bitbuf <<= 1; 8038 211 \}
8009 212 if ((bitbuf & (1 << winsize)) != 0) \{ 8039 212
8010 213 /* then multiply */ 8040 213 /* if bits remain then square/multiply */
8011 214 if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{ 8041 214 if (mode == 2 && bitcpy > 0) \{
8012 215 goto __RES; 8042 215 /* square then multiply if the bit is set */
8013 216 \} 8043 216 for (x = 0; x < bitcpy; x++) \{
8014 217 if ((err = mp_reduce (&res, P, &mu)) != MP_OKAY) \{ 8044 217 if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
8015 218 goto __RES; 8045 218 goto LBL_RES;
8016 219 \} 8046 219 \}
8017 220 \} 8047 220 if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
8018 221 \} 8048 221 goto LBL_RES;
8019 222 \} 8049 222 \}
8020 223 8050 223
8021 224 mp_exch (&res, Y); 8051 224 bitbuf <<= 1;
8022 225 err = MP_OKAY; 8052 225 if ((bitbuf & (1 << winsize)) != 0) \{
8023 226 __RES:mp_clear (&res); 8053 226 /* then multiply */
8024 227 __MU:mp_clear (&mu); 8054 227 if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{
8025 228 __M: 8055 228 goto LBL_RES;
8026 229 mp_clear(&M[1]); 8056 229 \}
8027 230 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{ 8057 230 if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
8028 231 mp_clear (&M[x]); 8058 231 goto LBL_RES;
8029 232 \} 8059 232 \}
8030 233 return err; 8060 233 \}
8031 234 \} 8061 234 \}
8032 235 #endif 8062 235 \}
8063 236
8064 237 mp_exch (&res, Y);
8065 238 err = MP_OKAY;
8066 239 LBL_RES:mp_clear (&res);
8067 240 LBL_MU:mp_clear (&mu);
8068 241 LBL_M:
8069 242 mp_clear(&M[1]);
8070 243 for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
8071 244 mp_clear (&M[x]);
8072 245 \}
8073 246 return err;
8074 247 \}
8075 248 #endif
8033 \end{alltt} 8076 \end{alltt}
8034 \end{small} 8077 \end{small}
8035 8078
8036 Lines 31 through 41 determine the optimal window size based on the length of the exponent in bits. The window divisions are sorted 8079 Lines 21 through 40 determine the optimal window size based on the length of the exponent in bits. The window divisions are sorted
8037 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested. For example, by the \textbf{if} statement 8080 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested. For example, by the \textbf{if} statement
8038 on line 33 the value of $x$ is already known to be greater than $140$. 8081 on line 32 the value of $x$ is already known to be greater than $140$.
8039 8082
8040 The conditional piece of code beginning on line 47 allows the window size to be restricted to five bits. This logic is used to ensure 8083 The conditional piece of code beginning on line 48 allows the window size to be restricted to five bits. This logic is used to ensure
8041 the table of precomputed powers of $G$ remains relatively small. 8084 the table of precomputed powers of $G$ remains relatively small.
8042 8085
8043 The for loop on line 60 initializes the $M$ array while lines 61 and 74 compute the value of $\mu$ required for 8086 The for loop on line 61 initializes the $M$ array while lines 62 and 77 compute the value of $\mu$ required for
8044 Barrett reduction. 8087 Barrett reduction.
8045 8088
8046 -- More later. 8089 -- More later.
8047 8090
8048 \section{Quick Power of Two} 8091 \section{Quick Power of Two}
8384 046 \} 8427 046 \}
8385 047 8428 047
8386 048 8429 048
8387 049 mp_set(&tq, 1); 8430 049 mp_set(&tq, 1);
8388 050 n = mp_count_bits(a) - mp_count_bits(b); 8431 050 n = mp_count_bits(a) - mp_count_bits(b);
8389 051 if (((res = mp_copy(a, &ta)) != MP_OKAY) || 8432 051 if (((res = mp_abs(a, &ta)) != MP_OKAY) ||
8390 052 ((res = mp_copy(b, &tb)) != MP_OKAY) || 8433 052 ((res = mp_abs(b, &tb)) != MP_OKAY) ||
8391 053 ((res = mp_mul_2d(&tb, n, &tb)) != MP_OKAY) || 8434 053 ((res = mp_mul_2d(&tb, n, &tb)) != MP_OKAY) ||
8392 054 ((res = mp_mul_2d(&tq, n, &tq)) != MP_OKAY)) \{ 8435 054 ((res = mp_mul_2d(&tq, n, &tq)) != MP_OKAY)) \{
8393 055 goto __ERR; 8436 055 goto LBL_ERR;
8394 056 \} 8437 056 \}
8395 057 8438 057
8396 058 while (n-- >= 0) \{ 8439 058 while (n-- >= 0) \{
8397 059 if (mp_cmp(&tb, &ta) != MP_GT) \{ 8440 059 if (mp_cmp(&tb, &ta) != MP_GT) \{
8398 060 if (((res = mp_sub(&ta, &tb, &ta)) != MP_OKAY) || 8441 060 if (((res = mp_sub(&ta, &tb, &ta)) != MP_OKAY) ||
8399 061 ((res = mp_add(&q, &tq, &q)) != MP_OKAY)) \{ 8442 061 ((res = mp_add(&q, &tq, &q)) != MP_OKAY)) \{
8400 062 goto __ERR; 8443 062 goto LBL_ERR;
8401 063 \} 8444 063 \}
8402 064 \} 8445 064 \}
8403 065 if (((res = mp_div_2d(&tb, 1, &tb, NULL)) != MP_OKAY) || 8446 065 if (((res = mp_div_2d(&tb, 1, &tb, NULL)) != MP_OKAY) ||
8404 066 ((res = mp_div_2d(&tq, 1, &tq, NULL)) != MP_OKAY)) \{ 8447 066 ((res = mp_div_2d(&tq, 1, &tq, NULL)) != MP_OKAY)) \{
8405 067 goto __ERR; 8448 067 goto LBL_ERR;
8406 068 \} 8449 068 \}
8407 069 \} 8450 069 \}
8408 070 8451 070
8409 071 /* now q == quotient and ta == remainder */ 8452 071 /* now q == quotient and ta == remainder */
8410 072 n = a->sign; 8453 072 n = a->sign;
8411 073 n2 = (a->sign == b->sign ? MP_ZPOS : MP_NEG); 8454 073 n2 = (a->sign == b->sign ? MP_ZPOS : MP_NEG);
8412 074 if (c != NULL) \{ 8455 074 if (c != NULL) \{
8413 075 mp_exch(c, &q); 8456 075 mp_exch(c, &q);
8414 076 c->sign = n2; 8457 076 c->sign = (mp_iszero(c) == MP_YES) ? MP_ZPOS : n2;
8415 077 \} 8458 077 \}
8416 078 if (d != NULL) \{ 8459 078 if (d != NULL) \{
8417 079 mp_exch(d, &ta); 8460 079 mp_exch(d, &ta);
8418 080 d->sign = n; 8461 080 d->sign = (mp_iszero(d) == MP_YES) ? MP_ZPOS : n;
8419 081 \} 8462 081 \}
8420 082 __ERR: 8463 082 LBL_ERR:
8421 083 mp_clear_multi(&ta, &tb, &tq, &q, NULL); 8464 083 mp_clear_multi(&ta, &tb, &tq, &q, NULL);
8422 084 return res; 8465 084 return res;
8423 085 \} 8466 085 \}
8424 086 8467 086
8425 087 #else 8468 087 #else
8464 126 return res; 8507 126 return res;
8465 127 \} 8508 127 \}
8466 128 q.used = a->used + 2; 8509 128 q.used = a->used + 2;
8467 129 8510 129
8468 130 if ((res = mp_init (&t1)) != MP_OKAY) \{ 8511 130 if ((res = mp_init (&t1)) != MP_OKAY) \{
8469 131 goto __Q; 8512 131 goto LBL_Q;
8470 132 \} 8513 132 \}
8471 133 8514 133
8472 134 if ((res = mp_init (&t2)) != MP_OKAY) \{ 8515 134 if ((res = mp_init (&t2)) != MP_OKAY) \{
8473 135 goto __T1; 8516 135 goto LBL_T1;
8474 136 \} 8517 136 \}
8475 137 8518 137
8476 138 if ((res = mp_init_copy (&x, a)) != MP_OKAY) \{ 8519 138 if ((res = mp_init_copy (&x, a)) != MP_OKAY) \{
8477 139 goto __T2; 8520 139 goto LBL_T2;
8478 140 \} 8521 140 \}
8479 141 8522 141
8480 142 if ((res = mp_init_copy (&y, b)) != MP_OKAY) \{ 8523 142 if ((res = mp_init_copy (&y, b)) != MP_OKAY) \{
8481 143 goto __X; 8524 143 goto LBL_X;
8482 144 \} 8525 144 \}
8483 145 8526 145
8484 146 /* fix the sign */ 8527 146 /* fix the sign */
8485 147 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG; 8528 147 neg = (a->sign == b->sign) ? MP_ZPOS : MP_NEG;
8486 148 x.sign = y.sign = MP_ZPOS; 8529 148 x.sign = y.sign = MP_ZPOS;
8488 150 /* normalize both x and y, ensure that y >= b/2, [b == 2**DIGIT_BIT] */ 8531 150 /* normalize both x and y, ensure that y >= b/2, [b == 2**DIGIT_BIT] */
8489 151 norm = mp_count_bits(&y) % DIGIT_BIT; 8532 151 norm = mp_count_bits(&y) % DIGIT_BIT;
8490 152 if (norm < (int)(DIGIT_BIT-1)) \{ 8533 152 if (norm < (int)(DIGIT_BIT-1)) \{
8491 153 norm = (DIGIT_BIT-1) - norm; 8534 153 norm = (DIGIT_BIT-1) - norm;
8492 154 if ((res = mp_mul_2d (&x, norm, &x)) != MP_OKAY) \{ 8535 154 if ((res = mp_mul_2d (&x, norm, &x)) != MP_OKAY) \{
8493 155 goto __Y; 8536 155 goto LBL_Y;
8494 156 \} 8537 156 \}
8495 157 if ((res = mp_mul_2d (&y, norm, &y)) != MP_OKAY) \{ 8538 157 if ((res = mp_mul_2d (&y, norm, &y)) != MP_OKAY) \{
8496 158 goto __Y; 8539 158 goto LBL_Y;
8497 159 \} 8540 159 \}
8498 160 \} else \{ 8541 160 \} else \{
8499 161 norm = 0; 8542 161 norm = 0;
8500 162 \} 8543 162 \}
8501 163 8544 163
8503 165 n = x.used - 1; 8546 165 n = x.used - 1;
8504 166 t = y.used - 1; 8547 166 t = y.used - 1;
8505 167 8548 167
8506 168 /* while (x >= y*b**n-t) do \{ q[n-t] += 1; x -= y*b**\{n-t\} \} */ 8549 168 /* while (x >= y*b**n-t) do \{ q[n-t] += 1; x -= y*b**\{n-t\} \} */
8507 169 if ((res = mp_lshd (&y, n - t)) != MP_OKAY) \{ /* y = y*b**\{n-t\} */ 8550 169 if ((res = mp_lshd (&y, n - t)) != MP_OKAY) \{ /* y = y*b**\{n-t\} */
8508 170 goto __Y; 8551 170 goto LBL_Y;
8509 171 \} 8552 171 \}
8510 172 8553 172
8511 173 while (mp_cmp (&x, &y) != MP_LT) \{ 8554 173 while (mp_cmp (&x, &y) != MP_LT) \{
8512 174 ++(q.dp[n - t]); 8555 174 ++(q.dp[n - t]);
8513 175 if ((res = mp_sub (&x, &y, &x)) != MP_OKAY) \{ 8556 175 if ((res = mp_sub (&x, &y, &x)) != MP_OKAY) \{
8514 176 goto __Y; 8557 176 goto LBL_Y;
8515 177 \} 8558 177 \}
8516 178 \} 8559 178 \}
8517 179 8560 179
8518 180 /* reset y by shifting it back down */ 8561 180 /* reset y by shifting it back down */
8519 181 mp_rshd (&y, n - t); 8562 181 mp_rshd (&y, n - t);
8551 213 mp_zero (&t1); 8594 213 mp_zero (&t1);
8552 214 t1.dp[0] = (t - 1 < 0) ? 0 : y.dp[t - 1]; 8595 214 t1.dp[0] = (t - 1 < 0) ? 0 : y.dp[t - 1];
8553 215 t1.dp[1] = y.dp[t]; 8596 215 t1.dp[1] = y.dp[t];
8554 216 t1.used = 2; 8597 216 t1.used = 2;
8555 217 if ((res = mp_mul_d (&t1, q.dp[i - t - 1], &t1)) != MP_OKAY) \{ 8598 217 if ((res = mp_mul_d (&t1, q.dp[i - t - 1], &t1)) != MP_OKAY) \{
8556 218 goto __Y; 8599 218 goto LBL_Y;
8557 219 \} 8600 219 \}
8558 220 8601 220
8559 221 /* find right hand */ 8602 221 /* find right hand */
8560 222 t2.dp[0] = (i - 2 < 0) ? 0 : x.dp[i - 2]; 8603 222 t2.dp[0] = (i - 2 < 0) ? 0 : x.dp[i - 2];
8561 223 t2.dp[1] = (i - 1 < 0) ? 0 : x.dp[i - 1]; 8604 223 t2.dp[1] = (i - 1 < 0) ? 0 : x.dp[i - 1];
8563 225 t2.used = 3; 8606 225 t2.used = 3;
8564 226 \} while (mp_cmp_mag(&t1, &t2) == MP_GT); 8607 226 \} while (mp_cmp_mag(&t1, &t2) == MP_GT);
8565 227 8608 227
8566 228 /* step 3.3 x = x - q\{i-t-1\} * y * b**\{i-t-1\} */ 8609 228 /* step 3.3 x = x - q\{i-t-1\} * y * b**\{i-t-1\} */
8567 229 if ((res = mp_mul_d (&y, q.dp[i - t - 1], &t1)) != MP_OKAY) \{ 8610 229 if ((res = mp_mul_d (&y, q.dp[i - t - 1], &t1)) != MP_OKAY) \{
8568 230 goto __Y; 8611 230 goto LBL_Y;
8569 231 \} 8612 231 \}
8570 232 8613 232
8571 233 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{ 8614 233 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{
8572 234 goto __Y; 8615 234 goto LBL_Y;
8573 235 \} 8616 235 \}
8574 236 8617 236
8575 237 if ((res = mp_sub (&x, &t1, &x)) != MP_OKAY) \{ 8618 237 if ((res = mp_sub (&x, &t1, &x)) != MP_OKAY) \{
8576 238 goto __Y; 8619 238 goto LBL_Y;
8577 239 \} 8620 239 \}
8578 240 8621 240
8579 241 /* if x < 0 then \{ x = x + y*b**\{i-t-1\}; q\{i-t-1\} -= 1; \} */ 8622 241 /* if x < 0 then \{ x = x + y*b**\{i-t-1\}; q\{i-t-1\} -= 1; \} */
8580 242 if (x.sign == MP_NEG) \{ 8623 242 if (x.sign == MP_NEG) \{
8581 243 if ((res = mp_copy (&y, &t1)) != MP_OKAY) \{ 8624 243 if ((res = mp_copy (&y, &t1)) != MP_OKAY) \{
8582 244 goto __Y; 8625 244 goto LBL_Y;
8583 245 \} 8626 245 \}
8584 246 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{ 8627 246 if ((res = mp_lshd (&t1, i - t - 1)) != MP_OKAY) \{
8585 247 goto __Y; 8628 247 goto LBL_Y;
8586 248 \} 8629 248 \}
8587 249 if ((res = mp_add (&x, &t1, &x)) != MP_OKAY) \{ 8630 249 if ((res = mp_add (&x, &t1, &x)) != MP_OKAY) \{
8588 250 goto __Y; 8631 250 goto LBL_Y;
8589 251 \} 8632 251 \}
8590 252 8633 252
8591 253 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1UL) & MP_MASK; 8634 253 q.dp[i - t - 1] = (q.dp[i - t - 1] - 1UL) & MP_MASK;
8592 254 \} 8635 254 \}
8593 255 \} 8636 255 \}
8610 272 mp_exch (&x, d); 8653 272 mp_exch (&x, d);
8611 273 \} 8654 273 \}
8612 274 8655 274
8613 275 res = MP_OKAY; 8656 275 res = MP_OKAY;
8614 276 8657 276
8615 277 __Y:mp_clear (&y); 8658 277 LBL_Y:mp_clear (&y);
8616 278 __X:mp_clear (&x); 8659 278 LBL_X:mp_clear (&x);
8617 279 __T2:mp_clear (&t2); 8660 279 LBL_T2:mp_clear (&t2);
8618 280 __T1:mp_clear (&t1); 8661 280 LBL_T1:mp_clear (&t1);
8619 281 __Q:mp_clear (&q); 8662 281 LBL_Q:mp_clear (&q);
8620 282 return res; 8663 282 return res;
8621 283 \} 8664 283 \}
8622 284 8665 284
8623 285 #endif 8666 285 #endif
8624 286 8667 286
8868 054 8911 054
8869 055 /* send carry into next iteration */ 8912 055 /* send carry into next iteration */
8870 056 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT)); 8913 056 u = (mp_digit) (r >> ((mp_word) DIGIT_BIT));
8871 057 \} 8914 057 \}
8872 058 8915 058
8873 059 /* store final carry [if any] */ 8916 059 /* store final carry [if any] and increment ix offset */
8874 060 *tmpc++ = u; 8917 060 *tmpc++ = u;
8875 061 8918 061 ++ix;
8876 062 /* now zero digits above the top */ 8919 062
8877 063 while (ix++ < olduse) \{ 8920 063 /* now zero digits above the top */
8878 064 *tmpc++ = 0; 8921 064 while (ix++ < olduse) \{
8879 065 \} 8922 065 *tmpc++ = 0;
8880 066 8923 066 \}
8881 067 /* set used count */ 8924 067
8882 068 c->used = a->used + 1; 8925 068 /* set used count */
8883 069 mp_clamp(c); 8926 069 c->used = a->used + 1;
8884 070 8927 070 mp_clamp(c);
8885 071 return MP_OKAY; 8928 071
8886 072 \} 8929 072 return MP_OKAY;
8887 073 #endif 8930 073 \}
8931 074 #endif
8888 \end{alltt} 8932 \end{alltt}
8889 \end{small} 8933 \end{small}
8890 8934
8891 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is 8935 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
8892 read from the source. This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively. 8936 read from the source. This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively.
9128 037 if ((res = mp_init (&t1)) != MP_OKAY) \{ 9172 037 if ((res = mp_init (&t1)) != MP_OKAY) \{
9129 038 return res; 9173 038 return res;
9130 039 \} 9174 039 \}
9131 040 9175 040
9132 041 if ((res = mp_init (&t2)) != MP_OKAY) \{ 9176 041 if ((res = mp_init (&t2)) != MP_OKAY) \{
9133 042 goto __T1; 9177 042 goto LBL_T1;
9134 043 \} 9178 043 \}
9135 044 9179 044
9136 045 if ((res = mp_init (&t3)) != MP_OKAY) \{ 9180 045 if ((res = mp_init (&t3)) != MP_OKAY) \{
9137 046 goto __T2; 9181 046 goto LBL_T2;
9138 047 \} 9182 047 \}
9139 048 9183 048
9140 049 /* if a is negative fudge the sign but keep track */ 9184 049 /* if a is negative fudge the sign but keep track */
9141 050 neg = a->sign; 9185 050 neg = a->sign;
9142 051 a->sign = MP_ZPOS; 9186 051 a->sign = MP_ZPOS;
9145 054 mp_set (&t2, 2); 9189 054 mp_set (&t2, 2);
9146 055 9190 055
9147 056 do \{ 9191 056 do \{
9148 057 /* t1 = t2 */ 9192 057 /* t1 = t2 */
9149 058 if ((res = mp_copy (&t2, &t1)) != MP_OKAY) \{ 9193 058 if ((res = mp_copy (&t2, &t1)) != MP_OKAY) \{
9150 059 goto __T3; 9194 059 goto LBL_T3;
9151 060 \} 9195 060 \}
9152 061 9196 061
9153 062 /* t2 = t1 - ((t1**b - a) / (b * t1**(b-1))) */ 9197 062 /* t2 = t1 - ((t1**b - a) / (b * t1**(b-1))) */
9154 063 9198 063
9155 064 /* t3 = t1**(b-1) */ 9199 064 /* t3 = t1**(b-1) */
9156 065 if ((res = mp_expt_d (&t1, b - 1, &t3)) != MP_OKAY) \{ 9200 065 if ((res = mp_expt_d (&t1, b - 1, &t3)) != MP_OKAY) \{
9157 066 goto __T3; 9201 066 goto LBL_T3;
9158 067 \} 9202 067 \}
9159 068 9203 068
9160 069 /* numerator */ 9204 069 /* numerator */
9161 070 /* t2 = t1**b */ 9205 070 /* t2 = t1**b */
9162 071 if ((res = mp_mul (&t3, &t1, &t2)) != MP_OKAY) \{ 9206 071 if ((res = mp_mul (&t3, &t1, &t2)) != MP_OKAY) \{
9163 072 goto __T3; 9207 072 goto LBL_T3;
9164 073 \} 9208 073 \}
9165 074 9209 074
9166 075 /* t2 = t1**b - a */ 9210 075 /* t2 = t1**b - a */
9167 076 if ((res = mp_sub (&t2, a, &t2)) != MP_OKAY) \{ 9211 076 if ((res = mp_sub (&t2, a, &t2)) != MP_OKAY) \{
9168 077 goto __T3; 9212 077 goto LBL_T3;
9169 078 \} 9213 078 \}
9170 079 9214 079
9171 080 /* denominator */ 9215 080 /* denominator */
9172 081 /* t3 = t1**(b-1) * b */ 9216 081 /* t3 = t1**(b-1) * b */
9173 082 if ((res = mp_mul_d (&t3, b, &t3)) != MP_OKAY) \{ 9217 082 if ((res = mp_mul_d (&t3, b, &t3)) != MP_OKAY) \{
9174 083 goto __T3; 9218 083 goto LBL_T3;
9175 084 \} 9219 084 \}
9176 085 9220 085
9177 086 /* t3 = (t1**b - a)/(b * t1**(b-1)) */ 9221 086 /* t3 = (t1**b - a)/(b * t1**(b-1)) */
9178 087 if ((res = mp_div (&t2, &t3, &t3, NULL)) != MP_OKAY) \{ 9222 087 if ((res = mp_div (&t2, &t3, &t3, NULL)) != MP_OKAY) \{
9179 088 goto __T3; 9223 088 goto LBL_T3;
9180 089 \} 9224 089 \}
9181 090 9225 090
9182 091 if ((res = mp_sub (&t1, &t3, &t2)) != MP_OKAY) \{ 9226 091 if ((res = mp_sub (&t1, &t3, &t2)) != MP_OKAY) \{
9183 092 goto __T3; 9227 092 goto LBL_T3;
9184 093 \} 9228 093 \}
9185 094 \} while (mp_cmp (&t1, &t2) != MP_EQ); 9229 094 \} while (mp_cmp (&t1, &t2) != MP_EQ);
9186 095 9230 095
9187 096 /* result can be off by a few so check */ 9231 096 /* result can be off by a few so check */
9188 097 for (;;) \{ 9232 097 for (;;) \{
9189 098 if ((res = mp_expt_d (&t1, b, &t2)) != MP_OKAY) \{ 9233 098 if ((res = mp_expt_d (&t1, b, &t2)) != MP_OKAY) \{
9190 099 goto __T3; 9234 099 goto LBL_T3;
9191 100 \} 9235 100 \}
9192 101 9236 101
9193 102 if (mp_cmp (&t2, a) == MP_GT) \{ 9237 102 if (mp_cmp (&t2, a) == MP_GT) \{
9194 103 if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) \{ 9238 103 if ((res = mp_sub_d (&t1, 1, &t1)) != MP_OKAY) \{
9195 104 goto __T3; 9239 104 goto LBL_T3;
9196 105 \} 9240 105 \}
9197 106 \} else \{ 9241 106 \} else \{
9198 107 break; 9242 107 break;
9199 108 \} 9243 108 \}
9200 109 \} 9244 109 \}
9208 117 /* set the sign of the result */ 9252 117 /* set the sign of the result */
9209 118 c->sign = neg; 9253 118 c->sign = neg;
9210 119 9254 119
9211 120 res = MP_OKAY; 9255 120 res = MP_OKAY;
9212 121 9256 121
9213 122 __T3:mp_clear (&t3); 9257 122 LBL_T3:mp_clear (&t3);
9214 123 __T2:mp_clear (&t2); 9258 123 LBL_T2:mp_clear (&t2);
9215 124 __T1:mp_clear (&t1); 9259 124 LBL_T1:mp_clear (&t1);
9216 125 return res; 9260 125 return res;
9217 126 \} 9261 126 \}
9218 127 #endif 9262 127 #endif
9219 \end{alltt} 9263 \end{alltt}
9220 \end{small} 9264 \end{small}
9270 026 return MP_OKAY; 9314 026 return MP_OKAY;
9271 027 \} 9315 027 \}
9272 028 9316 028
9273 029 /* first place a random non-zero digit */ 9317 029 /* first place a random non-zero digit */
9274 030 do \{ 9318 030 do \{
9275 031 d = ((mp_digit) abs (rand ())); 9319 031 d = ((mp_digit) abs (rand ())) & MP_MASK;
9276 032 \} while (d == 0); 9320 032 \} while (d == 0);
9277 033 9321 033
9278 034 if ((res = mp_add_d (a, d, a)) != MP_OKAY) \{ 9322 034 if ((res = mp_add_d (a, d, a)) != MP_OKAY) \{
9279 035 return res; 9323 035 return res;
9280 036 \} 9324 036 \}
9281 037 9325 037
9282 038 while (digits-- > 0) \{ 9326 038 while (--digits > 0) \{
9283 039 if ((res = mp_lshd (a, 1)) != MP_OKAY) \{ 9327 039 if ((res = mp_lshd (a, 1)) != MP_OKAY) \{
9284 040 return res; 9328 040 return res;
9285 041 \} 9329 041 \}
9286 042 9330 042
9287 043 if ((res = mp_add_d (a, ((mp_digit) abs (rand ())), a)) != MP_OKAY) \{ 9331 043 if ((res = mp_add_d (a, ((mp_digit) abs (rand ())), a)) != MP_OKAY) \{
9374 \hspace{-5.1mm}{\bf File}: bn\_mp\_read\_radix.c 9418 \hspace{-5.1mm}{\bf File}: bn\_mp\_read\_radix.c
9375 \vspace{-3mm} 9419 \vspace{-3mm}
9376 \begin{alltt} 9420 \begin{alltt}
9377 016 9421 016
9378 017 /* read a string [ASCII] in a given radix */ 9422 017 /* read a string [ASCII] in a given radix */
9379 018 int mp_read_radix (mp_int * a, char *str, int radix) 9423 018 int mp_read_radix (mp_int * a, const char *str, int radix)
9380 019 \{ 9424 019 \{
9381 020 int y, res, neg; 9425 020 int y, res, neg;
9382 021 char ch; 9426 021 char ch;
9383 022 9427 022
9384 023 /* make sure the radix is ok */ 9428 023 /* make sure the radix is ok */
9769 040 if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{ 9813 040 if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{
9770 041 return res; 9814 041 return res;
9771 042 \} 9815 042 \}
9772 043 9816 043
9773 044 if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{ 9817 044 if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{
9774 045 goto __U; 9818 045 goto LBL_U;
9775 046 \} 9819 046 \}
9776 047 9820 047
9777 048 /* must be positive for the remainder of the algorithm */ 9821 048 /* must be positive for the remainder of the algorithm */
9778 049 u.sign = v.sign = MP_ZPOS; 9822 049 u.sign = v.sign = MP_ZPOS;
9779 050 9823 050
9783 054 k = MIN(u_lsb, v_lsb); 9827 054 k = MIN(u_lsb, v_lsb);
9784 055 9828 055
9785 056 if (k > 0) \{ 9829 056 if (k > 0) \{
9786 057 /* divide the power of two out */ 9830 057 /* divide the power of two out */
9787 058 if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{ 9831 058 if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{
9788 059 goto __V; 9832 059 goto LBL_V;
9789 060 \} 9833 060 \}
9790 061 9834 061
9791 062 if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{ 9835 062 if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{
9792 063 goto __V; 9836 063 goto LBL_V;
9793 064 \} 9837 064 \}
9794 065 \} 9838 065 \}
9795 066 9839 066
9796 067 /* divide any remaining factors of two out */ 9840 067 /* divide any remaining factors of two out */
9797 068 if (u_lsb != k) \{ 9841 068 if (u_lsb != k) \{
9798 069 if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{ 9842 069 if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{
9799 070 goto __V; 9843 070 goto LBL_V;
9800 071 \} 9844 071 \}
9801 072 \} 9845 072 \}
9802 073 9846 073
9803 074 if (v_lsb != k) \{ 9847 074 if (v_lsb != k) \{
9804 075 if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{ 9848 075 if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{
9805 076 goto __V; 9849 076 goto LBL_V;
9806 077 \} 9850 077 \}
9807 078 \} 9851 078 \}
9808 079 9852 079
9809 080 while (mp_iszero(&v) == 0) \{ 9853 080 while (mp_iszero(&v) == 0) \{
9810 081 /* make sure v is the largest */ 9854 081 /* make sure v is the largest */
9813 084 mp_exch(&u, &v); 9857 084 mp_exch(&u, &v);
9814 085 \} 9858 085 \}
9815 086 9859 086
9816 087 /* subtract smallest from largest */ 9860 087 /* subtract smallest from largest */
9817 088 if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{ 9861 088 if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{
9818 089 goto __V; 9862 089 goto LBL_V;
9819 090 \} 9863 090 \}
9820 091 9864 091
9821 092 /* Divide out all factors of two */ 9865 092 /* Divide out all factors of two */
9822 093 if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{ 9866 093 if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{
9823 094 goto __V; 9867 094 goto LBL_V;
9824 095 \} 9868 095 \}
9825 096 \} 9869 096 \}
9826 097 9870 097
9827 098 /* multiply by 2**k which we divided out at the beginning */ 9871 098 /* multiply by 2**k which we divided out at the beginning */
9828 099 if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{ 9872 099 if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{
9829 100 goto __V; 9873 100 goto LBL_V;
9830 101 \} 9874 101 \}
9831 102 c->sign = MP_ZPOS; 9875 102 c->sign = MP_ZPOS;
9832 103 res = MP_OKAY; 9876 103 res = MP_OKAY;
9833 104 __V:mp_clear (&u); 9877 104 LBL_V:mp_clear (&u);
9834 105 __U:mp_clear (&v); 9878 105 LBL_U:mp_clear (&v);
9835 106 return res; 9879 106 return res;
9836 107 \} 9880 107 \}
9837 108 #endif 9881 108 #endif
9838 \end{alltt} 9882 \end{alltt}
9839 \end{small} 9883 \end{small}
9902 025 return res; 9946 025 return res;
9903 026 \} 9947 026 \}
9904 027 9948 027
9905 028 /* t1 = get the GCD of the two inputs */ 9949 028 /* t1 = get the GCD of the two inputs */
9906 029 if ((res = mp_gcd (a, b, &t1)) != MP_OKAY) \{ 9950 029 if ((res = mp_gcd (a, b, &t1)) != MP_OKAY) \{
9907 030 goto __T; 9951 030 goto LBL_T;
9908 031 \} 9952 031 \}
9909 032 9953 032
9910 033 /* divide the smallest by the GCD */ 9954 033 /* divide the smallest by the GCD */
9911 034 if (mp_cmp_mag(a, b) == MP_LT) \{ 9955 034 if (mp_cmp_mag(a, b) == MP_LT) \{
9912 035 /* store quotient in t2 such that t2 * b is the LCM */ 9956 035 /* store quotient in t2 such that t2 * b is the LCM */
9913 036 if ((res = mp_div(a, &t1, &t2, NULL)) != MP_OKAY) \{ 9957 036 if ((res = mp_div(a, &t1, &t2, NULL)) != MP_OKAY) \{
9914 037 goto __T; 9958 037 goto LBL_T;
9915 038 \} 9959 038 \}
9916 039 res = mp_mul(b, &t2, c); 9960 039 res = mp_mul(b, &t2, c);
9917 040 \} else \{ 9961 040 \} else \{
9918 041 /* store quotient in t2 such that t2 * a is the LCM */ 9962 041 /* store quotient in t2 such that t2 * a is the LCM */
9919 042 if ((res = mp_div(b, &t1, &t2, NULL)) != MP_OKAY) \{ 9963 042 if ((res = mp_div(b, &t1, &t2, NULL)) != MP_OKAY) \{
9920 043 goto __T; 9964 043 goto LBL_T;
9921 044 \} 9965 044 \}
9922 045 res = mp_mul(a, &t2, c); 9966 045 res = mp_mul(a, &t2, c);
9923 046 \} 9967 046 \}
9924 047 9968 047
9925 048 /* fix the sign to positive */ 9969 048 /* fix the sign to positive */
9926 049 c->sign = MP_ZPOS; 9970 049 c->sign = MP_ZPOS;
9927 050 9971 050
9928 051 __T: 9972 051 LBL_T:
9929 052 mp_clear_multi (&t1, &t2, NULL); 9973 052 mp_clear_multi (&t1, &t2, NULL);
9930 053 return res; 9974 053 return res;
9931 054 \} 9975 054 \}
9932 055 #endif 9976 055 #endif
9933 \end{alltt} 9977 \end{alltt}
9935 9979
9936 \section{Jacobi Symbol Computation} 9980 \section{Jacobi Symbol Computation}
9937 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is 9981 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is
9938 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is 9982 defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is
9939 equivalent to equation \ref{eqn:legendre}. 9983 equivalent to equation \ref{eqn:legendre}.
9984
9985 \textit{-- Tom, don't be an ass, cite your source here...!}
9940 9986
9941 \begin{equation} 9987 \begin{equation}
9942 a^{(p-1)/2} \equiv \begin{array}{rl} 9988 a^{(p-1)/2} \equiv \begin{array}{rl}
9943 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\ 9989 -1 & \mbox{if }a\mbox{ is a quadratic non-residue.} \\
9944 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\ 9990 0 & \mbox{if }a\mbox{ divides }p\mbox{.} \\
10121 047 if ((res = mp_init_copy (&a1, a)) != MP_OKAY) \{ 10167 047 if ((res = mp_init_copy (&a1, a)) != MP_OKAY) \{
10122 048 return res; 10168 048 return res;
10123 049 \} 10169 049 \}
10124 050 10170 050
10125 051 if ((res = mp_init (&p1)) != MP_OKAY) \{ 10171 051 if ((res = mp_init (&p1)) != MP_OKAY) \{
10126 052 goto __A1; 10172 052 goto LBL_A1;
10127 053 \} 10173 053 \}
10128 054 10174 054
10129 055 /* divide out larger power of two */ 10175 055 /* divide out larger power of two */
10130 056 k = mp_cnt_lsb(&a1); 10176 056 k = mp_cnt_lsb(&a1);
10131 057 if ((res = mp_div_2d(&a1, k, &a1, NULL)) != MP_OKAY) \{ 10177 057 if ((res = mp_div_2d(&a1, k, &a1, NULL)) != MP_OKAY) \{
10132 058 goto __P1; 10178 058 goto LBL_P1;
10133 059 \} 10179 059 \}
10134 060 10180 060
10135 061 /* step 4. if e is even set s=1 */ 10181 061 /* step 4. if e is even set s=1 */
10136 062 if ((k & 1) == 0) \{ 10182 062 if ((k & 1) == 0) \{
10137 063 s = 1; 10183 063 s = 1;
10155 081 if (mp_cmp_d (&a1, 1) == MP_EQ) \{ 10201 081 if (mp_cmp_d (&a1, 1) == MP_EQ) \{
10156 082 *c = s; 10202 082 *c = s;
10157 083 \} else \{ 10203 083 \} else \{
10158 084 /* n1 = n mod a1 */ 10204 084 /* n1 = n mod a1 */
10159 085 if ((res = mp_mod (p, &a1, &p1)) != MP_OKAY) \{ 10205 085 if ((res = mp_mod (p, &a1, &p1)) != MP_OKAY) \{
10160 086 goto __P1; 10206 086 goto LBL_P1;
10161 087 \} 10207 087 \}
10162 088 if ((res = mp_jacobi (&p1, &a1, &r)) != MP_OKAY) \{ 10208 088 if ((res = mp_jacobi (&p1, &a1, &r)) != MP_OKAY) \{
10163 089 goto __P1; 10209 089 goto LBL_P1;
10164 090 \} 10210 090 \}
10165 091 *c = s * r; 10211 091 *c = s * r;
10166 092 \} 10212 092 \}
10167 093 10213 093
10168 094 /* done */ 10214 094 /* done */
10169 095 res = MP_OKAY; 10215 095 res = MP_OKAY;
10170 096 __P1:mp_clear (&p1); 10216 096 LBL_P1:mp_clear (&p1);
10171 097 __A1:mp_clear (&a1); 10217 097 LBL_A1:mp_clear (&a1);
10172 098 return res; 10218 098 return res;
10173 099 \} 10219 099 \}
10174 100 #endif 10220 100 #endif
10175 \end{alltt} 10221 \end{alltt}
10176 \end{small} 10222 \end{small}
10404 026 10450 026
10405 027 /* default to not */ 10451 027 /* default to not */
10406 028 *result = MP_NO; 10452 028 *result = MP_NO;
10407 029 10453 029
10408 030 for (ix = 0; ix < PRIME_SIZE; ix++) \{ 10454 030 for (ix = 0; ix < PRIME_SIZE; ix++) \{
10409 031 /* what is a mod __prime_tab[ix] */ 10455 031 /* what is a mod LBL_prime_tab[ix] */
10410 032 if ((err = mp_mod_d (a, __prime_tab[ix], &res)) != MP_OKAY) \{ 10456 032 if ((err = mp_mod_d (a, ltm_prime_tab[ix], &res)) != MP_OKAY) \{
10411 033 return err; 10457 033 return err;
10412 034 \} 10458 034 \}
10413 035 10459 035
10414 036 /* is the residue zero? */ 10460 036 /* is the residue zero? */
10415 037 if (res == 0) \{ 10461 037 if (res == 0) \{
10429 10475
10430 \vspace{+3mm}\begin{small} 10476 \vspace{+3mm}\begin{small}
10431 \hspace{-5.1mm}{\bf File}: bn\_prime\_tab.c 10477 \hspace{-5.1mm}{\bf File}: bn\_prime\_tab.c
10432 \vspace{-3mm} 10478 \vspace{-3mm}
10433 \begin{alltt} 10479 \begin{alltt}
10434 016 const mp_digit __prime_tab[] = \{ 10480 016 const mp_digit ltm_prime_tab[] = \{
10435 017 0x0002, 0x0003, 0x0005, 0x0007, 0x000B, 0x000D, 0x0011, 0x0013, 10481 017 0x0002, 0x0003, 0x0005, 0x0007, 0x000B, 0x000D, 0x0011, 0x0013,
10436 018 0x0017, 0x001D, 0x001F, 0x0025, 0x0029, 0x002B, 0x002F, 0x0035, 10482 018 0x0017, 0x001D, 0x001F, 0x0025, 0x0029, 0x002B, 0x002F, 0x0035,
10437 019 0x003B, 0x003D, 0x0043, 0x0047, 0x0049, 0x004F, 0x0053, 0x0059, 10483 019 0x003B, 0x003D, 0x0043, 0x0047, 0x0049, 0x004F, 0x0053, 0x0059,
10438 020 0x0061, 0x0065, 0x0067, 0x006B, 0x006D, 0x0071, 0x007F, 10484 020 0x0061, 0x0065, 0x0067, 0x006B, 0x006D, 0x0071, 0x007F,
10439 021 #ifndef MP_8BIT 10485 021 #ifndef MP_8BIT
10545 040 return err; 10591 040 return err;
10546 041 \} 10592 041 \}
10547 042 10593 042
10548 043 /* compute t = b**a mod a */ 10594 043 /* compute t = b**a mod a */
10549 044 if ((err = mp_exptmod (b, a, a, &t)) != MP_OKAY) \{ 10595 044 if ((err = mp_exptmod (b, a, a, &t)) != MP_OKAY) \{
10550 045 goto __T; 10596 045 goto LBL_T;
10551 046 \} 10597 046 \}
10552 047 10598 047
10553 048 /* is it equal to b? */ 10599 048 /* is it equal to b? */
10554 049 if (mp_cmp (&t, b) == MP_EQ) \{ 10600 049 if (mp_cmp (&t, b) == MP_EQ) \{
10555 050 *result = MP_YES; 10601 050 *result = MP_YES;
10556 051 \} 10602 051 \}
10557 052 10603 052
10558 053 err = MP_OKAY; 10604 053 err = MP_OKAY;
10559 054 __T:mp_clear (&t); 10605 054 LBL_T:mp_clear (&t);
10560 055 return err; 10606 055 return err;
10561 056 \} 10607 056 \}
10562 057 #endif 10608 057 #endif
10563 \end{alltt} 10609 \end{alltt}
10564 \end{small} 10610 \end{small}
10636 037 /* get n1 = a - 1 */ 10682 037 /* get n1 = a - 1 */
10637 038 if ((err = mp_init_copy (&n1, a)) != MP_OKAY) \{ 10683 038 if ((err = mp_init_copy (&n1, a)) != MP_OKAY) \{
10638 039 return err; 10684 039 return err;
10639 040 \} 10685 040 \}
10640 041 if ((err = mp_sub_d (&n1, 1, &n1)) != MP_OKAY) \{ 10686 041 if ((err = mp_sub_d (&n1, 1, &n1)) != MP_OKAY) \{
10641 042 goto __N1; 10687 042 goto LBL_N1;
10642 043 \} 10688 043 \}
10643 044 10689 044
10644 045 /* set 2**s * r = n1 */ 10690 045 /* set 2**s * r = n1 */
10645 046 if ((err = mp_init_copy (&r, &n1)) != MP_OKAY) \{ 10691 046 if ((err = mp_init_copy (&r, &n1)) != MP_OKAY) \{
10646 047 goto __N1; 10692 047 goto LBL_N1;
10647 048 \} 10693 048 \}
10648 049 10694 049
10649 050 /* count the number of least significant bits 10695 050 /* count the number of least significant bits
10650 051 * which are zero 10696 051 * which are zero
10651 052 */ 10697 052 */
10652 053 s = mp_cnt_lsb(&r); 10698 053 s = mp_cnt_lsb(&r);
10653 054 10699 054
10654 055 /* now divide n - 1 by 2**s */ 10700 055 /* now divide n - 1 by 2**s */
10655 056 if ((err = mp_div_2d (&r, s, &r, NULL)) != MP_OKAY) \{ 10701 056 if ((err = mp_div_2d (&r, s, &r, NULL)) != MP_OKAY) \{
10656 057 goto __R; 10702 057 goto LBL_R;
10657 058 \} 10703 058 \}
10658 059 10704 059
10659 060 /* compute y = b**r mod a */ 10705 060 /* compute y = b**r mod a */
10660 061 if ((err = mp_init (&y)) != MP_OKAY) \{ 10706 061 if ((err = mp_init (&y)) != MP_OKAY) \{
10661 062 goto __R; 10707 062 goto LBL_R;
10662 063 \} 10708 063 \}
10663 064 if ((err = mp_exptmod (b, &r, a, &y)) != MP_OKAY) \{ 10709 064 if ((err = mp_exptmod (b, &r, a, &y)) != MP_OKAY) \{
10664 065 goto __Y; 10710 065 goto LBL_Y;
10665 066 \} 10711 066 \}
10666 067 10712 067
10667 068 /* if y != 1 and y != n1 do */ 10713 068 /* if y != 1 and y != n1 do */
10668 069 if (mp_cmp_d (&y, 1) != MP_EQ && mp_cmp (&y, &n1) != MP_EQ) \{ 10714 069 if (mp_cmp_d (&y, 1) != MP_EQ && mp_cmp (&y, &n1) != MP_EQ) \{
10669 070 j = 1; 10715 070 j = 1;
10670 071 /* while j <= s-1 and y != n1 */ 10716 071 /* while j <= s-1 and y != n1 */
10671 072 while ((j <= (s - 1)) && mp_cmp (&y, &n1) != MP_EQ) \{ 10717 072 while ((j <= (s - 1)) && mp_cmp (&y, &n1) != MP_EQ) \{
10672 073 if ((err = mp_sqrmod (&y, a, &y)) != MP_OKAY) \{ 10718 073 if ((err = mp_sqrmod (&y, a, &y)) != MP_OKAY) \{
10673 074 goto __Y; 10719 074 goto LBL_Y;
10674 075 \} 10720 075 \}
10675 076 10721 076
10676 077 /* if y == 1 then composite */ 10722 077 /* if y == 1 then composite */
10677 078 if (mp_cmp_d (&y, 1) == MP_EQ) \{ 10723 078 if (mp_cmp_d (&y, 1) == MP_EQ) \{
10678 079 goto __Y; 10724 079 goto LBL_Y;
10679 080 \} 10725 080 \}
10680 081 10726 081
10681 082 ++j; 10727 082 ++j;
10682 083 \} 10728 083 \}
10683 084 10729 084
10684 085 /* if y != n1 then composite */ 10730 085 /* if y != n1 then composite */
10685 086 if (mp_cmp (&y, &n1) != MP_EQ) \{ 10731 086 if (mp_cmp (&y, &n1) != MP_EQ) \{
10686 087 goto __Y; 10732 087 goto LBL_Y;
10687 088 \} 10733 088 \}
10688 089 \} 10734 089 \}
10689 090 10735 090
10690 091 /* probably prime now */ 10736 091 /* probably prime now */
10691 092 *result = MP_YES; 10737 092 *result = MP_YES;
10692 093 __Y:mp_clear (&y); 10738 093 LBL_Y:mp_clear (&y);
10693 094 __R:mp_clear (&r); 10739 094 LBL_R:mp_clear (&r);
10694 095 __N1:mp_clear (&n1); 10740 095 LBL_N1:mp_clear (&n1);
10695 096 return err; 10741 096 return err;
10696 097 \} 10742 097 \}
10697 098 #endif 10743 098 #endif
10698 \end{alltt} 10744 \end{alltt}
10699 \end{small} 10745 \end{small}