dropbear: tommath.tex comparison

comparison tommath.tex @ 386:97db060d0ef5 libtommath-orig libtommath-0.40

Update to LibTomMath 0.40

author	Matt Johnston <matt@ucc.asn.au>
date	Thu, 11 Jan 2007 03:11:15 +0000
parents	91fbc376f010
children

comparison

equal deleted inserted replaced

-:91fbc376f010
+:97db060d0ef5
 \end{tabular}
 %\end{small}
 }
 }
 \maketitle
-This text has been placed in the public domain.  This text corresponds to the v0.35 release of the
+This text has been placed in the public domain.  This text corresponds to the v0.39 release of the
 LibTomMath project.
 \begin{alltt}
 Tom St Denis
 111 Banning Rd
 Ottawa, Ontario
 K2L 1C3
 Canada
 Phone: 1-613-836-3160
 Email: [email protected]
 \end{alltt}
 This text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
 {\em book} macro package and the Perl {\em booker} package.
 Both texts also do not discuss several key optimal algorithms required such as ``Comba'' and Karatsuba multipliers
 and fast modular inversion, which we consider practical oversights.  These optimal algorithms are vital to achieve
 any form of useful performance in non-trivial applications.
 To solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer
-package.  As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.org}} package is used
+package.  As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.com}} package is used
 to demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field
 tested and work very well.  The LibTomMath library is freely available on the Internet for all uses and this text
 discusses a very large portion of the inner workings of the library.
 The algorithms that are presented will always include at least one ``pseudo-code'' description followed
 037     a->sign  = MP_ZPOS;
 038
 039     return MP_OKAY;
 040   \}
 041   #endif
+042
 \end{alltt}
 \end{small}
 One immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure.  It
 is assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack.  The
 035       a->alloc = a->used = 0;
 036       a->sign  = MP_ZPOS;
 037     \}
 038   \}
 039   #endif
+040
 \end{alltt}
 \end{small}
 The algorithm only operates on the mp\_int if it hasn't been previously cleared.  The if statement (line 24)
 checks to see if the \textbf{dp} member is not \textbf{NULL}.  If the mp\_int is a valid mp\_int then \textbf{dp} cannot be
 048       \}
 049     \}
 050     return MP_OKAY;
 051   \}
 052   #endif
+053
 \end{alltt}
 \end{small}
 A quick optimization is to first determine if a memory re-allocation is required at all.  The if statement (line 24) checks
 if the \textbf{alloc} member of the mp\_int is smaller than the requested digit count.  If the count is not larger than \textbf{alloc}
 039     \}
 040
 041     return MP_OKAY;
 042   \}
 043   #endif
+044
 \end{alltt}
 \end{small}
 The number of digits $b$ requested is padded (line 23) by first augmenting it to the next multiple of
 \textbf{MP\_PREC} and then adding \textbf{MP\_PREC} to the result.  If the memory can be successfully allocated the
 050       va_end(args);
 051       return res;                /* Assumed ok, if error flagged above. */
 052   \}
 053
 054   #endif
+055
 \end{alltt}
 \end{small}
 This function intializes a variable length list of mp\_int structure pointers.  However, instead of having the mp\_int
 structures in an actual C array they are simply passed as arguments to the function.  This function makes use of the
 035     if (a->used == 0) \{
 036       a->sign = MP_ZPOS;
 037     \}
 038   \}
 039   #endif
+040
 \end{alltt}
 \end{small}
 Note on line 27 how to test for the \textbf{used} count is made on the left of the \&\& operator.  In the C programming
 language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails.  This is
 059     b->used = a->used;
 060     b->sign = a->sign;
 061     return MP_OKAY;
 062   \}
 063   #endif
+064
 \end{alltt}
 \end{small}
 Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output
 mp\_int structures passed to a function are one and the same.  For this case it is optimal to return immediately without
 023       return res;
 024     \}
 025     return mp_copy (b, a);
 026   \}
 027   #endif
+028
 \end{alltt}
 \end{small}
 This will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}.  Note that
 \textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call
 027     for (n = 0; n < a->alloc; n++) \{
 028        *tmp++ = 0;
 029     \}
 030   \}
 031   #endif
+032
 \end{alltt}
 \end{small}
 After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
 \textbf{sign} variable is set to \textbf{MP\_ZPOS}.
 034     b->sign = MP_ZPOS;
 035
 036     return MP_OKAY;
 037   \}
 038   #endif
+039
 \end{alltt}
 \end{small}
 This fairly trivial algorithm first eliminates non--required duplications (line 27) and then sets the
 \textbf{sign} flag to \textbf{MP\_ZPOS}.
 031     \}
 032
 033     return MP_OKAY;
 034   \}
 035   #endif
+036
 \end{alltt}
 \end{small}
 Like mp\_abs() this function avoids non--required duplications (line 21) and then sets the sign.  We
 have to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}.  If the mp\_int is zero
 020     mp_zero (a);
 021     a->dp[0] = b & MP_MASK;
 022     a->used  = (a->dp[0] != 0) ? 1 : 0;
 023   \}
 024   #endif
+025
 \end{alltt}
 \end{small}
 First we zero (line 20) the mp\_int to make sure that the other members are initialized for a
 small positive constant.  mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count
 039     \}
 040     mp_clamp (a);
 041     return MP_OKAY;
 042   \}
 043   #endif
+044
 \end{alltt}
 \end{small}
 This function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes.  The weird
 addition on line 38 ensures that the newly added in bits are added to the number of digits.  While it may not
 046       \}
 047     \}
 048     return MP_EQ;
 049   \}
 050   #endif
+051
 \end{alltt}
 \end{small}
 The two if statements (lines 24 and 28) compare the number of digits in the two inputs.  These two are
 performed before all of the digits are compared since it is a very cheap test to perform and can potentially save
 034     \} else \{
 035        return mp_cmp_mag(a, b);
 036     \}
 037   \}
 038   #endif
+039
 \end{alltt}
 \end{small}
 The two if statements (lines 22 and 23) perform the initial sign comparison.  If the signs are not the equal then which ever
 has the positive sign is larger.   The inputs are compared (line 31) based on magnitudes.  If the signs were both
 100
 101     mp_clamp (c);
 102     return MP_OKAY;
 103   \}
 104   #endif
+105
 \end{alltt}
 \end{small}
 We first sort (lines 27 to 35) the inputs based on magnitude and determine the $min$ and $max$ variables.
 Note that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias.  Next we
 080     mp_clamp (c);
 081     return MP_OKAY;
 082   \}
 083
 084   #endif
+085
 \end{alltt}
 \end{small}
 Like low level addition we ``sort'' the inputs.  Except in this case the sorting is hardcoded
 (lines 24 and 25).  In reality the $min$ and $max$ variables are only aliases and are only
 044     \}
 045     return res;
 046   \}
 047
 048   #endif
+049
 \end{alltt}
 \end{small}
 The source code follows the algorithm fairly closely.  The most notable new source code addition is the usage of the $res$ integer variable which
 is used to pass result of the unsigned operations forward.  Unlike in the algorithm, the variable $res$ is merely returned as is without
 050     \}
 051     return res;
 052   \}
 053
 054   #endif
+055
 \end{alltt}
 \end{small}
 Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
 and forward it to the end of the function.  On line 38 the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a
 073     \}
 074     b->sign = a->sign;
 075     return MP_OKAY;
 076   \}
 077   #endif
+078
 \end{alltt}
 \end{small}
 This implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input.  The only noteworthy difference
 is the use of the logical shift operator on line 51 to perform a single precision doubling.
 059     b->sign = a->sign;
 060     mp_clamp (b);
 061     return MP_OKAY;
 062   \}
 063   #endif
+064
 \end{alltt}
 \end{small}
 \section{Polynomial Basis Operations}
 Recall from section 4.3 that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$.  Such a representation is also known as
 058       \}
 059     \}
 060     return MP_OKAY;
 061   \}
 062   #endif
+063
 \end{alltt}
 \end{small}
 The if statement (line 23) ensures that the $b$ variable is greater than zero since we do not interpret negative
 shift counts properly.  The \textbf{used} count is incremented by $b$ before the copy loop begins.  This elminates
 063
 064     /* remove excess digits */
 065     a->used -= b;
 066   \}
 067   #endif
+068
 \end{alltt}
 \end{small}
 The only noteworthy element of this routine is the lack of a return type since it cannot fail.  Like mp\_lshd() we
 form a sliding window except we copy in the other direction.  After the window (line 59) we then zero
 $\beta$.  For example, if $b = 37$ and $\beta = 2^{28}$ then this step will multiply by $x$ leaving a multiplication by $2^{37 - 28} = 2^{9}$
 left.
 After the digits have been shifted appropriately at most $lg(\beta) - 1$ shifts are left to perform.  Step 5 calculates the number of remaining shifts
 required.  If it is non-zero a modified shift loop is used to calculate the remaining product.
-Essentially the loop is a generic version of algorith mp\_mul2 designed to handle any shift count in the range $1 \le x < lg(\beta)$.  The $mask$
+Essentially the loop is a generic version of algorithm mp\_mul\_2 designed to handle any shift count in the range $1 \le x < lg(\beta)$.  The $mask$
 variable is used to extract the upper $d$ bits to form the carry for the next iteration.
 This algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
 complete.  It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
 076     \}
 077     mp_clamp (c);
 078     return MP_OKAY;
 079   \}
 080   #endif
+081
 \end{alltt}
 \end{small}
 The shifting is performed in--place which means the first step (line 24) is to copy the input to the
 destination.  We avoid calling mp\_copy() by making sure the mp\_ints are different.  The destination then
 088     \}
 089     mp_clear (&t);
 090     return MP_OKAY;
 091   \}
 092   #endif
+093
 \end{alltt}
 \end{small}
 The implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies.  The remainder $d$ may be optionally
 ignored by passing \textbf{NULL} as the pointer to the mp\_int variable.    The temporary mp\_int variable $t$ is used to hold the
 t) 1));
 047     mp_clamp (c);
 048     return MP_OKAY;
 049   \}
 050   #endif
+051
 \end{alltt}
 \end{small}
 We first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases.  Next if $2^b$ is larger
 than the input we just mp\_copy() the input and return right away.  After this point we know we must actually
 081
 082     mp_clear (&t);
 083     return MP_OKAY;
 084   \}
 085   #endif
+086
 \end{alltt}
 \end{small}
 First we determine (line 30) if the Comba method can be used first since it's faster.  The conditions for
 sing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than
 \hspace{3mm}5.3  $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
 \hspace{3mm}5.4  for $iz$ from 0 to $iy - 1$ do \\
 \hspace{6mm}5.4.1  $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\
 \hspace{3mm}5.5  $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\
 \hspace{3mm}5.6  $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
-6.  $W_{pa} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
 \\
-7.  $oldused \leftarrow c.used$ \\
+6.  $oldused \leftarrow c.used$ \\
-8.  $c.used \leftarrow digs$ \\
+7.  $c.used \leftarrow digs$ \\
-9.  for $ix$ from $0$ to $pa$ do \\
+8.  for $ix$ from $0$ to $pa$ do \\
-\hspace{3mm}9.1  $c_{ix} \leftarrow W_{ix}$ \\
+\hspace{3mm}8.1  $c_{ix} \leftarrow W_{ix}$ \\
-10.  for $ix$ from $pa + 1$ to $oldused - 1$ do \\
+9.  for $ix$ from $pa + 1$ to $oldused - 1$ do \\
-\hspace{3mm}10.1 $c_{ix} \leftarrow 0$ \\
+\hspace{3mm}9.1 $c_{ix} \leftarrow 0$ \\
 \\
-11.  Clamp $c$. \\
+10.  Clamp $c$. \\
-12.  Return MP\_OKAY. \\
+11.  Return MP\_OKAY. \\
 \hline
 \end{tabular}
 \end{center}
 \end{small}
 \caption{Algorithm fast\_s\_mp\_mul\_digs}
 067         iy = MIN(a->used-tx, ty+1);
 068
 069         /* execute loop */
 070         for (iz = 0; iz < iy; ++iz) \{
 071            _W += ((mp_word)*tmpx++)*((mp_word)*tmpy--);
-072         \}
+072
-073
+073         \}
-074         /* store term */
+074
-075         W[ix] = ((mp_digit)_W) & MP_MASK;
+075         /* store term */
-076
+076         W[ix] = ((mp_digit)_W) & MP_MASK;
-077         /* make next carry */
+077
-078         _W = _W >> ((mp_word)DIGIT_BIT);
+078         /* make next carry */
-079     \}
+079         _W = _W >> ((mp_word)DIGIT_BIT);
-080
+080    \}
-081     /* store final carry */
+081
-082     W[ix] = (mp_digit)(_W & MP_MASK);
+082     /* setup dest */
-083
+083     olduse  = c->used;
-084     /* setup dest */
+084     c->used = pa;
-085     olduse  = c->used;
+085
-086     c->used = pa;
+086     \{
-087
+087       register mp_digit *tmpc;
-088     \{
+088       tmpc = c->dp;
-089       register mp_digit *tmpc;
+089       for (ix = 0; ix < pa+1; ix++) \{
-090       tmpc = c->dp;
+090         /* now extract the previous digit [below the carry] */
-091       for (ix = 0; ix < pa+1; ix++) \{
+091         *tmpc++ = W[ix];
-092         /* now extract the previous digit [below the carry] */
+092       \}
-093         *tmpc++ = W[ix];
+093
-094       \}
+094       /* clear unused digits [that existed in the old copy of c] */
-095
+095       for (; ix < olduse; ix++) \{
-096       /* clear unused digits [that existed in the old copy of c] */
+096         *tmpc++ = 0;
-097       for (; ix < olduse; ix++) \{
+097       \}
-098         *tmpc++ = 0;
+098     \}
-099       \}
+099     mp_clamp (c);
-100     \}
+100     return MP_OKAY;
-101     mp_clamp (c);
+101   \}
-102     return MP_OKAY;
+102   #endif
-103   \}
+103
-104   #endif
 \end{alltt}
 \end{small}
 As per the pseudo--code we first calculate $pa$ (line 47) as the number of digits to output.  Next we begin the outer loop
 to produce the individual columns of the product.  We use the two aliases $tmpx$ and $tmpy$ (lines 61, 62) to point
 inside the two multiplicands quickly.
-The inner loop (lines 70 to 72) of this implementation is where the tradeoff come into play.  Originally this comba
+The inner loop (lines 70 to 73) of this implementation is where the tradeoff come into play.  Originally this comba
 implementation was ``row--major'' which means it adds to each of the columns in each pass.  After the outer loop it would then fix
 the carries.  This was very fast except it had an annoying drawback.  You had to read a mp\_word and two mp\_digits and write
 one mp\_word per iteration.  On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth
 is very high and it can keep the ALU fed with data.  It did, however, matter on older and embedded cpus where cache is often
 slower and also often doesn't exist.  This new algorithm only performs two reads per iteration under the assumption that the
 compiler has aliased $\_ \hat W$ to a CPU register.
-After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines 75, 78) to forward it as
+After the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines 76, 79) to forward it as
-a carry for the next pass.  After the outer loop we use the final carry (line 82) as the last digit of the product.
+a carry for the next pass.  After the outer loop we use the final carry (line 76) as the last digit of the product.
 \subsection{Polynomial Basis Multiplication}
 To break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication.  In the following algorithms
 the use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and
 $g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required.  In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
 Karatsuba \cite{KARA} multiplication when originally proposed in 1962 was among the first set of algorithms to break the $O(n^2)$ barrier for
 general purpose multiplication.  Given two polynomial basis representations $f(x) = ax + b$ and $g(x) = cx + d$, Karatsuba proved with
 light algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent.
 \begin{equation}
-f(x) \cdot g(x) = acx^2 + ((a - b)(c - d) - (ac + bd))x + bd
+f(x) \cdot g(x) = acx^2 + ((a + b)(c + d) - (ac + bd))x + bd
 \end{equation}
 Using the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product.  Applying
 this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique.  It turns
 out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points
-$\zeta_0$, $\zeta_{\infty}$ and $-\zeta_{-1}$.  Consider the resultant system of equations.
+$\zeta_0$, $\zeta_{\infty}$ and $\zeta_{1}$.  Consider the resultant system of equations.
 \begin{center}
 \begin{tabular}{rcrcrcrc}
 $\zeta_{0}$ &      $=$ &  &  &  & & $w_0$ \\
-$-\zeta_{-1}$ &    $=$ & $-w_2$ & $+$ & $w_1$ & $-$ & $w_0$ \\
+$\zeta_{1}$ &      $=$ & $w_2$ & $+$ & $w_1$ & $+$ & $w_0$ \\
 $\zeta_{\infty}$ & $=$ & $w_2$ &  & &  & \\
 \end{tabular}
 \end{center}
 By adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for.  The simplicity
 of this system of equations has made Karatsuba fairly popular.  In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
-making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman.  It is worth noting that the point
+making it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman.
-$\zeta_1$ could be substituted for $-\zeta_{-1}$.  In this case the first and third row are subtracted instead of added to the second row.
 \newpage\begin{figure}[!here]
 \begin{small}
 \begin{center}
 \begin{tabular}{l}
 7.  $y1 \leftarrow \lfloor b / \beta^B \rfloor$ \\
 \\
 Calculate the three products. \\
 8.  $x0y0 \leftarrow x0 \cdot y0$ (\textit{mp\_mul}) \\
 9.  $x1y1 \leftarrow x1 \cdot y1$ \\
-10.  $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\
+10.  $t1 \leftarrow x1 + x0$ (\textit{mp\_add}) \\
-11.  $x0 \leftarrow y1 - y0$ \\
+11.  $x0 \leftarrow y1 + y0$ \\
 12.  $t1 \leftarrow t1 \cdot x0$ \\
 \\
 Calculate the middle term. \\
 13.  $x0 \leftarrow x0y0 + x1y1$ \\
-14.  $t1 \leftarrow x0 - t1$ \\
+14.  $t1 \leftarrow t1 - x0$ (\textit{s\_mp\_sub}) \\
 \\
 Calculate the final product. \\
 15.  $t1 \leftarrow t1 \cdot \beta^B$ (\textit{mp\_lshd}) \\
 16.  $x1y1 \leftarrow x1y1 \cdot \beta^{2B}$ \\
 17.  $t1 \leftarrow x0y0 + t1$ \\
 be used for both of the inputs meaning that it must be smaller than the smallest input.  Step 3 chooses the radix point $B$ as half of the
 smallest input \textbf{used} count.  After the radix point is chosen the inputs are split into lower and upper halves.  Step 4 and 5
 compute the lower halves.  Step 6 and 7 computer the upper halves.
 After the halves have been computed the three intermediate half-size products must be computed.  Step 8 and 9 compute the trivial products
-$x0 \cdot y0$ and $x1 \cdot y1$.  The mp\_int $x0$ is used as a temporary variable after $x1 - x0$ has been computed.  By using $x0$ instead
+$x0 \cdot y0$ and $x1 \cdot y1$.  The mp\_int $x0$ is used as a temporary variable after $x1 + x0$ has been computed.  By using $x0$ instead
 of an additional temporary variable, the algorithm can avoid an addition memory allocation operation.
 The remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations.
 \vspace{+3mm}\begin{small}
 023    *
 024    * a = a1 * B**n + a0
 025    * b = b1 * B**n + b0
 026    *
 027    * Then, a * b =>
-028      a1b1 * B**2n + ((a1 - a0)(b1 - b0) + a0b0 + a1b1) * B + a0b0
+028      a1b1 * B**2n + ((a1 + a0)(b1 + b0) - (a0b0 + a1b1)) * B + a0b0
 029    *
 030    * Note that a1b1 and a0b0 are used twice and only need to be
 031    * computed once.  So in total three half size (half # of
 032    * digit) multiplications are performed, a0b0, a1b1 and
-033    * (a1-b1)(a0-b0)
+033    * (a1+b1)(a0+b0)
 034    *
 035    * Note that a multiplication of half the digits requires
 036    * 1/4th the number of single precision multiplications so in
 037    * total after one call 25% of the single precision multiplications
 038    * are saved.  Note also that the call to mp_mul can end up back
 119     if (mp_mul (&x0, &y0, &x0y0) != MP_OKAY)
 120       goto X1Y1;          /* x0y0 = x0*y0 */
 121     if (mp_mul (&x1, &y1, &x1y1) != MP_OKAY)
 122       goto X1Y1;          /* x1y1 = x1*y1 */
 123
-124     /* now calc x1-x0 and y1-y0 */
+124     /* now calc x1+x0 and y1+y0 */
-125     if (mp_sub (&x1, &x0, &t1) != MP_OKAY)
+125     if (s_mp_add (&x1, &x0, &t1) != MP_OKAY)
 126       goto X1Y1;          /* t1 = x1 - x0 */
-127     if (mp_sub (&y1, &y0, &x0) != MP_OKAY)
+127     if (s_mp_add (&y1, &y0, &x0) != MP_OKAY)
 128       goto X1Y1;          /* t2 = y1 - y0 */
 129     if (mp_mul (&t1, &x0, &t1) != MP_OKAY)
-130       goto X1Y1;          /* t1 = (x1 - x0) * (y1 - y0) */
+130       goto X1Y1;          /* t1 = (x1 + x0) * (y1 + y0) */
 131
 132     /* add x0y0 */
 133     if (mp_add (&x0y0, &x1y1, &x0) != MP_OKAY)
 134       goto X1Y1;          /* t2 = x0y0 + x1y1 */
-135     if (mp_sub (&x0, &t1, &t1) != MP_OKAY)
+135     if (s_mp_sub (&t1, &x0, &t1) != MP_OKAY)
-136       goto X1Y1;          /* t1 = x0y0 + x1y1 - (x1-x0)*(y1-y0) */
+136       goto X1Y1;          /* t1 = (x1+x0)*(y1+y0) - (x1y1 + x0y0) */
 137
 138     /* shift by B */
 139     if (mp_lshd (&t1, B) != MP_OKAY)
 140       goto X1Y1;          /* t1 = (x0y0 + x1y1 - (x1-x0)*(y1-y0))<<B */
 141     if (mp_lshd (&x1y1, B * 2) != MP_OKAY)
 158   X0:mp_clear (&x0);
 159   ERR:
 160     return err;
 161   \}
 162   #endif
+163
 \end{alltt}
 \end{small}
 The new coding element in this routine, not  seen in previous routines, is the usage of goto statements.  The conventional
 wisdom is that goto statements should be avoided.  This is generally true, however when every single function call can fail, it makes sense
 275                       &b2, &tmp1, &tmp2, NULL);
 276        return res;
 277   \}
 278
 279   #endif
+280
 \end{alltt}
 \end{small}
 The first obvious thing to note is that this algorithm is complicated.  The complexity is worth it if you are multiplying very
 large numbers.  For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with
 057     \}
 058     c->sign = (c->used > 0) ? neg : MP_ZPOS;
 059     return res;
 060   \}
 061   #endif
+062
 \end{alltt}
 \end{small}
 The implementation is rather simplistic and is not particularly noteworthy.  Line 23 computes the sign of the result using the ``?''
 operator from the C programming language.  Line 47 computes $\delta$ using the fact that $1 << k$ is equal to $2^k$.
 075     mp_exch (&t, b);
 076     mp_clear (&t);
 077     return MP_OKAY;
 078   \}
 079   #endif
+080
 \end{alltt}
 \end{small}
 Inside the outer loop (line 33) the square term is calculated on line 36.  The carry (line 43) has been
 extracted from the mp\_word accumulator using a right shift.  Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized
 105     \}
 106     mp_clamp (b);
 107     return MP_OKAY;
 108   \}
 109   #endif
+110
 \end{alltt}
 \end{small}
 This implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for
 the special case of squaring.
 Let $f(x) = ax + b$ represent the polynomial basis representation of a number to square.
 Let $h(x) = \left ( f(x) \right )^2$ represent the square of the polynomial.  The Karatsuba equation can be modified to square a
 number with the following equation.
 \begin{equation}
-h(x) = a^2x^2 + \left (a^2 + b^2 - (a - b)^2 \right )x + b^2
+h(x) = a^2x^2 + \left ((a + b)^2 - (a^2 + b^2) \right )x + b^2
 \end{equation}
-Upon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a - b)^2$.  As in
+Upon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a + b)^2$.  As in
 Karatsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of
 $O \left ( n^{lg(3)} \right )$.
 If the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm
 instead?  The answer to this arises from the cutoff point for squaring.  As in multiplication there exists a cutoff point, at which the
 5.  $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_lshd}) \\
 \\
 Calculate the three squares. \\
 6.  $x0x0 \leftarrow x0^2$ (\textit{mp\_sqr}) \\
 7.  $x1x1 \leftarrow x1^2$ \\
-8.  $t1 \leftarrow x1 - x0$ (\textit{mp\_sub}) \\
+8.  $t1 \leftarrow x1 + x0$ (\textit{s\_mp\_add}) \\
 9.  $t1 \leftarrow t1^2$ \\
 \\
 Compute the middle term. \\
 10.  $t2 \leftarrow x0x0 + x1x1$ (\textit{s\_mp\_add}) \\
-11.  $t1 \leftarrow t2 - t1$ \\
+11.  $t1 \leftarrow t1 - t2$ \\
 \\
 Compute final product. \\
 12.  $t1 \leftarrow t1\beta^B$ (\textit{mp\_lshd}) \\
 13.  $x1x1 \leftarrow x1x1\beta^{2B}$ \\
 14.  $t1 \leftarrow t1 + x0x0$ \\
 The radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is
 placed just below the middle.  Step 3, 4 and 5 compute the two halves required using $B$
 as the radix point.  The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form.
-By expanding $\left (x1 - x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $x1^2 + x0^2 - (x1 - x0)^2 = 2 \cdot x0 \cdot x1$.
+By expanding $\left (x1 + x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $(x0 - x1)^2 - (x1^2 + x0^2)  = 2 \cdot x0 \cdot x1$.
 Now if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then
 this method is faster.  Assuming no further recursions occur, the difference can be estimated with the following inequality.
 Let $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or
 machine clock cycles.}.
 077     if (mp_sqr (&x0, &x0x0) != MP_OKAY)
 078       goto X1X1;           /* x0x0 = x0*x0 */
 079     if (mp_sqr (&x1, &x1x1) != MP_OKAY)
 080       goto X1X1;           /* x1x1 = x1*x1 */
 081
-082     /* now calc (x1-x0)**2 */
+082     /* now calc (x1+x0)**2 */
-083     if (mp_sub (&x1, &x0, &t1) != MP_OKAY)
+083     if (s_mp_add (&x1, &x0, &t1) != MP_OKAY)
 084       goto X1X1;           /* t1 = x1 - x0 */
 085     if (mp_sqr (&t1, &t1) != MP_OKAY)
 086       goto X1X1;           /* t1 = (x1 - x0) * (x1 - x0) */
 087
 088     /* add x0y0 */
 089     if (s_mp_add (&x0x0, &x1x1, &t2) != MP_OKAY)
 090       goto X1X1;           /* t2 = x0x0 + x1x1 */
-091     if (mp_sub (&t2, &t1, &t1) != MP_OKAY)
+091     if (s_mp_sub (&t1, &t2, &t1) != MP_OKAY)
-092       goto X1X1;           /* t1 = x0x0 + x1x1 - (x1-x0)*(x1-x0) */
+092       goto X1X1;           /* t1 = (x1+x0)**2 - (x0x0 + x1x1) */
 093
 094     /* shift by B */
 095     if (mp_lshd (&t1, B) != MP_OKAY)
 096       goto X1X1;           /* t1 = (x0x0 + x1x1 - (x1-x0)*(x1-x0))<<B */
 097     if (mp_lshd (&x1x1, B * 2) != MP_OKAY)
 112   X0:mp_clear (&x0);
 113   ERR:
 114     return err;
 115   \}
 116   #endif
+117
 \end{alltt}
 \end{small}
 This implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul.  It uses the same inline style to copy and
 shift the input into the two halves.  The loop from line 53 to line 69 has been modified since only one input exists.  The \textbf{used}
 049     \}
 050     b->sign = MP_ZPOS;
 051     return res;
 052   \}
 053   #endif
+054
 \end{alltt}
 \end{small}
 \section*{Exercises}
 \begin{tabular}{cl}
 091     mp_clear (&q);
 092
 093     return res;
 094   \}
 095   #endif
+096
 \end{alltt}
 \end{small}
 The first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up.  This essentially halves
 the number of single precision multiplications required.  However, the optimization is only safe if $\beta$ is much larger than the number of digits
 025       return res;
 026     \}
 027     return mp_div (a, b, a, NULL);
 028   \}
 029   #endif
+030
 \end{alltt}
 \end{small}
 This simple routine calculates the reciprocal $\mu$ required by Barrett reduction.  Note the extended usage of algorithm mp\_div where the variable
 which would received the remainder is passed as NULL.  As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the
 \hline $4$ & $x + n = 1112$, $x/2 = 556$ \\
 \hline $5$ & $x/2 = 278$ \\
 \hline $6$ & $x/2 = 139$ \\
 \hline $7$ & $x + n = 396$, $x/2 = 198$ \\
 \hline $8$ & $x/2 = 99$ \\
+\hline $9$ & $x + n = 356$, $x/2 = 178$ \\
 \hline
 \end{tabular}
 \end{center}
 \end{small}
 \caption{Example of Montgomery Reduction (I)}
 \label{fig:MONT1}
 \end{figure}
-Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 8$.  The result of the algorithm $r = 99$ is
+Consider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 9$ (note $\beta^k = 512$ which is larger than $n$).  The result of
-congruent to the value of $2^{-8} \cdot 5555 \mbox{ (mod }257\mbox{)}$.  When $r$ is multiplied by $2^8$ modulo $257$ the correct residue
+the algorithm $r = 178$ is congruent to the value of $2^{-9} \cdot 5555 \mbox{ (mod }257\mbox{)}$.  When $r$ is multiplied by $2^9$ modulo $257$ the correct residue
 $r \equiv 158$ is produced.
 Let $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$.  The current algorithm requires $2k^2$ single precision shifts
 and $k^2$ single precision additions.  At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
 Fortunately there exists an alternative representation of the algorithm.
 \begin{figure}[!here]
 \begin{small}
 \begin{center}
 \begin{tabular}{l}
 \hline Algorithm \textbf{Montgomery Reduction} (modified I). \\
-\textbf{Input}.   Integer $x$, $n$ and $k$ \\
+\textbf{Input}.   Integer $x$, $n$ and $k$ ($2^k > n$) \\
 \textbf{Output}.  $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\
 \hline \\
-1.  for $t$ from $0$ to $k - 1$ do \\
+1.  for $t$ from $1$ to $k$ do \\
 \hspace{3mm}1.1  If the $t$'th bit of $x$ is one then \\
 \hspace{6mm}1.1.1  $x \leftarrow x + 2^tn$ \\
 2.  Return $x/2^k$. \\
 \hline
 \end{tabular}
 \hline $4$ & $x + 2^{3}n = 8896$ & $10001011000000$ \\
 \hline $5$ & $8896$ & $10001011000000$ \\
 \hline $6$ & $8896$ & $10001011000000$ \\
 \hline $7$ & $x + 2^{6}n = 25344$ & $110001100000000$ \\
 \hline $8$ & $25344$ & $110001100000000$ \\
-\hline -- & $x/2^k = 99$ & \\
+\hline $9$ & $x + 2^{7}n = 91136$ & $10110010000000000$ \\
+\hline -- & $x/2^k = 178$ & \\
 \hline
 \end{tabular}
 \end{center}
 \end{small}
 \caption{Example of Montgomery Reduction (II)}
 \label{fig:MONT2}
 \end{figure}
-Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 8$.
+Figure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 9$.
 With this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the
 loop.  Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed.  In those iterations the $t$'th bit of $x$ is
 zero and the appropriate multiple of $n$ does not need to be added to force the $t$'th bit of the result to zero.
 \subsection{Digit Based Montgomery Reduction}
 \begin{figure}[!here]
 \begin{small}
 \begin{center}
 \begin{tabular}{l}
 \hline Algorithm \textbf{Montgomery Reduction} (modified II). \\
-\textbf{Input}.   Integer $x$, $n$ and $k$ \\
+\textbf{Input}.   Integer $x$, $n$ and $k$ ($\beta^k > n$) \\
 \textbf{Output}.  $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
 \hline \\
 1.  for $t$ from $0$ to $k - 1$ do \\
 \hspace{3mm}1.1  $x \leftarrow x + \mu n \beta^t$ \\
 2.  Return $x/\beta^k$. \\
 109     \}
 110
 111     return MP_OKAY;
 112   \}
 113   #endif
+114
 \end{alltt}
 \end{small}
 This is the baseline implementation of the Montgomery reduction algorithm.  Lines 30 to 35 determine if the Comba based
 routine can be used instead.  Line 48 computes the value of $\mu$ for that particular iteration of the outer loop.
 163       return s_mp_sub (x, n, x);
 164     \}
 165     return MP_OKAY;
 166   \}
 167   #endif
+168
 \end{alltt}
 \end{small}
 The $\hat W$ array is first filled with digits of $x$ on line 50 then the rest of the digits are zeroed on line 54.  Both loops share
 the same alias variables to make the code easier to read.
 \textbf{Input}.   mp\_int $n$ ($n > 1$ and $(n, 2) = 1$) \\
 \textbf{Output}.  $\rho \equiv -1/n_0 \mbox{ (mod }\beta\mbox{)}$ \\
 \hline \\
 1.  $b \leftarrow n_0$ \\
 2.  If $b$ is even return(\textit{MP\_VAL}) \\
-3.  $x \leftarrow ((b + 2) \mbox{ AND } 4) << 1) + b$ \\
+3.  $x \leftarrow (((b + 2) \mbox{ AND } 4) << 1) + b$ \\
 4.  for $k$ from 0 to $\lceil lg(lg(\beta)) \rceil - 2$ do \\
 \hspace{3mm}4.1  $x \leftarrow x \cdot (2 - bx)$ \\
 5.  $\rho \leftarrow \beta - x \mbox{ (mod }\beta\mbox{)}$ \\
 6.  Return(\textit{MP\_OKAY}). \\
 \hline
 045   #ifdef MP_64BIT
 046     x *= 2 - b * x;               /* here x*a==1 mod 2**64 */
 047   #endif
 048
 049     /* rho = -1/m mod b */
-050     *rho = (((mp_word)1 << ((mp_word) DIGIT_BIT)) - x) & MP_MASK;
+050     *rho = (unsigned long)(((mp_word)1 << ((mp_word) DIGIT_BIT)) - x) & MP_MAS
+K;
 051
 052     return MP_OKAY;
 053   \}
 054   #endif
+055
 \end{alltt}
 \end{small}
 This source code computes the value of $\rho$ required to perform Montgomery reduction.  It has been modified to avoid performing excess
 multiplications when $\beta$ is not the default 28-bits.
 085       goto top;
 086     \}
 087     return MP_OKAY;
 088   \}
 089   #endif
+090
 \end{alltt}
 \end{small}
 The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$.  The label on line 51 is where
 the algorithm will resume if further reduction passes are required.  In theory it could be placed at the top of the function however, the size of
 023      *d = (mp_digit)((((mp_word)1) << ((mp_word)DIGIT_BIT)) -
 024           ((mp_word)a->dp[0]));
 025   \}
 026
 027   #endif
+028
 \end{alltt}
 \end{small}
 \subsubsection{Modulus Detection}
 Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus.  An integer is said to be
 034      \}
 035      return 1;
 036   \}
 037
 038   #endif
+039
 \end{alltt}
 \end{small}
 \subsection{Unrestricted Diminished Radix Reduction}
 The unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$.  This algorithm
 052      mp_clear(&q);
 053      return res;
 054   \}
 055
 056   #endif
+057
 \end{alltt}
 \end{small}
 The algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$.  The call to mp\_div\_2d
 on line 30 calculates both the quotient $q$ and the remainder $a$ required.  By doing both in a single function call the code size
 038      *d = tmp.dp[0];
 039      mp_clear(&tmp);
 040      return MP_OKAY;
 041   \}
 042   #endif
+043
 \end{alltt}
 \end{small}
 \subsubsection{Unrestricted Detection}
 An integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true.
 043      \}
 044      return MP_YES;
 045   \}
 046
 047   #endif
+048
 \end{alltt}
 \end{small}
 048
 049     mp_clear (&g);
 050     return MP_OKAY;
 051   \}
 052   #endif
+053
 \end{alltt}
 \end{small}
 Line 28 sets the initial value of the result to $1$.  Next the loop on line 30 steps through each bit of the exponent starting from
 the most significant down towards the least significant. The invariant squaring operation placed on line 32 is performed first.  After
 063        return MP_VAL;
 064   #endif
 065     \}
 066
 067   /* modified diminished radix reduction */
-068   #if defined(BN_MP_REDUCE_IS_2K_L_C) && defined(BN_MP_REDUCE_2K_L_C)
+068   #if defined(BN_MP_REDUCE_IS_2K_L_C) && defined(BN_MP_REDUCE_2K_L_C) && defin
+ed(BN_S_MP_EXPTMOD_C)
 069     if (mp_reduce_is_2k_l(P) == MP_YES) \{
 070        return s_mp_exptmod(G, X, P, Y, 1);
 071     \}
 072   #endif
 073
 103     \}
 104   #endif
 105   \}
 106
 107   #endif
+108
 \end{alltt}
 \end{small}
 In order to keep the algorithms in a known state the first step on line 28 is to reject any negative modulus as input.  If the exponent is
 negative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$.  The temporary variable $tmpG$ is assigned
 \vspace{+3mm}\begin{small}
 \hspace{-5.1mm}{\bf File}: bn\_s\_mp\_exptmod.c
 \vspace{-3mm}
 \begin{alltt}
-016
+016   #ifdef MP_LOW_MEM
-017   #ifdef MP_LOW_MEM
+017      #define TAB_SIZE 32
-018      #define TAB_SIZE 32
+018   #else
-019   #else
+019      #define TAB_SIZE 256
-020      #define TAB_SIZE 256
+020   #endif
-021   #endif
+021
-022
+022   int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y, int redmod
-023   int s_mp_exptmod (mp_int * G, mp_int * X, mp_int * P, mp_int * Y, int redmod
 e)
-024   \{
+023   \{
-025     mp_int  M[TAB_SIZE], res, mu;
+024     mp_int  M[TAB_SIZE], res, mu;
-026     mp_digit buf;
+025     mp_digit buf;
-027     int     err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize;
+026     int     err, bitbuf, bitcpy, bitcnt, mode, digidx, x, y, winsize;
-028     int (*redux)(mp_int*,mp_int*,mp_int*);
+027     int (*redux)(mp_int*,mp_int*,mp_int*);
-029
+028
-030     /* find window size */
+029     /* find window size */
-031     x = mp_count_bits (X);
+030     x = mp_count_bits (X);
-032     if (x <= 7) \{
+031     if (x <= 7) \{
-033       winsize = 2;
+032       winsize = 2;
-034     \} else if (x <= 36) \{
+033     \} else if (x <= 36) \{
-035       winsize = 3;
+034       winsize = 3;
-036     \} else if (x <= 140) \{
+035     \} else if (x <= 140) \{
-037       winsize = 4;
+036       winsize = 4;
-038     \} else if (x <= 450) \{
+037     \} else if (x <= 450) \{
-039       winsize = 5;
+038       winsize = 5;
-040     \} else if (x <= 1303) \{
+039     \} else if (x <= 1303) \{
-041       winsize = 6;
+040       winsize = 6;
-042     \} else if (x <= 3529) \{
+041     \} else if (x <= 3529) \{
-043       winsize = 7;
+042       winsize = 7;
-044     \} else \{
+043     \} else \{
-045       winsize = 8;
+044       winsize = 8;
-046     \}
+045     \}
-047
+046
-048   #ifdef MP_LOW_MEM
+047   #ifdef MP_LOW_MEM
-049       if (winsize > 5) \{
+048       if (winsize > 5) \{
-050          winsize = 5;
+049          winsize = 5;
-051       \}
+050       \}
-052   #endif
+051   #endif
-053
+052
-054     /* init M array */
+053     /* init M array */
-055     /* init first cell */
+054     /* init first cell */
-056     if ((err = mp_init(&M[1])) != MP_OKAY) \{
+055     if ((err = mp_init(&M[1])) != MP_OKAY) \{
-057        return err;
+056        return err;
-058     \}
+057     \}
-059
+058
-060     /* now init the second half of the array */
+059     /* now init the second half of the array */
-061     for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
+060     for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
-062       if ((err = mp_init(&M[x])) != MP_OKAY) \{
+061       if ((err = mp_init(&M[x])) != MP_OKAY) \{
-063         for (y = 1<<(winsize-1); y < x; y++) \{
+062         for (y = 1<<(winsize-1); y < x; y++) \{
-064           mp_clear (&M[y]);
+063           mp_clear (&M[y]);
-065         \}
+064         \}
-066         mp_clear(&M[1]);
+065         mp_clear(&M[1]);
-067         return err;
+066         return err;
-068       \}
+067       \}
-069     \}
+068     \}
-070
+069
-071     /* create mu, used for Barrett reduction */
+070     /* create mu, used for Barrett reduction */
-072     if ((err = mp_init (&mu)) != MP_OKAY) \{
+071     if ((err = mp_init (&mu)) != MP_OKAY) \{
-073       goto LBL_M;
+072       goto LBL_M;
-074     \}
+073     \}
-075
+074
-076     if (redmode == 0) \{
+075     if (redmode == 0) \{
-077        if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{
+076        if ((err = mp_reduce_setup (&mu, P)) != MP_OKAY) \{
-078           goto LBL_MU;
+077           goto LBL_MU;
-079        \}
+078        \}
-080        redux = mp_reduce;
+079        redux = mp_reduce;
-081     \} else \{
+080     \} else \{
-082        if ((err = mp_reduce_2k_setup_l (P, &mu)) != MP_OKAY) \{
+081        if ((err = mp_reduce_2k_setup_l (P, &mu)) != MP_OKAY) \{
-083           goto LBL_MU;
+082           goto LBL_MU;
-084        \}
+083        \}
-085        redux = mp_reduce_2k_l;
+084        redux = mp_reduce_2k_l;
-086     \}
+085     \}
-087
+086
-088     /* create M table
+087     /* create M table
-089      *
+088      *
-090      * The M table contains powers of the base,
+089      * The M table contains powers of the base,
-091      * e.g. M[x] = G**x mod P
+090      * e.g. M[x] = G**x mod P
-092      *
+091      *
-093      * The first half of the table is not
+092      * The first half of the table is not
-094      * computed though accept for M[0] and M[1]
+093      * computed though accept for M[0] and M[1]
-095      */
+094      */
-096     if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{
+095     if ((err = mp_mod (G, P, &M[1])) != MP_OKAY) \{
-097       goto LBL_MU;
+096       goto LBL_MU;
-098     \}
+097     \}
-099
+098
-100     /* compute the value at M[1<<(winsize-1)] by squaring
+099     /* compute the value at M[1<<(winsize-1)] by squaring
-101      * M[1] (winsize-1) times
+100      * M[1] (winsize-1) times
-102      */
+101      */
-103     if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{
+102     if ((err = mp_copy (&M[1], &M[1 << (winsize - 1)])) != MP_OKAY) \{
-104       goto LBL_MU;
+103       goto LBL_MU;
-105     \}
+104     \}
-106
+105
-107     for (x = 0; x < (winsize - 1); x++) \{
+106     for (x = 0; x < (winsize - 1); x++) \{
-108       /* square it */
+107       /* square it */
-109       if ((err = mp_sqr (&M[1 << (winsize - 1)],
+108       if ((err = mp_sqr (&M[1 << (winsize - 1)],
-110                          &M[1 << (winsize - 1)])) != MP_OKAY) \{
+109                          &M[1 << (winsize - 1)])) != MP_OKAY) \{
-111         goto LBL_MU;
+110         goto LBL_MU;
-112       \}
+111       \}
-113
+112
-114       /* reduce modulo P */
+113       /* reduce modulo P */
-115       if ((err = redux (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{
+114       if ((err = redux (&M[1 << (winsize - 1)], P, &mu)) != MP_OKAY) \{
-116         goto LBL_MU;
+115         goto LBL_MU;
-117       \}
+116       \}
-118     \}
+117     \}
-119
+118
-120     /* create upper table, that is M[x] = M[x-1] * M[1] (mod P)
+119     /* create upper table, that is M[x] = M[x-1] * M[1] (mod P)
-121      * for x = (2**(winsize - 1) + 1) to (2**winsize - 1)
+120      * for x = (2**(winsize - 1) + 1) to (2**winsize - 1)
-122      */
+121      */
-123     for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{
+122     for (x = (1 << (winsize - 1)) + 1; x < (1 << winsize); x++) \{
-124       if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{
+123       if ((err = mp_mul (&M[x - 1], &M[1], &M[x])) != MP_OKAY) \{
-125         goto LBL_MU;
+124         goto LBL_MU;
-126       \}
+125       \}
-127       if ((err = redux (&M[x], P, &mu)) != MP_OKAY) \{
+126       if ((err = redux (&M[x], P, &mu)) != MP_OKAY) \{
-128         goto LBL_MU;
+127         goto LBL_MU;
-129       \}
+128       \}
-130     \}
+129     \}
-131
+130
-132     /* setup result */
+131     /* setup result */
-133     if ((err = mp_init (&res)) != MP_OKAY) \{
+132     if ((err = mp_init (&res)) != MP_OKAY) \{
-134       goto LBL_MU;
+133       goto LBL_MU;
-135     \}
+134     \}
-136     mp_set (&res, 1);
+135     mp_set (&res, 1);
-137
+136
-138     /* set initial mode and bit cnt */
+137     /* set initial mode and bit cnt */
-139     mode   = 0;
+138     mode   = 0;
-140     bitcnt = 1;
+139     bitcnt = 1;
-141     buf    = 0;
+140     buf    = 0;
-142     digidx = X->used - 1;
+141     digidx = X->used - 1;
-143     bitcpy = 0;
+142     bitcpy = 0;
-144     bitbuf = 0;
+143     bitbuf = 0;
-145
+144
-146     for (;;) \{
+145     for (;;) \{
-147       /* grab next digit as required */
+146       /* grab next digit as required */
-148       if (--bitcnt == 0) \{
+147       if (--bitcnt == 0) \{
-149         /* if digidx == -1 we are out of digits */
+148         /* if digidx == -1 we are out of digits */
-150         if (digidx == -1) \{
+149         if (digidx == -1) \{
-151           break;
+150           break;
-152         \}
+151         \}
-153         /* read next digit and reset the bitcnt */
+152         /* read next digit and reset the bitcnt */
-154         buf    = X->dp[digidx--];
+153         buf    = X->dp[digidx--];
-155         bitcnt = (int) DIGIT_BIT;
+154         bitcnt = (int) DIGIT_BIT;
-156       \}
+155       \}
-157
+156
-158       /* grab the next msb from the exponent */
+157       /* grab the next msb from the exponent */
-159       y     = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1;
+158       y     = (buf >> (mp_digit)(DIGIT_BIT - 1)) & 1;
-160       buf <<= (mp_digit)1;
+159       buf <<= (mp_digit)1;
-161
+160
-162       /* if the bit is zero and mode == 0 then we ignore it
+161       /* if the bit is zero and mode == 0 then we ignore it
-163        * These represent the leading zero bits before the first 1 bit
+162        * These represent the leading zero bits before the first 1 bit
-164        * in the exponent.  Technically this opt is not required but it
+163        * in the exponent.  Technically this opt is not required but it
-165        * does lower the # of trivial squaring/reductions used
+164        * does lower the # of trivial squaring/reductions used
-166        */
+165        */
-167       if (mode == 0 && y == 0) \{
+166       if (mode == 0 && y == 0) \{
-168         continue;
+167         continue;
-169       \}
+168       \}
-170
+169
-171       /* if the bit is zero and mode == 1 then we square */
+170       /* if the bit is zero and mode == 1 then we square */
-172       if (mode == 1 && y == 0) \{
+171       if (mode == 1 && y == 0) \{
-173         if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
+172         if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
-174           goto LBL_RES;
+173           goto LBL_RES;
-175         \}
+174         \}
-176         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
+175         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
-177           goto LBL_RES;
+176           goto LBL_RES;
-178         \}
+177         \}
-179         continue;
+178         continue;
-180       \}
+179       \}
-181
+180
-182       /* else we add it to the window */
+181       /* else we add it to the window */
-183       bitbuf |= (y << (winsize - ++bitcpy));
+182       bitbuf |= (y << (winsize - ++bitcpy));
-184       mode    = 2;
+183       mode    = 2;
-185
+184
-186       if (bitcpy == winsize) \{
+185       if (bitcpy == winsize) \{
-187         /* ok window is filled so square as required and multiply  */
+186         /* ok window is filled so square as required and multiply  */
-188         /* square first */
+187         /* square first */
-189         for (x = 0; x < winsize; x++) \{
+188         for (x = 0; x < winsize; x++) \{
-190           if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
+189           if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
-191             goto LBL_RES;
+190             goto LBL_RES;
-192           \}
+191           \}
-193           if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
+192           if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
-194             goto LBL_RES;
+193             goto LBL_RES;
-195           \}
+194           \}
-196         \}
+195         \}
-197
+196
-198         /* then multiply */
+197         /* then multiply */
-199         if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{
+198         if ((err = mp_mul (&res, &M[bitbuf], &res)) != MP_OKAY) \{
-200           goto LBL_RES;
+199           goto LBL_RES;
-201         \}
+200         \}
-202         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
+201         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
-203           goto LBL_RES;
+202           goto LBL_RES;
-204         \}
+203         \}
-205
+204
-206         /* empty window and reset */
+205         /* empty window and reset */
-207         bitcpy = 0;
+206         bitcpy = 0;
-208         bitbuf = 0;
+207         bitbuf = 0;
-209         mode   = 1;
+208         mode   = 1;
-210       \}
+209       \}
-211     \}
+210     \}
-212
+211
-213     /* if bits remain then square/multiply */
+212     /* if bits remain then square/multiply */
-214     if (mode == 2 && bitcpy > 0) \{
+213     if (mode == 2 && bitcpy > 0) \{
-215       /* square then multiply if the bit is set */
+214       /* square then multiply if the bit is set */
-216       for (x = 0; x < bitcpy; x++) \{
+215       for (x = 0; x < bitcpy; x++) \{
-217         if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
+216         if ((err = mp_sqr (&res, &res)) != MP_OKAY) \{
-218           goto LBL_RES;
+217           goto LBL_RES;
-219         \}
+218         \}
-220         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
+219         if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
-221           goto LBL_RES;
+220           goto LBL_RES;
-222         \}
+221         \}
-223
+222
-224         bitbuf <<= 1;
+223         bitbuf <<= 1;
-225         if ((bitbuf & (1 << winsize)) != 0) \{
+224         if ((bitbuf & (1 << winsize)) != 0) \{
-226           /* then multiply */
+225           /* then multiply */
-227           if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{
+226           if ((err = mp_mul (&res, &M[1], &res)) != MP_OKAY) \{
-228             goto LBL_RES;
+227             goto LBL_RES;
-229           \}
+228           \}
-230           if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
+229           if ((err = redux (&res, P, &mu)) != MP_OKAY) \{
-231             goto LBL_RES;
+230             goto LBL_RES;
-232           \}
+231           \}
-233         \}
+232         \}
-234       \}
+233       \}
-235     \}
+234     \}
-236
+235
-237     mp_exch (&res, Y);
+236     mp_exch (&res, Y);
-238     err = MP_OKAY;
+237     err = MP_OKAY;
-239   LBL_RES:mp_clear (&res);
+238   LBL_RES:mp_clear (&res);
-240   LBL_MU:mp_clear (&mu);
+239   LBL_MU:mp_clear (&mu);
-241   LBL_M:
+240   LBL_M:
-242     mp_clear(&M[1]);
+241     mp_clear(&M[1]);
-243     for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
+242     for (x = 1<<(winsize-1); x < (1 << winsize); x++) \{
-244       mp_clear (&M[x]);
+243       mp_clear (&M[x]);
-245     \}
+244     \}
-246     return err;
+245     return err;
-247   \}
+246   \}
-248   #endif
+247   #endif
+248
 \end{alltt}
 \end{small}
-Lines 21 through 40 determine the optimal window size based on the length of the exponent in bits.  The window divisions are sorted
+Lines 31 through 45 determine the optimal window size based on the length of the exponent in bits.  The window divisions are sorted
 from smallest to greatest so that in each \textbf{if} statement only one condition must be tested.  For example, by the \textbf{if} statement
-on line 32 the value of $x$ is already known to be greater than $140$.
+on line 37 the value of $x$ is already known to be greater than $140$.
-The conditional piece of code beginning on line 48 allows the window size to be restricted to five bits.  This logic is used to ensure
+The conditional piece of code beginning on line 47 allows the window size to be restricted to five bits.  This logic is used to ensure
 the table of precomputed powers of $G$ remains relatively small.
-The for loop on line 61 initializes the $M$ array while lines 62 and 77 compute the value of $\mu$ required for
+The for loop on line 60 initializes the $M$ array while lines 71 and 76 through 85 initialize the reduction
-Barrett reduction.
+function that will be used for this modulus.
 -- More later.
 \section{Quick Power of Two}
 Calculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms.  Recall that a logical shift left $m << k$ is
 039     a->dp[b / DIGIT_BIT] = ((mp_digit)1) << (b % DIGIT_BIT);
 040
 041     return MP_OKAY;
 042   \}
 043   #endif
+044
 \end{alltt}
 \end{small}
 \chapter{Higher Level Algorithms}
 283   \}
 284
 285   #endif
 286
 287   #endif
+288
 \end{alltt}
 \end{small}
 The implementation of this algorithm differs slightly from the pseudo code presented previously.  In this algorithm either of the quotient $c$ or
 remainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired.  For example, the C code to call the division
 \begin{verbatim}
 mp_div(&a, &b, &c, NULL);  /* c = [a/b] */
 \end{verbatim}
-Lines 37 and 44 handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor
+Lines 108 and 113 handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor
-respectively.  After the two trivial cases all of the temporary variables are initialized.  Line 105 determines the sign of
+respectively.  After the two trivial cases all of the temporary variables are initialized.  Line 147 determines the sign of
-the quotient and line 76 ensures that both $x$ and $y$ are positive.
+the quotient and line 148 ensures that both $x$ and $y$ are positive.
-The number of bits in the leading digit is calculated on line 105.  Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits
+The number of bits in the leading digit is calculated on line 151.  Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits
 of precision which when reduced modulo $lg(\beta)$ produces the value of $k$.  In this case $k$ is the number of bits in the leading digit which is
 exactly what is required.  For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting
 them to the left by $lg(\beta) - 1 - k$ bits.
 Throughout the variables $n$ and $t$ will represent the highest digit of $x$ and $y$ respectively.  These are first used to produce the
-leading digit of the quotient.  The loop beginning on line 183 will produce the remainder of the quotient digits.
+leading digit of the quotient.  The loop beginning on line 184 will produce the remainder of the quotient digits.
-The conditional ``continue'' on line 114 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
+The conditional ``continue'' on line 186 is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
 algorithm eliminates multiple non-zero digits in a single iteration.  This ensures that $x_i$ is always non-zero since by definition the digits
 above the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}.
-Lines 130, 130 and 134 through 134 manually construct the high accuracy estimations by setting the digits of the two mp\_int
+Lines 214, 216 and 222 through 225 manually construct the high accuracy estimations by setting the digits of the two mp\_int
 variables directly.
 \section{Single Digit Helpers}
 This section briefly describes a series of single digit helper algorithms which come in handy when working with small constants.  All of
 037        res = mp_sub_d(a, b, c);
 038
 039        /* fix sign  */
 040        a->sign = c->sign = MP_NEG;
 041
-042        return res;
+042        /* clamp */
-043     \}
+043        mp_clamp(c);
 044
-045     /* old number of used digits in c */
+045        return res;
-046     oldused = c->used;
+046     \}
 047
-048     /* sign always positive */
+048     /* old number of used digits in c */
-049     c->sign = MP_ZPOS;
+049     oldused = c->used;
 050
-051     /* source alias */
+051     /* sign always positive */
-052     tmpa    = a->dp;
+052     c->sign = MP_ZPOS;
 053
-054     /* destination alias */
+054     /* source alias */
-055     tmpc    = c->dp;
+055     tmpa    = a->dp;
 056
-057     /* if a is positive */
+057     /* destination alias */
-058     if (a->sign == MP_ZPOS) \{
+058     tmpc    = c->dp;
-059        /* add digit, after this we're propagating
+059
-060         * the carry.
+060     /* if a is positive */
-061         */
+061     if (a->sign == MP_ZPOS) \{
-062        *tmpc   = *tmpa++ + b;
+062        /* add digit, after this we're propagating
-063        mu      = *tmpc >> DIGIT_BIT;
+063         * the carry.
-064        *tmpc++ &= MP_MASK;
+064         */
-065
+065        *tmpc   = *tmpa++ + b;
-066        /* now handle rest of the digits */
+066        mu      = *tmpc >> DIGIT_BIT;
-067        for (ix = 1; ix < a->used; ix++) \{
+067        *tmpc++ &= MP_MASK;
-068           *tmpc   = *tmpa++ + mu;
+068
-069           mu      = *tmpc >> DIGIT_BIT;
+069        /* now handle rest of the digits */
-070           *tmpc++ &= MP_MASK;
+070        for (ix = 1; ix < a->used; ix++) \{
-071        \}
+071           *tmpc   = *tmpa++ + mu;
-072        /* set final carry */
+072           mu      = *tmpc >> DIGIT_BIT;
-073        ix++;
+073           *tmpc++ &= MP_MASK;
-074        *tmpc++  = mu;
+074        \}
-075
+075        /* set final carry */
-076        /* setup size */
+076        ix++;
-077        c->used = a->used + 1;
+077        *tmpc++  = mu;
-078     \} else \{
+078
-079        /* a was negative and |a| < b */
+079        /* setup size */
-080        c->used  = 1;
+080        c->used = a->used + 1;
-081
+081     \} else \{
-082        /* the result is a single digit */
+082        /* a was negative and |a| < b */
-083        if (a->used == 1) \{
+083        c->used  = 1;
-084           *tmpc++  =  b - a->dp[0];
+084
-085        \} else \{
+085        /* the result is a single digit */
-086           *tmpc++  =  b;
+086        if (a->used == 1) \{
-087        \}
+087           *tmpc++  =  b - a->dp[0];
-088
+088        \} else \{
-089        /* setup count so the clearing of oldused
+089           *tmpc++  =  b;
-090         * can fall through correctly
+090        \}
-091         */
+091
-092        ix       = 1;
+092        /* setup count so the clearing of oldused
-093     \}
+093         * can fall through correctly
-094
+094         */
-095     /* now zero to oldused */
+095        ix       = 1;
-096     while (ix++ < oldused) \{
+096     \}
-097        *tmpc++ = 0;
+097
-098     \}
+098     /* now zero to oldused */
-099     mp_clamp(c);
+099     while (ix++ < oldused) \{
-100
+100        *tmpc++ = 0;
-101     return MP_OKAY;
+101     \}
-102   \}
+102     mp_clamp(c);
 103
-104   #endif
+104     return MP_OKAY;
+105   \}
+106
+107   #endif
+108
 \end{alltt}
 \end{small}
 Clever use of the letter 't'.
 070     mp_clamp(c);
 071
 072     return MP_OKAY;
 073   \}
 074   #endif
+075
 \end{alltt}
 \end{small}
 In this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
 read from the source.  This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively.
 101
 102     return res;
 103   \}
 104
 105   #endif
+106
 \end{alltt}
 \end{small}
 Like the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to
 indicate the respective value is not required.  This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created.
 123   LBL_T2:mp_clear (&t2);
 124   LBL_T1:mp_clear (&t1);
 125     return res;
 126   \}
 127   #endif
+128
 \end{alltt}
 \end{small}
 \section{Random Number Generation}
 046     \}
 047
 048     return MP_OKAY;
 049   \}
 050   #endif
+051
 \end{alltt}
 \end{small}
 \section{Formatted Representations}
 The ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties.  For example, the ability to
 018   int mp_read_radix (mp_int * a, const char *str, int radix)
 019   \{
 020     int     y, res, neg;
 021     char    ch;
 022
-023     /* make sure the radix is ok */
+023     /* zero the digit bignum */
-024     if (radix < 2 || radix > 64) \{
+024     mp_zero(a);
-025       return MP_VAL;
+025
-026     \}
+026     /* make sure the radix is ok */
-027
+027     if (radix < 2 || radix > 64) \{
-028     /* if the leading digit is a
+028       return MP_VAL;
-029      * minus set the sign to negative.
+029     \}
-030      */
+030
-031     if (*str == '-') \{
+031     /* if the leading digit is a
-032       ++str;
+032      * minus set the sign to negative.
-033       neg = MP_NEG;
+033      */
-034     \} else \{
+034     if (*str == '-') \{
-035       neg = MP_ZPOS;
+035       ++str;
-036     \}
+036       neg = MP_NEG;
-037
+037     \} else \{
-038     /* set the integer to the default of zero */
+038       neg = MP_ZPOS;
-039     mp_zero (a);
+039     \}
 040
-041     /* process each digit of the string */
+041     /* set the integer to the default of zero */
-042     while (*str) \{
+042     mp_zero (a);
-043       /* if the radix < 36 the conversion is case insensitive
+043
-044        * this allows numbers like 1AB and 1ab to represent the same  value
+044     /* process each digit of the string */
-045        * [e.g. in hex]
+045     while (*str) \{
-046        */
+046       /* if the radix < 36 the conversion is case insensitive
-047       ch = (char) ((radix < 36) ? toupper (*str) : *str);
+047        * this allows numbers like 1AB and 1ab to represent the same  value
-048       for (y = 0; y < 64; y++) \{
+048        * [e.g. in hex]
-049         if (ch == mp_s_rmap[y]) \{
+049        */
-050            break;
+050       ch = (char) ((radix < 36) ? toupper (*str) : *str);
-051         \}
+051       for (y = 0; y < 64; y++) \{
-052       \}
+052         if (ch == mp_s_rmap[y]) \{
-053
+053            break;
-054       /* if the char was found in the map
+054         \}
-055        * and is less than the given radix add it
+055       \}
-056        * to the number, otherwise exit the loop.
+056
-057        */
+057       /* if the char was found in the map
-058       if (y < radix) \{
+058        * and is less than the given radix add it
-059         if ((res = mp_mul_d (a, (mp_digit) radix, a)) != MP_OKAY) \{
+059        * to the number, otherwise exit the loop.
-060            return res;
+060        */
-061         \}
+061       if (y < radix) \{
-062         if ((res = mp_add_d (a, (mp_digit) y, a)) != MP_OKAY) \{
+062         if ((res = mp_mul_d (a, (mp_digit) radix, a)) != MP_OKAY) \{
 063            return res;
 064         \}
-065       \} else \{
+065         if ((res = mp_add_d (a, (mp_digit) y, a)) != MP_OKAY) \{
-066         break;
+066            return res;
-067       \}
+067         \}
-068       ++str;
+068       \} else \{
-069     \}
+069         break;
-070
+070       \}
-071     /* set the sign only if a != 0 */
+071       ++str;
-072     if (mp_iszero(a) != 1) \{
+072     \}
-073        a->sign = neg;
+073
-074     \}
+074     /* set the sign only if a != 0 */
-075     return MP_OKAY;
+075     if (mp_iszero(a) != 1) \{
-076   \}
+076        a->sign = neg;
-077   #endif
+077     \}
+078     return MP_OKAY;
+079   \}
+080   #endif
+081
 \end{alltt}
 \end{small}
 \subsection{Generating Radix-$n$ Output}
 Generating radix-$n$ output is fairly trivial with a division and remainder algorithm.
 066     mp_clear (&t);
 067     return MP_OKAY;
 068   \}
 069
 070   #endif
+071
 \end{alltt}
 \end{small}
 \chapter{Number Theoretic Algorithms}
 This chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi
 \begin{tabular}{l}
 \hline Algorithm \textbf{mp\_gcd}. \\
 \textbf{Input}.   mp\_int $a$ and $b$ \\
 \textbf{Output}.  The greatest common divisor $c = (a, b)$.  \\
 \hline \\
-1.  If $a = 0$ and $b \ne 0$ then \\
+1.  If $a = 0$ then \\
-\hspace{3mm}1.1  $c \leftarrow b$ \\
+\hspace{3mm}1.1  $c \leftarrow \vert b \vert $ \\
 \hspace{3mm}1.2  Return(\textit{MP\_OKAY}). \\
-2.  If $a \ne 0$ and $b = 0$ then \\
+2.  If $b = 0$ then \\
-\hspace{3mm}2.1  $c \leftarrow a$ \\
+\hspace{3mm}2.1  $c \leftarrow \vert a \vert $ \\
 \hspace{3mm}2.2  Return(\textit{MP\_OKAY}). \\
-3.  If $a = b = 0$ then \\
+3.  $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\
-\hspace{3mm}3.1  $c \leftarrow 1$ \\
+4.  $k \leftarrow 0$ \\
-\hspace{3mm}3.2  Return(\textit{MP\_OKAY}). \\
+5.  While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
-4.  $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\
+\hspace{3mm}5.1  $k \leftarrow k + 1$ \\
-5.  $k \leftarrow 0$ \\
+\hspace{3mm}5.2  $u \leftarrow \lfloor u / 2 \rfloor$ \\
-6.  While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
+\hspace{3mm}5.3  $v \leftarrow \lfloor v / 2 \rfloor$ \\
-\hspace{3mm}6.1  $k \leftarrow k + 1$ \\
+6.  While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
-\hspace{3mm}6.2  $u \leftarrow \lfloor u / 2 \rfloor$ \\
+\hspace{3mm}6.1  $u \leftarrow \lfloor u / 2 \rfloor$ \\
-\hspace{3mm}6.3  $v \leftarrow \lfloor v / 2 \rfloor$ \\
+7.  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
-7.  While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
+\hspace{3mm}7.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
-\hspace{3mm}7.1  $u \leftarrow \lfloor u / 2 \rfloor$ \\
+8.  While $v.used > 0$ \\
-8.  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
+\hspace{3mm}8.1  If $\vert u \vert > \vert v \vert$ then \\
-\hspace{3mm}8.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
+\hspace{6mm}8.1.1  Swap $u$ and $v$. \\
-9.  While $v.used > 0$ \\
+\hspace{3mm}8.2  $v \leftarrow \vert v \vert - \vert u \vert$ \\
-\hspace{3mm}9.1  If $\vert u \vert > \vert v \vert$ then \\
+\hspace{3mm}8.3  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
-\hspace{6mm}9.1.1  Swap $u$ and $v$. \\
+\hspace{6mm}8.3.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
-\hspace{3mm}9.2  $v \leftarrow \vert v \vert - \vert u \vert$ \\
+9.  $c \leftarrow u \cdot 2^k$ \\
-\hspace{3mm}9.3  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
+10.  Return(\textit{MP\_OKAY}). \\
-\hspace{6mm}9.3.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
-10.  $c \leftarrow u \cdot 2^k$ \\
-11.  Return(\textit{MP\_OKAY}). \\
 \hline
 \end{tabular}
 \end{center}
 \end{small}
 \caption{Algorithm mp\_gcd}
 \textbf{Algorithm mp\_gcd.}
 This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$.  The algorithm was originally based on Algorithm B of
 Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain.  In theory it achieves the same asymptotic working time as
 Algorithm B and in practice this appears to be true.
-The first three steps handle the cases where either one of or both inputs are zero.  If either input is zero the greatest common divisor is the
+The first two steps handle the cases where either one of or both inputs are zero.  If either input is zero the greatest common divisor is the
 largest input or zero if they are both zero.  If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of
 $a$ and $b$ respectively and the algorithm will proceed to reduce the pair.
-Step six will divide out any common factors of two and keep track of the count in the variable $k$.  After this step two is no longer a
+Step five will divide out any common factors of two and keep track of the count in the variable $k$.  After this step, two is no longer a
 factor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even.  Step
-seven and eight ensure that the $u$ and $v$ respectively have no more factors of two.  At most only one of the while loops will iterate since
+six and seven ensure that the $u$ and $v$ respectively have no more factors of two.  At most only one of the while--loops will iterate since
 they cannot both be even.
-By step nine both of $u$ and $v$ are odd which is required for the inner logic.  First the pair are swapped such that $v$ is equal to
+By step eight both of $u$ and $v$ are odd which is required for the inner logic.  First the pair are swapped such that $v$ is equal to
-or greater than $u$.  This ensures that the subtraction on step 9.2 will always produce a positive and even result.  Step 9.3 removes any
+or greater than $u$.  This ensures that the subtraction on step 8.2 will always produce a positive and even result.  Step 8.3 removes any
 factors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd.
 After $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six.  The result
 must be adjusted by multiplying by the common factors of two ($2^k$) removed earlier.
 019   \{
 020     mp_int  u, v;
 021     int     k, u_lsb, v_lsb, res;
 022
 023     /* either zero than gcd is the largest */
-024     if (mp_iszero (a) == 1 && mp_iszero (b) == 0) \{
+024     if (mp_iszero (a) == MP_YES) \{
 025       return mp_abs (b, c);
 026     \}
-027     if (mp_iszero (a) == 0 && mp_iszero (b) == 1) \{
+027     if (mp_iszero (b) == MP_YES) \{
 028       return mp_abs (a, c);
 029     \}
 030
-031     /* optimized.  At this point if a == 0 then
+031     /* get copies of a and b we can modify */
-032      * b must equal zero too
+032     if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{
-033      */
+033       return res;
-034     if (mp_iszero (a) == 1) \{
+034     \}
-035       mp_zero(c);
+035
-036       return MP_OKAY;
+036     if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{
-037     \}
+037       goto LBL_U;
-038
+038     \}
-039     /* get copies of a and b we can modify */
+039
-040     if ((res = mp_init_copy (&u, a)) != MP_OKAY) \{
+040     /* must be positive for the remainder of the algorithm */
-041       return res;
+041     u.sign = v.sign = MP_ZPOS;
-042     \}
+042
-043
+043     /* B1.  Find the common power of two for u and v */
-044     if ((res = mp_init_copy (&v, b)) != MP_OKAY) \{
+044     u_lsb = mp_cnt_lsb(&u);
-045       goto LBL_U;
+045     v_lsb = mp_cnt_lsb(&v);
-046     \}
+046     k     = MIN(u_lsb, v_lsb);
 047
-048     /* must be positive for the remainder of the algorithm */
+048     if (k > 0) \{
-049     u.sign = v.sign = MP_ZPOS;
+049        /* divide the power of two out */
-050
+050        if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{
-051     /* B1.  Find the common power of two for u and v */
+051           goto LBL_V;
-052     u_lsb = mp_cnt_lsb(&u);
+052        \}
-053     v_lsb = mp_cnt_lsb(&v);
+053
-054     k     = MIN(u_lsb, v_lsb);
+054        if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{
-055
+055           goto LBL_V;
-056     if (k > 0) \{
+056        \}
-057        /* divide the power of two out */
+057     \}
-058        if ((res = mp_div_2d(&u, k, &u, NULL)) != MP_OKAY) \{
+058
-059           goto LBL_V;
+059     /* divide any remaining factors of two out */
-060        \}
+060     if (u_lsb != k) \{
-061
+061        if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{
-062        if ((res = mp_div_2d(&v, k, &v, NULL)) != MP_OKAY) \{
+062           goto LBL_V;
-063           goto LBL_V;
+063        \}
-064        \}
+064     \}
-065     \}
+065
-066
+066     if (v_lsb != k) \{
-067     /* divide any remaining factors of two out */
+067        if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{
-068     if (u_lsb != k) \{
+068           goto LBL_V;
-069        if ((res = mp_div_2d(&u, u_lsb - k, &u, NULL)) != MP_OKAY) \{
+069        \}
-070           goto LBL_V;
+070     \}
-071        \}
+071
-072     \}
+072     while (mp_iszero(&v) == 0) \{
-073
+073        /* make sure v is the largest */
-074     if (v_lsb != k) \{
+074        if (mp_cmp_mag(&u, &v) == MP_GT) \{
-075        if ((res = mp_div_2d(&v, v_lsb - k, &v, NULL)) != MP_OKAY) \{
+075           /* swap u and v to make sure v is >= u */
-076           goto LBL_V;
+076           mp_exch(&u, &v);
 077        \}
-078     \}
+078
-079
+079        /* subtract smallest from largest */
-080     while (mp_iszero(&v) == 0) \{
+080        if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{
-081        /* make sure v is the largest */
+081           goto LBL_V;
-082        if (mp_cmp_mag(&u, &v) == MP_GT) \{
+082        \}
-083           /* swap u and v to make sure v is >= u */
+083
-084           mp_exch(&u, &v);
+084        /* Divide out all factors of two */
-085        \}
+085        if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{
-086
+086           goto LBL_V;
-087        /* subtract smallest from largest */
+087        \}
-088        if ((res = s_mp_sub(&v, &u, &v)) != MP_OKAY) \{
+088     \}
-089           goto LBL_V;
+089
-090        \}
+090     /* multiply by 2**k which we divided out at the beginning */
-091
+091     if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{
-092        /* Divide out all factors of two */
+092        goto LBL_V;
-093        if ((res = mp_div_2d(&v, mp_cnt_lsb(&v), &v, NULL)) != MP_OKAY) \{
+093     \}
-094           goto LBL_V;
+094     c->sign = MP_ZPOS;
-095        \}
+095     res = MP_OKAY;
-096     \}
+096   LBL_V:mp_clear (&u);
-097
+097   LBL_U:mp_clear (&v);
-098     /* multiply by 2**k which we divided out at the beginning */
+098     return res;
-099     if ((res = mp_mul_2d (&u, k, c)) != MP_OKAY) \{
+099   \}
-100        goto LBL_V;
+100   #endif
-101     \}
+101
-102     c->sign = MP_ZPOS;
-103     res = MP_OKAY;
-104   LBL_V:mp_clear (&u);
-105   LBL_U:mp_clear (&v);
-106     return res;
-107   \}
-108   #endif
 \end{alltt}
 \end{small}
 This function makes use of the macros mp\_iszero and mp\_iseven.  The former evaluates to $1$ if the input mp\_int is equivalent to the
 integer zero otherwise it evaluates to $0$.  The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise
 it evaluates to $0$.  Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero.  The three
-trivial cases of inputs are handled on lines 24 through 37.  After those lines the inputs are assumed to be non-zero.
+trivial cases of inputs are handled on lines 23 through 29.  After those lines the inputs are assumed to be non-zero.
-Lines 34 and 40 make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively.  At this point the common factors of two
+Lines 32 and 36 make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively.  At this point the common factors of two
-must be divided out of the two inputs.  The while loop on line 80 iterates so long as both are even.  The local integer $k$ is used to
+must be divided out of the two inputs.  The block starting at line 43 removes common factors of two by first counting the number of trailing
-keep track of how many factors of $2$ are pulled out of both values.  It is assumed that the number of factors will not exceed the maximum
+zero bits in both.  The local integer $k$ is used to keep track of how many factors of $2$ are pulled out of both values.  It is assumed that
-value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than entries than are accessible by an ``int'' so this is not
+the number of factors will not exceed the maximum value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than
-a limitation.}.
+entries than are accessible by an ``int'' so this is not a limitation.}.
-At this point there are no more common factors of two in the two values.  The while loops on lines 80 and 80 remove any independent
+At this point there are no more common factors of two in the two values.  The divisions by a power of two on lines 61 and 67 remove
-factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm.  The while loop
+any independent factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm.  The while loop
-on line 80 performs the reduction of the pair until $v$ is equal to zero.  The unsigned comparison and subtraction algorithms are used in
+on line 72 performs the reduction of the pair until $v$ is equal to zero.  The unsigned comparison and subtraction algorithms are used in
 place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative.
 \section{Least Common Multiple}
 The least common multiple of a pair of integers is their product divided by their greatest common divisor.  For two integers $a$ and $b$ the
 least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$.  For example, if $a = 2 \cdot 2 \cdot 3 = 12$
 051   LBL_T:
 052     mp_clear_multi (&t1, &t2, NULL);
 053     return res;
 054   \}
 055   #endif
+056
 \end{alltt}
 \end{small}
 \section{Jacobi Symbol Computation}
 To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg.  What is the name of this?} off which the Jacobi symbol is
 096   LBL_P1:mp_clear (&p1);
 097   LBL_A1:mp_clear (&a1);
 098     return res;
 099   \}
 100   #endif
+101
 \end{alltt}
 \end{small}
 As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C
 variable name character.
 034   #endif
 035
 036     return MP_VAL;
 037   \}
 038   #endif
+039
 \end{alltt}
 \end{small}
 \subsubsection{Odd Moduli}
 041     \}
 042
 043     return MP_OKAY;
 044   \}
 045   #endif
+046
 \end{alltt}
 \end{small}
 The algorithm defaults to a return of $0$ in case an error occurs.  The values in the prime table are all specified to be in the range of a
 mp\_digit.  The table \_\_prime\_tab is defined in the following file.
 052     0x05F3, 0x05FB, 0x0607, 0x060D, 0x0611, 0x0617, 0x061F, 0x0623,
 053     0x062B, 0x062F, 0x063D, 0x0641, 0x0647, 0x0649, 0x064D, 0x0653
 054   #endif
 055   \};
 056   #endif
+057
 \end{alltt}
 \end{small}
 Note that there are two possible tables.  When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes
 upto $1619$ are used.  Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit.
 053     err = MP_OKAY;
 054   LBL_T:mp_clear (&t);
 055     return err;
 056   \}
 057   #endif
+058
 \end{alltt}
 \end{small}
 \subsection{The Miller-Rabin Test}
 The Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen
 094   LBL_R:mp_clear (&r);
 095   LBL_N1:mp_clear (&n1);
 096     return err;
 097   \}
 098   #endif
+099
 \end{alltt}
 \end{small}

Mercurial > dropbear

comparison tommath.tex @ 386:97db060d0ef5 libtommath-orig libtommath-0.40