========================================================================
* README
========================================================================

Copyright 1991, 1996, 1999, 2000, 2007 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


			THE GNU MP LIBRARY


GNU MP is a library for arbitrary precision arithmetic, operating on signed
integers, rational numbers, and floating point numbers.  It has a rich set of
functions, and the functions have a regular interface.

GNU MP is designed to be as fast as possible, both for small operands and huge
operands.  The speed is achieved by using fullwords as the basic arithmetic
type, by using fast algorithms, with carefully optimized assembly code for the
most common inner loops for lots of CPUs, and by a general emphasis on speed
(instead of simplicity or elegance).

GNU MP is believed to be faster than any other similar library.  Its advantage
increases with operand sizes for certain operations, since GNU MP in many
cases has asymptotically faster algorithms.

GNU MP is free software and may be freely copied on the terms contained in the
files COPYING* (see the manual for information on which license(s) applies to
which components of GNU MP).


			OVERVIEW OF GNU MP

There are five classes of functions in GNU MP.

 1. Signed integer arithmetic functions (mpz).  These functions are intended
    to be easy to use, with their regular interface.  The associated type is
    `mpz_t'.

 2. Rational arithmetic functions (mpq).  For now, just a small set of
    functions necessary for basic rational arithmetics.  The associated type
    is `mpq_t'.

 3. Floating-point arithmetic functions (mpf).  If the C type `double'
    doesn't give enough precision for your application, declare your
    variables as `mpf_t' instead, set the precision to any number desired,
    and call the functions in the mpf class for the arithmetic operations.

 4. Positive-integer, hard-to-use, very low overhead functions are in the
    mpn class.  No memory management is performed.  The caller must ensure
    enough space is available for the results.  The set of functions is not
    regular, nor is the calling interface.  These functions accept input
    arguments in the form of pairs consisting of a pointer to the least
    significant word, and an integral size telling how many limbs (= words)
    the pointer points to.

    Almost all calculations, in the entire package, are made by calling these
    low-level functions.

 5. Berkeley MP compatible functions.

    To use these functions, include the file "mp.h".  You can test if you are
    using the GNU version by testing if the symbol __GNU_MP__ is defined.

For more information on how to use GNU MP, please refer to the documentation.
It is composed from the file doc/gmp.texi, and can be displayed on the screen
or printed.  How to do that, as well how to build the library, is described in
the INSTALL file in this directory.


			REPORTING BUGS

If you find a bug in the library, please make sure to tell us about it!

You should first check the GNU MP web pages at https://gmplib.org/, under
"Status of the current release".  There will be patches for all known serious
bugs there.

Report bugs to gmp-bugs@gmplib.org.  What information is needed in a useful bug
report is described in the manual.  The same address can be used for suggesting
modifications and enhancements.


----------------
Local variables:
mode: text
fill-column: 78
End:

========================================================================
* demos/calc/README
========================================================================

Copyright 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program.  If not, see https://www.gnu.org/licenses/.


                   DEMONSTRATION CALCULATOR PROGRAM


This is a simple program, meant only to show one way to use GMP with yacc
and lex to make a calculator.  Usage and comments on the implementation can
be found in calc.y.

Within a GMP build tree, the generated Makefile can be used to build the
program,

	make calc

(or on a DOS system, "make calc.exe").

Elsewhere, once GMP has been installed, the program can be compiled with for
instance

	gcc calc.c calclex.c -lgmp -o calc

Or if GNU readline is used then

	gcc calc.c calclex.c calcread.c -lgmp -lreadline -o calc

(again, on a DOS system "-o calc.exe").

Readline support can be enabled or disabled in calc-config.h.  That file is
created by the GMP ./configure based on the --with-readline option.  The
default is --with-readline=detect, which means to use readline if available.
"yes" can be used to force it to be used, or "no" to not use it.

The supplied calc.c was generated by GNU bison, but a standard yacc should
work too.

The supplied calclex.c was generated by GNU flex, but a standard lex should
work too.  The readline support may or may not work with a standard lex (see
comments with input() in calcread.c).  Note also that a standard lex will
require its library "-ll" on the compile command line.  "./configure" sets
this up in the GMP build tree Makefile.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* demos/expr/README
========================================================================

Copyright 2001, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                    GMP EXPRESSION EVALUATION
                    -------------------------


THIS CODE IS PRELIMINARY AND MAY BE SUBJECT TO INCOMPATIBLE CHANGES IN
FUTURE VERSIONS OF GMP.


The files in this directory implement a simple scheme of string based
expression parsing and evaluation, supporting mpz, mpq and mpf.

This will be slower than direct GMP library calls, but may be convenient in
various circumstances, such as while prototyping, or for letting a user
enter values in symbolic form.  "2**5723-7" for example is a lot easier to
enter or maintain than the equivalent written out in decimal.


BUILDING

Nothing in this directory is a normal part of libgmp, and nothing is built
or installed, but various Makefile rules are available to compile
everything.

All the functions are available through a little library (there's no shared
library since upward binary compatibility is not guaranteed).

	make libexpr.a

In a program, prototypes are available using

	#include "expr.h"

run-expr.c is a sample program doing evaluations from the command line.

	make run-expr
	./run-expr '1+2*3'

t-expr.c is self-test program, it prints nothing if successful.

	make t-expr
	./t-expr

The expr*.c sources don't depend on gmp-impl.h and can be compiled with just
a standard installed GMP.  This isn't true of t-expr though, since it uses
some of the internal tests/libtests.la.


SIMPLE USAGE

int mpz_expr (mpz_t res, int base, const char *e, ...);
int mpq_expr (mpq_t res, int base, const char *e, ...);
int mpf_expr (mpf_t res, int base, const char *e, ...);

These functions evaluate simple arithmetic expressions.  For example,

	mpz_expr (result, 0, "123+456", NULL);

Numbers are parsed by mpz_expr and mpq_expr the same as mpz_set_str with the
given base.  mpf_expr follows mpf_set_str, but supporting an "0x" prefix for
hex when base==0.

	mpz_expr (result, 0, "0xAAAA * 0x5555", NULL);

White space, as indicated by <ctype.h> isspace(), is ignored except for the
purpose of separating tokens.

Variables can be included in expressions by putting them in the stdarg list
after the string.  "a", "b", "c" etc in the expression string designate
those values.  For example,

        mpq_t  foo, bar;
        ...
	mpq_expr (q, 10, "2/3 + 1/a + b/2", foo, bar, NULL);

Here "a" will be the value from foo and "b" from bar.  Up to 26 variables
can be included this way.  The NULL must be present to indicate the end of
the list.

Variables can also be written "$a", "$b" etc.  This is necessary when using
bases greater than 10 since plain "a", "b" etc will otherwise be interpreted
as numbers.  For example,

        mpf_t  quux;
        mpf_expr (f, 16, "F00F@-6 * $a", quux, NULL);

All the standard C operators are available, with the usual precedences, plus
"**" for exponentiation at the highest precedence (and right associative).

        Operators      Precedence
         **              220
         ~ ! - (unary)   210
         * / %           200
         + -             190
         << >>           180
         <= < >= >       170
         == !=           160
         &               150
         ^               140
         |               130
         &&              120
         ||              110
         ? :             100/101

Currently only mpz_expr has the bitwise ~ % & ^ and | operators.  The
precedence numbers are of interest in the advanced usage described below.

Various functions are available too.  For example,

        mpz_expr (res, 10, "gcd(123,456,789) * abs(a)", var, NULL);

The following is the full set of functions,

        mpz_expr
            abs bin clrbit cmp cmpabs congruent_p divisible_p even_p fib fac
            gcd hamdist invert jacobi kronecker lcm lucnum max min nextprime
            odd_p perfect_power_p perfect_square_p popcount powm
            probab_prime_p root scan0 scan1 setbit sgn sqrt

        mpq_expr
            abs, cmp, den, max, min, num, sgn

        mpf_expr
            abs, ceil, cmp, eq, floor, integer_p, max, min, reldiff, sgn,
            sqrt, trunc

All these are the same as the GMP library functions, except that min and max
don't exist in the library.  Note also that min, max, gcd and lcm take any
number of arguments, not just two.

mpf_expr does all calculations to the precision of the destination variable.


Expression parsing can succeed or fail.  The return value indicates this,
and will be one of the following

	MPEXPR_RESULT_OK
	MPEXPR_RESULT_BAD_VARIABLE
	MPEXPR_RESULT_BAD_TABLE
	MPEXPR_RESULT_PARSE_ERROR
	MPEXPR_RESULT_NOT_UI

BAD_VARIABLE is when a variable is referenced that hasn't been provided.
For example if "c" is used when only two parameters have been passed.
BAD_TABLE is applicable to the advanced usage described below.

PARSE_ERROR is a general syntax error, returned for any mal-formed input
string.

NOT_UI is returned when an attempt is made to use an operand that's bigger
than an "unsigned long" with a function that's restricted to that range.
For example "fib" is mpz_fib_ui and only accepts an "unsigned long".


ADVANCED USAGE

int mpz_expr_a (const struct mpexpr_operator_t *table,
                mpz_ptr res, int base, const char *e, size_t elen,
                mpz_srcptr var[26])
int mpq_expr_a (const struct mpexpr_operator_t *table,
                mpq_ptr res, int base, const char *e, size_t elen,
                mpq_srcptr var[26])
int mpf_expr_a (const struct mpexpr_operator_t *table,
                mpf_ptr res, int base, unsigned long prec,
                const char *e, size_t elen,
                mpf_srcptr var[26])

These functions are an advanced interface to expression parsing.

The string is taken as pointer and length.  This makes it possible to parse
an expression in the middle of somewhere without copying and null
terminating it.

Variables are an array of 26 pointers to the appropriate operands, or NULL
for variables that are not available.  Any combination of variables can be
given, for example just "x" and "y" (var[23] and var[24]) could be set.

Operators and functions are specified with a table.  This makes it possible
to provide additional operators or functions, or to completely change the
syntax.  The standard tables used by the simple functions above are
available as

	const struct mpexpr_operator_t * const mpz_expr_standard_table;
	const struct mpexpr_operator_t * const mpq_expr_standard_table;
	const struct mpexpr_operator_t * const mpf_expr_standard_table;

struct mpexpr_operator_t is the following

	struct mpexpr_operator_t {
	  const char    *name;
	  mpexpr_fun_t  fun;
	  int           type;
	  int           precedence;
	};

        typedef void (*mpexpr_fun_t) (void);

As an example, the standard mpz_expr table entry for multiplication is as
follows.  See the source code for the full set of standard entries.

	{ "*", (mpexpr_fun_t) mpz_mul, MPEXPR_TYPE_BINARY, 200 },

"name" is the string to parse, "fun" is the function to call for it, "type"
indicates what parameters the function takes (among other things), and
"precedence" sets its operator precedence.

A NULL for "name" indicates the end of the table, so for example an mpf
table with nothing but addition could be

        struct mpexpr_operator_t  table[] = {
          { "+", (mpexpr_fun_t) mpf_add, MPEXPR_TYPE_BINARY, 190 },
          { NULL }
        };

A special type MPEXPR_TYPE_NEW_TABLE makes it possible to chain from one
table to another.  For example the following would add a "mod" operator to
the standard mpz table,

        struct mpexpr_operator_t  table[] = {
        { "mod", (mpexpr_fun_t) mpz_fdiv_r, MPEXPR_TYPE_BINARY, 125 },
        { (const char *) mpz_expr_standard_table, NULL, MPEXPR_TYPE_NEW_TABLE }
        };

Notice the low precedence on "mod", so that for instance "45+26 mod 7"
parses as "(45+26)mod7".


Functions are designated by a precedence of 0.  They always occur as
"foo(expr)" and so have no need for a precedence level.  mpq_abs in the
standard mpq table is

	{ "abs", (mpexpr_fun_t) mpq_abs, MPEXPR_TYPE_UNARY },

Functions expecting no arguments as in "foo()" can be given with
MPEXPR_TYPE_0ARY, or actual constants to be parsed as just "foo" are
MPEXPR_TYPE_CONSTANT.  For example if a "void mpf_const_pi(mpf_t f)"
function existed (which it doesn't) it could be,

	{ "pi", (mpexpr_fun_t) mpf_const_pi, MPEXPR_TYPE_CONSTANT },


Parsing of operator names is done by seeking the table entry with the
longest matching name.  So for instance operators "<" and "<=" exist, and
when presented with "x <= y" the parser matches "<=" because it's longer.

Parsing of function names, on the other hand, is done by requiring a whole
alphanumeric word to match.  For example presented with "fib2zz(5)" the
parser will attempt to find a function called "fib2zz".  A function "fib"
wouldn't be used because it doesn't match the whole word.

The flag MPEXPR_TYPE_WHOLEWORD can be ORed into an operator type to override
the default parsing style.  Similarly MPEXPR_TYPE_OPERATOR into a function.


Binary operators are left associative by default, meaning they're evaluated
from left to right, so for example "1+2+3" is treated as "(1+2)+3".
MPEXPR_TYPE_RIGHTASSOC can be ORed into the operator type to work from right
to left as in "1+(2+3)".  This is generally what's wanted for
exponentiation, and for example the standard mpz table has

        { "**", (mpexpr_fun_t) mpz_pow_ui,
          MPEXPR_TYPE_BINARY_UI | MPEXPR_TYPE_RIGHTASSOC, 220 }

Unary operators are postfix by default.  For example a factorial to be used
as "123!" might be

	{ "!", (mpexpr_fun_t) mpz_fac_ui, MPEXPR_TYPE_UNARY_UI, 215 }

MPEXPR_TYPE_PREFIX can be ORed into the type to get a prefix operator.  For
instance negation (unary minus) in the standard mpf table is

	{ "-", (mpexpr_fun_t) mpf_neg,
          MPEXPR_TYPE_UNARY | MPEXPR_TYPE_PREFIX, 210 },


The same operator can exist as a prefix unary and a binary, or as a prefix
and postfix unary, simply by putting two entries in the table.  While
parsing the context determines which style is sought.  But note that the
same operator can't be both a postfix unary and a binary, since the parser
doesn't try to look ahead to decide which ought to be used.

When there's two entries for an operator, both prefix or both postfix (or
binary), then the first in the table will be used.  This makes it possible
to override an entry in a standard table, for example to change the function
it calls, or perhaps its precedence level.  The following would change mpz
division from tdiv to cdiv,

        struct mpexpr_operator_t  table[] = {
          { "/", (mpexpr_fun_t) mpz_cdiv_q, MPEXPR_TYPE_BINARY, 200 },
          { "%", (mpexpr_fun_t) mpz_cdiv_r, MPEXPR_TYPE_BINARY, 200 },
          { (char *) mpz_expr_standard_table, NULL, MPEXPR_TYPE_NEW_TABLE }
        };


The type field indicates what parameters the given function expects.  The
following styles of functions are supported.  mpz_t is shown, but of course
this is mpq_t for mpq_expr_a, mpf_t for mpf_expr_a, etc.

    MPEXPR_TYPE_CONSTANT     void func (mpz_t result);

    MPEXPR_TYPE_0ARY         void func (mpz_t result);
    MPEXPR_TYPE_I_0ARY       int func (void);

    MPEXPR_TYPE_UNARY        void func (mpz_t result, mpz_t op);
    MPEXPR_TYPE_UNARY_UI     void func (mpz_t result, unsigned long op);
    MPEXPR_TYPE_I_UNARY      int func (mpz_t op);
    MPEXPR_TYPE_I_UNARY_UI   int func (unsigned long op);

    MPEXPR_TYPE_BINARY       void func (mpz_t result, mpz_t op1, mpz_t op2);
    MPEXPR_TYPE_BINARY_UI    void func (mpz_t result,
                                        mpz_t op1, unsigned long op2);
    MPEXPR_TYPE_I_BINARY     int func (mpz_t op1, mpz_t op2);
    MPEXPR_TYPE_I_BINARY_UI  int func (mpz_t op1, unsigned long op2);

    MPEXPR_TYPE_TERNARY      void func (mpz_t result,
                                        mpz_t op1, mpz_t op2, mpz_t op3);
    MPEXPR_TYPE_TERNARY_UI   void func (mpz_t result, mpz_t op1, mpz_t op2,
                                        unsigned long op3);
    MPEXPR_TYPE_I_TERNARY    int func (mpz_t op1, mpz_t op2, mpz_t op3);
    MPEXPR_TYPE_I_TERNARY_UI int func (mpz_t op1, mpz_t op2,
                                       unsigned long op3);

Notice the pattern of "UI" for the last parameter as an unsigned long, or
"I" for the result as an "int" return value.

It's important that the declared type for an operator or function matches
the function pointer given.  Any mismatch will have unpredictable results.

For binary functions, a further type attribute is MPEXPR_TYPE_PAIRWISE which
indicates that any number of arguments should be accepted, and evaluated by
applying the given binary function to them pairwise.  This is used by gcd,
lcm, min and max.  For example the standard mpz gcd is

	{ "gcd", (mpexpr_fun_t) mpz_gcd,
	  MPEXPR_TYPE_BINARY | MPEXPR_TYPE_PAIRWISE },

Some special types exist for comparison operators (or functions).
MPEXPR_TYPE_CMP_LT through MPEXPR_TYPE_CMP_GE expect an MPEXPR_TYPE_I_BINARY
function, returning positive, negative or zero like mpz_cmp and similar.
For example the standard mpf "!=" operator is

	{ "!=", (mpexpr_fun_t) mpf_cmp, MPEXPR_TYPE_CMP_NE, 160 },

But there's no obligation to use these types, for instance the standard mpq
table just uses a plain MPEXPR_TYPE_I_BINARY and mpq_equal for "==".

Further special types MPEXPR_TYPE_MIN and MPEXPR_TYPE_MAX exist to implement
the min and max functions, and they take a function like mpf_cmp similarly.
The standard mpf max function is

	{ "max",  (mpexpr_fun_t) mpf_cmp,
          MPEXPR_TYPE_MAX | MPEXPR_TYPE_PAIRWISE },

These can be used as operators too, for instance the following would be the
>? operator which is a feature of GNU C++,

	{ ">?", (mpexpr_fun_t) mpf_cmp, MPEXPR_TYPE_MAX, 175 },

Other special types are used to define "(" ")" parentheses, "," function
argument separator, "!" through "||" logical booleans, ternary "?"  ":", and
the "$" which introduces variables.  See the sources for how they should be
used.


User definable operator tables will have various uses.  For example,

  - a subset of the C operators, to be rid of infrequently used things
  - a more mathematical syntax like "." for multiply, "^" for powering,
    and "!" for factorial
  - a boolean evaluator with "^" for AND, "v" for OR
  - variables introduced with "%" instead of "$"
  - brackets as "[" and "]" instead of "(" and ")"

The only fixed parts of the parsing are the treatment of numbers, whitespace
and the two styles of operator/function name recognition.

As a final example, the following would be a complete mpz table implementing
some operators with a more mathematical syntax.  Notice there's no need to
preserve the standard precedence values, anything can be used so long as
they're in the desired relation to each other.  There's also no need to have
entries in precedence order, but it's convenient to do so to show what comes
where.

        static const struct mpexpr_operator_t  table[] = {
	  { "^",   (mpexpr_fun_t) mpz_pow_ui,
            MPEXPR_TYPE_BINARY_UI | MPEXPR_TYPE_RIGHTASSOC,           9 },

          { "!",   (mpexpr_fun_t) mpz_fac_ui, MPEXPR_TYPE_UNARY_UI,   8 },
          { "-",   (mpexpr_fun_t) mpz_neg,
            MPEXPR_TYPE_UNARY | MPEXPR_TYPE_PREFIX,                   7 },

          { "*",   (mpexpr_fun_t) mpz_mul,    MPEXPR_TYPE_BINARY,     6 },
          { "/",   (mpexpr_fun_t) mpz_fdiv_q, MPEXPR_TYPE_BINARY,     6 },

          { "+",   (mpexpr_fun_t) mpz_add,    MPEXPR_TYPE_BINARY,     5 },
          { "-",   (mpexpr_fun_t) mpz_sub,    MPEXPR_TYPE_BINARY,     5 },

          { "mod", (mpexpr_fun_t) mpz_mod,    MPEXPR_TYPE_BINARY,     6 },

          { ")",   NULL,                      MPEXPR_TYPE_CLOSEPAREN, 4 },
          { "(",   NULL,                      MPEXPR_TYPE_OPENPAREN,  3 },
          { ",",   NULL,                      MPEXPR_TYPE_ARGSEP,     2 },

          { "$",   NULL,                      MPEXPR_TYPE_VARIABLE,   1 },
          { NULL }
        };


INTERNALS

Operator precedence is implemented using a control and data stack, there's
no C recursion.  When an expression like 1+2*3 is read the "+" is held on
the control stack and 1 on the data stack until "*" has been parsed and
applied to 2 and 3.  This happens any time a higher precedence operator
follows a lower one, or when a right-associative operator like "**" is
repeated.

Parentheses are handled by making "(" a special prefix unary with a low
precedence so a whole following expression is read.  The special operator
")" knows to discard the pending "(".  Function arguments are handled
similarly, with the function pretending to be a low precedence prefix unary
operator, and with "," allowed within functions.  The same special ")"
operator recognises a pending function and will invoke it appropriately.

The ternary "? :" operator is also handled using precedences.  ":" is one
level higher than "?", so when a valid a?b:c is parsed the ":" finds a "?"
on the control stack.  It's a parse error for ":" to find anything else.


FUTURE

The ternary "?:" operator evaluates the "false" side of its pair, which is
wasteful, though it ought to be harmless.  It'd be better if it could
evaluate only the "true" side.  Similarly for the logical booleans "&&" and
"||" if they know their result already.

Functions like MPEXPR_TYPE_BINARY could return a status indicating operand
out of range or whatever, to get an error back through mpz_expr etc.  That
would want to be just an option, since plain mpz_add etc have no such
return.

Could have assignments like "a = b*c" modifying the input variables.
Assignment could be an operator attribute, making it expect an lvalue.
There would want to be a standard table without assignments available
though, so user input could be safely parsed.

The closing parenthesis table entry could specify the type of open paren it
expects, so that "(" and ")" could match and "[" and "]" match but not a
mixture of the two.  Currently "[" and "]" can be added, but there's no
error on writing a mixed expression like "2*(3+4]".  Maybe also there could
be a way to say that functions can only be written with one or the other
style of parens.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mini-gmp/README
========================================================================

Copyright 2011-2013 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This is "mini-gmp", a small implementation of a subset of GMP's mpn
and mpz interfaces.

It is intended for applications which need arithmetic on numbers
larger than a machine word, but which don't need to handle very large
numbers very efficiently. Those applications can include a copy of
mini-gmp to get a GMP-compatible interface with small footprint. One
can also arrange for optional linking with the real GMP library, using
mini-gmp as a fallback when for some reason GMP is not available, or
not desired as a dependency.

The supported GMP subset is declared in mini-gmp.h. The implemented
functions are fully compatible with the corresponding GMP functions,
as specified in the GMP manual, with a few exceptions:

  mpz_set_str, mpz_init_set_str, mpz_get_str, mpz_out_str and
  mpz_sizeinbase support only |base| <= 36;
  mpz_export and mpz_import support only NAILS = 0.

  The REALLOC_FUNC and FREE_FUNC registered with
  mp_set_memory_functions does not get the correct size of the
  allocated block in the corresponding argument. mini-gmp always
  passes zero for these rarely used arguments.

The implementation is a single file, mini-gmp.c.

The performance target for mini-gmp is to be at most 10 times slower
than the real GMP library, for numbers of size up to a few hundred
bits. No asymptotically fast algorithms are included in mini-gmp, so
it will be many orders of magnitude slower than GMP for very large
numbers.

You should never "install" mini-gmp. Applications can either just
#include mini-gmp.c (but then, beware that it defines several macros
and functions outside of the advertised interface). Or compile
mini-gmp.c as a separate compilation unit, and use the declarations in
mini-gmp.h.

The tests subdirectory contains a testsuite. To use it, you need GMP
and GNU make. Just run make check in the tests directory. If the
hard-coded compiler settings are not right, you have to either edit the
Makefile or pass overriding values on the make command line (e.g.,
make CC=cc check). Testing is not (yet) as thorough as for the real
GMP.

The current version was put together by Niels Möller
<nisse@lysator.liu.se>, with a fair amount of copy-and-paste from the
GMP sources.

========================================================================
* mpn/README
========================================================================

Copyright 1996, 1999 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains all code for the mpn layer of GMP.

Most subdirectories contain machine-dependent code, written in assembly or C.
The `generic' subdirectory contains default code, used when there is no
machine-dependent replacement for a particular machine.

There is one subdirectory for each ISA family.  Note that e.g., 32-bit SPARC
and 64-bit SPARC are very different ISA's, and thus cannot share any code.

A particular compile will only use code from one subdirectory, and the
`generic' subdirectory.  The ISA-specific subdirectories contain hierachies of
directories for various architecture variants and implementations; the
top-most level contains code that runs correctly on all variants.

========================================================================
* mpn/alpha/README
========================================================================

Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions optimized for DEC Alpha processors.

ALPHA ASSEMBLY RULES AND REGULATIONS

The `.prologue N' pseudo op marks the end of instruction that needs special
handling by unwinding.  It also says whether $27 is really needed for computing
the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
and at what offset in the frame.

Cray T3 code is very very different...

"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
/ "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
them to "$6" or "$f6" where necessary.

"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
required.  The X() macro accommodates this difference.

"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
necessary.

"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
the Unicos assembler.  The full "ornot" must be used.

"unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
r31,0(r30)", and in fact use that define on all systems since it comes out the
same.

"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
available in older alpha assemblers (including gas prior to 2.12), according to
the GCC manual, so the assembler macro forms must be used (eg. ldgp).


RELEVANT OPTIMIZATION ISSUES

EV4

1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
   through, and a cache line is transferred from the store buffer to the off-
   chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
   mpn_sub_n, mpn_lshift, and mpn_rshift.

2. Pairing is possible between memory instructions and integer arithmetic
   instructions.

3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
   cycles are pipelined.  Thus, multiply instructions can be issued at a rate
   of one each 21st cycle.

EV5

1. The memory bandwidth of this chip is good, both for loads and stores.  The
   L1 cache can handle two loads or one store per cycle, but two cycles after a
   store, no ld can issue.

2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
   umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
   (Note that published documentation gets these numbers slightly wrong.)

3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
   are memory operations.  This will take at least
	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
   We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
   cache cycles, which should be completely hidden in the 19 issue cycles.
   The computation is inherently serial, with these dependencies:

	       ldq  ldq
		 \  /\
	  (or)   addq |
	   |\   /   \ |
	   | addq  cmpult
	    \  |     |
	     cmpult  |
		 \  /
		  or

   I.e., 3 operations are needed between carry-in and carry-out, making 12
   cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
   a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
   might waste a cycle on EV4.  The total depth remain unaffected, since cmov
   has a latency of 2 cycles.

     addq
     /   \
   addq  cmpult
     |      \
   cmpult -> cmovne

  Montgomery has a slightly different way of computing carry that requires one
  less instruction, but has depth 4 (instead of the current 3).  Since the code
  is currently instruction issue bound, Montgomery's idea should save us 1/2
  cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
  Unfortunately, this method will not be good for the EV6.

4. addmul_1 and friends: We previously had a scheme for splitting the single-
   limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
   and then use FP operations for every 2nd multiply, and integer operations
   for every 2nd multiply.

   But it seems much better to split the single-limb operand in 16-bit chunks,
   since we save many integer shifts and adds that way.  See powerpc64/README
   for some more details.

EV6

Here we have a really parallel pipeline, capable of issuing up to 4 integer
instructions per cycle.  In actual practice, it is never possible to sustain
more than 3.5 integer insns/cycle due to rename register shortage.  One integer
multiply instruction can issue each cycle.  To get optimal speed, we need to
pretend we are vectorizing the code, i.e., minimize the depth of recurrences.

There are two dependencies to watch out for.  1) Address arithmetic
dependencies, and 2) carry propagation dependencies.

We can avoid serializing due to address arithmetic by unrolling loops, so that
addresses don't depend heavily on an index variable.  Avoiding serializing
because of carry propagation is trickier; the ultimate performance of the code
will be determined of the number of latency cycles it takes from accepting
carry-in to a vector point until we can generate carry-out.

Most integer instructions can execute in either the L0, U0, L1, or U1
pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.

CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
should always be placed as the last instruction of an aligned 4 instruction
block, or perhaps simply avoided.

Perhaps the most important issue is the latency between the L0/U0 and L1/U1
clusters; a result obtained on either cluster has an extra cycle of latency for
consumers in the opposite cluster.  Because of the dynamic nature of the
implementation, it is hard to predict where an instruction will execute.


REFERENCES

"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
EC-QD2KC-TE.

"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
order number EC-QP99C-TE.

"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
Compaq, September 2000, order number DS-0028B-TE.

"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
EC-RJ66A-TE.

All of the above are available online from

  http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
  ftp://ftp.compaq.com/pub/products/alphaCPUdocs

"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
number AA-PS31D-TE.

"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
March 1996, part number AA-PY8AC-TE.

The above are available online,

  http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM

(Dunno what h30097 means in this URL, but if it moves try searching for "tru64
online documentation" from the main www.hp.com page.)


----------------
Local variables:
mode: text
fill-column: 79
End:

========================================================================
* mpn/alpha/ev6/nails/README
========================================================================

Copyright 2002, 2005 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains assembly code for nails-enabled 21264.  The code is not
very well optimized.

For addmul_N, as N grows larger, we could make multiple loads together, then do
about 3.3 i/c.  10 cycles after the last load, we can increase to 4 i/c.  This
would surely allow addmul_4 to run at 2 c/l, but the same should be possible
also for addmul_3 and perhaps even addmul_2.


		current		fair		best
Routine		c/l  unroll	c/l  unroll	c/l  i/c
mul_1		3.25		2.75		2.75 3.273
addmul_1	4.0	4	3.5	4 14	3.25 3.385
addmul_2	4.0	1	2.5	2 10	2.25 3.333
addmul_3	3.0	1	2.33	2 14	2    3.333
addmul_4	2.5	1	2.125	2 17	2    3.135

addmul_5			2	1 10
addmul_6			2	1 12
addmul_7			2	1 14

(The "best" column doesn't account for bookkeeping instructions and
thereby assumes infinite unrolling.)

Basecase usages:

1	 addmul_1
2	 addmul_2
3	 addmul_3
4	 addmul_4
5	 addmul_3 + addmul_2	2.3998
6	 addmul_4 + addmul_2
7	 addmul_4 + addmul_3

========================================================================
* mpn/arm/README
========================================================================

Copyright 2002, 2012 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions for ARM processors.  It has been
optimised for Cortex-A9, but the code in the top-level directory should run
on all ARM processors at architecture level v4 or later.

========================================================================
* mpn/cray/README
========================================================================

Copyright 2000-2002 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


The code in this directory works for Cray vector systems such as C90,
J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
the T3E and T3D systems, see the `alpha' subdirectory at the same
level as the directory containing this file.)

The cfp subdirectory is for systems utilizing the traditional Cray
floating-point format, and the ieee subdirectory is for the newer
systems that use the IEEE floating-point format.

There are several issues that reduces speed on Cray systems.  For
systems with cfp floating point, the main obstacle is the forming of
128-bit products.  For IEEE systems, adding, and in particular
computing carry is the main issue.  There are no vectorizing
unsigned-less-than instructions, and the sequence that implement that
operation is very long.

Shifting is the only operation that is simple to make fast.  All Cray
systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
should be really useful.

For best speed for cfp systems, we need a mul_basecase, since that
reduces the need for carry propagation to a minimum.  Depending on the
size (vn) of the smaller of the two operands (V), we should split U and V
in different chunk sizes:

U split in 2 32-bit parts
V split according to the table:
parts			4	5	6	7	8
bits/part		16	13	11	10	8
max allowed vn		1	8	32	64	256
number of multiplies	8	10	12	14	16
peak cycles/limb	4	5	6	7	8

U split in 3 22-bit parts
V split according to the table:
parts			3	4	5
bits/part		22	16	13
max allowed vn		16	1024	8192
number of multiplies	9	12	15
peak cycles/limb	4.5	6	7.5

U split in 4 16-bit parts
V split according to the table:
parts			4
bits/part		16
max allowed vn		65536
number of multiplies	16
peak cycles/limb	8

(A T90 CPU can accumulate two products per cycle.)

IDEA:
* Rewrite mpn_add_n:
    short cy[n + 1];
    #pragma _CRI ivdep
      for (i = 0; i < n; i++)
	{ s = up[i] + vp[i];
	  rp[i] = s;
	  cy[i + 1] = s < up[i]; }
      more_carries = 0;
    #pragma _CRI ivdep
      for (i = 1; i < n; i++)
	{ s = rp[i] + cy[i];
	  rp[i] = s;
	  more_carries += s < cy[i]; }
      cys = 0;
      if (more_carries)
	{
	  cys = rp[1] < cy[1];
	  for (i = 2; i < n; i++)
	    { rp[i] += cys;
	      cys = rp[i] < cys; }
	}
      return cys + cy[n];

* Write mpn_add3_n for adding three operands.  First add operands 1
  and 2, and generate cy[].  Then add operand 3 to the partial result,
  and accumulate carry into cy[].  Finally propagate carry just like
  in the new mpn_add_n.

IDEA:

Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
storing even fewer bits per limb, perhaps 56, it would be possible to
write a mul_mul_basecase that would run at effectively 1 cycle/limb.
(Use VM here to better handle the romb-shaped multiply area, perhaps
rounding operand sizes up to the next power of 2.)

========================================================================
* mpn/ia64/README
========================================================================

Copyright 2000-2005 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      IA-64 MPN SUBROUTINES


This directory contains mpn functions for the IA-64 architecture.


CODE ORGANIZATION

	mpn/ia64          itanium-2, and generic ia64

The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
chips were ever sold, and Itanium 2 is more powerful, so the latter is what
we concentrate on.


CHIP NOTES

The IA-64 ISA keeps instructions three and three in 128 bit bundles.
Programmers/compilers need to put explicit breaks `;;' when there are WAW or
RAW dependencies, with some notable exceptions.  Such "breaks" are typically
at the end of a bundle, but can be put between operations within some bundle
types too.

The Itanium 1 and Itanium 2 implementations can under ideal conditions
execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
to do integer operations, while the Itanium 2 allows all 6 to be integer
operations.

Taken cloop branches seem to insert a bubble into the pipeline most of the
time on Itanium 1.

Loads to the fp registers bypass the L1 cache and thus get extremely long
latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.

The software pipeline stuff using br.ctop instruction causes delays, since
many issue slots are taken up by instructions with zero predicates, and
since many extra instructions are needed to set things up.  These features
are clearly designed for code density, not speed.

Misc pipeline limitations (Itanium 1):
* The getf.sig instruction can only execute in M0.
* At most four integer instructions/cycle.
* Nops take up resources like any plain instructions.

Misc pipeline limitations (Itanium 2):
* The getf.sig instruction can only execute in M0.
* Nops take up resources like any plain instructions.


ASSEMBLY SYNTAX

.align pads with nops in a text segment, but gas 2.14 and earlier
incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
it come out as break instructions.  We use the ALIGN() macro in
mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
suppresses any .align if the problem is detected by configure.  Lack of
alignment might hurt performance but will at least be correct.

foo:: to create a global symbol is not accepted by gas.  Use separate
".global foo" and "foo:" instead.

.global is the standard global directive.  gas accepts .globl, but hpux "as"
doesn't.

.proc / .endp generates the appropriate .type and .size information for ELF,
so the latter directives don't need to be given explicitly.

.pred.rel "mutex"... is standard for annotating predicate register
relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.

.pred directives can't be put on a line with a label, like
".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
gas is happy with it, and past versions of HP had seemed ok.

// is the standard comment sequence, but we prefer "C" since it inhibits m4
macro expansion.  See comments in ia64-defs.m4.


REGISTER USAGE

Special:
   r0: constant 0
   r1: global pointer (gp)
   r8: return value
   r12: stack pointer (sp)
   r13: thread pointer (tp)
Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
Caller-saves but rotating: r32-


================================================================
mpn_add_n, mpn_sub_n:

The current code runs at 1.25 c/l on Itanium 2.

================================================================
mpn_mul_1:

The current code runs at 2 c/l on Itanium 2.

Using a blocked approach, working off of 4 separate places in the operands,
one could make use of the xma accumulation, and approach 1 c/l.

	ldf8 [up]
	xma.l
	xma.hu
	stf8  [wrp]

================================================================
mpn_addmul_1:

The current code runs at 2 c/l on Itanium 2.

It seems possible to use a blocked approach, as with mpn_mul_1.  We should
read rp[] to integer registers, allowing for just one getf.sig per cycle.

	ld8  [rp]
	ldf8 [up]
	xma.l
	xma.hu
	getf.sig
	add+add+cmp+cmp
	st8  [wrp]

These 10 instructions can be scheduled to approach 1.667 cycles, and with
the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
ldfp8 we could approach 1.583 c/l.

================================================================
mpn_submul_1:

The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
ldfp8 with all alignment headache that implies.

================================================================
mpn_addmul_N

For best speed, we need to give up using mpn_addmul_2 as the main multiply
building block, and instead take multiple v limbs per loop.  For the Itanium
1, we need to take about 8 limbs at a time for full speed.  For the Itanium
2, something like mpn_addmul_4 should be enough.

The add+cmp+cmp+add we use on the other codes is optimal for shortening
recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
better.

/* First load the 8 values from v */
	ldfp8		v0, v1 = [r35], 16;;
	ldfp8		v2, v3 = [r35], 16;;
	ldfp8		v4, v5 = [r35], 16;;
	ldfp8		v6, v7 = [r35], 16;;

/* In the inner loop, get a new U limb and store a result limb. */
	mov		lc = un
Loop:	ldf8		u0 = [r33], 8
	ld8		r0 = [r32]
	xma.l		lp0 = v0, u0, hp0
	xma.hu		hp0 = v0, u0, hp0
	xma.l		lp1 = v1, u0, hp1
	xma.hu		hp1 = v1, u0, hp1
	xma.l		lp2 = v2, u0, hp2
	xma.hu		hp2 = v2, u0, hp2
	xma.l		lp3 = v3, u0, hp3
	xma.hu		hp3 = v3, u0, hp3
	xma.l		lp4 = v4, u0, hp4
	xma.hu		hp4 = v4, u0, hp4
	xma.l		lp5 = v5, u0, hp5
	xma.hu		hp5 = v5, u0, hp5
	xma.l		lp6 = v6, u0, hp6
	xma.hu		hp6 = v6, u0, hp6
	xma.l		lp7 = v7, u0, hp7
	xma.hu		hp7 = v7, u0, hp7
	getf.sig	l0 = lp0
	getf.sig	l1 = lp1
	getf.sig	l2 = lp2
	getf.sig	l3 = lp3
	getf.sig	l4 = lp4
	getf.sig	l5 = lp5
	getf.sig	l6 = lp6
	add+cmp+add	xx, l0, r0
	add+cmp+add	acc0, acc1, l1
	add+cmp+add	acc1, acc2, l2
	add+cmp+add	acc2, acc3, l3
	add+cmp+add	acc3, acc4, l4
	add+cmp+add	acc4, acc5, l5
	add+cmp+add	acc5, acc6, l6
	getf.sig	acc6 = lp7
	st8		[r32] = xx, 8
	br.cloop Loop

	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
	11+21 memops+intops at max 4/cycle	8 cycles/limb8

================================================================
mpn_lshift, mpn_rshift

The current code runs at 1 cycle/limb on Itanium 2.

Using 63 separate loops, we could use the double-word shrp instruction.
That instruction has a plain single-cycle latency.  We need 63 loops since
this instruction only accept immediate count.  That would lead to a somewhat
silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
each cycle plus shl/shr going down I1 for a further limb every second
cycle).

================================================================
mpn_copyi, mpn_copyd

The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
scheduling isn't great.  It might be best to actually use modulo scheduled
loops, since that will allow us to do better load-use scheduling without too
much unrolling.

Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
2, according to tune/speed.  Cache bank conflicts?


REFERENCES

Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
includes an Itanium optimization guide.

Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
etc.

Intel Itanium Architecture Assembly Language Reference Guide, Intel document
248801-004, 2000-2002.  Describes assembly instruction syntax and other
directives.

Itanium Software Conventions and Runtime Architecture Guide, Intel document
245358-003, May 2001.  Describes calling conventions, including stack
unwinding requirements.

Intel Itanium Processor Reference Manual for Software Optimization, Intel
document 245473-003, November 2001.

Intel Itanium-2 Processor Reference Manual for Software Development and
Optimization, Intel document 251110-003, May 2004.

All the above documents can be found online at

    http://developer.intel.com/design/itanium/manuals.htm

========================================================================
* mpn/m68k/README
========================================================================

Copyright 2001, 2003, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      M68K MPN SUBROUTINES


This directory contains mpn functions for various m68k family chips.


CODE ORGANIZATION

	m68k             m68000, m68010, m68060
	m68k/mc68020     m68020, m68030, m68040, and CPU32


The m5200 "coldfire", which is m68000 less a few instructions, currently has
no assembler code support.


STATUS

The code herein is old and poorly maintained.  If somebody really cared, it
could be optimized substantially.  For example,

* mpn_add_n and mpn_sub_n could, with more unrolling be improved from 6 to
  close to 4 c/l (on m68040).

* The multiplication loops could be sped up by using the FPU.

* mpn_lshift by 31 should use the special-case mpn_rshift by 1 code, and
  vice versa mpn_rshift by 31 use the special lshift by 1, when operand
  overlap permits.

* On 68000, mpn_mul_1, mpn_addmul_1 and mpn_submul_1 could check for a
  16-bit multiplier and use two multiplies per limb, not four.

  Similarly various other _1 operations like mpn_mod_1, mpn_divrem_1,
  mpn_divexact_1, mpn_modexact_1c_odd.

* On 68000, mpn_lshift and mpn_rshift could use a roll and mask instead of
  lsrl and lsll.  This promises to be a speedup, effectively trading a 6+2*n
  shift for one or two 4 cycle masks.  Suggested by Jean-Charles Meyrignac.

* config.guess detects 68000, 68010, CPU32 and 68020 by running some code,
  but relies on system information for 030, 040 and 060.  Can they be
  identified by running some code?  Currently this only makes a difference
  to the compiler options selected, since we have no specific asm code for
  those chips.

One novel idea for 68000 would be to use a 16-bit limb instead of 32-bits.
This would suit the native 16x16 multiply, but might make it difficult to
get full value from the native 32x32 add/sub/etc.  This would be an ABI
option, and would select "__GMP_SHORT_LIMB" in gmp.h.

Naturally an entirely new set of asm subroutines would be needed for a
16-bit limb.  Also there's various places in the C code assuming limb>=long,
which would need to be updated, eg. mpz_set_ui.  Some of the nails changes
may have helped cover some of this.


ASM FILES

The .asm files are put through m4 for macro processing, and with the help of
configure give either MIT or Motorola syntax.  The generic mpn/asm-defs.m4
is used, together with mpn/m68k/m68k-defs.m4.  See comments in those files.

Not all possible syntax variations are covered.  GCC config/m68k for
instance has things like $ for immediates on CRDS or reversed cmp order for
AT&T SGS.  These could probably be handled if anyone really needs it.


CALLING CONVENTIONS

The SVR4 standard has an int of 32 bits, and all parameters 32-bit aligned
on the stack.

PalmOS and perhaps various embedded systems intended for 68000 however use
an int of 16 bits and parameters only 16-bit aligned on the stack.  This is
generated by "gcc -mshort" (and is the default for the PalmOS gcc port, we
believe).

The asm files adapt to these two ABIs by checking sizeof(unsigned), coming
through config.m4 as SIZEOF_UNSIGNED.  Only mpn_lshift and mpn_rshift are
affected, all other routines take longs and pointers, which are 32-bits in
both cases.

Strictly speaking the size of an int doesn't determine the stack padding
convention.  But if int is 16 bits then we can definitely say the host
system is not SVR4, and therefore may as well assume we're in 16-bit stack
alignment.


REFERENCES

"Motorola M68000 Family Programmer's Reference Manual", available online,

	http://e-www.motorola.com/brdata/PDFDB/docs/M68000PM.pdf

"System V Application Binary Interface: Motorola 68000 Processor Family
Supplement", AT&T, 1990, ISBN 0-13-877553-6.  Has details of calling
conventions and ELF style PIC coding.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/m88k/README
========================================================================

Copyright 2003 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      M88K MPN SUBROUTINES

This directory contains mpn functions for various m88k family chips.

CODE ORGANIZATION

	m88k             m88000, m88100
	m88k/mc88110     m88110

STATUS

The code herein is old and poorly maintained.

* The .s files assume the system uses a "_" underscore prefix, which
  should be controlled by configure.

* The mc88110/*.S files are using the defunct "sysdep.h" configuration
  scheme and won't compile.

Conversion to the current m4 .asm style wouldn't be difficult.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/mips64/README
========================================================================

Copyright 1996 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions optimized for MIPS3.  Example of
processors that implement MIPS3 are R4000, R4400, R4600, R4700, and R8000.

RELEVANT OPTIMIZATION ISSUES

1. On the R4000 and R4400, branches, both the plain and the "likely" ones,
   take 3 cycles to execute.  (The fastest possible loop will take 4 cycles,
   because of the delay insn.)

   On the R4600, branches takes a single cycle

   On the R8000, branches often take no noticeable cycles, as they are
   executed in a separate function unit..

2. The R4000 and R4400 have a load latency of 4 cycles.

3. On the R4000 and R4400, multiplies take a data-dependent number of
   cycles, contrary to the SGI documentation.  There seem to be 3 or 4
   possible latencies.

4. The R1x000 processors can issue one floating-point operation, two integer
   operations, and one memory operation per cycle.  The FPU has very short
   latencies, while the integer multiply unit is non-pipelined.  We should
   therefore write fp based mpn_Xmul_1.

STATUS

Good...

========================================================================
* mpn/pa32/README
========================================================================

Copyright 1996, 1999, 2001, 2002, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions for various HP PA-RISC chips.  Code
that runs faster on the PA7100 and later implementations, is in the pa7100
directory.

RELEVANT OPTIMIZATION ISSUES

  Load and Store timing

On the PA7000 no memory instructions can issue the two cycles after a store.
For the PA7100, this is reduced to one cycle.

The PA7100 has a lookup-free cache, so it helps to schedule loads and the
dependent instruction really far from each other.

STATUS

1. mpn_mul_1 could be improved to 6.5 cycles/limb on the PA7100, using the
   instructions below (but some sw pipelining is needed to avoid the
   xmpyu-fstds delay):

	fldds	s1_ptr

	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)

	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)

	addc
	stws	res_ptr
	addc
	stws	res_ptr

	addib	Loop

2. mpn_addmul_1 could be improved from the current 10 to 7.5 cycles/limb
   (asymptotically) on the PA7100, using the instructions below.  With proper
   sw pipelining and the unrolling level below, the speed becomes 8
   cycles/limb.

	fldds	s1_ptr
	fldds	s1_ptr

	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)
	xmpyu
	fstds	N(%r30)

	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	ldws	N(%r30)
	addc
	addc
	addc
	addc
	addc	%r0,%r0,cy-limb

	ldws	res_ptr
	ldws	res_ptr
	ldws	res_ptr
	ldws	res_ptr
	add
	stws	res_ptr
	addc
	stws	res_ptr
	addc
	stws	res_ptr
	addc
	stws	res_ptr

	addib

3. For the PA8000 we have to stick to using 32-bit limbs before compiler
   support emerges.  But we want to use 64-bit operations whenever possible,
   in particular for loads and stores.  It is possible to handle mpn_add_n
   efficiently by rotating (when s1/s2 are aligned), masking+bit field
   inserting when (they are not).  The speed should double compared to the
   code used today.


LABEL SYNTAX

The HP-UX assembler takes labels starting in column 0 with no colon,

	L$loop  ldws,mb -4(0,%r25),%r22

Gas on hppa GNU/Linux however requires a colon,

	L$loop: ldws,mb -4(0,%r25),%r22

This is covered by using LDEF() from asm-defs.m4.  An alternative would be
to use ".label" which is accepted by both,

		.label  L$loop
		ldws,mb -4(0,%r25),%r22

but that's not as nice to look at, not if you're used to assembler code
having labels in column 0.


REFERENCES

Hewlett Packard, "HP Assembler Reference Manual", 9th edition, June 1998,
part number 92432-90012.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/pa64/README
========================================================================

Copyright 1999, 2001, 2002, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions for 64-bit PA-RISC 2.0.

PIPELINE SUMMARY

The PA8x00 processors have an orthogonal 4-way out-of-order pipeline.  Each
cycle two ALU operations and two MEM operations can issue, but just one of the
MEM operations may be a store.  The two ALU operations can be almost any
combination of non-memory operations.  Unlike every other processor, integer
and fp operations are completely equal here; they both count as just ALU
operations.

Unfortunately, some operations cause hickups in the pipeline.  Combining
carry-consuming operations like ADD,DC with operations that does not set carry
like ADD,L cause long delays.  Skip operations also seem to cause hickups.  If
several ADD,DC are issued consecutively, or if plain carry-generating ADD feed
ADD,DC, stalling does not occur.  We can effectively issue two ADD,DC
operations/cycle.

Latency scheduling is not as important as making sure to have a mix of ALU and
MEM operations, but for full pipeline utilization, it is still a good idea to
do some amount of latency scheduling.

Like for all other processors, RAW memory scheduling is critically important.
Since integer multiplication takes place in the floating-point unit, the GMP
code needs to handle this problem frequently.

STATUS

* mpn_lshift and mpn_rshift run at 1.5 cycles/limb on PA8000 and at 1.0
  cycles/limb on PA8500.  With latency scheduling, the numbers could
  probably be improved to 1.0 cycles/limb for all PA8x00 chips.

* mpn_add_n and mpn_sub_n run at 2.0 cycles/limb on PA8000 and at about
  1.6875 cycles/limb on PA8500.  With latency scheduling, this could
  probably be improved to get close to 1.5 cycles/limb.  A problem is the
  stalling of carry-inputting instructions after instructions that do not
  write to carry.

* mpn_mul_1, mpn_addmul_1, and mpn_submul_1 run at between 5.625 and 6.375
  on PA8500 and later, and about a cycle/limb slower on older chips.  The
  code uses ADD,DC for adjacent limbs, and relies heavily on reordering.


REFERENCES

Hewlett Packard, "64-Bit Runtime Architecture for PA-RISC 2.0", version 3.3,
October 1997.

========================================================================
* mpn/powerpc32/README
========================================================================

Copyright 2002, 2005 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                    POWERPC 32-BIT MPN SUBROUTINES


This directory contains mpn functions for various 32-bit PowerPC chips.


CODE ORGANIZATION

	directory	  used for
	================================================
	powerpc           generic, 604, 604e, 744x, 745x
	powerpc/750       740, 750, 7400, 7410


The top-level powerpc directory is currently mostly aimed at 604/604e but
should be reasonable on all powerpcs.


STATUS

The code is quite well optimized for the 604e, other chips have had less
attention.

Altivec SIMD available in 74xx might hold some promise, but unfortunately
GMP only guarantees 32-bit data alignment, so there's lots of fiddling
around with partial operations at the start and end of limb vectors.  A
128-bit limb would be a novel idea, but is unlikely to be practical, since
it would have to work with ordinary +, -, * etc in the C code.

Also, Altivec isn't very well suited for the GMP multiplication needs.
Using floating-point based multiplication has much better better performance
potential for all current powerpcs, both the ones with slow integer multiply
units (603, 740, 750, 7400, 7410) and those with fast (604, 604e, 744x,
745x).  This is because all powerpcs do some level of pipelining in the FPU:

603 and 750 can sustain one fmadd every 2nd cycle.
604 and 604e can sustain one fmadd per cycle.
7400 and 7410 can sustain 3 fmadd in 4 cycles.
744x and 745x can sustain 4 fmadd in 5 cycles.


REGISTER NAMES

The normal powerpc convention is to give registers as plain numbers, like
"mtctr 6", but on Apple MacOS X (powerpc*-*-rhapsody* and
powerpc*-*-darwin*) the assembler demands an "r" like "mtctr r6".  Note
however when register 0 in an instruction means a literal zero the "r" is
omitted, for instance "lwzx r6,0,r7".

The GMP code uses the "r" forms, powerpc-defs.m4 transforms them to plain
numbers according to what GMP_ASM_POWERPC_R_REGISTERS finds is needed.
(Note that this style isn't fully general, as the identifier r4 and the
register r4 will not be distinguishable on some systems.  However, this is
not a problem for the limited GMP assembly usage.)


GLOBAL REFERENCES

Linux non-PIC
	lis	9, __gmp_binvert_limb_table@ha
	rlwinm	11, 5, 31, 25, 31
	la	9, __gmp_binvert_limb_table@l(9)
	lbzx	11, 9, 11

Linux PIC (FIXME)
.LCL0:
	.long .LCTOC1-.LCF0
	bcl	20, 31, .LCF0
.LCF0:
	mflr	30
	lwz	7, .LCL0-.LCF0(30)
	add	30, 7, 30
	lwz	11, .LC0-.LCTOC1(30)
	rlwinm	3, 5, 31, 25, 31
	lbzx	7, 11, 3

AIX (always PIC)
LC..0:
	.tc __gmp_binvert_limb_table[TC],__gmp_binvert_limb_table[RW]
	lwz	9, LC..0(2)
	rlwinm	0, 5, 31, 25, 31
	lbzx	0, 9, 0

Darwin (non-PIC)
	lis	r2, ha16(___gmp_binvert_limb_table)
	rlwinm	r9, r5, 31, 25, 31
	la	r2, lo16(___gmp_binvert_limb_table)(r2)
	lbzx	r0, r2, r9
Darwin (PIC)
	mflr	r0
	bcl	20, 31, L0001$pb
L0001$pb:
	mflr	r7
	mtlr	r0
	addis	r2, r7, ha16(L___gmp_binvert_limb_table$non_lazy_ptr-L0001$pb)
	rlwinm	r9, r5, 31, 25, 31
	lwz	r2, lo16(L___gmp_binvert_limb_table$non_lazy_ptr-L0001$pb)(r2)
	lbzx	r0, r2, r9
------
	.non_lazy_symbol_pointer
L___gmp_binvert_limb_table$non_lazy_ptr:
	.indirect_symbol ___gmp_binvert_limb_table
	.long	0
	.subsections_via_symbols


For GNU/Linux and Darwin, we might want to duplicate __gmp_binvert_limb_table
into the text section in this file.  We should thus be able to reach it like
this:

	blr	L0
L0:	mflr	r2
	rlwinm	r9, r5, 31, 25, 31
	addi	r9, r9, lo16(local_binvert_table-L0)
	lbzx	r0, r2, r9


REFERENCES

PowerPC Microprocessor Family: The Programming Environments for 32-bit
Microprocessors, IBM document G522-0290-01, 2000.

PowerPC 604e RISC Microprocessor User's Manual with Supplement for PowerPC
604 Microprocessor, IBM document G552-0330-00, Freescale document
MPC604EUM/AD, 3/1998.

MPC7410/MPC7400 RISC Microprocessor User's Manual, Freescale document
MPC7400UM/D, rev 1, 11/2002.

MPC7450 RISC Microprocessor Family Reference Manual, Freescale document
MPC7450UM, rev 5, 1/2005.

The above are available online from

	http://www.ibm.com/chips/techlib/techlib.nsf/productfamilies/PowerPC
	http://www.freescale.com/PowerPC


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/powerpc64/README
========================================================================

Copyright 1999-2001, 2003-2005 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                    POWERPC-64 MPN SUBROUTINES


This directory contains mpn functions for 64-bit PowerPC chips.


CODE ORGANIZATION

	mpn/powerpc64          mode-neutral code
	mpn/powerpc64/mode32   code for mode32
	mpn/powerpc64/mode64   code for mode64


The mode32 and mode64 sub-directories contain code which is for use in the
respective chip mode, 32 or 64.  The top-level directory is code that's
unaffected by the mode.

The "adde" instruction is the main difference between mode32 and mode64.  It
operates on either on a 32-bit or 64-bit quantity according to the chip mode.
Other instructions have an operand size in their opcode and hence don't vary.


POWER3/PPC630 pipeline information:

Decoding is 4-way + branch and issue is 8-way with some out-of-order
capability.

Functional units:
LS1  - ld/st unit 1
LS2  - ld/st unit 2
FXU1 - integer unit 1, handles any simple integer instruction
FXU2 - integer unit 2, handles any simple integer instruction
FXU3 - integer unit 3, handles integer multiply and divide
FPU1 - floating-point unit 1
FPU2 - floating-point unit 2

Memory:		  Any two memory operations can issue, but memory subsystem
		  can sustain just one store per cycle.  No need for data
		  prefetch; the hardware has very sophisticated prefetch logic.
Simple integer:	  2 operations (such as add, rl*)
Integer multiply: 1 operation every 9th cycle worst case; exact timing depends
		  on 2nd operand's most significant bit position (10 bits per
		  cycle).  Multiply unit is not pipelined, only one multiply
		  operation in progress is allowed.
Integer divide:	  ?
Floating-point:	  Any plain 2 arithmetic instructions (such as fmul, fadd, and
		  fmadd), latency 4 cycles.
Floating-point divide:
		  ?
Floating-point square root:
		  ?

POWER3/PPC630 best possible times for the main loops:
shift:	      1.5 cycles limited by integer unit contention.
	      With 63 special loops, one for each shift count, we could
	      reduce the needed integer instructions to 2, which would
	      reduce the best possible time to 1 cycle.
add/sub:      1.5 cycles, limited by ld/st unit contention.
mul:	      18 cycles (average) unless floating-point operations are used,
	      but that would only help for multiplies of perhaps 10 and more
	      limbs.
addmul/submul:Same situation as for mul.


POWER4/PPC970 and POWER5 pipeline information:

This is a very odd pipeline, it is basically a VLIW masquerading as a plain
architecture.  Its issue rules are not made public, and since it is so weird,
it is very hard to figure out any useful information from experimentation.
An example:

  A well-aligned loop with nop's take 3, 4, 6, 7, ... cycles.
    3 cycles for  0,  1,  2,  3,  4,  5,  6,  7 nop's
    4 cycles for  8,  9, 10, 11, 12, 13, 14, 15 nop's
    6 cycles for 16, 17, 18, 19, 20, 21, 22, 23 nop's
    7 cycles for 24, 25, 26, 27 nop's
    8 cycles for 28, 29, 30, 31 nop's
    ... continues regularly


Functional units:
LS1  - ld/st unit 1
LS2  - ld/st unit 2
FXU1 - integer unit 1, handles any integer instruction
FXU2 - integer unit 2, handles any integer instruction
FPU1 - floating-point unit 1
FPU2 - floating-point unit 2

While this is one integer unit less than POWER3/PPC630, the remaining units
are more powerful; here they handle multiply and divide.

Memory:		  2 ld/st.  Stores go to the L2 cache, which can sustain just
		  one store per cycle.
		  L1 load latency: to gregs 3-4 cycles, to fregs 5-6 cycles.
		  Operations that modify the address register might be split
		  to use also an integer issue slot.
Simple integer:	  2 operations every cycle, latency 2.
Integer multiply: 2 operations every 6th cycle, latency 7 cycles.
Integer divide:	  ?
Floating-point:	  Any plain 2 arithmetic instructions (such as fmul, fadd, and
		  fmadd), latency 6 cycles.
Floating-point divide:
		  ?
Floating-point square root:
		  ?


IDEAS

*mul_1: Handling one limb using mulld/mulhdu and two limbs using floating-
point operations should give performance of about 20 cycles for 3 limbs, or 7
cycles/limb.

We should probably split the single-limb operand in 32-bit chunks, and the
multi-limb operand in 16-bit chunks, allowing us to accumulate well in fp
registers.

Problem is to get 32-bit or 16-bit words to the fp registers.  Only 64-bit fp
memops copies bits without fiddling with them.  We might therefore need to
load to integer registers with zero extension, store as 64 bits into temp
space, and then load to fp regs.  Alternatively, load directly to fp space
and add well-chosen constants to get cancellation.  (Other part after given by
subsequent subtraction.)

Possible code mix for load-via-intregs variant:

lwz,std,lfd
fmadd,fmadd,fmul,fmul
fctidz,stfd,ld,fctidz,stfd,ld
add,adde
lwz,std,lfd
fmadd,fmadd,fmul,fmul
fctidz,stfd,ld,fctidz,stfd,ld
add,adde
srd,sld,add,adde,add,adde

========================================================================
* mpn/s390_64/README
========================================================================

Copyright 2011 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


There are 5 generations of 64-but s390 processors, z900, z990, z9,
z10, and z196.  The current GMP code was optimised for the two oldest,
z900 and z990.


mpn_copyi

This code makes use of a loop around MVC.  It almost surely runs very
close to optimally.  A small improvement could be done by using one
MVC for size 256 bytes, now we use two (we use an extra MVC when
copying any multiple of 256 bytes).


mpn_copyd

We have tried several feed-in variants here, branch tree, jump table
and computed goto.  The fastest (on z990) turned out to be computed
goto.

An approach not tried is EX of LMG and STMG, modifying the register set
on-the-fly.  Using that trick, we could completely avoid using
separate feed-in paths.


mpn_lshift, mpn_rshift

The current code runs at pipeline decode bandwith on z990.


mpn_add_n, mpn_sub_n

The current code is 4-way unrolled.  It should be unrolled more, at
least 8x, in order to reach 2.5 c/l.


mpn_mul_1, mpn_addmul_1, mpn_submul_1

The current code is very naive, but due to the non-pipelined nature of
MLGR on z900 and z990, more sophisticated code would not gain much.

On z10 one would need to cluster at least 4 MLGR together, in order to
reduce stalling.

On z196, one surely want to use unrolling and pipelining, to perhaps
reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
stalling.


mpn_mul_2, mpn_addmul_2

At least for older machines (z900, z990) with very slow MLGR, we
should use Karatsuba's algorithm on 2-limb units, making mul_2 and
addmul_2 the main multiplication primitives.  The newer machines might
benefit less from this approach, perhaps in particular z10, where MLGR
clustering is more important.

With Karatsuba, one could hope for around 16 cycles per accumulated
128 cross product, on z990.

========================================================================
* mpn/sparc32/README
========================================================================

Copyright 1996, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions for various SPARC chips.  Code that
runs only on version 8 SPARC implementations, is in the v8 subdirectory.

RELEVANT OPTIMIZATION ISSUES

  Load and Store timing

On most early SPARC implementations, the ST instructions takes multiple
cycles, while a STD takes just a single cycle more than an ST.  For the CPUs
in SPARCstation I and II, the times are 3 and 4 cycles, respectively.
Therefore, combining two ST instructions into a STD when possible is a
significant optimization.

Later SPARC implementations have single cycle ST.

For SuperSPARC, we can perform just one memory instruction per cycle, even
if up to two integer instructions can be executed in its pipeline.  For
programs that perform so many memory operations that there are not enough
non-memory operations to issue in parallel with all memory operations, using
LDD and STD when possible helps.

UltraSPARC-1/2 has very slow integer multiplication.  In the v9 subdirectory,
we therefore use floating-point multiplication.

STATUS

1. On a SuperSPARC, mpn_lshift and mpn_rshift run at 3 cycles/limb, or 2.5
   cycles/limb asymptotically.  We could optimize speed for special counts
   by using ADDXCC.

2. On a SuperSPARC, mpn_add_n and mpn_sub_n runs at 2.5 cycles/limb, or 2
   cycles/limb asymptotically.

3. mpn_mul_1 runs at what is believed to be optimal speed.

4. On SuperSPARC, mpn_addmul_1 and mpn_submul_1 could both be improved by a
   cycle by avoiding one of the add instructions.  See a29k/addmul_1.

The speed of the code for other SPARC implementations is uncertain.

========================================================================
* mpn/sparc64/README
========================================================================

Copyright 1997, 1999-2002 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


This directory contains mpn functions for 64-bit V9 SPARC

RELEVANT OPTIMIZATION ISSUES

Notation:
  IANY = shift/add/sub/logical/sethi
  IADDLOG = add/sub/logical/sethi
  MEM = ld*/st*
  FA = fadd*/fsub*/f*to*/fmov*
  FM = fmul*

UltraSPARC can issue four instructions per cycle, with these restrictions:
* Two IANY instructions, but only one of these may be a shift.  If there is a
  shift and an IANY instruction, the shift must precede the IANY instruction.
* One FA.
* One FM.
* One branch.
* One MEM.
* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle.  Taken branches
  should not be in slot 4, since that makes the delay insn come from separate
  bundle.
* If two IANY/IADDLOG instructions are to be executed in the same cycle and one
  of these is setting the condition codes, that instruction must be the second
  one.

To summarize, ignoring branches, these are the bundles that can reach the peak
execution speed:

insn1	iany	iany	mem	iany	iany	mem	iany	iany	mem
insn2	iaddlog	mem	iany	mem	iaddlog	iany	mem	iaddlog	iany
insn3	mem	iaddlog	iaddlog	fa	fa	fa	fm	fm	fm
insn4	fa/fm	fa/fm	fa/fm	fm	fm	fm	fa	fa	fa

The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
depending on the position of the most significant bit of the first source
operand.  When used for 32x32->64 multiplication, it needs 20 cycles.
Furthermore, it stalls the processor while executing.  We stay away from that
instruction, and instead use floating-point operations.

Floating-point add and multiply units are fully pipelined.  The latency for
UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.

Integer conditional move instructions cannot dual-issue with other integer
instructions.  No conditional move can issue 1-5 cycles after a load.  (This
might have been fixed for UltraSPARC-3.)

The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is
somewhat slower.  Branches execute slower, and there may be other new stalls.
But integer multiply doesn't stall the entire CPU and also has a much lower
latency.  But it's still not pipelined, and thus useless for our needs.

STATUS

* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
  UltraSPARC-1/2 and 2.65 on UltraSPARC-3.  For UltraSPARC-1/2, the IEU0
  functional unit is saturated with shifts.

* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
  UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3.  The 4 instruction
  recurrency is the speed limiter.

* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
  UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3.  On UltraSPARC-1/2, the
  code sustains 4 instructions/cycle.  It might be possible to invent a better
  way of summing the intermediate 49-bit operands, but it is unlikely that it
  will save enough instructions to save an entire cycle.

  The load-use of the u operand is not enough scheduled for good L2 cache
  performance.  The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
  temporary stack slots that will conflict with the u and r operands, we miss
  to L2 very often.  The load-use of the std/ldx pairs via the stack are
  perhaps over-scheduled.

  It would be possible to save two instructions: (1) The mov could be avoided
  if the std/ldx were less scheduled.  (2) The ldx of the r operand could be
  split into two ld instructions, saving the shifts/masks.

  It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
  operations where rescheduled for this processor's 4-cycle latency.

* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
  code.  It would be possible to shave one or two cycles from it, with some
  labour.

* mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n.  This
  means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
  UltraSPARC-3.  It would be possible to either match the mpn_addmul_1
  performance, or in the worst case use one more instruction group.

* US1/US2 cache conflict resolving.  The direct mapped L1 date cache of US1/US2
  is a problem for mul_1, addmul_1 (and a prospective submul_1).  We should
  allocate a larger cache area, and put the stack temp area in a place that
  doesn't cause cache conflicts.

========================================================================
* mpn/x86/README
========================================================================

Copyright 1999-2002 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      X86 MPN SUBROUTINES


This directory contains mpn functions for various 80x86 chips.


CODE ORGANIZATION

	x86               i386, generic
	x86/i486          i486
	x86/pentium       Intel Pentium (P5, P54)
	x86/pentium/mmx   Intel Pentium with MMX (P55)
	x86/p6            Intel Pentium Pro
	x86/p6/mmx        Intel Pentium II, III
	x86/p6/p3mmx      Intel Pentium III
	x86/k6            \ AMD K6
	x86/k6/mmx        /
	x86/k6/k62mmx     AMD K6-2
	x86/k7            \ AMD Athlon
	x86/k7/mmx        /
	x86/pentium4      \
	x86/pentium4/mmx  | Intel Pentium 4
	x86/pentium4/sse2 /


The top-level x86 directory contains blended style code, meant to be
reasonable on all x86s.


STATUS

The code is well-optimized for AMD and Intel chips, but there's nothing
specific for Cyrix chips, nor for actual 80386 and 80486 chips.


ASM FILES

The x86 .asm files are BSD style assembler code, first put through m4 for
macro processing.  The generic mpn/asm-defs.m4 is used, together with
mpn/x86/x86-defs.m4.  See comments in those files.

The code is meant for use with GNU "gas" or a system "as".  There's no
support for assemblers that demand Intel style code.


STACK FRAME

m4 macros are used to define the parameters passed on the stack, and these
act like comments on what the stack frame looks like too.  For example,
mpn_mul_1() has the following.

        defframe(PARAM_MULTIPLIER, 16)
        defframe(PARAM_SIZE,       12)
        defframe(PARAM_SRC,         8)
        defframe(PARAM_DST,         4)

PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
return address is at offset 0, but there's not normally any need to access
that.

FRAME is redefined as necessary through the code so it's the number of bytes
pushed on the stack, and hence the offsets in the parameter macros stay
correct.  At the start of a routine FRAME should be zero.

        deflit(`FRAME',0)
	...
	deflit(`FRAME',4)
	...
	deflit(`FRAME',8)
	...

Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
and can be used instead of explicit definitions if preferred.
defframe_pushl() is a combination FRAME_pushl() and defframe().

There's generally some slackness in redefining FRAME.  If new values aren't
going to get used then the redefinitions are omitted to keep from cluttering
up the code.  This happens for instance at the end of a routine, where there
might be just four pops and then a ret, so FRAME isn't getting used.

Local variables and saved registers can be similarly defined, with negative
offsets representing stack space below the initial stack pointer.  For
example,

	defframe(SAVE_ESI,   -4)
	defframe(SAVE_EDI,   -8)
	defframe(VAR_COUNTER,-12)

	deflit(STACK_SPACE, 12)

Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
space, and that instruction must be followed by a redefinition of FRAME
(setting it equal to STACK_SPACE) to reflect the change in %esp.

Definitions for pushed registers are only put in when they're going to be
used.  If registers are just saved and restored with pushes and pops then
definitions aren't made.


ASSEMBLER EXPRESSIONS

Only addition and subtraction seem to be universally available, certainly
that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
then m4 eval() should be used.

In particular note that a "/" anywhere in a line starts a comment in Solaris
"as", and in some configurations of gas too.

	addl	$32/2, %eax           <-- wrong

	addl	$eval(32/2), %eax     <-- right

Binutils gas/config/tc-i386.c has a choice between "/" being a comment
anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
the latter, and from 2.9.5 it's the default for GNU/Linux too.


ASSEMBLER COMMENTS

Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
files have no comments.

Any comments before include(`../config.m4') must use m4 "dnl", since it's
only after the include that "C" is available.  By convention "dnl" is also
used for comments about m4 macros.


TEMPORARY LABELS

Temporary numbered labels like "1:" used as "1f" or "1b" are available in
"gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
used instead, possibly with a counter to make them unique, see jadcl0() in
x86-defs.m4 for instance.  A separate counter for each macro makes it
possible to nest them, for instance movl_text_address() can be used within
an ASSERT().

"1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
unique number looks like a good alternative, but is that actually a
documented feature?  In any case this problem doesn't currently arise.


ZERO DISPLACEMENTS

In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
displacement are wanted, rather than (%ebx) with no displacement.  These are
either for computed jumps or to get desirable code alignment.  Explicit
.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
(%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.

Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
1.92.3 changes it.  In general changing would be the sort of "optimization"
an assembler might perform, hence explicit ".byte"s are used where
necessary.


SHLD/SHRD INSTRUCTIONS

The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
gas), and omits %cl elsewhere.

For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
%cl should be used, and the macros shldl, shrdl, shldw and shrdw in
mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
with those macros for usage.


IMUL INSTRUCTION

GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
that the following two forms produce identical object code

	imul	$12, %eax
	imul	$12, %eax, %eax

but that the former isn't accepted by some assemblers, in particular the SCO
OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.

(This applies only to immediate operands, the three operand form is only
valid with an immediate.)


DIRECTION FLAG

The x86 calling conventions say that the direction flag should be clear at
function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
Although this has been so since the year dot, it's not absolutely clear
whether it's universally respected.  Since it's better to be safe than
sorry, GMP follows glibc and does a "cld" if it depends on the direction
flag being clear.  This happens only in a few places.


POSITION INDEPENDENT CODE

  Coding Style

    Defining the symbol PIC in m4 processing selects SVR4 / ELF style
    position independent code.  This is necessary for shared libraries
    because they can be mapped into different processes at different virtual
    addresses.  Actually, relocations are allowed but text pages with
    relocations aren't shared, defeating the purpose of a shared library.

    The GOT is used to access global data, and the PLT is used for
    functions.  The use of the PLT adds a fixed cost to every function call,
    and the GOT adds a cost to any function accessing global variables.
    These are small but might be noticeable when working with small
    operands.

  Scope

    It's intended, as a matter of policy, that references within libgmp are
    resolved within libgmp.  Certainly there's no need for an application to
    replace any internals, and we take the view that there's no value in an
    application subverting anything documented either.

    Resolving references within libgmp in theory means calls can be made with a
    plain PC-relative call instruction, which is faster and smaller than going
    through the PLT, and data references can be similarly PC-relative, saving a
    GOT entry and fetch from there.  Unfortunately the normal linker behaviour
    doesn't allow us to do this.

    By default an R_386_PC32 PC-relative reference, either for a call or for
    data, is left in libgmp.so by the linker so that it can be resolved at
    runtime to a location in the application or another shared library.  This
    means a text segment relocation which we don't want.

  -Bsymbolic

    Under the "-Bsymbolic" option, the linker resolves references to symbols
    within libgmp.so.  This gives us the desired effect for R_386_PC32,
    ie. it's resolved at link time.  It also resolves R_386_PLT32 calls
    directly to their target without creating a PLT entry (though if this is
    done to normal compiler-generated code it still leaves a setup of %ebx
    to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).

    Unfortunately -Bsymbolic does bad things to global variables defined in
    a shared library but accessed by non-PIC code from the mainline (or a
    static library).

    The problem is that the mainline needs a fixed data address to avoid
    text segment relocations, so space is allocated in its data segment and
    the value from the variable is copied from the shared library's data
    segment when the library is loaded.  Under -Bsymbolic, however,
    references in the shared library are then resolved still to the shared
    library data area.  Not surprisingly it bombs badly to have mainline
    code and library code accessing different locations for what should be
    one variable.

    Note that this -Bsymbolic effect for the shared library is not just for
    R_386_PC32 offsets which might have been cooked up in assembler, but is
    done also for the contents of GOT entries.  -Bsymbolic simply applies a
    general rule that symbols are resolved first from the local module.

  Visibility Attributes

    GCC __attribute__ ((visibility ("protected"))), which is available in
    recent versions, eg. 3.3, is probably what we'd like to use.  It makes
    gcc generate plain PC-relative calls to indicated functions, and directs
    the linker to resolve references to the given function within the link
    module.

    Unfortunately, as of debian binutils 2.13.90.0.16 at least, the
    resulting libgmp.so comes out with text segment relocations, references
    are not resolved at link time.  If the gcc description is to be believed
    this is this not how it should work.  If a symbol cannot be overridden
    by another module then surely references within that module can be
    resolved immediately (ie. at link time).

  Present

    In any case, all this means that we have no optimizations we can
    usefully make to function or variable usages, neither for assembler nor
    C code.  Perhaps in the future the visibility attribute will work as
    we'd like.


GLOBAL OFFSET TABLE

The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the
GOT sometimes requires an extra underscore prefix.  SVR4 systems and NetBSD
don't need a prefix, OpenBSD does need one.  Note that NetBSD and OpenBSD
are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_
is not simply the same as the prefix for ordinary globals.

In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro
in x86-defs.m4 add an extra underscore if required (according to a configure
test).

Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
asked to assemble the following,

        L1:
            addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx

It seems that using the label in the same instruction it refers to is the
problem, since a nop in between works.  But the simplest workaround is to
follow gcc and omit the +[.-L1] since it does nothing,

            addl  $_GLOBAL_OFFSET_TABLE_, %ebx

Current gas 2.10 generates incorrect object code when %eax is used in such a
construction (with or without +[.-L1]),

            addl  $_GLOBAL_OFFSET_TABLE_, %eax

The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
other register, since then it's a two byte opcode+mod/rm.  GCC for example
always uses %ebx (which is needed for calls through the PLT).

A similar problem occurs in an leal (again with or without a +[.-L1]),

            leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx

This time the R_386_GOTPC gets a displacement of 0 rather than the 2
appropriate for the opcode and mod/rm, making this form unusable.


SIMPLE LOOPS

The overheads in setting up for an unrolled loop can mean that at small
sizes a simple loop is faster.  Making small sizes go fast is important,
even if it adds a cycle or two to bigger sizes.  To this end various
routines choose between a simple loop and an unrolled loop according to
operand size.  The path to the simple loop, or to special case code for
small sizes, is always as fast as possible.

Adding a simple loop requires a conditional jump to choose between the
simple and unrolled code.  The size of a branch misprediction penalty
affects whether a simple loop is worthwhile.

The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
a couple of cycles to an unrolled loop setup, the threshold will vary with
PIC or non-PIC.  Something like the following is typical.

	deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))

There's no automated way to determine the threshold.  Setting it to a small
value and then to a big value makes it possible to measure the simple and
unrolled loops each over a range of sizes, from which the crossover point
can be determined.  Alternately, just adjust the threshold up or down until
there's no more speedups.


UNROLLED LOOP CODING

The x86 addressing modes allow a byte displacement of -128 to +127, making
it possible to access 256 bytes, which is 64 limbs, without adjusting
pointer registers within the loop.  Dword sized displacements can be used
too, but they increase code size, and unrolling to 64 ought to be enough.

When unrolling to the full 64 limbs/loop, the limb at the top of the loop
will have a displacement of -128, so pointers have to have a corresponding
+128 added before entering the loop.  When unrolling to 32 limbs/loop
displacements 0 to 127 can be used with 0 at the top of the loop and no
adjustment needed to the pointers.

Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
16 is small, so support for 64 limbs/loop is generally only for comparison.


COMPUTED JUMPS

When working from least significant limb to most significant limb (most
routines) the computed jump and pointer calculations in preparation for an
unrolled loop are as follows.

	S = operand size in limbs
	N = number of limbs per loop (UNROLL_COUNT)
	L = log2 of unrolling (UNROLL_LOG2)
	M = mask for unrolling (UNROLL_MASK)
	C = code bytes per limb in the loop
	B = bytes per limb (4 for x86)

	computed jump            (-S & M) * C + entrypoint
	subtract from pointers   (-S & M) * B
	initial loop counter     (S-1) >> L
	displacements            0 to B*(N-1)

The loop counter is decremented at the end of each loop, and the looping
stops when the decrement takes the counter to -1.  The displacements are for
the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".

Usually the multiply by "C" can be handled without an imul, using instead an
leal, or a shift and subtract.

When working from most significant to least significant limb (eg. mpn_lshift
and mpn_copyd), the calculations change as follows.

	add to pointers          (-S & M) * B
	displacements            0 to -B*(N-1)


OLD GAS 1.92.3

This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
affect GMP code.

Firstly, an expression involving two forward references to labels comes out
as zero.  For example,

		addl	$bar-foo, %eax
	foo:
		nop
	bar:

This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
When only one forward reference is involved, it works correctly, as for
example,

	foo:
		addl	$bar-foo, %eax
		nop
	bar:

Secondly, an expression involving two labels can't be used as the
displacement for an leal.  For example,

	foo:
		nop
	bar:
		leal	bar-foo(%eax,%ebx,8), %ecx

A slightly cryptic error is given, "Unimplemented segment type 0 in
parse_operand".  When only one label is used it's ok, and the label can be a
forward reference too, as for example,

		leal	foo(%eax,%ebx,8), %ecx
		nop
	foo:

These problems only affect PIC computed jump calculations.  The workarounds
are just to do an leal without a displacement and then an addl, and to make
sure the code is placed so that there's at most one forward reference in the
addl.


REFERENCES

"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,
2006, order numbers 253665 through 253669.  Available on-line,

	ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf
	ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
	ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf
	ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
	ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf


"System V Application Binary Interface", Unix System Laboratories Inc, 1992,
published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
conventions and ELF shared library PIC coding.  Versions of both available
on-line,

	http://www.sco.com/developer/devspecs

"Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
ABI supplement.)


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86/k6/README
========================================================================

Copyright 2000, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


			AMD K6 MPN SUBROUTINES


This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
K6-3.

The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
the separate directories are just so that ./configure can omit them if the
assembler doesn't support MMX.


STATUS

Times for the loops, with all code and data in L1 cache, are as follows.

                                 cycles/limb

	mpn_add_n/sub_n            3.25 normal, 2.75 in-place

	mpn_mul_1                  6.25
	mpn_add/submul_1           7.65-8.4  (varying with data values)

	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
                                   or 9.2 cycles/triangleproduct (approx)

	mpn_l/rshift               3.0

	mpn_divrem_1              20.0
	mpn_mod_1                 20.0
	mpn_divexact_by3          11.0

	mpn_copyi                  1.0
	mpn_copyd                  1.0


K6-2 and K6-3 have dual-issue MMX and get the following improvements.

	mpn_l/rshift               1.75


Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
instruction, code seems to run slower, and with just "mov" loads it doesn't
seem faster.  Results so far are inconsistent.  The K6 does a hardware
prefetch of the second cache line in a sector, so the penalty for not
prefetching in software is reduced.


NOTES

All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.

Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
execute them in both X and Y (and in both together).

Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
chapter 6 table 12).

Write-allocate L1 data cache means prefetching of destinations is unnecessary.
Store queue is 7 entries of 64 bits each.

Floating point multiplications can be done in parallel with integer
multiplications, but there doesn't seem to be any way to make use of this.


OPTIMIZATIONS

Unrolled loops are used to reduce looping overhead.  The unrolling is
configurable up to 32 limbs/loop for most routines, up to 64 for some.

Sometimes computed jumps into the unrolling are used to handle sizes not a
multiple of the unrolling.  An attractive feature of this is that times
smoothly increase with operand size, but an indirect jump is about 6 cycles
and the setups about another 6, so it depends on how much the unrolled code
is faster than a simple loop as to whether a computed jump ought to be used.

Position independent code is implemented using a call to get eip for
computed jumps and a ret is always done, rather than an addl $4,%esp or a
popl, so the CPU return address branch prediction stack stays synchronised
with the actual stack in memory.  Such a call however still costs 4 to 7
cycles.

Branch prediction, in absence of any history, will guess forward jumps are
not taken and backward jumps are taken.  Where possible it's arranged that
the less likely or less important case is under a taken forward jump.


MMX

Putting emms or femms as late as possible in a routine seems to be fastest.
Perhaps an emms or femms stalls until all outstanding MMX instructions have
completed, so putting it later gives them a chance to complete on their own,
in parallel with other operations (like register popping).

The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
at the start of a routine, in case it's been preceded by x87 floating point
operations.  This isn't done because in gmp programs it's expected that x87
floating point won't be much used and that chances are an mpn routine won't
have been preceded by any x87 code.


CODING

Instructions in general code are shown paired if they can decode and execute
together, meaning two short decode instructions with the second not
depending on the first, only the first using the shifter, no more than one
load, and no more than one store.

K6 does some out of order execution so the pairings aren't essential, they
just show what slots might be available.  When decoding is the limiting
factor things can be scheduled that might not execute until later.


NOTES

Code alignment

- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
  short decode is inhibited.  The cross.pl script detects this.

- loops and branch targets should be aligned to 16 bytes, or ensure at least
  2 instructions before a 32 byte boundary.  This makes use of the 16 byte
  cache in the BTB.

Addressing modes

- (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
  problem, and can be used as an equivalent, or easier is just to use a
  different register, like %ebx.

- K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
  have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).

  If more than 3 bytes are needed to determine instruction length then
  decoding degrades from direct to long, or from long to vector.  This
  happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
  with mod=00 the sib determines whether there's a displacement.

  This affects all MMX and 3DNow instructions, and others with an 0F prefix,
  like movzbl.  The modes affected are anything with an index and no
  displacement, or an index but no base, and this includes (%esp) which is
  really (,%esp,1).

  The cross.pl script detects problem cases.  The workaround is to always
  use a displacement, and to do this with Zdisp if it's zero so the
  assembler doesn't discard it.

  See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
  13-14 and 36-37.

Calls

- indirect jumps and calls are not branch predicted, they measure about 6
  cycles.

Various

- adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
- bsf       12-27 cycles
- emms      5 cycles
- femms     3 cycles
- jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
- divl      20 cycles back-to-back
- imull     2 decode, 3 execute
- mull      2 decode, 3 execute (optimization manual decoding sample)
- prefetch  2 cycles
- rcll/rcrl implicit by one bit: 2 cycles
            immediate or %cl count: 11 + 2 per bit for dword
                                    13 + 4 per bit for byte
- setCC	    2 cycles
- xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
        reg,reg   2 cycles, back-to-back


REFERENCES

"AMD-K6 Processor Code Optimization Application Note", AMD publication
number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
K6-3.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf

"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
publication number 21828, revision A amendment 0, August 1997.  This is an
older edition of the above document, describing plain K6.  Available
on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf

"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
This describes the femms and prefetch instructions, but nothing else from
3DNow has been used.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf

"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
August 1999.  This has some notes on general K6 optimizations as well as
3DNow.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86/k7/README
========================================================================

Copyright 2000, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      AMD K7 MPN SUBROUTINES


This directory contains code optimized for the AMD Athlon CPU.

The mmx subdirectory has routines using MMX instructions.  All Athlons have
MMX, the separate directory is just so that configure can omit it if the
assembler doesn't support MMX.


STATUS

Times for the loops, with all code and data in L1 cache.

                               cycles/limb
	mpn_add/sub_n             1.6

	mpn_copyi                 0.75 or 1.0   \ varying with data alignment
	mpn_copyd                 0.75 or 1.0   /

	mpn_divrem_1             17.0 integer part, 15.0 fractional part
	mpn_mod_1                17.0
	mpn_divexact_by3          8.0

	mpn_l/rshift              1.2

	mpn_mul_1                 3.4
	mpn_addmul/submul_1       3.9

	mpn_mul_basecase          4.42 cycles/crossproduct (approx)
        mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
				  or 4.55 cycles/triangleproduct (approx)

Prefetching of sources hasn't yet been tried.


NOTES

cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.

Write-allocate L1 data cache means prefetching of destinations is unnecessary.

Floating point multiplications can be done in parallel with integer
multiplications, but there doesn't seem to be any way to make use of this.

Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
the speed of the multiplication routines.  The documentation shows mul
executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
to get near 3 cycles code has to be arranged so that nothing else is issued
to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
apparently equivalent code takes 5.


OPTIMIZATIONS

Unrolled loops are used to reduce looping overhead.  The unrolling is
configurable up to 32 limbs/loop for most routines and up to 64 for some.
The K7 has 64k L1 code cache so quite big unrolling is allowable.

Computed jumps into the unrolling are used to handle sizes not a multiple of
the unrolling.  An attractive feature of this is that times increase
smoothly with operand size, but it may be that some routines should just
have simple loops to finish up, especially when PIC adds between 2 and 16
cycles to get %eip.

Position independent code is implemented using a call to get %eip for the
computed jumps and a ret is always done, rather than an addl $4,%esp or a
popl, so the CPU return address branch prediction stack stays synchronised
with the actual stack in memory.

Branch prediction, in absence of any history, will guess forward jumps are
not taken and backward jumps are taken.  Where possible it's arranged that
the less likely or less important case is under a taken forward jump.


CODING

Instructions in general code have been shown grouped if they can execute
together, which means up to three direct-path instructions which have no
successive dependencies.  K7 always decodes three and has out-of-order
execution, but the groupings show what slots might be available and what
dependency chains exist.

When there's vector-path instructions an effort is made to get triplets of
direct-path instructions in between them, even if there's dependencies,
since this maximizes decoding throughput and might save a cycle or two if
decoding is the limiting factor.


INSTRUCTIONS

adcl       direct
divl       39 cycles back-to-back
lodsl,etc  vector
loop       1 cycle vector (decl/jnz opens up one decode slot)
movd reg   vector
movd mem   direct
mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
popl	   vector (use movl for more than one pop)
pushl	   direct, will pair with a load
shrdl %cl  vector, 3 cycles, seems to be 3 decode too
xorl r,r   false read dependency recognised


REFERENCES

"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
22007, revision K, February 2002.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf

"3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
This describes the femms and prefetch instructions.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf

"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
publication number 22466, revision D, March 2000.  This describes
instructions added in the Athlon processor, such as pswapd and the extra
prefetch forms.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf

"3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
August 1999.  This has some notes on general Athlon optimizations as well as
3DNow.  Available on-line,

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86/p6/README
========================================================================

Copyright 2000, 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                      INTEL P6 MPN SUBROUTINES


This directory contains code optimized for Intel P6 class CPUs, meaning
PentiumPro, Pentium II and Pentium III.  The mmx and p3mmx subdirectories
have routines using MMX instructions.


STATUS

Times for the loops, with all code and data in L1 cache, are as follows.
Some of these might be able to be improved.

                               cycles/limb

	mpn_add_n/sub_n           3.7

	mpn_copyi                 0.75
	mpn_copyd                 1.75 (or 0.75 if no overlap)

	mpn_divrem_1             39.0
	mpn_mod_1                21.5
	mpn_divexact_by3          8.5

	mpn_mul_1                 5.5
	mpn_addmul/submul_1       6.35

	mpn_l/rshift              2.5

	mpn_mul_basecase          8.2 cycles/crossproduct (approx)
	mpn_sqr_basecase          4.0 cycles/crossproduct (approx)
				  or 7.75 cycles/triangleproduct (approx)

Pentium II and III have MMX and get the following improvements.

	mpn_divrem_1             25.0 integer part, 17.5 fractional part

	mpn_l/rshift              1.75


NOTES

Write-allocate L1 data cache means prefetching of destinations is unnecessary.

Mispredicted branches have a penalty of between 9 and 15 cycles, and even up
to 26 cycles depending how far speculative execution has gone.  The 9 cycle
minimum penalty comes from the issue pipeline being 9 stages.

A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4,
5, 6 or 7 limb operations are all the same.  The 0.75 cycles/limb would be 3
cycles per 16 byte block.


CODING

Instructions in general code have been shown grouped if they can execute
together, which means up to three instructions with no successive
dependencies, and with only the first being a multiple micro-op.

P6 has out-of-order execution, so the groupings are really only showing
dependent paths where some shuffling might allow some latencies to be
hidden.


REFERENCES

"Intel Architecture Optimization Reference Manual", 1999, revision 001 dated
02/99, order number 245127 (order number 730795-001 is in the document too).
Available on-line:

	http://download.intel.com/design/PentiumII/manuals/245127.htm

"Intel Architecture Optimization Manual", 1997, order number 242816.  This
is an older document mostly about P5 and not as good as the above.
Available on-line:

	http://download.intel.com/design/PentiumII/manuals/242816.htm


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86/pentium/README
========================================================================

Copyright 1996, 1999-2001, 2003 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                   INTEL PENTIUM P5 MPN SUBROUTINES


This directory contains mpn functions optimized for Intel Pentium (P5,P54)
processors.  The mmx subdirectory has additional code for Pentium with MMX
(P55).


STATUS

                                cycles/limb

	mpn_add_n/sub_n            2.375

	mpn_mul_1                 12.0
	mpn_add/submul_1          14.0

	mpn_mul_basecase          14.2 cycles/crossproduct (approx)

	mpn_sqr_basecase           8 cycles/crossproduct (approx)
                                   or 15.5 cycles/triangleproduct (approx)

	mpn_l/rshift               5.375 normal (6.0 on P54)
				   1.875 special shift by 1 bit

	mpn_divrem_1              44.0
	mpn_mod_1                 28.0
	mpn_divexact_by3          15.0

	mpn_copyi/copyd            1.0

Pentium MMX gets the following improvements

	mpn_l/rshift               1.75

	mpn_mul_1                 12.0 normal, 7.0 for 16-bit multiplier


mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
overhead and other delays (cache refill?), they run at or near 2.5
cycles/limb.

mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
should.  Intel documentation says a mul instruction is 10 cycles, but it
measures 9 and the routines using it run as 9.


P55 MMX AND X87

The cost of switching between MMX and x87 floating point on P55 is about 100
cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
mixed and currently that means using MMX and not x87.

MMX offers a big speedup for lshift and rshift, and a nice speedup for
16-bit multipliers in mpn_mul_1.  If fast code using x87 is found then
perhaps the preference for MMX will be reversed.


P54 SHLDL

mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.

It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
but not two.  For example, back to back repetitions of the following

	shldl(	%cl, %eax, %ebx)
	xorl	%edx, %edx
	xorl	%esi, %esi

run at 5 cycles, as expected, but repetitions of the following run at 7
cycles, whereas 6 would be expected (and is achieved on P55),

	shldl(	%cl, %eax, %ebx)
	xorl	%edx, %edx
	xorl	%esi, %esi
	xorl	%edi, %edi
	xorl	%ebp, %ebp

Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing
inhibited is only in the second following cycle (or something like that).

Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
made on something like that, but it's not yet complete.


OTHER NOTES

Prefetching Destinations

    Pentium doesn't allocate cache lines on writes, unlike most other modern
    processors.  Since the functions in the mpn class do array writes, we
    have to handle allocating the destination cache lines by reading a word
    from it in the loops, to achieve the best performance.

Prefetching Sources

    Prefetching of sources is pointless since there's no out-of-order loads.
    Any load instruction blocks until the line is brought to L1, so it may
    as well be the load that wants the data which blocks.

Data Cache Bank Clashes

    Pairing of memory operations requires that the two issued operations
    refer to different cache banks (ie. different addresses modulo 32
    bytes).  The simplest way to ensure this is to read/write two words from
    the same object.  If we make operations on different objects, they might
    or might not be to the same cache bank.

PIC %eip Fetching

    A simple call $+5 and popl can be used to get %eip, there's no need to
    balance calls and returns since P5 doesn't have any return stack branch
    prediction.

Float Multiplies

    fmul is pairable and can be issued every 2 cycles (with a 4 cycle
    latency for data ready to use).  This is a lot better than integer mull
    or imull at 9 cycles non-pairing.  Unfortunately the advantage is
    quickly eaten away by needing to throw data through memory back to the
    integer registers to adjust for fild and fist being signed, and to do
    things like propagating carry bits.


REFERENCES

"Intel Architecture Optimization Manual", 1997, order number 242816.  This
is mostly about P5, the parts about P6 aren't relevant.  Available on-line:

        http://download.intel.com/design/PentiumII/manuals/242816.htm


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86/pentium4/README
========================================================================

Copyright 2001 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


                   INTEL PENTIUM-4 MPN SUBROUTINES


This directory contains mpn functions optimized for Intel Pentium-4.

The mmx subdirectory has routines using MMX instructions, the sse2
subdirectory has routines using SSE2 instructions.  All P4s have these, the
separate directories are just so configure can omit that code if the
assembler doesn't support it.


STATUS

                                cycles/limb

	mpn_add_n/sub_n            4 normal, 6 in-place

	mpn_mul_1                  4 normal, 6 in-place
	mpn_addmul_1               6
	mpn_submul_1               7

	mpn_mul_basecase           6 cycles/crossproduct (approx)

	mpn_sqr_basecase           3.5 cycles/crossproduct (approx)
                                   or 7.0 cycles/triangleproduct (approx)

	mpn_l/rshift               1.75


The shifts ought to be able to go at 1.5 c/l, but not much effort has been
applied to them yet.

In-place operations, and all addmul, submul, mul_basecase and sqr_basecase
calls, suffer from pipeline anomalies associated with write combining and
movd reads and writes to the same or nearby locations.  The movq
instructions do not trigger the same hardware problems.  Unfortunately,
using movq and splitting/combining seems to require too many extra
instructions to help.  Perhaps future chip steppings will be better.


NOTES

The Pentium-4 pipeline "Netburst", provides for quite a number of surprises.
Many traditional x86 instructions run very slowly, requiring use of
alterative instructions for acceptable performance.

adcl and sbbl are quite slow at 8 cycles for reg->reg.  paddq of 32-bits
within a 64-bit mmx register seems better, though the combination
paddq/psrlq when propagating a carry is still a 4 cycle latency.

incl and decl should be avoided, instead use add $1 and sub $1.  Apparently
the carry flag is not separately renamed, so incl and decl depend on all
previous flags-setting instructions.

shll and shrl have a 4 cycle latency, or 8 times the latency of the fastest
integer instructions (addl, subl, orl, andl, and some more).  shldl and
shrdl seem to have 13 and 15 cycles latency, respectively.  Bizarre.

movq mmx -> mmx does have 6 cycle latency, as noted in the documentation.
pxor/por or similar combination at 2 cycles latency can be used instead.
The movq however executes in the float unit, thereby saving MMX execution
resources.  With the right juggling, data moves shouldn't be on a dependent
chain.

L1 is write-through, but the write-combining sounds like it does enough to
not require explicit destination prefetching.

xmm registers so far haven't found a use, but not much effort has been
expended.  A configure test for whether the operating system knows
fxsave/fxrestor will be needed if they're used.


REFERENCES

Intel Pentium-4 processor manuals,

	http://developer.intel.com/design/pentium4/manuals

"Intel Pentium 4 Processor Optimization Reference Manual", Intel, 2001,
order number 248966.  Available on-line:

	http://developer.intel.com/design/pentium4/manuals/248966.htm


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* mpn/x86_64/README
========================================================================

Copyright 2003, 2004, 2006, 2008 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


			AMD64 MPN SUBROUTINES


This directory contains mpn functions for AMD64 chips.  It is also useful
for 64-bit Pentiums, and "Core 2".


		     RELEVANT OPTIMIZATION ISSUES

The Opteron and Athlon64 can sustain up to 3 instructions per cycle, but in
practice that is only possible for integer instructions.  But almost any
three integer instructions can issue simultaneously, including any 3 ALU
operations, including shifts.  Up to two memory operations can issue each
cycle.

Scheduling typically requires that load-use instructions are split into
separate load and use instructions.  That requires more decode resources,
and it is rarely a win.  Opteron/Athlon64 have deep out-of-order core.


Optimizing for 64-bit Pentium4 is probably a waste of time, as the most
critical instructions are very poorly implemented here.  Perhaps we could
save a cycle or two, but the most common loops now run at between 10 and 22
cycles, so a saved cycle isn't too exciting.


The new spin of the venerable P6 core, the "Core 2" is much better than the
Pentium4 for the GMP loops.  Its integer pipeline is somewhat similar to to
the Opteron/Athlon64 pipeline, except that the GMP favourites ADC/SBB and
MUL are slower.  Furthermore, an INC/DEC followed by ADC/SBB incur a
pipeline stall of around 10 cycles.  The default mpn_add_n and mpn_sub_n
code suffers badly from the stall.  The code in the core2 subdirectory uses
the almost forgotten instruction JRCXZ for loop control, and updates the
induction variable using LEA.


REFERENCES

"System V Application Binary Interface AMD64 Architecture Processor
Supplement", draft version 0.99, December 2007.
http://www.x86-64.org/documentation/abi.pdf

========================================================================
* tune/README
========================================================================

Copyright 2000-2002, 2004 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.


               GMP SPEED MEASURING AND PARAMETER TUNING


The programs in this directory are for knowledgeable users who want to
measure GMP routines on their machine, and perhaps tweak some settings or
identify things that can be improved.

The programs here are tools, not ready to run solutions.  Nothing is built
in a normal "make all", but various Makefile targets described below exist.

Relatively few systems and CPUs have been tested, so be sure to verify that
results are sensible before relying on them.


MISCELLANEOUS NOTES

--enable-assert

    Don't configure with --enable-assert, since the extra code added by
    assertion checking may influence measurements.

Direct mapped caches

    Some effort has been made to accommodate CPUs with direct mapped caches,
    by putting data blocks more or less contiguously on the stack.  But this
    will depend on TMP_ALLOC using alloca, and even then it may or may not
    be enough.

FreeBSD 4.2 i486 getrusage

    This getrusage seems to be a bit doubtful, it looks like it's
    microsecond accurate, but sometimes ru_utime remains unchanged after a
    time of many microseconds has elapsed.  It'd be good to detect this in
    the time.c initializations, but for now the suggestion is to pretend it
    doesn't exist.

        ./configure ac_cv_func_getrusage=no

NetBSD 1.4.1 m68k macintosh time base

    On this system it's been found getrusage often goes backwards, making it
    unusable (time.c getrusage_backwards_p detects this).  gettimeofday
    sometimes doesn't update atomically when it crosses a 1 second boundary.
    Not sure what to do about this.  Expect possible intermittent failures.

SCO OpenUNIX 8 /etc/hw

    /etc/hw takes about a second to return the cpu frequency, which suggests
    perhaps it's measuring each time it runs.  If this is annoying when
    running the speed program repeatedly then set a GMP_CPU_FREQUENCY
    environment variable (see TIME BASE section below).

Timing on GNU/Linux

    On Linux, timing currently uses the cycle counter. This is unreliable,
    since the counter is not saved and restored at context switches (unlike
    FreeBSD and Solaris where the cycle counter is "virtualized").

    Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
    CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
    with glibc, one has to link with -lrt (which also drags in the pthreads
    threading library). configure.in must be hacked to detect this and
    arrange proper linking. Something like

      old_LIBS="$LIBS"
      AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
      TUNE_LIBS="$LIBS"
      LIBS="$old_LIBS"

      AC_SUBST(TUNE_LIBS)

    might work.

Low resolution timebase

    Parameter tuning can be very time consuming if the only timebase
    available is a 10 millisecond clock tick, to the point of being
    unusable.  This is currently the case on VAX and ARM systems.


PARAMETER TUNING

The "tuneup" program runs some tests designed to find the best settings for
various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
into gmp-mparam.h.  The program is built and run with

        make tune

If the thresholds indicated are grossly different from the values in the
selected gmp-mparam.h then there may be a performance boost in applicable
size ranges by changing gmp-mparam.h accordingly.

Be sure to do a full reconfigure and rebuild to get any newly set thresholds
to take effect.  A partial rebuild is enough sometimes, but a fresh
configure and make is certain to be correct.

If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
the mpn subdirectories then the values from "make tune" should be similar.
But check that the configured CPU is right and there are no machine specific
effects causing a difference.

It's hoped the compiler and options used won't have too much effect on
thresholds, since for most CPUs they ultimately come down to comparisons
between assembler subroutines.  Missing out on the longlong.h macros by not
using gcc will probably have an effect.

Some thresholds produced by the tune program are merely single values chosen
from what's a range of sizes where two algorithms are pretty much the same
speed.  When this happens the program is likely to give somewhat different
values on successive runs.  This is noticeable on the toom3 thresholds for
instance.


SPEED PROGRAM

The "speed" program can be used for measuring and comparing various
routines, and producing tables of data or gnuplot graphs.  Compile it with

	make speed

(Or on DOS systems "make speed.exe".)

Here are some examples of how to use it.  Check the code for all the
options.

Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
(whichever is greater).

        ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
	gnuplot foo.gnuplot

Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
showing under mpn_lshift the difference between it and mpn_add_n.

	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1

Using option -c for times in cycles is interesting but normally only
necessary when looking carefully at assembler subroutines.  You might think
it would always give an integer value, but this doesn't happen in practice,
probably due to overheads in the time measurements.

In the free-form output the "#" symbol against a measurement means the
corresponding routine is fastest at that size.  This is a convenient visual
cue when comparing different routines.  The graph data files <name>.data
don't get this since it would upset gnuplot or other data viewers.


TIME BASE

The time measuring method is determined in time.c, based on what the
configured host has available.  A cycle counter is preferred, possibly
supplemented by another method if the counter has a limited range.  A
microsecond accurate getrusage() or gettimeofday() will work quite well too.

The cycle counters (except possibly on alpha) and gettimeofday() will depend
on the machine being otherwise idle, or rather on other jobs not stealing
CPU time from the measuring program.  Short routines (those that complete
within a timeslice) should work even on a busy machine.

Some trouble is taken by speed_measure() in common.c to avoid ill effects
from sporadic interrupts, or other intermittent things (like cron waking up
every minute).  But generally an idle machine will be necessary to be
certain of consistent results.

The CPU frequency is needed to convert between cycles and seconds, or for
when a cycle counter is supplemented by getrusage() etc.  The speed program
will convert as necessary according to the output format requested.  The
tune program will work with either cycles or seconds.

freq.c knows how to get the frequency on some systems, or can measure a
cycle counter against gettimeofday() or getrusage(), but when that fails, or
needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
used (in Hertz).  For example in "bash" on a 650 MHz machine,

	export GMP_CPU_FREQUENCY=650e6

A high precision time base makes it possible to get accurate measurements in
a shorter time.


EXAMPLE COMPARISONS - VARIOUS

Here are some ideas for things that can be done with the speed program.

There's always going to be a certain amount of overhead in the time
measurements, due to reading the time base, and in the loop that runs a
routine enough times to get a reading of the desired precision.  Noop
functions taking various arguments are available to measure this.  The
"overhead" printed by the speed program each time in its intro is the "noop"
routine, but note that this is just for information, it isn't deducted from
the times printed or anything.

	./speed -s 1 noop noop_wxs noop_wxys

To see how many cycles per limb a routine is taking, look at the time
increase when the size increments, using option -D.  This avoids fixed
overheads in the measuring.  Also, remember many of the assembler routines
have unrolled loops, so it might be necessary to compare times at, say, 16,
32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
finishing off.

        ./speed -s 16-64 -t 16 -C -D mpn_add_n

The -C option on its own gives cycles per limb, but is really only useful at
big sizes where fixed overheads are small compared to the code doing the
real work.  Remember of course memory caching and/or page swapping will
affect results at large sizes.

        ./speed -s 500000 -C mpn_add_n

Once a calculation stops fitting in the CPU data cache, it's going to start
taking longer.  Exactly where this happens depends on the cache priming in
the measuring routines, and on what sort of "least recently used" the
hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
limbs.

        ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
	gnuplot foo.gnuplot

When a routine has an unrolled loop for, say, multiples of 8 limbs and then
an ordinary loop for the remainder, it can happen that it's actually faster
to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
draws a graph of mpn_sub_n, to see whether times smoothly increase with
size.

        ./speed -s 1-100 -c -P foo mpn_sub_n
	gnuplot foo.gnuplot

If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
ought to be faster (or at least not slower) than shifting by, say, 2 bits.

        ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2

An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
if the lshift isn't faster there's an obvious improvement that's possible.

        ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self

On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
destination is one of the sources is faster than a separate destination.
Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.

        ./speed -s 1-200 -c mpn_add_n mpn_add_n.1

The gmp manual points out that divisions by powers of two should be done
using a right shift because it'll be significantly faster than an actual
division.  The following shows by what factor mpn_rshift is faster than
mpn_divrem_1, using division by 32 as an example.

        ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32


EXAMPLE COMPARISONS - MULTIPLICATION

mul_basecase takes a ".<r>" parameter. If positive, it gives the second
(smaller) operand size.  For example to show speeds for 3x3 up to 20x3 in
cycles,

        ./speed -s 3-20 -c mpn_mul_basecase.3

A negative ".<-r>" parameter fixes the size of the product to the absolute
value r.  For example to show speeds for 10x10 up to 19x1 in cycles,

        ./speed -s 10-19 -c mpn_mul_basecase.-20

mul_basecase with no parameter does an NxN multiply, so for example to show
speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,

        ./speed -s 1-20 -c mpn_mul_basecase

sqr_basecase is implemented by a "triangular" method on most CPUs, making it
up to twice as fast as mul_basecase.  In practice loop overheads and the
products on the diagonal mean it falls short of this.  Here's an example
running the two and showing by what factor an NxN mul_basecase is slower
than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)

        ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase

The technique described above with -CD for showing the time difference in
cycles per limb between two size operations can be done on an NxN
mul_basecase using -E to change the basis for the size increment to N*N.
For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
doing 256 limbs.  The following therefore shows the per crossproduct speed
of mul_basecase and sqr_basecase at around 20x20 limbs.

        ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase

Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
interesting to compare it to mul_basecase as if it was.  For sqr_basecase
the -F option can be used to base the deltas on N*(N+1)/2 operations, which
is the triangular products sqr_basecase does.  For example,

        ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase

Both -E and -F are preliminary and might change.  A consistent approach to
using them when claiming certain per crossproduct or per triangularproduct
speeds hasn't really been established, but the increment between speeds in
the range karatsuba will call seems sensible, that being k to k/2.  For
instance, if the karatsuba threshold was 20 for the multiply and 30 for the
square,

        ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
        ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase


EXAMPLE COMPARISONS - MALLOC

The gmp manual recommends application programs avoid excessive initializing
and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
variable will at a minimum go through an init, a realloc for its first
store, and finally a clear.  Quite how long that takes depends on the C
library.  The following compares an mpz_init/realloc/clear to a 10 limb
mpz_add.  Don't be surprised if the mallocing is quite slow.

        ./speed -s 10 -c mpz_init_realloc_clear mpz_add

On some systems malloc and free are much slower when dynamic linked.  The
speed-dynamic program can be used to see this.  For example the following
measures malloc/free, first static then dynamic.

        ./speed -s 10 -c malloc_free
        ./speed-dynamic -s 10 -c malloc_free

Of course a real world program has big problems if it's doing so many
mallocs and frees that it gets slowed down by a dynamic linked malloc.


EXAMPLE COMPARISONS - STRING CONVERSIONS

mpn_get_str does a binary to string conversion.  The base is specified with
a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
than general bases.  The following compares decimal and hex for instance.

        ./speed -s 1-20 -c mpn_get_str mpn_get_str.16

Smaller bases need more divisions to split a given size number, and so are
slower.  The following compares base 3 and base 9.  On small operands 9 will
be nearly twice as fast, though at bigger sizes this reduces since in the
current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
limbs) and those divisions come to dominate.

        ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9

mpn_set_str does a string to binary conversion.  The base is specified with
a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
general bases on large conversions.

	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10

mpn_set_str also has some special case code for decimal which is a bit
faster than the general case, basically by giving the compiler a chance to
optimize some multiplications by 10.

	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11


EXAMPLE COMPARISONS - GCDs

mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
x and y are single limbs.  This isn't tuned currently, but a value can be
established by a measurement like

	./speed -s 10-32 mpn_gcd_1.10

This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
is done by nibbling away at the 32-bit operands bit-by-bit.  When the
threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
10x10 bit operation.

The threshold in mpn/generic/gcd_1.c or the various assembler
implementations can be tweaked up or down until there's no more speedups on
interesting combinations of sizes.  Note that this affects only a 1x1 limb
operation and so isn't very important.  (An Nx1 limb operation always does
an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)


SPEED PROGRAM EXTENSIONS

Potentially lots of things could be made available in the program, but it's
been left at only the things that have actually been wanted and are likely
to be reasonably useful in the future.

Extensions should be fairly easy to make though.  speed-ext.c is an example,
in a style that should suit one-off tests, or new code fragments under
development.

many.pl is a script for generating a new speed program supplemented with
alternate versions of the standard routines.  It can be used for measuring
experimental code, or for comparing different implementations that exist
within a CPU family.


THRESHOLD EXAMINING

The speed program can be used to examine the speeds of different algorithms
to check the tune program has done the right thing.  For example to examine
the karatsuba multiply threshold,

	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n

When examining the toom3 threshold, remember it depends on the karatsuba
threshold, so the right karatsuba threshold needs to be compiled into the
library first.  The tune program uses specially recompiled versions of
mpn/mul_n.c etc for this reason, but the speed program simply uses the
normal libgmp.la.

Note further that the various routines may recurse into themselves on sizes
far enough above applicable thresholds.  For example, mpn_kara_mul_n will
recurse into itself on sizes greater than twice the compiled-in
MUL_TOOM22_THRESHOLD.

When doing the above comparison between mul_basecase and kara_mul_n what's
probably of interest is mul_basecase versus a kara_mul_n that does one level
of Karatsuba then calls to mul_basecase, but this only happens on sizes less
than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
setting can be compiled-in to avoid the problem if necessary.  The same
applies to toom3 and DC, though in a trickier fashion.

There are some upper limits on some of the thresholds, arising from arrays
dimensioned according to a threshold (mpn_mul_n), or asm code with certain
sized displacements (some x86 versions of sqr_basecase).  So putting huge
values for the thresholds, even just for testing, may fail.


FUTURE

Make a program to check the time base is working properly, for small and
large measurements.  Make it able to test each available method, including
perhaps the apparent resolution of each.

Make a general mechanism for specifying operand overlap, and a syntax like
maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
sort of thing with the "r" parameter currently.


----------------
Local variables:
mode: text
fill-column: 76
End:

========================================================================
* COPYING
========================================================================

                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.  We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors.  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received.  You must make sure that they, too, receive
or can get the source code.  And you must show them these terms so they
know their rights.

  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

  For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software.  For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

  Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so.  This is fundamentally incompatible with the aim of
protecting users' freedom to change the software.  The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable.  Therefore, we
have designed this version of the GPL to prohibit the practice for those
products.  If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

  Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary.  To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Use with the GNU Affero General Public License.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

    <program>  Copyright (C) <year>  <name of author>
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<http://www.gnu.org/licenses/>.

  The GNU General Public License does not permit incorporating your program
into proprietary programs.  If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.  But first, please read
<http://www.gnu.org/philosophy/why-not-lgpl.html>.

========================================================================
* COPYING.LESSERv3
========================================================================

		   GNU LESSER GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.


  This version of the GNU Lesser General Public License incorporates
the terms and conditions of version 3 of the GNU General Public
License, supplemented by the additional permissions listed below.

  0. Additional Definitions.

  As used herein, "this License" refers to version 3 of the GNU Lesser
General Public License, and the "GNU GPL" refers to version 3 of the GNU
General Public License.

  "The Library" refers to a covered work governed by this License,
other than an Application or a Combined Work as defined below.

  An "Application" is any work that makes use of an interface provided
by the Library, but which is not otherwise based on the Library.
Defining a subclass of a class defined by the Library is deemed a mode
of using an interface provided by the Library.

  A "Combined Work" is a work produced by combining or linking an
Application with the Library.  The particular version of the Library
with which the Combined Work was made is also called the "Linked
Version".

  The "Minimal Corresponding Source" for a Combined Work means the
Corresponding Source for the Combined Work, excluding any source code
for portions of the Combined Work that, considered in isolation, are
based on the Application, and not on the Linked Version.

  The "Corresponding Application Code" for a Combined Work means the
object code and/or source code for the Application, including any data
and utility programs needed for reproducing the Combined Work from the
Application, but excluding the System Libraries of the Combined Work.

  1. Exception to Section 3 of the GNU GPL.

  You may convey a covered work under sections 3 and 4 of this License
without being bound by section 3 of the GNU GPL.

  2. Conveying Modified Versions.

  If you modify a copy of the Library, and, in your modifications, a
facility refers to a function or data to be supplied by an Application
that uses the facility (other than as an argument passed when the
facility is invoked), then you may convey a copy of the modified
version:

   a) under this License, provided that you make a good faith effort to
   ensure that, in the event an Application does not supply the
   function or data, the facility still operates, and performs
   whatever part of its purpose remains meaningful, or

   b) under the GNU GPL, with none of the additional permissions of
   this License applicable to that copy.

  3. Object Code Incorporating Material from Library Header Files.

  The object code form of an Application may incorporate material from
a header file that is part of the Library.  You may convey such object
code under terms of your choice, provided that, if the incorporated
material is not limited to numerical parameters, data structure
layouts and accessors, or small macros, inline functions and templates
(ten or fewer lines in length), you do both of the following:

   a) Give prominent notice with each copy of the object code that the
   Library is used in it and that the Library and its use are
   covered by this License.

   b) Accompany the object code with a copy of the GNU GPL and this license
   document.

  4. Combined Works.

  You may convey a Combined Work under terms of your choice that,
taken together, effectively do not restrict modification of the
portions of the Library contained in the Combined Work and reverse
engineering for debugging such modifications, if you also do each of
the following:

   a) Give prominent notice with each copy of the Combined Work that
   the Library is used in it and that the Library and its use are
   covered by this License.

   b) Accompany the Combined Work with a copy of the GNU GPL and this license
   document.

   c) For a Combined Work that displays copyright notices during
   execution, include the copyright notice for the Library among
   these notices, as well as a reference directing the user to the
   copies of the GNU GPL and this license document.

   d) Do one of the following:

       0) Convey the Minimal Corresponding Source under the terms of this
       License, and the Corresponding Application Code in a form
       suitable for, and under terms that permit, the user to
       recombine or relink the Application with a modified version of
       the Linked Version to produce a modified Combined Work, in the
       manner specified by section 6 of the GNU GPL for conveying
       Corresponding Source.

       1) Use a suitable shared library mechanism for linking with the
       Library.  A suitable mechanism is one that (a) uses at run time
       a copy of the Library already present on the user's computer
       system, and (b) will operate properly with a modified version
       of the Library that is interface-compatible with the Linked
       Version.

   e) Provide Installation Information, but only if you would otherwise
   be required to provide such information under section 6 of the
   GNU GPL, and only to the extent that such information is
   necessary to install and execute a modified version of the
   Combined Work produced by recombining or relinking the
   Application with a modified version of the Linked Version. (If
   you use option 4d0, the Installation Information must accompany
   the Minimal Corresponding Source and Corresponding Application
   Code. If you use option 4d1, you must provide the Installation
   Information in the manner specified by section 6 of the GNU GPL
   for conveying Corresponding Source.)

  5. Combined Libraries.

  You may place library facilities that are a work based on the
Library side by side in a single library together with other library
facilities that are not Applications and are not covered by this
License, and convey such a combined library under terms of your
choice, if you do both of the following:

   a) Accompany the combined library with a copy of the same work based
   on the Library, uncombined with any other library facilities,
   conveyed under the terms of this License.

   b) Give prominent notice with the combined library that part of it
   is a work based on the Library, and explaining where to find the
   accompanying uncombined form of the same work.

  6. Revised Versions of the GNU Lesser General Public License.

  The Free Software Foundation may publish revised and/or new versions
of the GNU Lesser General Public License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns.

  Each version is given a distinguishing version number. If the
Library as you received it specifies that a certain numbered version
of the GNU Lesser General Public License "or any later version"
applies to it, you have the option of following the terms and
conditions either of that published version or of any later version
published by the Free Software Foundation. If the Library as you
received it does not specify a version number of the GNU Lesser
General Public License, you may choose any version of the GNU Lesser
General Public License ever published by the Free Software Foundation.

  If the Library as you received it specifies that a proxy can decide
whether future versions of the GNU Lesser General Public License shall
apply, that proxy's public statement of acceptance of any version is
permanent authorization for you to choose that version for the
Library.

========================================================================
* COPYINGv2
========================================================================

                    GNU GENERAL PUBLIC LICENSE
                       Version 2, June 1991

 Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The licenses for most software are designed to take away your
freedom to share and change it.  By contrast, the GNU General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users.  This
General Public License applies to most of the Free Software
Foundation's software and to any other program whose authors commit to
using it.  (Some other Free Software Foundation software is covered by
the GNU Lesser General Public License instead.)  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.

  To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have.  You must make sure that they, too, receive or can get the
source code.  And you must show them these terms so they know their
rights.

  We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
distribute and/or modify the software.

  Also, for each author's protection and ours, we want to make certain
that everyone understands that there is no warranty for this free
software.  If the software is modified by someone else and passed on, we
want its recipients to know that what they have is not the original, so
that any problems introduced by others will not reflect on the original
authors' reputations.

  Finally, any free program is threatened constantly by software
patents.  We wish to avoid the danger that redistributors of a free
program will individually obtain patent licenses, in effect making the
program proprietary.  To prevent this, we have made it clear that any
patent must be licensed for everyone's free use or not licensed at all.

  The precise terms and conditions for copying, distribution and
modification follow.

                    GNU GENERAL PUBLIC LICENSE
   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

  0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License.  The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language.  (Hereinafter, translation is included without limitation in
the term "modification".)  Each licensee is addressed as "you".

Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope.  The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.

  1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
along with the Program.

You may charge a fee for the physical act of transferring a copy, and
you may at your option offer warranty protection in exchange for a fee.

  2. You may modify your copy or copies of the Program or any portion
of it, thus forming a work based on the Program, and copy and
distribute such modifications or work under the terms of Section 1
above, provided that you also meet all of these conditions:

    a) You must cause the modified files to carry prominent notices
    stating that you changed the files and the date of any change.

    b) You must cause any work that you distribute or publish, that in
    whole or in part contains or is derived from the Program or any
    part thereof, to be licensed as a whole at no charge to all third
    parties under the terms of this License.

    c) If the modified program normally reads commands interactively
    when run, you must cause it, when started running for such
    interactive use in the most ordinary way, to print or display an
    announcement including an appropriate copyright notice and a
    notice that there is no warranty (or else, saying that you provide
    a warranty) and that users may redistribute the program under
    these conditions, and telling the user how to view a copy of this
    License.  (Exception: if the Program itself is interactive but
    does not normally print such an announcement, your work based on
    the Program is not required to print an announcement.)

These requirements apply to the modified work as a whole.  If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works.  But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.

Thus, it is not the intent of this section to claim rights or contest
your rights to work written entirely by you; rather, the intent is to
exercise the right to control the distribution of derivative or
collective works based on the Program.

In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.

  3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:

    a) Accompany it with the complete corresponding machine-readable
    source code, which must be distributed under the terms of Sections
    1 and 2 above on a medium customarily used for software interchange; or,

    b) Accompany it with a written offer, valid for at least three
    years, to give any third party, for a charge no more than your
    cost of physically performing source distribution, a complete
    machine-readable copy of the corresponding source code, to be
    distributed under the terms of Sections 1 and 2 above on a medium
    customarily used for software interchange; or,

    c) Accompany it with the information you received as to the offer
    to distribute corresponding source code.  (This alternative is
    allowed only for noncommercial distribution and only if you
    received the program in object code or executable form with such
    an offer, in accord with Subsection b above.)

The source code for a work means the preferred form of the work for
making modifications to it.  For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable.  However, as a
special exception, the source code distributed need not include
anything that is normally distributed (in either source or binary
form) with the major components (compiler, kernel, and so on) of the
operating system on which the executable runs, unless that component
itself accompanies the executable.

If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.

  4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License.  Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.

  5. You are not required to accept this License, since you have not
signed it.  However, nothing else grants you permission to modify or
distribute the Program or its derivative works.  These actions are
prohibited by law if you do not accept this License.  Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.

  6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions.  You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.

  7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all.  For example, if a patent
license would not permit royalty-free redistribution of the Program by
all those who receive copies directly or indirectly through you, then
the only way you could satisfy both it and this License would be to
refrain entirely from distribution of the Program.

If any portion of this section is held invalid or unenforceable under
any particular circumstance, the balance of the section is intended to
apply and the section as a whole is intended to apply in other
circumstances.

It is not the purpose of this section to induce you to infringe any
patents or other property right claims or to contest validity of any
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices.  Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.

This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.

  8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded.  In such case, this License incorporates
the limitation as if written in the body of this License.

  9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

Each version is given a distinguishing version number.  If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation.  If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.

  10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission.  For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this.  Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.

                            NO WARRANTY

  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
REPAIR OR CORRECTION.

  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
convey the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Also add information on how to contact you by electronic and paper mail.

If the program is interactive, make it output a short notice like this
when it starts in an interactive mode:

    Gnomovision version 69, Copyright (C) year name of author
    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.

You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary.  Here is a sample; alter the names:

  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
  `Gnomovision' (which makes passes at compilers) written by James Hacker.

  <signature of Ty Coon>, 1 April 1989
  Ty Coon, President of Vice

This General Public License does not permit incorporating your program into
proprietary programs.  If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.

========================================================================
* COPYINGv3
========================================================================

                    GNU GENERAL PUBLIC LICENSE
                       Version 3, 29 June 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.  We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors.  You can apply it to
your programs, too.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights.  Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.

  For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received.  You must make sure that they, too, receive
or can get the source code.  And you must show them these terms so they
know their rights.

  Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.

  For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software.  For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.

  Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so.  This is fundamentally incompatible with the aim of
protecting users' freedom to change the software.  The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable.  Therefore, we
have designed this version of the GPL to prohibit the practice for those
products.  If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.

  Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary.  To prevent this, the GPL assures that
patents cannot be used to render the program non-free.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Use with the GNU Affero General Public License.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time.  Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) <year>  <name of author>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

    <program>  Copyright (C) <year>  <name of author>
    This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
    This is free software, and you are welcome to redistribute it
    under certain conditions; type `show c' for details.

The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License.  Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<https://www.gnu.org/licenses/>.

  The GNU General Public License does not permit incorporating your program
into proprietary programs.  If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library.  If this is what you want to do, use the GNU Lesser General
Public License instead of this License.  But first, please read
<https://www.gnu.org/philosophy/why-not-lgpl.html>.