GMP Itemized Development Tasks

Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

or

  * the GNU General Public License as published by the Free Software
    Foundation; either version 2 of the License, or (at your option) any
    later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.

This file current as of 29 Jan 2014. An up-to-date version is available at https://gmplib.org/tasks.html. Please send comments about this page to gmp-devel@gmplib.org.

These are itemized GMP development tasks. Not all the tasks listed here are suitable for volunteers, but many of them are. Please see the projects file for more sizeable projects.

CAUTION: This file needs updating. Many of the tasks here have either already been taken care of, or have become irrelevant.

Correctness and Completeness

_LONG_LONG_LIMB in gmp.h is not namespace clean. Reported by Patrick Pelissier.
We sort of mentioned _LONG_LONG_LIMB in past releases, so need to be careful about changing it. It used to be a define applications had to set for long long limb systems, but that in particular is no longer relevant now that it's established automatically.
The various reuse.c tests need to force reallocation by calling _mpz_realloc with a small (1 limb) size.
One reuse case is missing from mpX/tests/reuse.c: mpz_XXX(a,a,a).
Make the string reading functions allow the `0x' prefix when the base is explicitly 16. They currently only allow that prefix when the base is unspecified (zero).
mpf_eq is not always correct, when one operand is 1000000000... and the other operand is 0111111111..., i.e., extremely close. There is a special case in mpf_sub for this situation; put similar code in mpf_eq. [In progress.]
mpf_eq doesn't implement what gmp.texi specifies. It should not use just whole limbs, but partial limbs. [In progress.]
mpf_set_str doesn't validate it's exponent, for instance garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow of a long is not detected.
mpf_add doesn't check for a carry from truncated portions of the inputs, and in that respect doesn't implement the "infinite precision followed by truncate" specified in the manual.
Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global variables with pointers to mpz_add etc, which doesn't work when those routines are coming from a DLL (because they're effectively function pointer global variables themselves). Need to rearrange perhaps to a set of calls to a test function rather than iterating over an array.
mpz_pow_ui: Detect when the result would be more memory than a size_t can represent and raise some suitable exception, probably an alloc call asking for SIZE_T_MAX, and if that somehow succeeds then an abort. Various size overflows of this kind are not handled gracefully, probably resulting in segvs.
In mpz_n_pow_ui, detect when the count of low zero bits exceeds an unsigned long. There's a (small) chance of this happening but still having enough memory to represent the value. Reported by Winfried Dreckmann in for instance mpz_ui_pow_ui (x, 4UL, 1431655766UL).
mpf: Detect exponent overflow and raise some exception. It'd be nice to allow the full mp_exp_t range since that's how it's been in the past, but maybe dropping one bit would make it easier to test if e1+e2 goes out of bounds.

Machine Independent Optimization

mpf_cmp: For better cache locality, don't test for low zero limbs until the high limbs fail to give an ordering. Reduce code size by turning the three mpn_cmp's into a single loop stopping when the end of one operand is reached (and then looking for a non-zero in the rest of the other).
mpf_mul_2exp, mpf_div_2exp: The use of mpn_lshift for any size<=prec means repeated mul_2exp and div_2exp calls accumulate low zero limbs until size==prec+1 is reached. Those zeros will slow down subsequent operations, especially if the value is otherwise only small. If low bits of the low limb are zero, use mpn_rshift so as to not increase the size.
mpn_dc_sqrtrem, mpn_sqrtrem2: Don't use mpn_add_1 and mpn_sub_1 for 1 limb operations, instead ADDC_LIMB and SUBC_LIMB.
mpn_sqrtrem2: Use plain variables for sp[0] and rp[0] calculations, so the compiler needn't worry about aliasing between sp and rp.
mpn_sqrtrem: Some work can be saved in the last step when the remainder is not required, as noted in Paul's paper.
mpq_add, mpq_sub: The gcd fits a single limb with high probability and in this case binvert_limb could be used to calculate the inverse just once for the two exact divisions "op1.den / gcd" and "op2.den / gcd", rather than letting mpn_bdiv_q_1 do it each time. This would require calling mpn_pi1_bdiv_q_1.
mpn_gcdext: Don't test count_leading_zeros for zero, instead check the high bit of the operand and avoid invoking count_leading_zeros. This is an optimization on all machines, and significant on machines with slow count_leading_zeros, though it's possible an already normalized operand might not be encountered very often.
Rewrite umul_ppmm to use floating-point for generating the most significant limb (if GMP_LIMB_BITS <= 52 bits). (Peter Montgomery has some ideas on this subject.)
Improve the default umul_ppmm code in longlong.h: Add partial products with fewer operations.
Consider inlining mpz_set_ui. This would be both small and fast, especially for compile-time constants, but would make application binaries depend on having 1 limb allocated to an mpz_t, preventing the "lazy" allocation scheme below.
Consider inlining mpz_[cft]div_ui and maybe mpz_[cft]div_r_ui. A __gmp_divide_by_zero would be needed for the divide by zero test, unless that could be left to mpn_mod_1 (not sure currently whether all the risc chips provoke the right exception there if using mul-by-inverse).
Consider inlining: mpz_fits_s*_p. The setups for LONG_MAX etc would need to go into gmp.h, and on Cray it might, unfortunately, be necessary to forcibly include <limits.h> since there's no apparent way to get SHRT_MAX with an expression (since short and unsigned short can be different sizes).
mpz_powm and mpz_powm_ui aren't very fast on one or two limb moduli, due to a lot of function call overheads. These could perhaps be handled as special cases.
Make sure mpz_powm_ui is never slower than the corresponding computation using mpz_powm.
mpz_powm REDC should do multiplications by g[] using the division method when they're small, since the REDC form of a small multiplier is normally a full size product. Probably would need a new tuned parameter to say what size multiplier is "small", as a function of the size of the modulus.
mpn_gcd might be able to be sped up on small to moderate sizes by improving find_a, possibly just by providing an alternate implementation for CPUs with slowish count_leading_zeros.
mpf_set_str produces low zero limbs when a string has a fraction but is exactly representable, eg. 0.5 in decimal. These could be stripped to save work in later operations.
mpz_and, mpz_ior and mpz_xor should use mpn_and_n etc for the benefit of the small number of targets with native versions of those routines. Need to be careful not to pass size==0. Is some code sharing possible between the mpz routines?
mpf_add: Don't do a copy to avoid overlapping operands unless it's really necessary (currently only sizes are tested, not whether r really is u or v).
mpf_add: Under the check for v having no effect on the result, perhaps test for r==u and do nothing in that case, rather than currently it looks like an MPN_COPY_INCR will be done to reduce prec+1 limbs to prec.
mpf_div_ui: Instead of padding with low zeros, call mpn_divrem_1 asking for fractional quotient limbs.
mpf_div_ui: Eliminate TMP_ALLOC. When r!=u there's no overlap and the division can be called on those operands. When r==u and is prec+1 limbs, then it's an in-place division. If r==u and not prec+1 limbs, then move the available limbs up to prec+1 and do an in-place there.
mpf_div_ui: Whether the high quotient limb is zero can be determined by testing the dividend for high<divisor. When non-zero, the division can be done on prec dividend limbs instead of prec+1. The result size is also known before the division, so that can be a tail call (once the TMP_ALLOC is eliminated).
mpn_divrem_2 could usefully accept unnormalized divisors and shift the dividend on-the-fly, since this should cost nothing on superscalar processors and avoid the need for temporary copying in mpn_tdiv_qr.
mpf_sqrt: If r!=u, and if u doesn't need to be padded with zeros, then there's no need for the tp temporary.
mpq_cmp_ui could form the num1*den2 and num2*den1 products limb-by-limb from high to low and look at each step for values differing by more than the possible carry bit from the uncalculated portion.
mpq_cmp could do the same high-to-low progressive multiply and compare. The benefits of karatsuba and higher multiplication algorithms are lost, but if it's assumed only a few high limbs will be needed to determine an order then that's fine.
mpn_add_1, mpn_sub_1, mpn_add, mpn_sub: Internally use __GMPN_ADD_1 etc instead of the functions, so they get inlined on all compilers, not just gcc and others with inline recognised in gmp.h. __GMPN_ADD_1 etc are meant mostly to support application inline mpn_add_1 etc and if they don't come out good for internal uses then special forms can be introduced, for instance many internal uses are in-place. Sometimes a block of code is executed based on the carry-out, rather than using it arithmetically, and those places might want to do their own loops entirely.
__gmp_extract_double on 64-bit systems could use just one bitfield for the mantissa extraction, not two, when endianness permits. Might depend on the compiler allowing long long bit fields when that's the only actual 64-bit type.
tal-notreent.c could keep a block of memory permanently allocated. Currently the last nested TMP_FREE releases all memory, so there's an allocate and free every time a top-level function using TMP is called. Would need mp_set_memory_functions to tell tal-notreent.c to release any cached memory when changing allocation functions though.
__gmp_tmp_alloc from tal-notreent.c could be partially inlined. If the current chunk has enough room then a couple of pointers can be updated. Only if more space is required then a call to some sort of __gmp_tmp_increase would be needed. The requirement that TMP_ALLOC is an expression might make the implementation a bit ugly and/or a bit sub-optimal.
```
#define TMP_ALLOC(n)
  ((ROUND_UP(n) > current->end - current->point ?
     __gmp_tmp_increase (ROUND_UP (n)) : 0),
     current->point += ROUND_UP (n),
     current->point - ROUND_UP (n))
```
__mp_bases has a lot of data for bases which are pretty much never used. Perhaps the table should just go up to base 16, and have code to generate data above that, if and when required. Naturally this assumes the code would be smaller than the data saved.
__mp_bases field big_base_inverted is only used if USE_PREINV_DIVREM_1 is true, and could be omitted otherwise, to save space.
mpz_get_str, mtox: For power-of-2 bases, which are of course fast, it seems a little silly to make a second pass over the mpn_get_str output to convert to ASCII. Perhaps combine that with the bit extractions.
mpz_gcdext: If the caller requests only the S cofactor (of A), and A<B, then the code ends up generating the cofactor T (of B) and deriving S from that. Perhaps it'd be possible to arrange to get S in the first place by calling mpn_gcdext with A+B,B. This might only be an advantage if A and B are about the same size.
mpz_n_pow_ui does a good job with small bases and stripping powers of 2, but it's perhaps a bit too complicated for what it gains. The simpler mpn_pow_1 is a little faster on small exponents. (Note some of the ugliness in mpz_n_pow_ui is due to supporting mpn_mul_2.)
Perhaps the stripping of 2s in mpz_n_pow_ui should be confined to single limb operands for simplicity and since that's where the greatest gain would be.
Ideally mpn_pow_1 and mpz_n_pow_ui would be merged. The reason mpz_n_pow_ui writes to an mpz_t is that its callers leave it to make a good estimate of the result size. Callers of mpn_pow_1 already know the size by separate means (mp_bases).
mpz_invert should call mpn_gcdext directly.

Machine Dependent Optimization

invert_limb on various processors might benefit from the little Newton iteration done for alpha and ia64.
Alpha 21264: mpn_addlsh1_n could be implemented with mpn_addmul_1, since that code at 3.5 is a touch faster than a separate lshift and add_n at 1.75+2.125=3.875. Or very likely some specific addlsh1_n code could beat both.
Alpha 21264: Improve feed-in code for mpn_mul_1, mpn_addmul_1, and mpn_submul_1.
Alpha 21164: Rewrite mpn_mul_1, mpn_addmul_1, and mpn_submul_1 for the 21164. This should use both integer multiplies and floating-point multiplies. For the floating-point operations, the single-limb multiplier should be split into three 21-bit chunks, or perhaps even better in four 16-bit chunks. Probably possible to reach 9 cycles/limb.
Alpha: GCC 3.4 will introduce __builtin_ctzl, __builtin_clzl and __builtin_popcountl using the corresponding CIX ct instructions, and __builtin_alpha_cmpbge. These should give GCC more information about scheduling etc than the asm blocks currently used in longlong.h and gmp-impl.h.
Alpha Unicos: Apparently there's no alloca on this system, making configure choose the slower malloc-reentrant allocation method. Is there a better way? Maybe variable-length arrays per notes below.
Alpha Unicos 21164, 21264: .align is not used since it pads with garbage. Does the code get the intended slotting required for the claimed speeds? .align at the start of a function would presumably be safe no matter how it pads.
ARM V5: count_leading_zeros can use the clz instruction. For GCC 3.4 and up, do this via __builtin_clzl since then gcc knows it's "predicable".
Itanium: GCC 3.4 introduces __builtin_popcount which can be used instead of an asm block. The builtin should give gcc more opportunities for scheduling, bundling and predication. __builtin_ctz similarly (it just uses popcount as per current longlong.h).
UltraSPARC/64: Optimize mpn_mul_1, mpn_addmul_1, for s2 < 2^32 (or perhaps for any zero 16-bit s2 chunk). Not sure how much this can improve the speed, though, since the symmetry that we rely on is lost. Perhaps we can just gain cycles when s2 < 2^16, or more accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
UltraSPARC/64: Write native mpn_submul_1, analogous to mpn_addmul_1.
UltraSPARC/64: Write umul_ppmm. Using four "mulx"s either with an asm block or via the generic C code is about 90 cycles. Try using fp operations, and also try using karatsuba for just three "mulx"s.
UltraSPARC/32: Rewrite mpn_lshift, mpn_rshift. Will give 2 cycles/limb. Trivial modifications of mpn/sparc64 should do.
UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 < 2^16.
UltraSPARC/32: Use mulx for umul_ppmm if possible (see commented out code in longlong.h). This is unlikely to save more than a couple of cycles, so perhaps isn't worth bothering with.
UltraSPARC/32: On Solaris gcc doesn't give us __sparc_v9__ or anything to indicate V9 support when -mcpu=v9 is selected. See gcc/config/sol2-sld-64.h. Will need to pass something through from ./configure to select the right code in longlong.h. (Currently nothing is lost because mulx for multiplying is commented out.)
UltraSPARC/32: mpn_divexact_1 and mpn_modexact_1c_odd can use a 64-bit inverse and take 64-bits at a time from the dividend, as per the 32-bit divisor case in mpn/sparc64/mode1o.c. This must be done in assembler, since the full 64-bit registers (%gN) are not available from C.
UltraSPARC/32: mpn_divexact_by3c can work 64-bits at a time using mulx, in assembler. This would be the same as for sparc64.
UltraSPARC: binvert_limb might save a few cycles from masking down to just the useful bits at each point in the calculation, since mulx speed depends on the highest bit set. Either explicit masks or small types like short and int ought to work.
Sparc64 HAL R1 popc: This chip reputedly implements popc properly (see gcc sparc.md). Would need to recognise it as sparchalr1 or something in configure / config.sub / config.guess. popc_limb in gmp-impl.h could use this (per commented out code). count_trailing_zeros could use it too.
PA64: Improve mpn_addmul_1, mpn_submul_1, and mpn_mul_1. The current code runs at 11 cycles/limb. It should be possible to saturate the cache, which will happen at 8 cycles/limb (7.5 for mpn_mul_1). Write special loops for s2 < 2^32; it should be possible to make them run at about 5 cycles/limb.
PPC601: See which of the power or powerpc32 code runs better. Currently the powerpc32 is used, but only because it's the default for powerpc*.
PPC630: Rewrite mpn_addmul_1, mpn_submul_1, and mpn_mul_1. Use both integer and floating-point operations, possibly two floating-point and one integer limb per loop. Split operands into four 16-bit chunks for fast fp operations. Should easily reach 9 cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb (using one int + two fp).
PPC630: mpn_rshift could do the same sort of unrolled loop as mpn_lshift. Some judicious use of m4 might let the two share source code, or with a register to control the loop direction perhaps even share object code.
Implement mpn_mul_basecase and mpn_sqr_basecase for important machines. Helping the generic sqr_basecase.c with an mpn_sqr_diagonal might be enough for some of the RISCs.
POWER2/POWER2SC: Schedule mpn_lshift/mpn_rshift. Will bring time from 1.75 to 1.25 cycles/limb.
X86: Optimize non-MMX mpn_lshift for shifts by 1. (See Pentium code.)
X86: Good authority has it that in the past an inline rep movs would upset GCC register allocation for the whole function. Is this still true in GCC 3? It uses rep movs itself for __builtin_memcpy. Examine the code for some simple and complex functions to find out. Inlining rep movs would be desirable, it'd be both smaller and faster.
Pentium P54: mpn_lshift and mpn_rshift can come down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after shrdl and shldl, see mpn/x86/pentium/README.
Pentium P55 MMX: mpn_lshift and mpn_rshift might benefit from some destination prefetching.
PentiumPro: mpn_divrem_1 might be able to use a mul-by-inverse, hoping for maybe 30 c/l.
K7: mpn_lshift and mpn_rshift might be able to do something branch-free for unaligned startups, and shaving one insn from the loop with alternative indexing might save a cycle.
PPC32: Try using fewer registers in the current mpn_lshift. The pipeline is now extremely deep, perhaps unnecessarily deep.
Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
Fujitsu VPP: Write mpn_mul_basecase and mpn_sqr_basecase. This should use a "vertical multiplication method", to avoid carry propagation. splitting one of the operands in 11-bit chunks.
Pentium: mpn_lshift by 31 should use the special rshift by 1 code, and vice versa mpn_rshift by 31 should use the special lshift by 1. This would be best as a jump across to the other routine, could let both live in lshift.asm and omit rshift.asm on finding mpn_rshift already provided.
Cray T3E: Experiment with optimization options. In particular, -hpipeline3 seems promising. We should at least up -O to -O2 or -O3.
Cray: mpn_com and mpn_and_n etc very probably wants a pragma like MPN_COPY_INCR.
Cray vector systems: mpn_lshift, mpn_rshift, mpn_popcount and mpn_hamdist are nice and small and could be inlined to avoid function calls.
Cray: Variable length arrays seem to be faster than the tal-notreent.c scheme. Not sure why, maybe they merely give the compiler more information about aliasing (or the lack thereof). Would like to modify TMP_ALLOC to use them, or introduce a new scheme. Memory blocks wanted unconditionally are easy enough, those wanted only sometimes are a problem. Perhaps a special size calculation to ask for a dummy length 1 when unwanted, or perhaps an inlined subroutine duplicating code under each conditional. Don't really want to turn everything into a dog's dinner just because Cray don't offer an alloca.
Cray: mpn_get_str on power-of-2 bases ought to vectorize. Does it? bits_per_digit and the inner loop over bits in a limb might prevent it. Perhaps special cases for binary, octal and hex would be worthwhile (very possibly for all processors too).
S390: BSWAP_LIMB_FETCH looks like it could be done with lrvg, as per glibc sysdeps/s390/s390-64/bits/byteswap.h. This is only for 64-bit mode or something is it, since 32-bit mode has other code? Also, is it worth using for BSWAP_LIMB too, or would that mean a store and re-fetch? Presumably that's what comes out in glibc.

Improve count_leading_zeros for 64-bit machines:

	   if ((x >> 32) == 0) { x <<= 32; cnt += 32; }
	   if ((x >> 48) == 0) { x <<= 16; cnt += 16; }
	   ...

IRIX 6 MIPSpro compiler has an __inline which could perhaps be used in __GMP_EXTERN_INLINE. What would be the right way to identify suitable versions of that compiler?
IRIX cc is rumoured to have an _int_mult_upper (in <intrinsics.h> like Cray), but it didn't seem to exist on some IRIX 6.5 systems tried. If it does actually exist somewhere it would very likely be an improvement over a function call to umul.asm.
mpn_get_str final divisions by the base with udiv_qrnd_unnorm could use some sort of multiply-by-inverse on suitable machines. This ends up happening for decimal by presenting the compiler with a run-time constant, but the same for other bases would be good. Perhaps use could be made of the fact base<256.
mpn_umul_ppmm, mpn_udiv_qrnnd: Return a structure like div_t to avoid going through memory, in particular helping RISCs that don't do store-to-load forwarding. Clearly this is only possible if the ABI returns a structure of two mp_limb_ts in registers.
On PowerPC, structures are returned in memory on AIX and Darwin. In SVR4 they're returned in registers, except that draft SVR4 had said memory, so it'd be prudent to check which is done. We can jam the compiler into the right mode if we know how, since all this is purely internal to libgmp. (gcc has an option, though of course gcc doesn't matter since we use inline asm there.)

New Functionality

Maybe add mpz_crr (Chinese Remainder Reconstruction).
Let `0b' and `0B' mean binary input everywhere.
mpz_init and mpq_init could do lazy allocation. Set ALLOC(var) to 0 to indicate nothing allocated, and let _mpz_realloc do the initial alloc. Set z->_mp_d to a dummy that mpz_get_ui and similar can unconditionally fetch from. Niels Möller has had a go at this.
The advantages of the lazy scheme would be:
- Initial allocate would be the size required for the first value stored, rather than getting 1 limb in mpz_init and then more or less immediately reallocating.
- mpz_init would only store magic values in the mpz_t fields, and could be inlined.
- A fixed initializer could even be used by applications, like mpz_t z = MPZ_INITIALIZER;, which might be convenient for globals.
The advantages of the current scheme are:
- mpz_set_ui and other similar routines needn't check the size allocated and can just store unconditionally.
- mpz_set_ui and perhaps others like mpz_tdiv_r_ui and a prospective mpz_set_ull could be inlined.
Add mpf_out_raw and mpf_inp_raw. Make sure format is portable between 32-bit and 64-bit machines, and between little-endian and big-endian machines. A format which MPFR can use too would be good.
mpn_and_n ... mpn_copyd: Perhaps make the mpn logops and copys available in gmp.h, either as library functions or inlines, with the availability of library functions instantiated in the generated gmp.h at build time.
mpz_set_str etc variants taking string lengths rather than null-terminators.
mpz_andn, mpz_iorn, mpz_nand, mpz_nior, mpz_xnor might be useful additions, if they could share code with the current such functions (which should be possible).
mpz_and_ui etc might be of use sometimes. Suggested by Niels Möller.
mpf_set_str and mpf_inp_str could usefully accept 0x, 0b etc when base==0. Perhaps the exponent could default to decimal in this case, with a further 0x, 0b etc allowed there. Eg. 0xFFAA@0x5A. A leading "0" for octal would match the integers, but probably something like "0.123" ought not mean octal.
GMP_LONG_LONG_LIMB or some such could become a documented feature of gmp.h, so applications could know whether to printf a limb using %lu or %Lu.
GMP_PRIdMP_LIMB and similar defines following C99 <inttypes.h> might be of use to applications printing limbs. But if GMP_LONG_LONG_LIMB or whatever is added then perhaps this can easily enough be left to applications.
gmp_printf could accept %b for binary output. It'd be nice if it worked for plain int etc too, not just mpz_t etc.
gmp_printf in fact could usefully accept an arbitrary base, for both integer and float conversions. A base either in the format string or as a parameter with * should be allowed. Maybe &13b (b for base) or something like that.
gmp_printf could perhaps accept mpq_t for float conversions, eg. "%.4Qf". This would be merely for convenience, but still might be useful. Rounding would be the same as for an mpf_t (ie. currently round-to-nearest, but not actually documented). Alternately, perhaps a separate mpq_get_str_point or some such might be more use. Suggested by Pedro Gimeno.
mpz_rscan0 or mpz_revscan0 or some such searching towards the low end of an integer might match mpz_scan0 nicely. Likewise for scan1. Suggested by Roberto Bagnara.
mpz_bit_subset or some such to test whether one integer is a bitwise subset of another might be of use. Some sort of return value indicating whether it's a proper or non-proper subset would be good and wouldn't cost anything in the implementation. Suggested by Roberto Bagnara.
mpf_get_ld, mpf_set_ld: Conversions between mpf_t and long double, suggested by Dan Christensen. Other long double routines might be desirable too, but mpf would be a start.
long double is an ANSI-ism, so everything involving it would need to be suppressed on a K&R compiler.
There'd be some work to be done by configure to recognise the format in use, MPFR has a start on this. Often long double is the same as double, which is easy but pretty pointless. A single float format detector macro could look at double then long double
Sometimes there's a compiler option for the size of a long double, eg. xlc on AIX can use either 64-bit or 128-bit. It's probably simplest to regard this as a compiler compatibility issue, and leave it to users or sysadmins to ensure application and library code is built the same.
mpz_sqrt_if_perfect_square: When mpz_perfect_square_p does its tests it calculates a square root and then discards it. For some applications it might be useful to return that root. Suggested by Jason Moxham.
mpz_get_ull, mpz_set_ull, mpz_get_sll, mpz_get_sll: Conversions for long long. These would aid interoperability, though a mixture of GMP and long long would probably not be too common. Since long long is not always available (it's in C99 and GCC though), disadvantages of using long long in libgmp.a would be
- Library contents vary according to the build compiler.
- gmp.h would need an ugly #ifdef block to decide if the application compiler could take the long long prototypes.
- Some sort of LIBGMP_HAS_LONGLONG might be wanted to indicate whether the functions are available. (Applications using autoconf could probe the library too.)
It'd be possible to defer the need for long long to application compile time, by having something like mpz_set_2ui called with two halves of a long long. Disadvantages of this would be,
- Bigger code in the application, though perhaps not if a long long is normally passed as two halves anyway.
- mpz_get_ull would be a rather big inline, or would have to be two function calls.
- mpz_get_sll would be a worse inline, and would put the treatment of -0x10..00 into applications (see mpz_get_si correctness above).
- Although having libgmp.a independent of the build compiler is nice, it sort of sacrifices the capabilities of a good compiler to uniformity with inferior ones.
Plain use of long long is probably the lesser evil, if only because it makes best use of gcc. In fact perhaps it would suffice to guarantee long long conversions only when using GCC for both application and library. That would cover free software, and we can worry about selected vendor compilers later.
In C++ the situation is probably clearer, we demand fairly recent C++ so long long should be available always. We'd probably prefer to have the C and C++ the same in respect of long long support, but it would be possible to have it unconditionally in gmpxx.h, by some means or another.
mpz_strtoz parsing the same as strtol. Suggested by Alexander Kruppa.

Configuration

Alpha ev7, ev79: Add code to config.guess to detect these. Believe ev7 will be "3-1307" in the current switch, but need to verify that. (On OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do it ourselves for other systems.)
Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is "pic always" and ends up running gcc twice with the same options. This is wasteful, but harmless. Perhaps a newer libtool will be better.
ARM: umul_ppmm in longlong.h always uses umull, but is that available only for M series chips or some such? Perhaps it should be configured in some way.
HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
HPPA: gcc 3.2 introduces a -mschedule=7200 etc parameter, which could be driven by an exact hppa cpu type.
Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc. "hinv -c processor" gives lots of information on Irix. Standard config.guess appends "el" to indicate endianness, but AC_C_BIGENDIAN seems the best way to handle that for GMP.
PowerPC: The function descriptor nonsense for AIX is currently driven by *-*-aix*. It might be more reliable to do some sort of feature test, examining the compiler output perhaps. It might also be nice to merge the aix.m4 files into powerpc-defs.m4.
config.m4 is generated only by the configure script, it won't be regenerated by config.status. Creating it as an AC_OUTPUT would work, but it might upset "make" to have things like L$ get into the Makefiles through AC_SUBST. AC_CONFIG_COMMANDS would be the alternative. With some careful m4 quoting the changequote calls might not be needed, which might free up the order in which things had to be output.
Automake: Latest automake has a CCAS, CCASFLAGS scheme. Though we probably wouldn't be using its assembler support we could try to use those variables in compatible ways.
GMP_LDFLAGS could probably be done with plain LDFLAGS already used by automake for all linking. But with a bit of luck the next libtool will pass pretty much all CFLAGS through to the compiler when linking, making GMP_LDFLAGS unnecessary.
mpn/Makeasm.am uses -c and -o together in the .S and .asm rules, but apparently that isn't completely portable (there's an autoconf AC_PROG_CC_C_O test for it). So far we've not had problems, but perhaps the rules could be rewritten to use "foo.s" as the temporary, or to do a suitable "mv" of the result. The only danger from using foo.s would be if a compile failed and the temporary foo.s then looked like the primary source. Hopefully if the SUFFIXES are ordered to have .S and .asm ahead of .s that wouldn't happen. Might need to check.

Random Numbers

_gmp_rand is not particularly fast on the linear congruential algorithm and could stand various improvements.
- Make a second seed area within gmp_randstate_t (or _mp_algdata rather) to save some copying.
- Make a special case for a single limb 2exp modulus, to avoid mpn_mul calls. Perhaps the same for two limbs.
- Inline the lc code, to avoid a function call and TMP_ALLOC for every chunk.
- Perhaps the 2exp and general LC cases should be split, for clarity (if the general case is retained).
gmp_randstate_t used for parameters perhaps should become gmp_randstate_ptr the same as other types.
Some of the empirical randomness tests could be included in a "make check". They ought to work everywhere, for a given seed at least.

C++

mpz_class(string), etc: Use the C++ global locale to identify whitespace.
mpf_class(string): Use the C++ global locale decimal point, rather than the C one.
Consider making these variant mpz_set_str etc forms available for mpz_t too, not just mpz_class etc.
mpq_class operator+=: Don't emit an unnecessary mpq_set(q,q) before mpz_addmul etc.
Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining. Candidates for this would be,
- mpz_class(const char *), etc: since they're normally not fast anyway, and we can hide the exception throw.
- mpz_class(string), etc: to hide the cstr needed to get to the C conversion function.
- mpz_class string, char* etc constructors: likewise to hide the throws and conversions.
- mpz_class::get_str, etc: to hide the char* to string conversion and free. Perhaps mpz_get_str can write directly into a string, to avoid copying.
  Consider making such string returning variants available for use with plain mpz_t etc too.

Miscellaneous

mpz_gcdext and mpn_gcdext ought to document what range of values the generated cofactors can take, and preferably ensure the definition uniquely specifies the cofactors for given inputs. A basic extended Euclidean algorithm or multi-step variant leads to |x|<|b| and |y|<|a| or something like that, but there's probably two solutions under just those restrictions.
demos/factorize.c: use mpz_divisible_ui_p rather than mpz_tdiv_qr_ui. (Of course dividing multiple primes at a time would be better still.)
The various test programs use quite a bit of the main libgmp. This establishes good cross-checks, but it might be better to use simple reference routines where possible. Where it's not possible some attention could be paid to the order of the tests, so a libgmp routine is only used for tests once it seems to be good.
MUL_FFT_THRESHOLD etc: the FFT thresholds should allow a return to a previous k at certain sizes. This arises basically due to the step effect caused by size multiples effectively used for each k. Looking at a graph makes it fairly clear.
__gmp_doprnt_mpf does a rather unattractive round-to-nearest on the string returned by mpf_get_str. Perhaps some variant of mpf_get_str could be made which would better suit.

Aids to Development

Add ASSERTs at the start of each user-visible mpz/mpq/mpf function to check the validity of each mp?_t parameter, in particular to check they've been mp?_inited. This might catch elementary mistakes in user programs. Care would need to be taken over MPZ_TMP_INITed variables used internally. If nothing else then consistency checks like size<=alloc, ptr not NULL and ptr+size not wrapping around the address space, would be possible. A more sophisticated scheme could track _mp_d pointers and ensure only a valid one is used. Such a scheme probably wouldn't be reentrant, not without some help from the system.
tune/time.c could try to determine at runtime whether getrusage and gettimeofday are reliable. Currently we pretend in configure that the dodgy m68k netbsd 1.4.1 getrusage doesn't exist. If a test might take a long time to run then perhaps cache the result in a file somewhere.
tune/time.c could choose the default precision based on the speed_unittime determined, independent of the method in use.
Cray vector systems: CPU frequency could be determined from sysconf(_SC_CLK_TCK), since it seems to be clock cycle based. Is this true for all Cray systems? Would like some documentation or something to confirm.

Documentation

mpz_inp_str (etc) doesn't say when it stops reading digits.
mpn_get_str isn't terribly clear about how many digits it produces. It'd probably be possible to say at most one leading zero, which is what both it and mpz_get_str currently do. But want to be careful not to bind ourselves to something that might not suit another implementation.
va_arg doesn't do the right thing with mpz_t etc directly, but instead needs a pointer type like MP_INT*. It'd be good to show how to do this, but we'd either need to document mpz_ptr and friends, or perhaps fallback on something slightly nasty with void*.

Bright Ideas

The following may or may not be feasible, and aren't likely to get done in the near future, but are at least worth thinking about.

Reorganize longlong.h so that we can inline the operations even for the system compiler. When there is no such compiler feature, make calls to stub functions. Write such stub functions for as many machines as possible.
longlong.h could declare when it's using, or would like to use, mpn_umul_ppmm, and the corresponding umul.asm file could be included in libgmp only in that case, the same as is effectively done for __clz_tab. Likewise udiv.asm and perhaps cntlz.asm. This would only be a very small space saving, so perhaps not worth the complexity.
longlong.h could be built at configure time by concatenating or #including fragments from each directory in the mpn path. This would select CPU specific macros the same way as CPU specific assembler code. Code used would no longer depend on cpp predefines, and the current nested conditionals could be flattened out.
mpz_get_si returns 0x80000000 for -0x100000000, whereas it's sort of supposed to return the low 31 (or 63) bits. But this is undocumented, and perhaps not too important.
mpz_init_set* and mpz_realloc could allocate say an extra 16 limbs over what's needed, so as to reduce the chance of having to do a reallocate if the mpz_t grows a bit more. This could only be an option, since it'd badly bloat memory usage in applications using many small values.
mpq functions could perhaps check for numerator or denominator equal to 1, on the assumption that integers or denominator-only values might be expected to occur reasonably often.
count_trailing_zeros is used on more or less uniformly distributed numbers in a couple of places. For some CPUs count_trailing_zeros is slow and it's probably worth handling the frequently occurring 0 to 2 trailing zeros cases specially.
mpf_t might like to let the exponent be undefined when size==0, instead of requiring it 0 as now. It should be possible to do size==0 tests before paying attention to the exponent. The advantage is not needing to set exp in the various places a zero result can arise, which avoids some tedium but is otherwise perhaps not too important. Currently mpz_set_f and mpf_cmp_ui depend on exp==0, maybe elsewhere too.
__gmp_allocate_func: Could use GCC __attribute__ ((malloc)) on this, though don't know if it'd do much. GCC 3.0 allows that attribute on functions, but not function pointers (see info node "Attribute Syntax"), so would need a new autoconf test. This can wait until there's a GCC that supports it.
mpz_add_ui contains two __GMPN_COPYs, one from mpn_add_1 and one from mpn_sub_1. If those two routines were opened up a bit maybe that code could be shared. When a copy needs to be done there's no carry to append for the add, and if the copy is non-empty no high zero for the sub.

Old and Obsolete Stuff

The following tasks apply to chips or systems that are old and/or obsolete. It's unlikely anything will be done about them unless anyone is actively using them.

Sparc32: The integer based udiv_nfp.asm used to be selected by configure --nfp but that option is gone now that autoconf is used. The file could go somewhere suitable in the mpn search if any chips might benefit from it, though it's possible we don't currently differentiate enough exact cpu types to do this properly.
VAX D and G format double floats are straightforward and could perhaps be handled directly in __gmp_extract_double and maybe in mpn_get_d, rather than falling back on the generic code. (Both formats are detected by configure.)