# Area-Time Optimal VLSI Integer Multiplier with Minimum Computation Time* 

Kurt Mehlhorn<br>Universität der Saarlandes, Fachbereich 10, Saarbrücken, West Germany

AND<br>Franco P. Preparata<br>Coordinated Science Laboratory, University of Illinois, Urbana, Illinois 61801


#### Abstract

According to VLSI theory, $[\log n, \sqrt{n}]$ is the range of computation times for which there may exist an $A T^{2}$-optimal multiplier of $n$-bit integers. Such networks were previously known for the time range $\left[\Omega\left(\log ^{2} n\right), O(\sqrt{n})\right]$; this theoretical question is settled, by exhibition of a class of $A T^{2}$-optimal multipliers with computation times $[\Omega(\log n), O(\sqrt{n})]$. The designs are based on the DFT on a Fermat ring, whose elements are represented in a redundant radix-4 form to ensure $O(1)$ addition time.


## 1. Introduction

Research on efficient integer multiplication schemes, potentially suitable for direct circuit implementation, has been going on for some years. Investigations have focussed on both the realization of practical (and possibly suboptimal) networks and the more subtle question of the existence of optimal networks. Optimality is defined with respect to the customary $A T^{2}$ measure of complexity, which is central to the synchronous VLSI model of computation (Thompson, 1979; Brent and Kung, 1981). Here $A$ is the area of the multiplier chip, while $T$ is the computation time, i.e., the time elapsing between the arrival of the first input bit and the delivery of the last output bit. As is well known (Abelson and Andreae, 1980; Brent and Kung, 1981), any multiplier of two $n$-bit integers must satisfy $A T^{2}=\Omega\left(n^{2}\right)$ and $A=\Omega(n)$ in the VISI model; in addition, standard fan-in constraints of the logic gates yield the lower bound $T=\Omega(\log n)$. These three lower bounds

[^0]indicate that $[\log n, \sqrt{n}]$ is the range of computation times for which there may exist an $A T^{2}$-optimal multiplier.

The search for an $A T^{2}$-optimal integer multiplier began with the suboptimal design of Brent and Kung (1981), for which $A T^{2}=O\left(n^{2} \log ^{3} n\right)$. Subsequently, Preparata and Vuillemin (1981b) proposed a class of optimal designs whose computation time could be selected in the range $\left[\theta\left(\log ^{2} n\right), \theta(\sqrt{n})\right]$. More recently, Preparata (1983) exhibited an optimal mesh-connected multiplier achieving $T=O(\sqrt{n})$. An intriguing feature of all the above optimal designs is the explicit recourse to the Discrete Fourier Transform (DFT), as the device used for computing convolutions. However, none of these optimal designs achieves the minimum computation time $T=\theta(\log n)$. On the other hand there are well-known multiplication algorithms which achieve optimum computation time $T=\theta(\log n)$, e.g., the Wallace tree (1964) and Dadda counting (1965). Both algorithms are not easily embedded into silicon because of their irregular interconnection pattern. More recently, there have been proposals of desings with optimum computation time and nearly optimum $A T^{2}$-measure (Vuillemin, 1983; Becker, 1982; Luk and Vuillemin, 1983; Lengauer and Mehlhorn, 1983). Moreover, some of these designs are eminently practical. We refer the reader to Luk and Vuillemin (1983) for a detailed discussion. All of these designs are based on divide-and-conquer techniques and achieve their speed by the use of a redundant operand representation, which results in $O(1)$ addition time. The most efficient of these designs (Luk and Vuillemin, 1983; Lengauer and Mehlhorn, 1983) achieves $T=O(\log n)$ and $A T^{2}=O\left(n^{2}(\log n)^{2}\right)$.

In this paper we shall exhibit a class of optimal, i.e., $A T^{2}=\theta\left(n^{2}\right)$, designs realizing any computation time in the range $[\Omega(\log n), O(\sqrt{n})]$, thereby realizing the first $A T^{2}$-optimal $O(\log n)$-time multiplier. More generally, the new design settles, at least theoretically, the problem of integer multiplication: there exist optimal designs for the entire spectrum of possible computation times. Our new design incorporates ideas of many of the papers cited above. Not unlike previous optimal designs, it makes essential use of the DFT over a finite ring $G$, which we choose as a Fermat ring. In contrast to previous papers, a low-order DFT is used to achieve fast computation time. More precisely, in order to achieve computation time $O(T)$ we will resort to a $T$-point DFT over a ring of $2^{O(n / T)}$ elements. In Preparata and Vuillemin (1981b) an ( $n / \log n$ )-point DFT over a ring of $2^{o(\log n)}$ elements is used for all achievable values of $T$. Since we compute a DFT in a large ring of $2^{O(n / T)}$ elements, efficient implementations of the ring operations and of the data transfer between computing elements are crucial. We borrow from Preparata (1983) the idea of computing the DFT on a mesh of processing elements. Only communication between adjacent processing elements is required in this case and hence we can provide for a large communication bandwidth without paying too high a penalty in area. Each processing
element of the mesh has the ability to do additions-subtractions over $G$ and multiplications by powers of the root of unity. We choose a redundant representation for ring elements and thus achieve $O(1)$ addition/subtraction time in a small area. Since ring $G$ is a Fermat ring, i.e., the set of integers modulo $m=2^{p}+1$ for some $p$, and since the root of unity used in the DFT is a power of two, multiplication by a power of the root of unity can be essentially reduced to a cyclic shift and a small number of additions/ subtractions. We implement cyclic shifts by means of a cube-connected-cycle network (Preparata and Vuillemin, 1981a). Finally, general multiplications in $G$ are realized by one of the fast, suboptimal designs referred to above. Since only $O(T)$ general multiplicaitions of ring elements (which are essentially $O(N / T)$-bit numbers) are required, we will stay within the desired limits of time and area.

The paper is organized as follows. In Section 2 we review the arithmetic basis for the proposed multiplication scheme. We start with a description of one of the fast designs based on divide-and-conquer. Next we review how integer multiplication can be reduced to polynomial multiplication (convolution). We will then discuss the school-method (for polynomial multiplication) and derive from it a $T=O(\log n), A=O\left(n^{2} / \log n\right)$ design. This design is probably the most practical design proposed in this paper. Finally, we discuss interpolation/evaluation schemes and more specifically the DFT for computing convolutions efficiently. In Section 3 we describe the proposed multiplication scheme in detail. We first review how to compute the DFT on a mesh and then discuss in detail the organization of each mesh module. Appendix 1 contains a discussion of a redundant number representation which we use for the algorithm and Appendix 2 shows how to compute the DFT on a mesh; the latter material is essentially taken from Preparata (1983). Appendix 3 contains a succinct review of the structure and operation of the cube-connected-cycles network.

## 2. Arithmetic Background

In this section we briefly review the arithmetic basis of the proposed multiplication scheme. Specifically, we recall how integer multiplication can be solved by divide-and-conquer techniques and more generally by polynomial multiplication. We will also review a particular VLSI-design based on divide-and-conquer techniques. We will then show how the "schoolmethod" for polynomial multiplication can be used to construct a $T=O(\log n), A T^{2}=O\left(n^{2} \log n\right)$ multiplier. Finally, we will discuss how convolution can be computed by an evaluation/interpolation scheme and we shall describe one particularly efficient specialization of the latter as a Discrete Fourier Transform over a Fermat ring.

Throughout this paper $a$ and $b$ are nonnegative integers in the range $\left[0,2^{n / 2}-1\right]$ ( $n$ even). We use $c=a \times b$ to denote their product and $a_{n-1}, \ldots, a_{0}, b_{n-1}, \ldots, b_{0}, c_{n-1}, \ldots, c_{0}$, where $a_{n / 2}=\cdots=a_{n-1}=b_{n / 2}=\cdots=$ $b_{n-1}=0$, to denote the binary representations of $a, b$, and $c$, respectively.

### 2.1. Integer Multiplication by Divide-and-Conquer

Assume that $n$ is divisible by 4 for the sake of simplicity. We can then write

$$
a=a_{1} 2^{n / 4}+a_{0}, \quad b=b_{1} 2^{n / 4}+b_{0}, \quad c=c_{2} 2^{n / 2}+c_{1} 2^{n / 4}+c_{0}
$$

with $a_{1}, a_{0}, b_{1}, b_{0} \in\left[0,2^{n / 4}-1\right]$ and $c_{2}=a_{1} b_{1}, c_{1}=a_{0} b_{1}+a_{1} b_{0}, c_{0}=a_{0} b_{0}$. Hence we can compute the product of two $n / 2$-bit numbers by computing four products of $n / 4$-orbit numbers and a few sums of $n$-bit numbers. The next observation to make is that addition takes time $O(1)$ if a suitable redundant representation is used (cf. Appendix 1). Hence this scheme will result in a $T=O(\log n)$ multiplier. A VLSI layout with $A=O\left(n^{2} \log n\right)$ is readily obtained and can be found in Luk and Vuillemin (1983).

An interesting improvement upon the technique described above is due to Karazuba and Ofman (1962). They observed that $c_{2}, c_{1}$, and $c_{0}$ can be expressed as

$$
c_{2}=a_{1} b_{1} \quad c_{1}=\left(a_{1}+a_{0}\right)\left(b_{1}+b_{0}\right)-a_{1} b_{1}-a_{0} b_{0} \quad c_{0}=a_{0} b_{0}
$$

and hence three multiplications of numbers of half the length suffice. Again, if combined with a redundant number representation, a $T=O(\log n)$ multiplier results. Also, a VLSI layout with $A=O\left(n^{2}\right)$ is available and can be found in Luk and Vuillemin (1983), and Lengauer and Mehlhorn (1983). Thus $A T^{2}=O\left(n^{2} \log ^{2} n\right)$.

It is important to note that both of the above multipliers are pipelinable (in technical terms, their periods are $O(1)$ ), since at each step of the computation all data lie on a single level of the recursion. Thus the pipelined 3-multiplication multiplier $(\mathrm{P} 3 \mathrm{M})$ can be used to multiply $O(\log n)$ pairs of $n$-bit numbers in time $O(\log n)$. We will exploit this observation below.

### 2.2. Integer Multiplication via Polynomial Multiplication

Integer multiplication by divide-and-conquer is a special case of integer multiplication via polynomial multiplication. Let $k$ be an integer, $k \geqslant 2$. For the divide-and-conquer scheme we have $k=2$. Assume w.l.o.g that $k$ divides $n$. We subdivide the binary representation $a_{n-1} \cdots a_{0}$ of $a$ into $k$ strings of length $n / k$ each and consider each string as the representation of a binary number in the range $\left[0,2^{n / k}-1\right]$. In this way we associate with integer $a$ the


Fig. 1. Illustration of the release-of-the-carries.
polynomial $A(x)=\sum_{j=0}^{k=1} A_{j} x^{j}$ such that $A_{j} \in\left[0,2^{n / k}-1\right], a=A\left(2^{n / k}\right)$ and $A_{k / 2}=\cdots=A_{k-1}=0$. Similarly we associate polynomial $B(x)=\sum_{j=0}^{k-1} B_{j} x^{j}$ with $b$. Note that $A(x)$ and $B(x)$ are really of degree $[k / 2\rceil-1$. Let $C(x)=$ $A(x) \cdot B(x)$ be their product. Then $C(x)$ is of degree $k-1$ and $C\left(2^{n / k}\right)=$ $A\left(2^{n / k}\right) B\left(2^{n / k}\right)=a b$. Also, each coefficient of $C(x)$ lies in the range $\left[0, k 2^{2 n / k}-1\right]$ and can thus be expressed as a $\left(2 n / k+\left[\log _{2} k \mid\right)\right.$-bit number. It follows that the product $a b$ can be obtained by expressing each coefficient of $C(x)$ as a $\left(2 n / k+\left\lceil\log _{2} k\right\rceil\right)$-bit number, by positioning the coefficients $n / k$ bits apart as shown in Fig. 1 and by adding these ( $k-1$ ) numbers. This transformation of $C(x)$ into $c=a b$ is normally referred to as the "release-of-the-carries" and can be performed in time $O(\log n)$ (Preparata and Vuillemin, 1981b).

At this point we have reduced integer multiplication to polynomial multiplication. The naive method for the latter problem is the school-method: compute the $k^{2}$ products $A_{i} B_{j}, 0 \leqslant i, j<k$, and sum appropriate terms. For $k=2$ this leads to the 4 -multiplication recursive scheme described at the beginning of Section 2.1.

We will next show how to combine the P3M multiplier with the "schoolmethod" for convolution in order to obtain an $T=O(\log n), A T^{2}=$ $O\left(n^{2} \log n\right)$ VLSI-multiplier. We will describe two quite different methods.


Fig. 2. Structure of Muller's Multiplier.

The first method is a hybridization with a multiplier due to Muller (1963). The multiplier originally proposed by Muller computes the product of two $s$ digit integers by convolving the two factors (see Fig. 2). It consists of $2 s-1$ cells, and the product is obtained in $2 s-1$ shifts and adds. In the binary case, the adder is a conventional full binary adder and the "elementary multiplier" is just an AND-gate. Suppose now we subdivide each of the $n$-bit operands into $\left\lceil\log _{2} n\right\rceil$ strings of $\left(n /\left\lceil\log _{2} n\right\rceil\right)$-bits each to be viewed as a binary number. We now construct a Muller multiplier with $2\left\lceil\log _{2} n\right\rceil-1$ cells, each of which is adapted to process $\left(n /\left\lceil\log _{2} n\right\rceil\right)$-bit numbers, rather than single bits. The adaptation consists of replacing the AND-gate by a P3M multiplier for $n /\left[\log _{2} n\right]$-bit numbers, and the full adder by a threeoperand adder; the redundant representation is kept throughout, so that $O(1)$-time addition is guaranteed. The sequential feeding of the $\left\lceil\log _{2} n\right\rceil$ strings (each fed in parallel) provides the pipeline input to the P 3 M multipliers, so that the overall multiplication is completed in time $O(\log n)$. As to the chip area, each of the $2\left\lceil\log _{2} n\right\rceil-1$ cells has area $O\left(n^{2} / \log ^{2} n\right)$, thereby resulting in an overall area $O\left(n^{2} / \log n\right)$. It follows that $A T^{2}=O\left(n^{2} \log n\right)$, as claimed earlier.

Remark. Of course, the P3M multiplier used in the design above can be replaced by any other pipelinable fast multiplier. In particular, we might use the $T=O(\log n), A=O\left(n^{2} \log n\right)$ multiplier described in Vuillemin (1983), Luk and Vuillemin (1983), and Becker (1982), and obtain a $T=O(\log n)$, $A=O\left(n^{2}\right)$ multiplier. This design is probably the most practical design proposed in this paper.

An alternative approach has been described in Lengauer and Mehlhorn (1983) and is as follows. Divide the $n$-bit integers $a$ and $b$ into $k=\left(\log _{2} n\right)^{1 / 2}$ strings of $n /\left(\log _{2} n\right)^{1 / 2}$ bits each. We now use a P3M multiplier for $n /\left(\log _{2} n\right)^{1 / 2}$-bit integers in order to compute the $k^{2}=\log _{2} n$ products $A_{i} B_{j}$. Adding up appropriate terms and releasing the carry finishes the multiplication. It is easily seen that the area of the design is dominated by the area of the P3M multiplier and hence is $O\left(n^{2} / \log n\right)$. Thus a $T=O(\log n), A=O\left(n^{2} / \log n\right)$ multiplier results.

We have now described two alternative implementations of the schoolmethod. The first one uses $k=\log _{2} n$ and the second one uses $k=\left(\log _{2} n\right)^{1 / 2}$. In the first implementation we use $\log n \mathrm{P} 3 \mathrm{M}$ multipliers to compute the $\left(\log _{2} n\right)^{2}$ multiplications of $\left(n / \log _{2} n\right)$-bit integers and in the second implementation we use one P3M multiplier to compute the $\log _{2} n$ multiplications of $n /\left(\log _{2} n\right)^{1 / 2}$-bit integers.

At this point it is natural to hope that even more efficient multipliers can be obtained by replacing the school-method for polynomial multiplication by a more efficient scheme. The more efficient schemes are based on the concept of evaluation-interpolation and are described in the next section.

### 2.3. Polynomial Multiplication via Evaluation/Interpolation and the Discrete Fourier Transform

Let $G$ be a division ${ }^{1}$ ring and let $A(x), B(x)$ be polynomials of degree $<k / 2$ over $G$. Let $x_{0}, x_{1}, \ldots, x_{k-1}$ be distinct elements of $G$. Denoting, as usual, by $C(x)$ the degree- $(k-1)$ product $A(x) \cdot B(x)$, we have

$$
C\left(x_{j}\right)=A\left(x_{j}\right) \cdot B\left(x_{j}\right) \quad 0 \leqslant j \leqslant k-1 .
$$

Thus, by evaluating $A(x)$ and $B(x)$ at each of the values $x_{0}, \ldots, x_{k-1}$, we obtain, by means of only $k$ multiplications over $G$, the values $C\left(x_{0}\right), \ldots$, $C\left(x_{k-1}\right)$, from which the $k$ coefficients of $C$ are computed by interpolation.

Karazuba's 3-multiplication method is an instance of evaluation/ interpolation for $k=3$. Let $A(x)=A_{1} x+A_{0}, B(x)=B_{1} x+B_{0}$ be two polynomials of degree 1 . We choose $x_{0}=0, x_{1}=1$ and $x_{2}=\infty$. Then $A\left(x_{0}\right)=A_{0}, \quad A\left(x_{1}\right)=A_{1}+A_{0}, \quad A\left(x_{2}\right)=A_{1}, \quad B\left(x_{0}\right)=B_{0}, \quad B\left(x_{1}\right)=B_{1}+B_{0}$, $B\left(x_{2}\right)=B_{1}$, and $C\left(x_{0}\right)=A_{0} B_{0}, C\left(x_{1}\right)=\left(A_{1}+A_{0}\right)\left(B_{1}+B_{0}\right), C\left(x_{2}\right)=A_{1} B_{1}$. Finally, the interpolation formulae are

$$
\begin{aligned}
& C_{0}=C\left(x_{0}\right) \\
& C_{1}=C\left(x_{1}\right)-C\left(x_{0}\right)-C\left(x_{2}\right) \\
& C_{2}=C\left(x_{2}\right)
\end{aligned}
$$

As one can see from Karazuba's metod, the choice of points $x_{0}, \ldots, x_{k-1}$ is very crucial for the effectiveness of the evaluation/interpolation scheme. It is well known that for large $k$ a good choice for the evaluation points is consecutive powers of an order $-k$ primitive root of unity in $G$. This leads to the Discrete Fourier Transform (DFT) (Aho, Hopcroft, and Ullman, 1974).

In particular we want to choose the commutative ring $G$ such that if $\omega$ is a primitive root of unity of order $k$, multiplication of an element of $G$ by $\omega^{i}$ ( $i=0, \ldots, k-1$ ) can be done very efficiently. One very attrative choice is provided by a Fermat ring, i.e., by the set of the integers modulo a number of the form $2^{p}+1$, for integer $p$ : indeed, as we show in Appendix 1, multiplication by $\omega^{i}$ reduces to a minor variant of left cyclic shift. The suitability of a Fermat ring to our objective is demonstrated by the following property (see Aho et al., 1974, p. 266, Theorem 7.5):

Proposition 1. Let $r$ and $\omega$ be powers of 2 and let $m=\omega^{r / 2}+1$. Letting $\mathbf{Z}_{m}$ be the ring of integers modulo $m$ (a Fermat ring), then $r$ and $\omega$

[^1]have multiplicative inverses in $\mathbf{Z}_{m}$ and $\omega$ is a primitive rth root of unit in $\mathbf{Z}_{m}$.

The arithmetic of Fermat rings is described in Appendix 1.
We close this section with a brief description of the construction we are about to describe. Let $T$ be an integer between $\log n$ and $\sqrt{n}$ (the symbol $T$ is chosen as a reminder that this integer is the "target computation time" of the network). We reduce multiplication of $n$-bit integers to multiplication of polynomials of degree $T$ with coefficients in the range $\left[0,2^{n / T}\right)$. Multiplication of polynomials is based on evaluation/interpolation over a Fermat ring. For the $T$ multiplications in the ring we use one P 3 M multiplier for $(n / T)$-bit integers. Thus all $T$ multiplications take time $O(T+\log n / T)=$ $O(T)$ and area $A=O\left(n^{2} / T^{2}\right)$. Evaluation and interpolation are the DFT and its inverse. Section 3 is devoted to the computation of the order $-T$ (DFT) in a ring of size $2^{O(n / T)}$ in time $O(T)$ and area $O\left(n^{2} / T^{2}\right)$.

## 3. The Multiplier Network

A multiplier network of the type we describe below consists of four major subnetworks, illustrated in Fig. 3. Operands are loaded from the left and intermediate results migrate to the right, residing a certain amount of time in each of the four major subnetworks. Operands are represented with $n$ bits and, denoting by $T \in[\log n, \sqrt{n})]^{2}$ a power of 2 that divides $n$, each operand is subdivided into $T$ strings of $n / T$ bits. For two such operands $a$ and $b$ we have

$$
a=\sum_{i=0}^{T-1} a_{i} 2^{n i / T}, \quad b=\sum_{i=0}^{T-1} b_{i} 2^{n i / T}
$$

where $0 \leqslant a_{i}, b_{i}<2^{n / T}$ for $i=0, \ldots, T / 2-1$ and $a_{i}=b_{i}=0$ for $i \geqslant T / 2$. Analogously we define $c=a \times b$, so that

$$
c_{j}=\sum_{i=0}^{j} a_{i} b_{j-i} \quad 0 \leqslant j<T .
$$

From this expression, it is obvious that $0 \leqslant c_{j}<T 2^{2 n / T}$. We now wish to treat the $a_{i}$ 's, $b_{i}$ 's, and $c_{i}$ 's as members of a Fermat ring $\mathbf{Z}_{m}$. To find the smallest suitable Fermat ring it is sufficient to choose $m=2^{p}+1 \geqslant$ $k T^{2 n / T}>\max _{i=0}^{T-1} c_{i}$, which is verified by

$$
p=3\left\lceil\frac{n}{T^{2}}\right\rceil T \quad \text { for } \quad n \geqslant 16
$$

[^2]

Fig. 3. General scheme of the multiplier.
Therefore, if we select

$$
\begin{aligned}
& p=3\left[\frac{n}{T^{2}}\right] T, \\
& m=2^{p}+1, \\
& \omega=2^{2 p / T}=2^{6\left\{n / T^{2}\right\}}
\end{aligned}
$$

we verify the hypotheses of Proposition 1 with $r=T$, so that $\omega$ is a primitive root of unity in $\mathbf{Z}_{m}$ of order $T$.

Example. For $n=256=2^{8}$, and $T=8$, we have $p=96$ and $\omega=2^{24}$. A 128 -bit operand is subdivided as illustrated below. Each of the four "chunks"

is further embedded into a 97 -bit string, as a member of $\mathbf{Z}_{296+1}$.
Each transfer of data from major subnetwork to major subnetwork (see Fig. 3), as well as from the input and to the output, involves $O(n / T)$ bits at a time: specifically, $(p+1)$ bits between modules and $n / T$ bits in I/O transfers. Thus each transfer uses $O(T)$ time.

The pipelined multiplier is a straightforward variant of a pipelined 3multiplication multiplier: it uses length- $(p+1)$ operands and has area $O\left(n^{2} / T^{2}\right)$, since $p=O(n / T)$. Due to its pipeline structure, it performs multiplications in time $O(T+\log (n / T))=O(T)$ since $T \geqslant \log n$. Thus, the pipelined multiplier subnetwork has area and time obeying the $A T^{2}=\theta\left(n^{2}\right)$ target.

The FFT-engines (both for the direct DFT and for its inverse) are the crucial components of the network and will now be described in some detail. Each consists of $T$ macromodules organized as a $\lfloor\sqrt{T}\rfloor \times\lceil\sqrt{T}\rceil$ mesh (see Fig. 4, where we have implicitly assumed that $T$ is a square). If the length of each side of the macromodule is $O\left(n / T^{3 / 2}\right)$, then each DFT-engine has area $O\left(n^{2} / T^{2}\right)$, sufficient to achieve $A T^{2}$-optimality. Next we shall show that this objective is attainable.


Fig. 4. Architecture of the FFT-engine.
It has been shown in Preparata $(1983)^{3}$ how a mesh-connected architecture of $s \times s$ modules can be used to compute the FFT of $s^{2}$ elements in $O(s)$ "parallel exchange steps" and $O(\operatorname{logs})$ "parallel butterfly steps," where an exchange step involves the exchange of the operands of two adjacent modules and the butterfly involves a multiplication by a power $\omega^{i}$ of the principal root and an addition-subtraction. With this background, each macromodule of the mesh is designed to contain a $\mathbf{Z}_{m}$-operand represented in redundant radix-4 form (i.e., with $3(p / 2+1)$ bits) and must have the following capabilities:

1. Transfer its operand to an adjacent module (or exchange operands with an adjacent module);
2. Add two operands (or, equivalently in the redundant representation, subtract one from the other);
3. Multiply an operand by $\omega^{i}(i=0, \ldots,-1)$.

As noted earlier, we have $O(\sqrt{T})$ operations of type 1 and $O(\log T)$ operations of types 2 and 3 ; thus, since $T$ is our target computation time, the target times are $O(\sqrt{T})$ and $O(T / \log T)$ for the two types of operations, respectively.

The macromodule, which is designed to store an $O(n / T)$-bit operand, will be structured as follows. It contains $n / T^{3 / 2} \theta(\sqrt{T})$-bit shift registers, as illustrated in Fig. 5. The length of either side of the macromodule square is $O\left(n / T^{3 / 2}+T^{1 / 2}\right)=O\left(n / T^{3 / 2}\right)$ since $T \leqslant \sqrt{n}$, thus attaining the desired area. Note that the shift-registers can always be arranged as illustrated, since the register length is of lower order than the length of the macromodule side for all $T$. It is also straightforward to conclude that time $\theta(\sqrt{T})$ for an exchange operation is achieved by the proposed structure, by shifting in parallel the content of each shift-register to the homologous shift-register in one of the adjacent macromodules.

[^3]

Fig. 5. Structure of a macromodule.

The structure obtained so far is also quite adequate for the execution of type 2 operations (additions-subtractions), by simply equipping each register with a serial adder and introducing a few extra wires to transmit the carries between registers and to perform $R$-normalization, as defined in Appendix 1.

More delicate is the implementation of type 3 operations. Since multiplication by $\omega^{i}=2^{6\left[n / T^{2}\right\rceil i}$ is basically a left cyclic shift by $O\left(\left[n / T^{2}\right\rceil i\right)$ positions (for $i=0,1, \ldots, T-1$ ), we must provide an interconnection capable of performing any one of $T$ different cyclic shifts of data blocks of size $O\left(n / T^{2}\right)$. Thus the basic information unit dealt with in type 3 operations is a block of $O\left(n / T^{2}\right)$ bits, which we stipulate to be stored in a micromodule. (Thus a macromodule consists of $T$ micromodules.) For the sake of simplicity, we assume temporarily that $T \leqslant C_{2} n^{2 / 5}$, for some constant $C_{2}$. With this hypothesis each micromodule is an assembly of $O\left(n / T^{5 / 2}\right)$ continguous registers. ( $C_{2}$ is chosen so that this number of registers is at least 1.) The transfer of the content of a micromodule occurs, with a bandwidth equal to the number of its constituent registers, in time $O(\sqrt{T})$.

To perform the desired cyclic shift, we propose to interconnect the $T$ micromodules as an appropriate cube-connected-cycles (CCC) network. ${ }^{4}$ Specifically, we shall realize a CCC of $2^{v-u}$ cycles, each cycle consisting of $2^{u}$ micromodules (referred to as a $2^{u} \times 2^{v-u} \mathrm{CCC}$ ), where $v=\log _{2} T$ and $u=\left\lceil\frac{1}{2} \log _{2} T-\log _{2} \log _{2} T+c\right\rceil$, for a suitable constant $c$. Since it is a functional requirement of the CCC that $2^{u} \geqslant v-u$ (see Appendix 3), we have the condition

$$
2^{\left\lceil\log _{2} T-\log _{2} \log _{2} T+c\right\rceil / 2} \geqslant \log _{2} T-\left\lceil\frac{1}{2} \log _{2} T \log _{2} \log _{2} T+c\right\rceil .
$$

[^4]It can be easily verified that this constraint is always satisfied for $T \geqslant 4$ by choosing $c \simeq 1.308$. Note that this CCC has cycles of length $O(\sqrt{T} / \log T)$ and is capable of performing any of the $T$ prescribed cyclic shifts in a number of steps also $O(\sqrt{T} / \log T)$. Since, as noted earlier, the total available time for a cyclic shift is $O(T / \log T)$, the time available for each CCC-step is $O(\sqrt{T})$, which is exactly the time used to transfer the content of a micromodule. Thus a CCC interconnection realizes the desired computation time for type 3 operations.

We must still verify that the described CCC can be embedded in the macromodule with an insignificant increase in the area (i.e., an area blowup by a constant factor). The modified layout is obtained by dilating, by a factor of 2 , each side of the original layout. The new tracks made available are used to realize the CCC connections: specifically, the upper-right portion is used for the cycle links, whereas the lower-left portion is used for the lateral connections. The scheme is illustrated for a $4 \times 4 \mathrm{CCC}$ in Fig. 6 and requires no further comment.

The same considerations apply to the FFT engine designed to implement the inverse FFT. (The only additional operations are the multiplication of each result by $1 / T=-2^{p} / T$-the negative of a power of two (see Appendix 1).)

Remark. We now consider the case $T>C_{2} n^{2 / 5}$. In this situation each shift register contains $O\left(T / n^{2 / 5}\right)$ micromodules, since a shift register holds $\theta\left(T^{1 / 2}\right)$ bits and a micromodule holds $\theta\left(n / T^{2}\right)$ bits. Also, as before, there are $n / T^{3 / 2}$ shift registers per macromodule.

Within each macromodule we realize a CCC whose nodes are now the registers (not the micromodules as before). The CCC has $2^{v-u}$ cycles, each cycle consisting of $2^{u}$ registers, where $v=\log _{2}\left(n / T^{3 / 2}\right)$ and $u=$ $\log _{2}\left(T^{1 / 2} / \log T\right)$. In particular, the cycle length is $T^{1 / 2} / \log T$. We can clearly embed the CCC with only a constant blowup in area. The interconnection of


Fig. 6. Embedding of a $4 \times 4 \mathrm{CCC}$ into a 16 -element macromodule.
the registers is completed by (switchable) links that have the capacity of reconfiguring the registers into a single $(N / T)$-bit cyclic shift register.

Consider now a cyclic shift by $\left(n / T^{2}\right) i$ positions, $0 \leqslant i<T$, and write

$$
\left(n / T^{2}\right) i=a T^{1 / 2}+b
$$

where $a$ and $b$ are integers with $b<T^{1 / 2}$. We can realize the cyclic shift of the content of a macromodule by $\left(n / T^{2}\right) i$ positions in two steps:
(1) Shift by $b$ positions. This can clearly be done in time $O(b)=O\left(T^{1 / 2}\right)$ by using the registers reconfigured as a single shift register.
(2) Shift by $a \cdot T^{1 / 2}$ positions. Since $a T^{1 / 2}$ is a multiple of the shift register length we can perform such a shift using the CCC. It takes $O\left(T^{1 / 2} / \log T\right)$ CCC-steps, and hence $O(T / \log T)$ time units. (Recall that shift registers are transferred bit-serially.)
We conclude that the total time needed for the shift is $O(T / \log T)$.
Since time $O(T / \log T)$ is available for each shift, as noted earlier, we conclude that we are operating within the desired time bound.

One final comment is in order with regard to the transfer of data from the DFT-engine to the pipeline multiplier (and vice versa). Elements of $\mathrm{Z}_{m}$ have to be transferred in parallel on the entire channel of bandwidth $O(n / T)$, whereas at the completion of the DFT computation each such element is wholly stored in a macromodule. Therefore a preliminary data rearrangement is necessary, that will bring all microoperands of a given macromodule to be aligned (in a given row or column). The data paths necessary for this rearrangement are available, and it is left as an exercise to show that $O(\sqrt{T})$ time suffices to complete this task.

Since all major modules of Fig. 3 have area $O\left(n^{2} / T^{2}\right)$, and the time used for the DFTs and the pointwise multiplications is $O(T)$ (notice that this time adequately accounts for the release-of-the-carries and the conversions in $\mathbf{Z}_{m}$ to nonredundant form), we have the followong conclusions:

Theorem. It is possible to construct VLSI multipliers of $n$-bit numbers with the optimal performance $A T^{2}=\theta\left(n^{2}\right)$ for all computation times $T$ such that $\Omega(\log n) \leqslant T \leqslant O(\sqrt{n})$.

## APPENDIX 1: The Arithmetic of Fermat Rings

The operands considered in this paper are elements of a Fermat ring ${ }^{5} \mathbf{Z}_{m}$ of the integers modulo $m=2^{p}+1$, where $p$ is an even integer (to be chosen later). The operands are also represented in a redundant radix- 4 form, where

[^5]\[

$$
\begin{array}{lllll}
a_{L-1} & \cdots & a_{i} & \cdots & a_{1} \\
a_{0}
\end{array}
$$+
\]

Fig. 7. Illustration of the first step of addition for numbers in redundant radix-4 representation.
the digits belong to the set $\{-3,-2,-1,0,1,2,3\}$. Thus the value of a digit string ( $a_{L-1}, a_{L-2}, \ldots, a_{0}$ ), with $L=p / 2+1$, is

$$
\sum_{i=0}^{L-1} a_{i} 4^{i}
$$

which yields an operand range $R \triangleq\left[-4^{L}+1,4^{L}-1\right] \supseteq \mathbf{Z}_{m}$. (Notice also that $4 m>4^{L}-1$.) We shall call $R$-normalization the operation of bringing a number within the range $R$ modulo $m$, i.e., to go from $x \in Z$ to $y \in R$ with $y=x(\bmod m)$.

We shall discuss the operations of addition/subtraction, multiplication and division by a power of 4 , and conversion between redundant and irredundant forms.
(i) Addition-subtraction in $\mathbf{Z}_{m}$. Since $a=\sum_{i=0}^{L-1} a_{i} 4^{i}$ means $-a=$ $\sum_{i=0}^{L-1}\left(-a_{i}\right) 4^{i}$, subtraction reduces trivially to addition. Suppose then we wish to add modulo $m$ the two numbers in $R$

$$
a=\sum_{i=0}^{L-1} a_{i} 4^{i} \quad \text { and } \quad b=\sum_{i=0}^{L-1} b_{i} 4^{i}
$$

so that their sum is also normalized in $R$. Referring to Fig. 7, for each $i=0, \ldots, L-1$, we first compute the digit pair ( $s_{i}^{*}, c_{i+1}$ ) from the pair $\left(a_{i}, b_{i}\right)$, according to Table I. Notice that $a_{i}+b_{i}=s_{i}^{*}+4 c_{i+1}$, $s_{i}^{*} \in\{-2,-1,0,1,2\}$ and $c_{i+1} \in\{-1,0,1\}$. To obtain the final sum, we distinguish various cases:

TABLE I

| $a_{i}+b_{i}$ | -6 | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| $s_{i}^{*}$ | -2 | -1 | 0 | 1 | -2 | -1 | 0 | 1 | 2 | -1 | 0 | 1 | 2 |
| $c_{i+1}$ | -1 | -1 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |

1. $c_{L}=0$. In this case, the integer represented by

$$
\left(s_{L-1}, \ldots, s_{0}\right), \quad s_{i}=s_{i}^{*}+c_{i} \quad\left(-3 \leqslant s_{i} \leqslant 3, i=0, \ldots, L-1\right)
$$

is a legitimate representation of the sum $(a+b) \bmod m$ in the range $R$.
2. $c_{L} \neq 0$. In this case the result does not belong to $R$, so that a correction is necessary to accomplish $R$-normalization. Specifically, we have

$$
\begin{aligned}
(a+b) \bmod m & =\left(\sum_{j=0}^{L-1} s_{j} 4^{j}+c_{L} 4^{L} \bmod m\right) \bmod m \\
& =\left(\sum_{j=0}^{L-1} s_{j} 4^{j}-4 c_{L}\right) \bmod m
\end{aligned}
$$

since $4^{L}=2^{p+2}=-4 \bmod m$. We further distinguish two subcases:
2.1. $c_{L} s_{1} \neq-3$. In this case the final sum is obtained by simply replacing $s_{1}$ with $s_{1}^{\prime}=s_{1}-c_{L}$, since $-3 \leqslant s_{1}-c_{L} \leqslant-3$.
2.2. $c_{L} s_{1}=-3$. In this case $s_{1}-c_{L}=-4$ or 4 , so the correction of case 2.1 is not applicable. We then apply the same technique to perform the addition $(S+C) \bmod m$, where $S=\sum_{j=0}^{L-1} s_{j} 4^{j}$ and $C=-4 c_{L}$. Letting $s_{1}-c_{L}=s_{1}^{* *}+c_{2}^{\prime}$, we note that $s_{1}^{* *}=0$ (since $s_{1}-c_{L}=-4$ or 4), whence in forming the final sum case 2.2 cannot arise again.

It follows from the preceding discussion that addition can be done in $O(1)$ time.
(ii) Multiplication and division by a power of 4 . Let $a=\sum_{j=0}^{L-1} a_{j} 4^{j}$ and consider the product $a \cdot 4^{s} \bmod m$, for some integer $s$. We have

$$
\begin{aligned}
a 4^{s} \bmod m & =\sum_{j=0}^{L-1} a_{j} 4^{j+s} \bmod m \\
& =\left(\sum_{h=s}^{L-1} a_{h-s} 4^{h}+\sum_{h=L}^{L-1+s} a_{h-s} 4^{h} \bmod m\right) \bmod m \\
& =\left(\sum_{h=s}^{L-1} a_{h-s} 4^{h}+\sum_{i=0}^{s-1} a_{L-s+i} 4^{i}(-4)\right) \bmod m
\end{aligned}
$$

since $4^{L} \bmod m=-4$. Thus multiplication by $4^{s}$ is equivalent to:
(a) cyclically shifting to the left the $L$-digit string by $s$ digit positions;
(b) changing the sign of the $s$ least significant digits of the string obtained in (a) and shifting them one position to the left;


Fig. 8. Illustration of multiplication and division by a power of 4 .
(c) adding the two resulting numbers with the method described above.

These opereations are illustrated schematically in Fig. 8a.
With regard to divisions by powers of 4 , we know that any power of two $2^{s}(s \leqslant p)$ has a multiplicative inverse in $\mathbf{Z}_{m}$, given by $2^{p}+1-2^{p-s}$. The inverse of a power of two in $\mathbf{Z}_{m}$, however, is not a power of two, and so multiplication by it does not exhibit the interesting feature described above. However, since we chose to represent the elements of $\mathbf{Z}_{m}$ in $R$, we represent the multiplicative inverse of $2^{s}$ as $-2^{p-s}$, so that multiplication by it becomes a right cyclic shift by $s$ digit positions, with subsequent negation of the $s$ most significant digits and their shift one further position to the right, as shown in Fig. 8b.

Since sign changes and additions are performed in constant time, the computation time is dominated by the time used by the shift operations.
(iii) Multiplications in $\mathbf{Z}_{m}$. Whatever multiplication scheme we adopt (see Section 2), the result $p$ is a $2 L$-digit number. To bring it within range $R$ ( $L$-digit numbers), we operate as follows:

$$
\begin{aligned}
p=\sum_{i=0}^{2 L-1} p_{i} 4^{i} & =\sum_{i=0}^{L-1} p_{i} 4^{i}+\sum_{i=0}^{L-1} p_{L+i} 4^{L+i} \\
& =\sum_{i=0}^{L-1} p_{i} 4^{i}-\sum_{i=0}^{L-1} p_{L+i} 4^{i} \quad \bmod m \\
& =\sum_{i=0}^{L-1} p_{i} 4^{i}-\sum_{i=0}^{L-1} p_{L-1+i} 4^{i}+4 p_{2 L-1} \quad \bmod m
\end{aligned}
$$

Thus, to perform the $R$-normalization of the product we must perform a shift and two additions over $\mathbf{Z}_{m}$.
(iv) Conversion between binary form and redundant radix-4 form. Without loss of generality, we assume that the input binary numbers are nonnegative and represented in 2 's complement form with $p+2$ bits $b_{p+1}, \ldots, b_{0}\left(b_{p+1}=0\right.$ is the sign bit). The conversion to radix-4 form is
trivially accomplished by inserting 0 to the left of $b_{2 i+1}$ for $i=0,1, \ldots, p / 2$, in constant time.

The conversion to binary form is somewhat more complex. One possible implementation consists of forming two redundant radix-4 numbers $a_{+}$and $a_{-}$, consisting respectively of the positive and negative digits of the given number $a$. Next, each digit of $a_{+}$and $a_{-}$is converted to the binary representation of its modulus, thereby obtaining two binary numbers $a_{+}^{\prime}$ and $a_{-}^{\prime}$; all of these operations take time $O(1)$.

Finally we compute $a^{*}=a_{+}^{\prime}-a_{-}^{\prime}$ by a binary subtraction (in time $O(\log p)$. Notice that $a^{*}$ belongs to $R$ but not necessarily to $\mathbf{Z}_{m}:$ so, to compute $a^{*} \bmod m\left(\mathbf{Z}_{m}\right.$-normalization) it may be necessary to add/subtract $m$ at most four times, since $4 m>4^{L}-1$.

## APPENDIX 2: A Mesh-Connected Network for the Discrete Fourier Transform ${ }^{6}$

Let $G$ be a commutative ring containing a primitive root of unity, $\omega$, of order $k=m^{2}$ in $G$. We then have the following two facts:

A1. The DFT $\left\langle A_{0}, A_{1}, \ldots, A_{k-1}\right\rangle$ of a vector $\left\langle a_{0}, a_{1}, \ldots, a_{k-1}\right\rangle$ can be obtained as a two-dimensional DFT, by arranging the vector in row-major order as $m \times m$ matrix $A=\left\|a_{i j}\right\|$, where $a_{i j}=a_{m i+j}(j<m)$. (Note that indexing starts from 0 rather than from 1.) Letting $A_{r s}=A_{m r+s}$, we then have

$$
\begin{equation*}
A_{r s}=\sum_{i j} a_{i j} \omega^{(m i+j)(m r+s)}=\sum_{j=0}^{m-1}\left(\omega^{m}\right)^{j r}\left(\omega^{s j} \sum_{i=0}^{m-1} a_{i j}\left(\omega^{m}\right)^{i s}\right) . \tag{1}
\end{equation*}
$$

The latter expression suggests the following algorithm
D1. $A_{s j}^{\prime} \leftarrow \sum_{i=0}^{m-1} a_{i j}\left(\omega^{m}\right)^{i s} \quad$ (DFT of each column of the matrix);
D2. $A_{s j}^{\prime \prime} \leftarrow \omega^{s j} A_{s j}^{\prime} \quad$ (local multiplication);
D3. $A_{s r}^{\prime \prime \prime} \leftarrow \sum_{j=0}^{m-1} A_{s j}^{\prime \prime}\left(\omega^{m}\right)^{j r} \quad$ (DFT of each row of the matrix).
(Note that $A_{s r}^{\prime \prime \prime}=A_{r s}$; i.e., the algorithm obtains in reality the transpose of the desired matrix.) This method has already been used in Brent and Kung (1981), where, however, the DFT itself was obtained through matrix multiplication.

[^6]A2. A unidimensional $m$-module array (where $m=2^{r}$ for convenience) can be used to compute the DFT of an $m$-vector, as has been shown in Preparata and Vuillemin (1981a,b). This computation uses $\theta(m)$ exchange steps and $\theta(\log m)$ "butterfly" steps.

Thus, if we have an $m \times m$ mesh of $k$ modules, the columns of the mesh are first used to execute in parallel Step D1 according to the scheme alluded to in A2, and-following the local multiplication D2-the rows of the mesh are finally used to execute in parallel Step D3 (again using the scheme A2).

## APPENDIX 3: Structure and Operation of the Cube-Connected-Cycles Network

The $2^{u} \times 2^{v-u}$ cube-connected-cycles (CCC) ${ }^{7}$ is a network of $2^{v}$ modules, which can be conveniently thought of as a $2^{u} \times 2^{v-u}$ array of processors $P[i, j]\left(0 \leqslant i<2^{u}, 0 \leqslant j<2^{v-u}\right)$, arranged as a matrix where $j$ grows from left to right and $i$ grows from bottom to top. The CCC-processor $P[i, j]$ has number $h=j \cdot 2^{u}+i$. The columns of the $2^{u} \times 2^{v-u}$ arrays are connected as cycles; i.e., there is a connection between $P[i, j]$ and $P\left[(i+1) \bmod 2^{u}, j\right]$ for $0 \leqslant i<2^{u}, 0 \leqslant j<2^{v-u}$. Furthermore, there is a link between processors $P[i, j]$ and $P\left[i, j^{\prime}\right]$, i.e., processors in the same row, if the binary representations of $j$ and $j^{\prime}$ differ exactly in bit position $i$; these links are called lateral connections. A $4 \times 4 \mathrm{CCC}$ is shown in Fig. 9.

It has been shown ${ }^{7}$ that a $2^{v}$-processor CCC emulates the $v$-dimensional binary cube architecture, in executing the algorithms that requires the successive use of the cube dimensions $\left\{E_{0}, \ldots, E_{v-1}\right\}$, either in the order $E_{0}, \ldots, E_{v-1}$ (ASCEND) or in the reverse order (DESCEND). (Such an algorithmic paradigm has been referred to as "recursive combination.") In more detail, and referring for concreteness to the $A S C E N D$ schedule, the CCC cycles emulate cube dimensions $E_{0}, E_{1}, \ldots, E_{u-1}$ (cycle dimensions), whereas the lateral connections are used to emulate cube dimensions $E_{u}$, $E_{u+1}, \ldots, E_{v-1}$ (lateral dimensions). It is therefore clear that, due to the assignment of rows to dimensions, a cycle must contain at least as many processors as there are lateral dimensions; that is,

$$
2^{u} \geqslant v-u
$$

The time used by the CCC to carry out an $A S C E N D$ or $D E S C E N D$ algorithm is proportional to the CCC cycle length.

The operation of cyclic shift has been shown to be a representative of the recursive combination paradigm, and therefore can be executed by the CCC.

[^7]

Fig. 9. A $4 \times 4$ CCC. Processors are labelled with their numbers $(v=4, u=2)$.

Received August 11, 1983; accepted December 21, 1983

## References

Abelson, H. and Andreae, P. (1980), Information transfer and area-time trade-offs for VLSI multiplication, Comm. ACM 23, No. 1, 20-22.
Aho, A. V., Hopcroft, J. E., and Ullman, J. D. (1974), "The Design and Analysis of Computer Algorithms," Addison-Wesley, Reading, Mass.
Becker, B. "Schnelle Multiplizierwerke für VLSI-Implementierung," Technical Report, Uni. des Saarlandes, 1982.
Brent, R. P., and Kung, H. T. (1981), The chip complexity of binary arithmetic, J. Assoc. Comput. Mach. 28, 521-534.
Dadda, L. (1965), Some schemes for parallel multipliers, Alta Frequenza 34, 343-356.
Karazuba, A and Ofman, Y. (1962), Multiplication of multidigit numbers on automata, Dokl. Akad. Nauk SSSR 145, 293-294.
Lengauer, T., and Mehlhorn, K. (1983), VLSI complexity theory, efficient VLSI algorithms and the HILL design system, in "The International Professorship in Computer Science: Algorithmics for VLSI" (Trullemans, Ed.), Academic Press, New York, in press.
Luk, W. K., and Vuillemin, J. E. (1983), "Recursive Implementation of Optimal Time VLSI Integer Multipliers," VLSI 83, Trondheim, Norway, September.
Muller, D. E. (1963), Asynchronous logic and application to information processing, in "Switching Theory in Space Technology" (Aiken and Main, Eds.), Stanford Univ. Press, Stanford, Calif.
Preparata, F. P. (1983), An area-time optimal mesh-connected multiplier of large integers, IEEE. Trans. Comput. C-32, No. 2, 194-198.
Preparata, F. P., and Vuillemin, J. (1981a) The cube-connected-cycles: A versatile network for parallel computation, Comm. ACM 24, No. 5, 300-309.
Preparata, F. P., and Vuillemin, J. (1981b), Area-time optimal VLSI networks for computing integer multiplication and discrete Fourier transform, in "Proceedings, I.C.A.L.P., Haifa, Israel," pp. 29-40.

Schönhage, A., and Strassen, V. (1971), Schnelle Multiplikation grosser Zahlen, Computing 7, 281-292.

Thompson, C. D. (1979), Area-time complexity for VLSI, in "Proceedings, 11 th Annual ACM Symposium on the Theory of Computing (SIGACT)," pp. 81-88.
Vuillemin, J. E. (1983), A very fast multiplication algorithm for VLSI implementation, Integration, VLSI J. 1, No. 1, 33-52.
Wallace, C. S. (1964), A suggestion for a fast multiplier, IEEE Trans. Comput. EC-13, No. 2, 14-17.


[^0]:    * This work was supported by the National Science Foundation under Grants MCS-8105552 and ECS-81-06939; additional support was provided by the Deutsche Forschungsgemeinschaft SFB 124, VLSI-Entwurf und Parallelität.

[^1]:    ${ }^{1}$ Below, this condition on the nature of $G$ will be relaxed.

[^2]:    ${ }^{2}$ We shall discuss later the choice of the upper extreme of this interval.

[^3]:    ${ }^{3}$ For convenience, see Appendix 2 for a review of the technique.

[^4]:    ${ }^{4}$ In Appendix 3 the reader will find a concise description of the CCC.

[^5]:    ${ }^{5}$ Fermat rings were used by Schönhage and Strassen (1971) in their fast multiplication technique.

[^6]:    ${ }^{6}$ Preparata (1983).

[^7]:    ${ }^{7}$ Preparata and Vuillemin (1981a).

