Wednesday, May 28, 2008

P vs NC IV: Linear PRAMs I

1. Prelude

Taken together, the P vs NC paper is quite intimidating in its size and scope. And yet, when it's cut down to bite-sized blogging chunks, the pieces seem to follow inexorably in a manner that leaves one asking, 'how else could it have been'. There's the key inspiration of using parametric complexity, the magic trick we'll see more of today. But the rest of it uses the kind of geometric arguments that are familiar to anyone who's read papers on the complexity of arrangements. Make no mistake: taking care of issues of bit complexity requires delicate arguments. But the main thrust of the approach is surprisingly easy to explain, ONCE we've accepted the magic trick.

In this post and the next, we'll begin to see glimmers of the WHY for the parametric argument. In brief, using parametric complexity allows us to navigate an affine subspace of the space of all possible inputs, and then we let geometry take over and guide us.

2. Linear PRAMs

We're close to seeing the main killer argument at the heart of the paper. But as with many arguments, it's easier if we look at a skeleton argument applied to a somewhat simpler case. We already saw a toy example of this argument, and now we'll move to a more nontrivial model, that of linear PRAMs.

In the first post, we defined a linear PRAM as a PRAM-without-bitops in which all multiplications are by a constant. We'll see a slightly stronger lower bound for computations in this model. Specifically, we will show that for a problem with parametric complexity and bitsize , there exists a large enough constant b such that the problem cannot be solved on a linear PRAM with p processors in time . This generalizes the bound that will hold for the stronger model: namely, time with processors.

Substituting what we've already seen for the parametric complexity of max-flow, we recover a lower bound of time using p processors, for an n-node network.

Notes:
  • For clarity, I'm skipping over the specification of bitlengths: these will fold in later, but throwing in all these numbers right now confuses the flow.
  • As before, the lower bound applies for approximations and randomization: how this happens will be discussed later on.
3. Bounding the shape and number of branchings

The high-level proof strategy first requires us to bound the number of possible branchings of an algorithm in this model, and then requires us to describe the "cells" of these branchings in some geometric manner. As we did earlier, we'll say that two inputs x, x' are t-equivalent if the machine executes the same set of instructions on x and x' for the first t steps of the computation.
This defines an equivalence relation, and our goal will be to estimate the number of equivalence classes in this relation (as a function of t).

Now let's introduce the parametrization. Suppose that each input x is a function of d parameters , where each particular input variable is expressed as a linear function of the parameters. We'll further assume that these linear forms are integral, and the parameters range over integer values with some bounded bitlength.

The partitioning of the inputs x into equivalence classes induces a partitioning of the parameters in the obvious way (two parameter vectors z, z' are t-equivalent if the corresponding x, x' are). Let denote the number of t-equivalent classes in the parameter space. The main result we can show now is that:
For all t, is bounded by
Here's the proof argument (and for those of you who've waded through parametric search, this will be familiar): First of all, remember that since all operations are linear (since one term of each multiplication is a constant), each arithmetic operation is some linear function of the input. From this, and the assumption that memory pointers are not functions of the input (this is the circuit view of the computation), it follows that the contents of each memory location are a linear function of the equivalence class the input lies in, and the time t.

Now imagine simulating the execution of the algorithm on an as-yet unspecified parameter vector z (as we do in parametric search). What this means is that arithmetic operations are linear forms defined over the symbolic value of the parameter. When a branch occurs, the branch outcome is determined by comparing two linear forms, which reduces to testing the sign of a linear expression. Let's take some z in a fixed t-equivalence class C. From the above remarks, each branch involves testing at most p linear forms on z, each of which can be viewed as a hyperplane in . If we now look at the arrangement of these hyperplanes, then each cell represents a fixed set of decisions for the branch outcomes. By standard arguments, the complexity of this arrangement is .

Now how does this arrangement evolve at the next time step ? Clearly, any single piece of the equivalence class C can be split by the arrangement into smaller pieces, based on decisions that are made at the next time step. But from above, there are only such splits possible. Therefore, , yielding the desired bound.

A corollary is that if we track any particular t-equivalence class, it is split by at most pt linear forms (one for each time step and each processor), and so it is sufficient to check the signs of these forms to determine whether any particular z belongs to C.

Finally, taking a union over all possible equivalence classes, we get the following result:
If a set of inputs parametrized by d parameters is accepted in the linear PRAM model with p processors in t time, then there is a way of partitioning using hyperplanes, so that each cell of the resulting arrangement can be labelled 0 or 1 so that an input z is accepted iff it lies in a cell labelled with a 1.
In the next part of this argument, we will show that a problem with high parametric complexity cannot be represented this way: that is, there must be some cell that contains inputs in the language, and inputs not in the language.

Tuesday, May 20, 2008

P vs NC III: Parametric complexity

(note: this is part III in the series on Mulmuley's separation of P and NC without bit ops. For all posts in this series, click on the tag 'p-vs-'nc')

Parametric complexity is the notion of 'intrinsic hardness' that will be used to show the main lower bound. In this post, we'll go over the notion of parametric complexity, and view s-t mincuts (and dually, max flows) in this light.

Consider the max-flow problem on n vertices. It is parametrized by the capacity of each edge. Suppose now that we replace each such parameter by a linear function of a parameter , taking care that for some fixed range [p,q] all such capacities are non-negative. We get a parametrized instance in which the optimal flow f is a function of the parameter . Moreover, since the capacities are linear functions of , f itself is piecewise linear.

Let be the number of breakpoints of f i.e the number of points at which the slope of f changes. To picture this, imagine that some edge is critical for f, in that increasing the capacity on this edge increases f. Clearly, as the parameter increases, the capacity increases linearly, and so does f. At some point though, the capacity becomes large enough that this edge no longer plays a key role in capping f. At this point, some other edge (with capacity changing at a different rate) starts playing this key role, and f starts changing linearly again, but at a different rate.

Now suppose we compute for all possible parametrizations. The maximum value is called $\phi$, and is called the parametric complexity of f. Technically, is also a function of both n, the number of vertices in the graph, and , the bitlength of the input. The main result of this paper is the following two-step dance:
  1. Max-flow has high (exponential) parametric complexity
  2. A problem with high parametric complexity has high parallel complexity
In the rest of this post, we'll look at the first step.

Before we go on, another simple observation. Parametric complexity starts looking a lot less mysterious if you think geometrically. Let's look at min-cut, since it's a bit easier to see what's going on. Any s-t cut is some set of edges, and its capacity is the sum of the corresponding edge capacities. Therefore, the capacity of a cut is a linear function of . Graphically, we can think of this capacity as a line segment starting at and ending at .

We can do this for all cuts in the graph (and there are exponentially many of them). What we get is a familiar beast: the arrangement of the collection of line segments. Now what's the min-cut. For any value of , it's the cut with the smallest value. In other words, the function describing the min-cut as a value of is the lower envelope of this arrangement of line segments, and the function is merely the complexity of this lower envelope (the number of pieces in it).

Now the lower envelope of a collection of lines (note that although these are line segments, they extend across the domain of interest and therefore function as lines for combinatorial purposes) in 2D is linear in the number of lines. Since there are exponentially many cuts, we can expect that the complexity of the lower envelope is exponential. The hard part is showing that all the dependencies (a single edge can participate in many cuts) don't prevent us from achieving this.

Some technical points:
  • We assume that non-numeric parts of an input remain fixed for all values of the parameter (i.e an edge doesn't disappear or reappear)
  • All linear functions come with rational coefficients.
  • The bitsize of the problem is defined as the maximum bitsize of any of the vertex descriptions and the coefficients of the linear functions. Note that although
can vary continuously, interesting events happen at breakpoints that can be described rationally.
High parametric complexity bounds have already been proved for many problems of interest. Combinatorial linear programming with linear bitsizes has a parametric complexity of . Min-cost flow has similar complexity, and the same bound can be proved for max-flow with quadratic bitsizes. Mulmuley in fact reproves this last result with what he says is a simpler argument, and also shows that in contrast, the global min-cut problem has polynomial parametric complexity (which is a good thing, since there's a RNC algorithm for it !).

Let's take a closer look at the parametrized complexity of max-flow (actually its dual, min cut). The goal is to come up with a construction that exhibits a parametric complexity exponential in the input. So we need a graph, and a way to parametrize edge capacities.

The graph itself is quite simple. It has a source s, a sink t, and vertices from 1 through n (call this set A) and 1' through n' (call this set B). s is connected to all the interior vertices, as is t. Each of A and B forms a clique. Moreover, vertices i and j', and j and i', are connected (i and j are different). One possible cut of the graph consists of all vertices in A (along with s). Another consists of all vertices in B (along with s). It will turn out that by choosing weights appropriately, both of these will represent min cuts, as will any set formed by exchanging some vertices from A with their prime counterparts in B. Since there are ways of doing this, we'll get the desired bound.

This is where things start getting a little tricky, and I must confess that I don't have a clean way of explaining what happens next (which is a short way of saying that you might be better off reading the relevant construction !). In brief, the idea works like this: an edge from s to i (or i') is given a capacity , where is a constant, and the capacities are set up to be exponentially decaying: . An edge from i to t is given a capacity , and an edge from i' to t is given capacity . The parameter is set to run from -T to T. Note that at these two extremes, either all edges from A to t have capacity zero, or all edges from B to t have capacity zero (making them good candidates for cut edges).

There are also cross edges to consider. An edge from i (or i') to the opposite side has the "default" capacity . However, edges in the cliques (inside A and B) are weighted differently: the edge from i to j in A has capacity , and this is the same for an edge from i' to j'.

What's the capacity of the cut that contains s and A on one side ? We add all edges from s to B (), and all edges from A to B () and A to t (). Clearly, this is minimized for , and it turns out that this is in fact a min cut. A similar argument works for a cut containing s and B, and .

But what happens as the parameter increases ? Note that it makes no sense to have i and i' on the same side of a cut, because there's no edge between them and so it's always better to keep them separate. This means that the min cut is always "complementary" with respect to A and B, which is useful. It also means that the cut will change by having i and i' swap positions as the parameter increases. The question that remains is how many times this can happen.

It turns out that via a reasonably clever (but with a boring inductive proof) argument, it can be shown that the region spanned by the parameter can be divided into regions by a binary division, and when each region is encoded using a standard bitstring, the string encodes the min cut set (1s denote elements from A, and 0s denote elements from B, and so on). The proof works by arguing that for any region defined by the binary division, the min cut remains invariant, and can be read off from the binary expansion of the region. This is proved by induction on the length of the bitstring.

Finally, since each min cut defines a linear function of the parameter with a different slope, we obtain the desired result: namely that the number of distinct slope changes is exponential.

Monday, May 19, 2008

CCF restructuring

(note: this post is probably only of interest to readers in the US)

James Lee posts a note from the Visioning workshop held ahead of STOC 2008. There's some interesting news about the restructuring of CCF (the "theory" section within the NSF's Computer Science directorate). The previous CCF structure (run by Michael Foster) consisted of:
  • Theoretical Foundations (TF), which included all of theoryCS, as well as geometric modelling, some visualization, information theory, and signal processing.
  • Foundations of Computing processes and artifacts, which among other things included graphics and visualization, as well as some software, architecture, and programming languages
  • Emerging models, which had quantum, nano, and DNA computing.
The new CCF (which will be run by Sampath Kannan), looks like this:
  • Algorithmic Foundations
  • Communication and Information Foundations
  • Hardware and Software Foundations
The first one contains pretty much all of theoryCS as before, but from the sound of it will not have the other areas like information theory and signal processing (which I imagine will be in the second cluster). The algorithmic side of quantum computing (as opposed to the engineering side) will also be folded in.

One interesting comment was that geometry will be folded back into the main cluster. Currently, CG, geometric modelling, and symbolic methods have a separate pool that awards are made through, under the larger TF cluster. It sounds like this will change, and CG will "compete" in the main theory pool.

This will probably make life interesting in the short run. I don't think I'm overstating the case to say here that there's a not trivial amount of hostility between CGers on one hand, and the "regular" STOC/FOCS crowd on the other. If I had a dollar for each time someone vowed never to attend STOC/FOCS because of perceived slights to geometry, I'd never have to write grant proposals. On the other hand, I have heard (from various sources) of people who say things like 'Geometry is not real theory'.

It should be interesting to see how this all shakes out during panel review. I suspect/hope that things will happen reasonably, in that proposals will be binned and panelists assigned appropriately.

Saturday, May 17, 2008

The history of Alice and Bob

The Alice and Bob approach to cryptography is brilliant on so many levels:
  • It personalizes the discussion, rather than dealing with impersonal As and Bs
  • It creates a gender-neutral discussion framework
  • It's much more effective at capturing the imagination of the non-expert public (see point 1).
I was wondering about the origin of this coinage. It's attributed to the original RSA paper, and here's what Ron Rivest had to say about it:
RSA co-founder Rivest, who is a Massachusetts Institute of Technology (MIT) professor, says he came up with Alice and Bob to be able to use "A" and "B" for notation, and that by having one male and one female, the pronouns "he" and "she" could be used in descriptions. Rivest says it is possible that Alice came to mind because he is something of an Alice in Wonderland buff.
If you've wondered about the deeper personal lives of Alice and Bob, go no further than this speech. And for more on the extended Alice and Bob family, check out the wikipedia entry. (xkcd has more on the sad tale of the maligned other woman Eve).

Friday, May 16, 2008

The rectangle method in communication complexity

It's tempting to take way an impression that the P vs NC proof is a one-off kind of thing, with no real lessons for other kinds of lower bounds that one might want to prove. Only time will tell whether this is the case or not.

But I was thinking about communication complexity lower bounds recently. Consider the standard 'rectangle' method for proving a communication complexity lower bound. As usual, we have some (Boolean) function f(x,y) whose answer we wish to compute. Alice has x and Bob has y, both n-bit strings, and we want to exchange as few bits as possible in order to compute the answer (since f is Boolean, it doesn't really matter who ends up with the final answer: they can always pass a bit back to the other party).

A natural description of the communication is in the form of a matrix, where each row is labelled with one of Alice's inputs, and each column is labelled with one of Bob's possible inputs. Now let's say Alice sends the first bit. This bit depends only on x, and so we can partition the rows into two blocks, one containing all inputs for which Alice sends a 1, and the other containing all rows for which she sends a 0. As the communication proceeds, we can further partition the space into equivalence classes of sets of inputs that induce the same communication pattern.

So far, we're doing what most lower bound arguments tend to do: finding equivalence classes of inputs prior to doing some kind of counting (think of the sorting lower bound). But here comes a neat twist.
Each equivalence class thus defined forms a rectangle in the matrix.
The proof of this is not too hard, and is a consequence of the fact that Alice and Bob can only see their side of the pair. Let's say one of these equivalence classes contains the pair of inputs and , which means that for both these input pairs, the conversation transcript is identical.

But now what happens if the input was instead ? Alice sees , and sends the same first bit out that she would have sent if she had seen (because of the equivalence). Bob sees this bit, and looks at , and has no way of telling that Alice's input was NOT , and so sends out the next bit in the transcript. Repeating this argument for the remainder of the conversation, it's easy to see that must also generate the same transcript, as must . Geometrically, this proves that every equivalence class must be a rectangle, and therefore, the matrix consists of rectangular patterns of 1s.

In other words, what we've done is come up with a geometric way of describing equivalent classes of inputs. This means directly that things like EQ(x,y) (is 1 if x=y) need n bits of communication, because the corresponding matrix has 1s along the diagonal only, and there's no way of covering these with fewer than monochromatic rectangles. Again, we're using the geometry induced by the computation to make arguments about the lower bound construction. This geometric structure is used algebraically for the more involved rank-based lower bound argument as well.

I'm not claiming that these are identical arguments to what we're seeing in P vs NC. It's that the idea of exploiting the geometry induced by a set of computations is not as unfamiliar as it might seem.

Sunday, May 11, 2008

P vs NC II: The main theorem, and a (very high level) proof skeleton

After some initial throat-clearing, we're now in a position to state the main result of this paper. To save myself some typing, I'll refer to the model of computation (PRAM without bit operations) as PRAM-wb from now on.

Theorem.
  • Min-cost flow on k nodes cannot be solved in the PRAM-wb model deterministically (or randomized) in (expected) time using parallel processors, even if we assume that the costs and capacities are integers with bit length at most for some large enough positive constants a, b.
  • A similar result holds for max-flow, with the limits on capacities being replaced by
  • All lower bounds hold even if we merely desire an additive approximation.
Corollary.
A min-cost-flow or max-flow problem of total input bitlength N cannot be solved in the PRAM-wb model deterministically (or with randomization) in time (expected) using processors, for some constant c.
Discussion.
Before we dive in, it's useful (or at least it was for me), to take apart the statement of the theorem itself. Firstly, we note that the bounds hold even with randomization, which of course means that the kinds of obstructions that Mulmuley constructs are "frequent". The actual proof goes via the standard Yao-minimax principle, and we'll get to it once we've completed the deterministic lower bound.

Another interesting point is that the lower bound holds for additive approximations as well. I haven't read far enough ahead for this to be obvious, but I suspect that it has something to do with the fact that we'll always be dealing with integers, and so intuitively an approximation that can get us "between" integers will collapse to the right answer.

Finally, a note on the bitlengths. One might argue that if the bitlengths were "constant", the problem would be solvable. This is in fact the case, as Mulmuley discusses: it is actually important that the bitlengths are "long enough". If the bitlengths are short (say O(log n)), then we could read off all the bits efficiently using the procedure described in a previous post, at which point we have access to the full power of PRAM. At this point, we can solve max flow via an RNC algorithm for bipartite matching. So to get the strong bound on the number of processors, we need the bitlengths to be long enough.

But we don't want them to be too long. This constraint on the bitlengths feeds directly into the corollary, since we can relate N and k using the upper bound on the bitlength. However, the larger the bitlengths get, the weaker the bound on the running time expressed in terms of N. So it's actually useful that the problem is hard even for inputs with "smaller" bitlengths.

An overview of the proof strategy.
As I mentioned earlier, the technique used to prove a lower bound for bit extraction is a useful template to follow. Let's consider the basic argument.
  1. Examining the operations permitted by the model, come up with a bound on the number of distinct paths any bounded computation can take.
  2. Come up with a geometric description of the way the space of inputs is carved out by these paths.
  3. Show that if we mark inputs as either being in or out of the language (the decision problem say), that the "intrinsic" complexity of this space prevents the geometric description constructed in (2) from being able to carve out the IN points from the OUT points: loosely, that the model does not have enough geometric expressivity to separate good from bad.
[Aside: stated this way, there's a very VC-dimension feel to the argument. For example, let's say your "model" consists of "things you can do with one hyperplane", and your language consists of "two diagonally opposed points on the unit square", then the famous perceptron result is basically that the "model" can't capture the "language"].

Stated this way, it might not seem to surprising that algebraic geometry starts playing a role in the argument. In order to perform Step (2), we need to talk about invariants under computations that can be expressed as a series of algebraic operations (this is one reason why the omission of bit operations makes things more tractable), and algebraic geometry, which deals (at the most basic level) with the geometry of solutions to algebraic equations, is the right toolkit to use.

It would be remiss of me if I didn't point out that at least some of these ideas are not new (once you're looking at them from high enough). Dobkin and Lipton considered linear decision trees, Steele and Yao generalized these lower bounds to algebraic decision trees, and Ben-Or generalized further to algebraic computation trees (see Jeff Erickson's notes on algebraic lower bounds, and also the postscript on Joel Friedman's more recent work). In all these cases, the rough lower bound argument worked by showing that
  1. The computations expressible in the model could be captured geometrically.
  2. A bound on the computation could be expressed in terms of an intrinsic complexity of the target function. In all the above, the intrinsic complexity was topological: number of connected components, or even a sum of Betti numbers.
  3. The target function had a "high" intrinsic complexity (large number of components etc).
So what's the notion of "intrinsic complexity" in our setting ? This goes to step (3) in the high level sketch, and leads to the notion of parametric complexity. The idea is to consider a specific class of candidate inputs that can be described by parameters: for example, specifying that each edge in a graph has a capacity that's a linear function of some parameter . This is a "magic step", in the sense of Gowers: it seems pulled out of a hat because I don't understand why the parametric lens reveals the lower bound to us (of course, it might be clearer to others: if so, do speak up :)).

One can define a notion of parametric complexity (basically the number of breakpoints in the optimal value as the parameters change), and show that it is high for the problems under consideration. This takes care of one part of step (3): showing that the intrinsic complexity measure is high. The next step is to show that if this is true, there exists a way of parametrizing the inputs so that the IN and OUT inputs are distributed badly (intuitively, we'd want a strategy that doesn't allow large chunks of INs and OUTs to congregate in input space) . Finally, the proof completes by showing that the "low degree" regions that PRAM-wb can carve out cannot separate these badly distributed points.

This is all very vague, and indeed the "details", if I dare call them so, are tremendously complex. It's a good idea to keep this very high level line of attack in one's head as we go forward: in subsequent posts I'll dive down into the proofs and start getting very detailed, and it will be easy to lose the forest for the trees (pun unintended) without a roadmap.

p.s Recent work by Joel Friedman on algebraic geometry-baed methods for proving circuit lower bounds is more directly in the program of research initiated by the work on decision trees. Understanding his work is extremely well beyond my ability, and requires a deep understanding of modern algebraic geometry. I'll leave it to others to try and explain his work for the masses, merely noting that this at least shows ONE other approach based on algebraic geometry for attacking lower bound questions.

Wednesday, May 07, 2008

A note on bit operations.

In the last post,we saw a lower bound for bit probing (specifically, extracting the lower-most bit of a number) in the PRAM-without-bitops model. Now clearly this can be computed efficiently in P, so why isn't this enough of a proof for the separation of the two classes ?

The answer is subtle. In a PRAM model, excluding bit operations is a nontrivial restriction, because, as we saw, there's no parallel-efficient way of extracting bits from a word without bit operations. However, in P, this is not true. In time linear in the bitlength, we can extract every single bit, and then carry on. Thus, P-with-no-bitops is as powerful as P itself !

Mulmuley argues that the "correct" notion of excluding bit operations in P is that of an algorithm that might run in time that depends on the bit length, but without bitops, in the way that algorithms like the ellipsoid method work. But an easier approach is to exclude algorithms whose run time depends on the bit length, yielding the class of strongly polynomial algorithms. Note that this class (which he calls SP) still contains the P-complete problems max-flow and min-cost flow.

Once we do that, extracting bits is no longer an "easy" operation in SP. The "input size" is 1, and so any bit-extracting algorithm is only permitted a constant number of steps when extracting a bit. Therefore, the apparent separation doesn't really hold.

Formally, what is proved in this paper is that SP strictly contains C (the class of PRAM-without-bitops algorithms), via the separation result for max flow.

More math-blogging

Gil Kalai has a blog (at least for the next year) (HT 0xDE)

Monday, May 05, 2008

P vs NC Part I: Preliminaries

We start our exploration of Mulmuley's result on P vs NC with some definitions, and an "easy" example of a lower bound.

(For the definition of P and NC, visit the Complexity Zoo)

1. P-completeness

NC is our proxy for "can be parallelized efficiently" just like P is our proxy for "efficient algorithm". P contains NC, and as with P vs NP, the interesting question is whether the containment is strict: that is, are there problems that are inherently sequential ?

A theorem that separates P and NC will show that there's some problem in P that cannot be parallelized. Obviously, we want to pick a hard problem in P. P-completeness captures the idea of 'hard-core' problems for P, and the equivalent of SAT for P-completeness is the circuit value problem: given a circuit, evaluate it. We'll actually be looking at a different P-complete problem here, but one that's equally familiar: min-cost flow on a network with n nodes.

Remember, this result DOES NOT show that P strictly contains NC. What it shows is that P strictly contains a fairly general (but still restricted) subclass of NC.

2. The PRAM model

A convenient computational model to work with when talking about NC algorithms is the P(arallel) RAM model. PRAMs were all the rage in the 80s when supercomputers were being built, fell out of favor in the 90s when parallel computing lost its sheen, and are possibly making a comeback today in the world of multicore systems, GPU clusters, and MapReduce.

As the name suggests, a PRAM consists of many processors, each able to run a general purpose program with branchings, indirections, and arithmetic operations +, - X. A problem is in NC if it can be implemented to run in polylogarithmic parallel steps on polynomially many machines (where in one parallel step, each machine might perform one step of a computation).

A bit operation on a PRAM is any operation that treats data as bit-strings, such as AND, OR, and other boolean operations on bitstrings, or even 'extract-bit': get a particular bit of a string. All such operations will be excluded in the models considered here.

It will be important to pay attention to input size. As always, the input size can either be described by n, the cardinality of the input (the quantity counting the number of "things"), or by N, the bitsize (the quantity measuring the number of bits needed to write everything down). When we talk about polylogarithmic time and polynomial number of processors, unless specified otherwise, we will be referring to polynomials of n AND N.

In the non-uniform setting, we'll assume that we're presented with a PRAM that for fixed n, N, runs in time t(n, N) with p(n, N) processors. Each input consists either of an integer, or a rational with integer numerator and denominator presented separately (so that we can do division more easily). Division of two rationals is encoded via multiplication, so that p/q is computed as p * 1/q, and 1/q is computed by reversing the numerator and denominator of q.

The cost model is unit-cost, which means that any operation takes constant time, regardless of the relative sizes of the inputs. It is very important to exclude the floor operation when we do this: if not, judicious use of the floor function along with unit cost operations would allow us to solve PSPACE-Complete problems in polynomial time ! Since we are only concerned about lower bounds, assuming unit-cost for operations can only make the lower bound weaker than it actually is, so it's a reasonable thing to do.

It'll actually be useful to consider four variants of the basic PRAM-without-bit operations model.
  1. Arithmetic PRAM: In this model, the running time depends on the cardinality of the input alone, rather than the total bitlength.
  2. Linear PRAM: Here, at least one of the operands in any multiplication operation must be a constant.
  3. Arithmetic Linear PRAM: Combine the two above
  4. PRAM: None of the above restrictions
We will also elide the difference between the various memory access protocols (EREW, CRCW etc) since they only affect the overall running time by a polylogarithmic factor through well-known simulation results.

3. Are these good models ?

The dilemma of lower bounds is the problem of finding models that are rich enough to describe actual algorithms, while being weak enough that we have any hope of proving lower bounds in the first place. Is the PRAM-without-bit-operations a reasonable model of parallel computation ?
Mulmuley argues that it is, by pointing out that this model captures "virtually all" known parallel algorithms for optimization and algebraic problems. Even the weaker models proposed have exemplars: for example, one problem that admits an arithmetic PRAM is solving linear systems over rationals.I am not aware of any algorithm that appears to need bit operations in any nontrivial way, and would welcome pointers to possible candidates.

4. The problem of bit extraction.

Since we've excluded bit operations from consideration, it's natural to wonder how expensive it really is to extract a bit (or in other words, simulate a step from a general PRAM). It turns that that a very simple version of the general proof argument can be used to show a lower bound here, and although it's not terribly relevant, it's useful as a warmup exercise.

Proposition: The lowest order bit of an n-bit operand cannot be extracted in time using processors in the PRAM-without-bitops model, for some large enough constant a.

Why on earth should this be true ? After all, to test for odd/even, all we have to do is divide the number by 2 and take the remainder ! But how do we do that ? the only operations permitted are on integers or rationals, and the operation "5/2" will merely return the rational (5,2). If we had a floor function, we could compute , but we don't have it.

On the other hand, is this tight ? Clearly, if the bitlength is log-bounded, you can guess it by doubling, and then extract bits starting with the most significant one. This can be done in O(l) time sequentially, if the bitlength is l. Generalizing, you can split an l-length string into pieces of size , and work on each piece, starting with the most significant one, for a constant number of parallel time units. On such a piece, you have processors guessing all possible values, and then check which one is correct. Subtracting, you can then proceed to the next piece. This is one way of understanding where the bounds in the proposition come from.

So if we're convinced that the 'form' of the proposition makes sense, let's see how we might go about proving it.

Proof outline:
Assume that there actually is a machine that can return the lowermost bit in the desired bounds , . Suppose we simulate the action of this machine on two inputs x, x' upto time t. If the machine, upto time t, behaves the same on both inputs, then we say that they're in the same t-equivalence class.

Can we bound the total number of such classes ? Note that if we can do that, then we might have a hope of using the pigeonhole principle to argue that at least some class might have too many elements in it: since all elements in an equivalence class will yield the same outcome, we might be able to use this to create a contradiction.

Consider the branch operations executed by the machines at the (t+1)^th step for two t-equivalent inputs x, x'. Since the computation is identical for the two inputs upto the previous step, any branching operations consist of evaluating the sign of some polynomial in x (because all the stored memory locations are polynomials in x). Each of these polynomials has degree at most , and therefore has at most real roots. Collecting all these roots together for all machines, we get a subdivision of the real line into intervals, and (this is the crucial part) in each of these intervals, the answers to sign queries for any of the polynomials does not change.

[Aside: this trick, of constructing an arrangement over a geometric space such that each cell contains a fixed answer to a problem, is fairly common in computational geometry].

What this means is that each t-equivalence class can blow up into only a certain number of (t+1)-equivalence classes (roughly ), and by induction, this means that over a total of t(n) time steps, there can be at most something like equivalence classes.

Now equivalence classes partition the space of possible inputs, but they don't really tell us how: any one partition might have inputs from all over the real line, and knowing the number of equivalence classes in and of itself doesn't tell us why estimating the lowest bit might be difficult. In order to make that leap, we have to further fracture these equivalence classes into actual segments on the real line, such that in each segment, all inputs are t(n)-equivalent.

This is not too hard to do, since we know that all polynomials evaluated in the process have degree at most . Since there are at most t(n)*p(n) polynomials to consider in any computation for any particular equivalence class, the total number of intervals induced by roots of these polynomials is roughly , which is still . Two inputs from any one of these intervals are treated in exactly the same way by the computation.

Now comes a neat application of the pigeonhole principle. Choose a large enough so that . If we label each of the possible integers with their correct output (the lowermost bit), we get an alternating sequence of zeros and ones, and clearly a correct algorithm must create at least intervals to ensure that no interval contains numbers labelled both one and zero. But if we choose a large enough, we can ensure by the pigeonhole principle that there will be at least one interval that is "too long" i.e must contain inputs labelled both one and zero. Since by construction, the algorithm must return the same output on ALL inputs in this interval, it will be wrong on at least one input, which yields a contradiction.

We will use many elements of this basic proof structure as we go along.