The currently favored approach to this is to get as many Opteron cores as you can pay for, rack em up with appropriate storage, and run the whole thing as either a Linux or Solaris grid.
The plus side of this is simply that you can get performance well into the 100TF (linpack TeraFlops) range, the downside is that operating costs will kill you in the long run while some intrinsic bottlenecks limit the top end you can reach.
Specifically, you can heat grow-ops big enough to cover your capital and staffing costs with the thing, but cabling and storage limits will stop performance growth long before you get into the exaflop range.
Worse, the software combines with those hardware bottlenecks to give you diminishing returns to scale - meaning that a 10,000 core machine will need considerably more than half the time needed by a 5,000 core unit to complete a large run.
What that means is that if you generally plan on having two programs running concurrently, you'll get more performance per dollar if you're willing to forget about bragging rights and build two smaller systems instead of one big one.
There are alternatives to x86 grid computing: specifically, IBM has long offered PPC based grids and has now moved, in Cell, to putting a grid directly into the silicon. Right now, for example, you could make yourself a supercomputer that would have placed number one on the supercomputer list for the year 2000 by combining 12 IBM QS-20 dual cell blades with two Sun x4500 storage servers, a big UPS, and a 16 way Infiniband class fabric switch in one IBM blade server chassis.
The upside here is cost: about $400,000 for a theoretical 5TF machine with 44TB of mirrored storage -buy your X4500s with only two 500GB disks, than fill the remaining 22 slots in each machine with third party 1TB disks.
The downside is a combination of software, hardware limitations, and communications bottlenecks. In particular software limitations mean that performance on typical tasks scales less than linearly with the number of processors, hardware limits in the present generation cell reduce double precision throughput, and overall storage and I/O limits mean that you just can't shove enough data and instructions through the available pipes to push this kind of system much beyond 10TF per concurrent run.
So if that's not enough, what do you do?
My advice: rent - until IBM gets the second generation cell out the door. No more double precision limit, faster I/O and, above all, multiple cells per chip - I don't know how many they'll allow, but "slices" containing 9 cells running at 5+ Ghz should be possible, and the ultra dense memory needed to support that is on the way.
Ok, the 10TF grid CPU is probably in the 2010 time frame, but it's coming; so if you really want your own supercomputer before then - consider a couple of smaller units or renting space on someone else's machine, because three years won't get you to cost recovery on a x86 super grid, and by 2011 nothing else is even going to be in the same ball park with cell..
Unless, of course, someone gets in RAM array processing (write an op and two vectors to RAM, read result) or a compilable hardware language to work...