A programmer could, of course, choose the hard places route instead: taking software mapped memory management and execution parallelism into the application - a decision very roughly similar to taking on both load and page management for every host in an Opengrid/OpenMP environment.
There's a reason this stuff is so difficult - and it's almost as hard to get your head around as the problem itself. That reason is simple: nobody really understands how computational parallelism works for non trivial tasks and, in the absence of a good theoretical model, all our attempts to work the problem have been heuristic - an extended case of learning what seems to work by experience.
For an OpenGrid/OpenMP application this hasn't been much of a problem because most of the system's nominal capacity gets lost in communications delay and process management overheads - 50% efficiencies on well defined, highly repetitive, tasks like dense matrix multiplies are considered pretty good. Cell, however, changes the focus because the point of getting the grid hardware down to the unitary chip level is to cut out most (>99%) of that wasted communications time, power use, and process management overhead - meaning that with the hardware working, it's now obvious that the problem lies in the software.
And the reason the software isn't there is that we fundamentally don't understand how concurrent integration across non trivially parallel processes works.
Let me suggest an extreme example from human experience. Sometime in the early sixties Eugene Ormandy was able to rehearse an augmented Philadelphia Orchestra using the full score for Gliere's Ilya Murometz - controlling the orchestra while comparing what he expected to what he heard for something over 840,000 separate sound elements in 116 parallel tracks over a 93 minute period - and remember every single mistake before hearing the tape.
We don't know how that works and until we do, serial computing will continue to beat parallel for both ease of use and achieved hardware efficiency - and that's the key reason IBM and others are working toward terahertz serial processing CPUs rather than betting everything on parallelism.