Wide finder redux

A CMT style processor will normally take longer to get a serial job done than will something like an x86 Xeon - but in the real world of corporate data centers those jobs are almost vanishingly rare. Instead what you see is multiples: the need to run a serial job many times with only minor change, and for those CMT turns out to be much faster.

Here's a comment from frequent contributor Erik Engbrecht:

Until Tim Bray started his Widefinder discussions I was a big believer in Coolthreads. The principles behind the technology just make sense. But then I noticed how badly my Intel based laptop beat the T2 when running serial code. I knew it would - but the magnitude was rather shocking.

So now I'm a skeptic. Don't get me wrong, I would probably buy one for most general enterprise computing workloads. I still think it's a cool technology, and in the future I'm sure those cores will get faster while Intel will just keep on growing more cores. In the end I think Sun has it right, and I think for the most part Sun has it right right now. But at this point the performance characteristics of T2 seem to be somewhat enigmatic.

Here's how Tim Bray introduces the Widefinder project on his own website:

In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It's a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.


It's already obvious that Web-oriented frameworks in general, and Java EE in particular, run like a bat out of hell on processors like our Niagara boxes and especially the systems that will be built around the new T2.

But you know, there's lots of interesting computing that doesn't happen in Web frameworks. Like, for example, logfile processing. And since more and more computers are going to start looking more and more like our Niagara-family boxes, we're going to need to figure out how to get this kind of basic unglamorous essential job to run fast on them.

Whatever it is we use to write the many-core-friendly Intergalactic P2P Adaptive Semantic Web4.0 killer apps of the future, well, it has to be a general-purpose programming language, because they always win. Which means it has to have a good way to read a logfile and compute the most popular fetches. You might argue that this program is kind of heavy on I/O and text processing. That's correct; but so is computing in general.

With that in mind I'd like to propose a Christmas diversion: a narrow loser programming contest in which the goal is to write the ugliest, least efficient, most obfuscated code that will do the widefinder job: list the top ten most downloaded pages from an apache log.

The rules are simple: it has to work on Linux or Solaris with only open source software.

The test is his: process an apache "combined" log file to correctly list the top ten, and only the top ten, with the correct counts.

To enter your masterpiece: put the code, the results (including run time, file size, and test hardware information), and a comment or two in the talkbacks here.

Assuming we get some entries, I'll create a reader poll to decide who "wins" -and I'll personally send the winner absolutely nothing.

As a serious aside, here's Tim's own defense of the value of his project - from his results page:

Worth Doing · There is a steady drumbeat of commentary along the lines of "WTF" This is a trivial I/O-bound hack you could do on the command line, ridiculously inappropriate for investigating issues of parallelism.? To which I say "Horseshit". It's an extremely mainstream, boring, unglamorous file-processing batch job. There's not the slightest whiff of IT Architecture or Social Networking or Service-Orientation about it. Well, that's how lots of life is. And the Ruby version runs faster on my Mac laptop than on the mighty T5120. The T5120 is what most of the world's future computers are going to look like! Houston, we have a problem, and the Wide Finder is a useful lens to examine it.

I believe I've been beating the drums on the software revolution required to make use of non linear technologies like CMT or Cell for longer than just about anybody else, and I certainly agree with what he says here.

On the other hand, you rarely encounter single jobs this simple in the real world. What's more likely to happen is that you'll accumulate requests for custom reports from the same data - in this case an apache log file. Before today's multi-core machines most people tried to save cycles by processing the input file only once, but with CMT/SMP that doesn't make sense any more.

Instead you go for the maintenance and clarity advantages that go with writing separate scripts to process the file once for each report - and then run them in parallel after copying the source file to /tmp :

#crontab 30 6 * * 1-5 2>> usr/adm/custom_reports_log
#murph 12/16/07
cp /var/apache2/logs/access_log /tmp
/usr/murph/custom/tims_top10.pl &
/usr/murph/custom/jacks_refs.pl &
/usr/murph/custom/harrys_crosstab.pl &

Notice first that for this to work as fast as possible your T2 has to be properly set up - with lots of swap split across the two fastest internal disks and ZFS configured for all external storage - something I don't think Bray's test machine is. Secondly, the important thing here is that the T2 will do many of these in parallel and first gain on, and then pass Erik's Lintel machine as the workload increases.

For the less serious purposes of the narrow loser contest, however.. I'm thinking PostScript - i.e.

% gscript -f my_program.ps image_log |lp -dqms

prints a labelled graph.

I (ahem, cough) "may not have time" - but I'm hoping you do and looking forward to any entries I get.