Paranoia, money, and research

Summary: If you demonstrably have no access to the other guy's system and it produces the same results, the burden of proof will shift to the accuser - meaning that, practically speaking, you're off the hook, and so is your department chairman.

/usr/sfw/bin/gtar: /dev/rmt/0: Cannot read: I/O error
/usr/sfw/bin/gtar: /dev/rmt/0: Cannot read: I/O error
/usr/sfw/bin/gtar: Too many errors, quitting
/usr/sfw/bin/gtar: Error is not recoverable: exiting now

That's frightening, but simple minded compared to a very different kind of recovery problem that's becoming increasingly important in academic and other research oriented service areas. What's going on is that researchers are coming under increasing financial and career pressure to produce either commercial success or grants at just the same time that changes in science are making simulation both riskier and more important - and some of them are finding ways to make their problems ours.

Consider these two lines:

602 192304:45./// 2512213:55/ 204096414:45/ 2512215:28/ 784608414:07/ 45122304:44/ 11215:29/ 145122305:10/ 756322812:09/ 281922304:44/ 25122614:15/ 25122304:45/ 11215:29/ 195122306:04/ 5220482812:09/ 94258752514:37/ 210242812:07/ 45122304:45/ 7988514:31/ 25122306:08/ 35122810:37/ 411024214:03/ 4510242811:00/ 6512215:29/ 466 25122305:101394/ 1302304:47..///0: 1272304:47..///0: 172304:54/0 1102304:54/0 1112305:55./// 1252304:54..///0: 1282304:51..///0: 35122304:54/ 1312304:47..///0: 1302304:47..///0: 35122304:54/ 1332304:54..///0: 1392304:54..///0: 25122304:54/ 1482304:54..//1,0/7/0,0/0,30: 1482304:540..//1,0/7/0,0/0,30: 25122304:54/ 25122307:23/ 1292304:54..///0: 1542304:540..//1,0/7/0,0/0,378:0 1292304:47..///0: 135215:29//1,0/,6413:640 182304:540/640 25122304:54/ 1312304:54..///0: 2528514:37/ 142306:0700 1482306:070..//1,0/7/0,0/0,30: 1482306:070..//1,0/7/0,0/0,30: 192306:0700 1302304:54..///0:

Now imagine first that the file this came from has slightly over 328 million lines all of which look pretty much like this - and secondly that the guy who owns the data looks at you, the departmental sysadmin, while telling your boss that someone altered the file to sabotage his research.

If this file were real it would represent the outcome of about 3,174,336,000 AMD 2.6Ghz CPU seconds and nearly seven months spent waiting for a two rack grid with 176 total cores to produce a result. Researchers, of course, don't care about CPU seconds or hardware limitations -but they do care about waiting time and care even more if that waiting time either lets a competitor publish first or turns out to have been wasted because of an error in the setup or running code.

Combine frustration with outside financial pressure and what you get is a recipe for the expression of paranoia - hence our entirely hypothetical researcher's assertion that the file had been sabotaged and consequent threats of both legal and direct action against the guy responsible for running the hardware.

Since I'm making this up as I go along I can imagine that this particular situation was resolved when the system logs showed that some unsung hero in the University's central IT department had taken it upon himself to update every Solaris box on campus through an automated shutdown, patch, and reboot process running every thirty days -thereby surfacing both an IT control hidden in purchasing and a synchronization error in the panic shutdown, checkpointing, and restart code managing the application.

In the more common case, however, the only thing between you and getting fired in disgrace will be your ability to document the stringent application of control procedures designed to safeguard both data and processes - procedures that, if actually applied, would just about cripple the research productivity you're supposed to facilitate.

So what can you do?

There are some things you have to do - for example using cryptology and Solaris containers or similar means to ensure user privacy and denying all outsiders, particularly central IT people if you're burdened with those, access to the machine during critical periods.

The more general solution, however, is duplication - convince management that any serious research computing is at risk and that the only known way to combine relationship based researcher access with retroactively demonstrable processing integrity is to silently copy all work to another system run by somebody else.

If you demonstrably have no access to the other guy's system and it produces the same results, the burden of proof will shift to the accuser - meaning that, practically speaking, you're off the hook, and so is your department chairman.

To implement this, take advantage of two things: continuously falling hardware costs and continuously improving software. If, for example, you work in biotechnology and you get along with the sysadmin over in chemistry you've got the basis for a deal - because he's got the same problem.

Propose that new hardware be duplicated at both sites, use one set of containers to achieve verifiable run-time isolation, and use another pair to swap data and programs. When paranoia erupts and one of you is facing accusations of leaking, distorting, destroying, or otherwise interfering in some user's march to the Nobel -you've got backup: and it's not a tape, it's a person who can testify that the same data with the same programs on similar hardware produced the same results.

Just make sure only you and the department heads know - or the researchers will be signing agreements promising faithfully not to whine at you if only they can, just this once, please, have access to both halves of the facility, because, you know, it's an emergency, and you can trust them, right?

Topic: Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

2 comments
Log in or register to join the discussion
  • Interesting bunch of problems

    Convincing a resourceful determined intelligent paranoid and their lawyers that the mirror system is identical, down to every last line of every last config file, every last piece of hardware, e.g. CPU stepping, is identical. Every last hard drive (including the ones swapped out when they failed) has or had exactly the same firmware on it? Need a (independent, third party) team of experts to do a big audit? Before and after each run? And for there to be a lot of logging etc? Prove that no-one messed with anything deliberately or accidentally (even with good intentions) when the CCTV was down, months ago?

    Containers etc, fine in many ways, but once you?re into virtualisation, how will you prove anything?

    Keeping it secret, throughout a large organisation, that each department?s simulation system is duplicated. "You?ve explained what you do on those x racks. What do those other x/2 racks do then - they seem to be running flat out, they must be using about two-thirds of your power supply?" "Um," "I was just over at the other department, they have a similar set-up. Are you guys doing something for the government or something then?"

    Won?t encryption slow things down?

    Rather than duplication would the money and effort be better directed at:

    1. Getting some disclaimers drafted? Ten pages of ambiguous stuff that seems to say we make best efforts and if it goes wrong the undersigned certifies that they were warned it probably would because we are not fully competent, and they swore they wouldn?t complain anyway, not even if there was deliberate sabotage?

    2. As far as possible including self-checking in simulation software. And doing things in modules and steps, periodically backing up data at defined stages from where the run can be re-started.

    3. In particular, buying really fast hardware and people who can program it to best use the speed (taking us back to previous topics :)

    Further re duplication - what if you get different results? Does our would-be laureate choose the one which is "obviously" right and keep quiet and hope? Or are they just as angry, because it will all have to be done again? Another argument for 2. above
    Ross44
    • Yes - those are problems, but..

      1 - duplication doesn't prove anything by itself, but it does do the one thing that counts: put the onus on the accuser. Most will look at that burden and decide to rethink the hostility.

      2 - of course you have disclaimers, and they're worth the paper...

      3 - If you do get different results, somebody clearly has a problem.

      Oddly there are cases where getting the same results also indicated a failure. For example a failure in the random number generator for Windows should have been caught years ago by financial people getting unusually similar simulation results...
      murph_z