Xerox scanners alter numbers in scanned documents

Xerox scanners alter numbers in scanned documents

Summary: Certain Xerox scanners and photocopiers silently transform numbers in documents in some non-default configurations. [UPDATED]

TOPICS: Software

When set in the non-default 'Normal' image quality mode, some Xerox scanners and photocopiers may change characters, including numerals, in scanned documents. 

UPDATE: Xerox has posted a blog entry by Rick Dastin, corporate vice president and president, Office and Solutions Business Group on the scanning issue. The company emphasizes that this affects few customers, only those who use the scanning functions after changing the quality/compression settings. Customers affected by the problem can change the setting back to defaults. Alternatively, Xerox will, in the next weeks, be rolling out a software update to the devices that will '…disable the highest compression mode thus completely eliminating the possibility for character substitution.'

The initial finding of this problem was made earlier today by German researcher D. Kriesel. Kriesel demonstrated that on a Xerox WorkCentre models 7535 and 7556 scans of certain documents with numbers in them resulted in the numbers being different in the paper and scanned image. He suggests several possible implications for users:

  1. Incorrect invoices
  2. Construction plans with incorrect numbers (as will be shown later in the article) even though they look right
  3. Other incorrect construction plans, for example for bridges (danger of life may be the result!)
  4. Incorrect metering of medicine, even worse, I think.
The Xerox image compression engine changes numbers at lower quality settings. source: D. Kriesel (with permission)

Kriesel lists many other devices which he says are affected, but which he has not personally tested himself. He also provides many examples. The images in this story were provided by him.

In response to Kriesel, Xerox issued a statement indicating that the problem was due to settings changes in the device:

The problem stems from a combination of compression level and resolution setting. The devices mentioned are shipped from the factory with a compression level and resolution that produces scanned files which are optimized for viewing or printing while maintaining a reasonable file size. We do not normally see a character substitution issue with the factory default settings however, the defect may be seen at lower quality and resolution settings

The company attributes the problem to the JBIG2 compressor software in the device. In the 'Normal' setting image compression is tuned to produce smaller files at the expense of image quality. With the 'High' setting, the JBIG2 compressor is not used, and the device emphasizes image quality over file size.

Xerox confirmed for me that 'High' is the default setting in the device and Kriesel confirmed for me that the test device was configured to 'Normal.' Kriesel suspects that the reseller from whom the device was purchased made the change.

Around the time Xerox was issuing their statement, Kriesel posted a second blog entry discussing the quality setting. At this point he seemed unaware of the default settings issue. Kriesel tells me, and also says in the second blog entry, that Xerox tech support was unhelpful and obviously unaware of this issue. He says he first contacted them on July 25. 

But it turns out the problem is not news to Xerox's developers. As shown in an image in Kriesel's second post, there is a warning in the user interface for changing image quality:

The normal quality option produces small file sizes by using advanced compression techniques. Image quality is generally acceptable, however, text quality degredation and character substitution errors may occur with some originals.

Kriesel explains repeatedly that OCR was not used in any of these cases. The transpositions were made purely in image scans. Clearly the Xerox software still attempts, in such cases, to recognize objects.

In a final blog post, Kriesel summarizes his communications with the Xerox and concludes that the worst part of the problem was support's lack of familiarity with the character substitution errors.

Even if the nature of the issue is settled, it leaves the question of whether it is acceptable that a device configured in a permitted way such as this should be capable of such behavior. It turns out that Xerox does provide a fairly clear warning, but it's equally obvious that neither customers nor their own tech support knows about the problem. As Kriesel himself says, it's impossible to know how many important errors have been made because of this bug.

Topic: Software

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Nothing new here.

    Wall Street & the banks have been using these for decades.
    • Scary Thought

      Hopefully any glitches like those shown here stay only on the paper and don't find their way back into any computer system. That could be mayhem
  • No OCR??

    How can you have character substitution without OCR? If you scan you have an image file, tiff, jpeg or whatever.
    I recently did a project of of scanning several hundred paper statements and OCR'ing them into
    Excel. 6's into 8's was very common as well as date issues with the forward slash becoming a 1. The bottom line is anything thru and OCR MUST be proofed.
    • OCR quality varies tremendously among products

      I have also found that the quality of name-brand OCR varies tremendously. Awhile back I OCR'd a not-very-good-quality document with Acrobat Standard IX. It came out complete garbage. I ran OCR on it with Nuance PDF Converter Pro 7 and got better than 95% accuracy.

      Because Nuance also puts out Omnipage and claims that is the "cream of the crop", I downloaded that and OCR'd a high school yearbook, something I have done several times with both Acrobat and Converter Pro. Omnipage found "text" in virtually EVERYTHING--photos of classrooms, underclass homeroom photos, senior portraits, etc. In some cases OVER 400 OCR "suspects" on ONE page -- which had about 15 words of text and the rest was photos.
    • This is scanning

      It is obvious on the numbers used in the example, but if you read the original blog post, JBig2 uses know "swatches" in a dictionary. If an area of the scan "matches" a swatch to a large degree, instead of compressing the area manually, it just uses teh swatch out of the dictionary in order to save space.

      That means parts of images would also be subject to such substitution, but would probably be harder to recognise. With numbers, like those shown here, or characters, it is much more obvious to the human eye that the compression algorithm has misrepresented the area and used the wrong swatch.
  • Not Subsitution

    This isn't character substitution, it's just image compression applied to text. I'm sure you'll find the same thing if you scanned small text on a number of scanners/copiers with lower resolution/file size settings.

    Anyone who looks at those numbers and assumes they're correct, is a fool. You can clearly see that they've been scanned in and the quality has been affected, making the numerals difficult to read and distinguish. Anyone who receives a document like that should straight away be demanding a better quality one if they need that text.
  • I've Heard Of Lossy Compression ...

    ... but what do you call this? Lying compression?
  • Crappy scans at low resolutions means you get a crappy copy.

    Looking at the sample scans: This is nothing new. Crappy scans at low resolutions means you get a crappy copy. Bump up the quality settings and scan at a higher resolution. Maybe fiddle with the brightness as well.
    • Re: Crappy scans at low resolutions means you get a crappy copy

      But these are not crappy-looking scans; they look perfectly readable, except the information is wrong.
      • They are.

        "But these are not crappy-looking scans"

        Baloney. Our el-cheapo scanner at home makes better scans. As does the camera on my iPhone. It appears as if the scans taken by that Xerox were passed through a blur filter, followed by a sharpen filter. Somewhat typical of lower quality scans.

        For the most part, it's 6's that loop almost all the way around being turned into what appears to be 8's.

        Apparently we do have an explanation for this: A poorly built image compression algorithm.

        . . . personally, I'd rather scan lossless and do the compression on my PC, so I have complete control over the compression. But maybe that's just me.