OCR and Scanning

ocr.gif

Optical Character Recognition [OCR] and scanning are often confused. OCR includes scanning of fields to be read. Scanning takes an image of a page or line without intelligently recognising the data scanned. OCR was originally developed for automatically processing very high volumes of paper such as cheques or utility bill remittance stubs. The characters in the fields that could be recognised were not ordinary type fonts but were specialised character sets designed specifically for OCR. These fonts were known as OCR A and OCR B.

The early OCR readers were large, cumbersome, noisy pieces of equipment. Moving volumes of ‘used’ paper past a read head at high speed and accurately reading the desired fields is a non-trivial task. Forms design is critical. Paper weight, colour, layout, and box size have to be carefully controlled. In processing, forms have to be accurately aligned, fields to be read have to be identified and registered and characters have to be machine readable. Where an item could not be read accurately the form was dropped into a ‘reject’ pocket on the reader. The rejects were then re-processed by key punching. These systems were cost-effective from the late 1960s/early 1970s although acquisition and running costs were high. A resident engineer was needed for daily operations.

Later in the 1970s, OCR readers became more reliable and efficiency was improved by automating reject repair. This was done by integrating a key-to-disk minicomputer with the OCR reader. Rejected characters were displayed as read on a terminal screen. The operator corrected the rejected character by visual verification. A second development was the ability to read handprint [not handwriting]. These systems were called Mixed Media systems because they could capture data both by OCR and by keying.

In 1978, British Rail installed the UK’s first mixed media system with handprint recognition for reading timesheets for payroll. Over the next 20 years, OCR was continuously developed so that more standard type fonts could be read and handprint became widely available. The readers became smaller, more reliable and less costly.

 

In 1998 the same conceptual system approach using commodity rather than proprietary hardware was used to read Cattle Passports for the UK’s Cattle Tracing System designed in the aftermath of the Mad Cow Disease [BSE] crisis. The only common features were really the common challenge of setting-up and maintaining the systems to a high standard although the new technology was orders of magnitude easier to work with than the old. Nevertheless OCR remains a specialist technology.

Scanning generically came from photocopying with advances in laser technology and digitisation improving the quality and tractability of the scanned image. A good example of scanning being used in data capture is the Atomic Weapons Establishment’s payroll system using desk-top scanners connected to PCs. These scanners had recognition software and were ‘state-of-the art’ at the time.

The Case Studies are the only ones to survive. They are broadly representative and indicative of the use of the technology. By the 1990s desk-top scanners were so common they were rarely documented as systems. In the early 1990s software systems were available such as ROCC’s SEECHECK Forms Processing Solutions that ran on commodity hardware and provided comprehensive facilities for all scanning- related and keying-related data capture. From a separate department in an organization using somewhat esoteric technology ,data capture had become just another desk-top task.