How Information Systems Will Read Handwriting in the Tally of the Census
GAITHERSBURG, MD, 03/15/2000 --
One of the marvels of the electronic era will take place over the next several weeks as the U.S. Census gets under way--the use of automated recognition technology to read the handwriting on millions of Census forms flooding into Census processing centers around the country.
It should come as no surprise that information technology is an integral part of the processing of the Census. But, how do scanning and recognition systems cope with the great variances in handwriting where your letter l looks like someone else's f, where respondents dot an m or an n but never an i, and where someone's printed t can be easily confused with an x? Can computer software determine what the respondent meant to write, and do so quickly and reliably enough to capture the volumes of data that will be represented on more than 120 million forms expected to be processed during this year's Census? And, will computers be able to cope with some of the accidents that befall Census forms enroute, such as coffee stains, torn corners, or pens that leak ink?
Lockheed Martin systems engineers who worked with the U.S. Census Bureau to develop Data Capture System 2000 (DCS 2000) to process the 2000 Census, are absolutely confident that it will. They report that not only will the system handle the huge variances in handwriting--but that it will do so more accurately and with greater confidence levels than any previous systems.
Able to Handle All Kinds of Handwriting
The systems we've developed will be able to deal with all kinds of handwriting types, says Richard E. Taylor, systems architect who led the systems design for Lockheed Martin Mission Systems. We've tested it exhaustively and validated the results in dry runs and we know that it's fully capable of handling what will be the largest image recognition project ever undertaken. I don't believe there's any system in the world that is as accurate as this one on a data stream as diverse as the Census.
DCS 2000 supports the entire Census processing from check-in of returned forms to the point where the final captured data is forwarded to Census Bureau computers, ready for analysis by scholars and planners, citizenry and media, among others. Developed by Lockheed Martin and the Census Bureau in an industry partnership with scores of equipment and software vendors, the system is installed at four Data Capture Centers across the U.S. and has been sized to handle each center's census form processing load and schedule.
How Does It All Work?
It is the scanning and recognition activity, though, that really are the heart of the operation.
How does it work? How does information penned, penciled, scrawled or printed on a Census form get recorded accurately and become part of the enormous quantity of data that will be captured in this Census?
Census forms will arrive by mail trucks at four Bureau of Census processing centers in Baltimore, Md; Jeffersonville, Ind; Phoenix, Arz, and Pomona, Calif. At peak times, as many as 15 tractor trailer loads of forms will arrive daily at each center. Within 48 hours of receipt, the forms are checked in and sorted. This allows the Census Bureau to know which households have responded and to plan its non-response follow-up accordingly. After the check-in process, the paper forms will be scanned. This is akin to taking a computer photo of the information. The scanning digitizes all the data entered on the form, including the handwriting. Information marked in boxes with X marks or check marks is captured through extremely accurate optical mark recognition (OMR) technology pioneered in the 1990 Census and enhanced for the 2000 Census.
Fat, Slanted, Little and Tall A's Are All Recognized
The system next evaluates the zones or areas where hand written entries are expected. If anything is present in one of these zones, the system attempts to recognize it. First, a zone is segmented into characters by looking for breaks in the writing. (Note that respondents are instructed to print.) After segmentation into possible characters, the digitized hand writing images are analyzed according to the size and shapes of the characters or alphabet letters written. It's a process called optical character recognition (OCR) and it's done by a type of statistical analysis that is programmed to recognize each alphabet letter in a variety of variations--a big fat A, a little A, an A with a decided slant, an capital A, a lower case a.
The Optical Character Recognition (OCR) engine (that's a systems term for the software that does this) comes up with its best judgment of what the letter is and also attaches a confidence factor, a number from 0 to 100 indicating just how sure it is of the letter being recognized. Sometimes the OCR engine also provides a second choice (with a lower, but still credible, confidence level). The functions of the commercial OCR engine are complemented with unique software developed by Lockheed Martin engineers that enhances the evaluation of characters in context. Included in this unique context checking are cross checking of related fields (e.g., age and date of birth), comparison to specialized dictionaries (e.g., list of occupations), and trigram analysis. Trigram analysis is the evaluation of all three-letter combinations in a word to determine its likelihood of being a valid English word. A table is created containing all three-letter combinations of letters in the English alphabet (i.e., AAA, AAB,?, ZZZ). Associated with each three-letter combination is its frequency of occurrence at the beginning, middle, and end of English words. The initial confidence of accurate recognition is then adjusted, based on the trigram analysis. If for example, three i's in succession were discovered in a word, the software will reduce its confidence rating since there is no word with three i's in a row. Conversely, a letter combination that appears frequently would get a high rating. . If possible, invalid trigrams are replaced with alternative valid trigrams, using alternate recognition choices provided by the OCR engine.
Poorly Written Words Sometimes Need Human Operator Help
Using the combination of the OCR engine, the trigram analysis, and the confidence factors, each word is judged to be either recognized with high confidence or not. Those words recognized with high confidence are placed directly in the output data for the form being processed. Those words not recognized with high confidence are forwarded to a human operator who is presented with images of low confidence words and keys the values. Reasons for failure to recognize a word with high confidence include the presence of characters that are poorly written (an A that resembles an R), words that have been scratched out and rewritten elsewhere on the form, and words that have been written in script rather than printed.
How accurate is the whole process? If the automated results were accepted completely, and no low confidence words were keyed by human operators, approximately 92 percent of the words would be entirely correct. This is truly superb, when you consider the range of hand printing that must be processed, according to Lockheed Martin's Taylor. However, the Census requires that the data be at least 98 percent accurate, so the low confidence words are forwarded to keyers for human recognition and entry. By having human keyers enter the 25 percent of the words with the lowest automated recognition confidence, the desired accuracy will be achieved.
System Reduces Need for Keying Support by 75 Percent
Accuracy of these levels means that the system can be counted on to process and verify the information without the need for extensive keying operations. In fact, the use of the DCS 2000 system has reduced by almost 75 percent the keying support that would have been required if the past systems had been used.
In the 1990 Census, the processing made use of mark recognition to read some of the information, but the major portion of the data was laboriously keyed by operators. Microfilming was done to maintain a permanent record of the information. Today, that's no longer necessary since the DCS 2000 has captured the digital image of the data.
Continuous Quality Control to Flag Problems
The DCS 2000 System also includes ongoing quality control that can readily spot any problems that may be occurring virtually as they occur. Taylor explains: Dynamically, as data from forms is collected, we siphon off data samples and send them to a human operator for keying. This provides a continuous rolling measure of OCR and keyer accuracy. If for any reason there's a problem, we stop the processing on the batch of forms in question in real time and start over again until we're sure that we have it right. Taylor notes that forms are usually processed in batches of about 400 and that an entire batch will be recycled if the overall accuracy of the batch is below the acceptable standard. Think of the batches as busloads of people--one person doesn't get to the destination until all the others do. Ultimately, our Lockheed Martin role is to guarantee to the Census Bureau that the data collected is accurate.
And what about the coffee spills on forms? When we encounter a form with coffee spots or tears or other damage, the system in the initial scanning operation spots the problem. It alerts an operator who makes a judgement on whether the form can indeed be read. Sometimes, a corner might be missing, but not in an area where information has been written, and we can continue with processing. If there's any doubt at all about the readability, it goes to a human operator to decipher.
But Taylor would prefer to rely as much as possible on the information system to read and assess the data. Every time we can eliminate a person's judgment, our accuracy goes up.
A leader in mission critical systems integration and information operations, Lockheed Martin Mission Systems serves customers including U.S. and international defense and civil government agencies. Mission Systems employs approximately 2,700 at major facilities in Gaithersburg, Md., Colorado Springs, Colo., Manassas, Va., and Santa Maria, Calif., and is a business unit of Lockheed Martin Corporation.