NLA Trial index

NLA Trial Articles from 1981

Notes
  1. Accuracy of OCR and overProof is measured in comparison with the human corrections. We know human corrections in this sample are incomplete, and themselves contain errors, but they are the best we could find automatically from the NLA newspapers corpus, tagged as completely corrected then further filtered to those with at least 3 corrections, at least 40% of lines corrected and lowest third percentage of non-dictionary words.
  2. Accuracy is measured by a separate process from that used to colour words in this output: the colouring process is heuristic, and not completely accurate.
  3. Colour legend:
    Text - OCR text corrected by human and/or overProof
    Text - human and/or overProof corrections
    Text - discrepencies between human and/or overProof
    Text - human corrections not applied by overProof
  4. Identified overProof corrections are calculated by the statistical calculation process, and shows those words changed by overProof which ALSO match human corrections. As human corrections are often wrong and incomplete, so too is this list.
  5. Identified overProof non-corrections are calculated by the statistical calculation process, and shows those words in the overProof output which DO NOT MATCH human corrections. As human corrections are often wrong and incomplete, so too is this list. Words marked as [**VANDALISED] are those which have been changed by overProof but not by the human correction; as before, a missed human correction will be (incorrectly) classified as vandalisation by overProof.
  6. Searchability of unique words refers to the distinct words in an article, and how many are present before and after correction. It is measure of how many of the words within an article could be used to find the article using a search engine.
  7. Weighted Words refers to a calculation in which common words count for little (a fraction of a word) and unusual words count for more, in proportion to the log of the inverse of their frequency in the corpus. It may be an indicator of how well distinctive words in an article can be searched before and after correction.

Article ID 126826221, Article, Gold nugget, page 12 1981-03-07, The Canberra Times (ACT : 1926 - 1995), 77 words, 4 corrections

Raw OCRHuman CorrectedoverProof Corrected
Gold nugget Gold nugget Gold nugget
MELBOURNE: The world's largest MELBOURNE: The world's largest MELBOURNE: The world's largest
gold nugget, the Hand of Faith, found ; gold nugget, the Hand of Faith, found gold nugget, the Hand of Faith, found ;
five months ago at Wcdderburri, Vic five months ago at Wedderburn, five months ago at Wedderburn, Vic
Victoria, has left Australia.? Victoria, has left Australia. Victoria, has left Australia.?
The 27.2-kilogram nugget with, a; The 27.2-kilogram nugget with a The 27.2 kilogram nugget with, a;
$ I million price tag, will either be : $1 million price tag, will either be $ I million price tag, will either be
displayed in the Golden Nugget Casino V displayed in the Golden Nugget Casino displayed in the Golden Nugget Casino V
in Las Vegas, which already dispays the't in Las Vegas, which already displays the in Las Vegas, which already dispays the't
Robins nugget valued at $800,000 (dug " Robins nugget valued at $800,000 (dug Robins nugget valued at $800,000 (dug "
up in Bcndigo five years ago), or in a r up in Bendigo five years ago), or in a up in Bendigo five years ago), or in a r
.new $140 million complex in Atlanta,'' new $140 million complex in Atlanta, new $140 million complex in Atlanta,''
Georgia. _[ Georgia. Georgia. _[
Identified overProof corrections WEDDERBURN BENDIGO
Identified overProof non-corrections DISPLAYS
Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words6493.896.950.0
Searchability of unique words4793.697.966.7
Weighted Words94.598.266.7

Accumulated stats for 1 articles from year 1981

Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words6493.896.950.0
Searchability of unique words4793.697.967.2
Weighted Words94.598.267.3