NLA Trial index

NLA Trial Articles from 1966

Notes
  1. Accuracy of OCR and overProof is measured in comparison with the human corrections. We know human corrections in this sample are incomplete, and themselves contain errors, but they are the best we could find automatically from the NLA newspapers corpus, tagged as completely corrected then further filtered to those with at least 3 corrections, at least 40% of lines corrected and lowest third percentage of non-dictionary words.
  2. Accuracy is measured by a separate process from that used to colour words in this output: the colouring process is heuristic, and not completely accurate.
  3. Colour legend:
    Text - OCR text corrected by human and/or overProof
    Text - human and/or overProof corrections
    Text - discrepencies between human and/or overProof
    Text - human corrections not applied by overProof
  4. Identified overProof corrections are calculated by the statistical calculation process, and shows those words changed by overProof which ALSO match human corrections. As human corrections are often wrong and incomplete, so too is this list.
  5. Identified overProof non-corrections are calculated by the statistical calculation process, and shows those words in the overProof output which DO NOT MATCH human corrections. As human corrections are often wrong and incomplete, so too is this list. Words marked as [**VANDALISED] are those which have been changed by overProof but not by the human correction; as before, a missed human correction will be (incorrectly) classified as vandalisation by overProof.
  6. Searchability of unique words refers to the distinct words in an article, and how many are present before and after correction. It is measure of how many of the words within an article could be used to find the article using a search engine.
  7. Weighted Words refers to a calculation in which common words count for little (a fraction of a word) and unusual words count for more, in proportion to the log of the inverse of their frequency in the corpus. It may be an indicator of how well distinctive words in an article can be searched before and after correction.

Article ID 107889104, Article, 4c Christmas stamp issue, page 10 1966-08-26, The Canberra Times (ACT : 1926 - 1995), 65 words, 4 corrections

Raw OCRHuman CorrectedoverProof Corrected
4c Christmas 4c Christmas 4c Christmas
stamp issu£ stamp issue stamp issue
Australia's annual Christ Australia's annual Christ- Australia's annual Christmas
mas postage stamp will be mas postage stamp will be postage stamp will be
issued at all post offices on issued at all post offices on issued at all post offices on
Wednesday, October 19. Wednesday, October 19. Wednesday, October 19.
The Postmaster-General, The Postmaster-General, The Postmaster-General,
Mr Hulme, said that the Mr Hulme, said that the Mr Hulme, said that the
stamp, of 4c denomination stamp, of 4c denomination stamp, of 4c denomination
would be olive green and : would be olive green and would be olive green and
black. Designed by Mr j black. Designed by Mr. black. Designed by Mr J
Lance Stirling of Ringwood, ■1 Lance Stirling of Ringwood, Lance Stirling of Ringwood, in
Victoria, the stamp wa< Victoria, the stamp was Victoria, the stamp was
styled after a medieval en styled after a medieval en- styled after a medieval engraving
graving of the Adoration of graving of the Adoration of of the Adoration of
the Shepherds. j the Shepherds. the Shepherds. j
Identified overProof corrections WAS ISSUE
Identified overProof non-corrections VICTORIA [**VANDALISED]
Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words5594.598.266.7
Searchability of unique words4295.297.650.0
Weighted Words95.897.950.0

Accumulated stats for 1 articles from year 1966

Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words5594.598.267.3
Searchability of unique words4295.297.650.0
Weighted Words95.897.950.0