NLA Trial index

NLA Trial Articles from 1980

Notes
  1. Accuracy of OCR and overProof is measured in comparison with the human corrections. We know human corrections in this sample are incomplete, and themselves contain errors, but they are the best we could find automatically from the NLA newspapers corpus, tagged as completely corrected then further filtered to those with at least 3 corrections, at least 40% of lines corrected and lowest third percentage of non-dictionary words.
  2. Accuracy is measured by a separate process from that used to colour words in this output: the colouring process is heuristic, and not completely accurate.
  3. Colour legend:
    Text - OCR text corrected by human and/or overProof
    Text - human and/or overProof corrections
    Text - discrepencies between human and/or overProof
    Text - human corrections not applied by overProof
  4. Identified overProof corrections are calculated by the statistical calculation process, and shows those words changed by overProof which ALSO match human corrections. As human corrections are often wrong and incomplete, so too is this list.
  5. Identified overProof non-corrections are calculated by the statistical calculation process, and shows those words in the overProof output which DO NOT MATCH human corrections. As human corrections are often wrong and incomplete, so too is this list. Words marked as [**VANDALISED] are those which have been changed by overProof but not by the human correction; as before, a missed human correction will be (incorrectly) classified as vandalisation by overProof.
  6. Searchability of unique words refers to the distinct words in an article, and how many are present before and after correction. It is measure of how many of the words within an article could be used to find the article using a search engine.
  7. Weighted Words refers to a calculation in which common words count for little (a fraction of a word) and unusual words count for more, in proportion to the log of the inverse of their frequency in the corpus. It may be an indicator of how well distinctive words in an article can be searched before and after correction.

Article ID 125617279, Article, NRMA man farewelled, page 21 1980-08-21, The Canberra Times (ACT : 1926 - 1995), 95 words, 3 corrections

Raw OCRHuman CorrectedoverProof Corrected
farewelled NRMA nan farewelled farewelled
Mr Ken; Wilson, manager of the Canberra Mr. Ken Wilson, manager of the Canberra Mr Ken; Wilson, manager of the Canberra
branch of the NRMA for nine, years, was branch of the NRMA for nine years, was branch of the NRMA for nine years, was
officially farewelled. at .a function at the officially farewelled at a function at the officially farewelled. at a function at the
Parkroyal Motor Inn yesterday. y Parkroyal Motor Inn yesterday. Park royal Motor Inn yesterday. by
His replacement, who took up'his appointment His replacement, who took up his appointment His replacement, who took up his appointment
on August 11, is Mr Bob Simpson, who has workedi on August 11, is Mr Bob Simpson, who has worked on August 11, is Mr Bob Simpson, who has worked
■for/the NRMA for 28 years. j, ' for the NRMA for 28 years. for the NRMA for 28 years. a j, '
Mr Simpson said yesterday that he d|d not . Mr. Simpson said yesterday that he d|d not Mr Simpson said yesterday that he did not expect
expect any changes in the branch's operation. Its expect any changes in the branch's operation. Its any changes in the branch's operation. Its
biggest business was compulsory third-party biggest business was compulsory third-party biggest business was compulsory third-party
motor-vehicle insurance. motor-vehicle insurance. motor-vehicle insurance.
Mr Ken Wilson, seated, with his successor as Mr. Ken Wilson, seated, with his successor as Mr Ken Wilson, seated, with his successor as
manager of the Canberra branch of the manager of the Canberra branch of the manager of the Canberra branch of the
NRMA, Mr Bob Simpson. NRMA, Mr Bob Simpson. NRMA, Mr Bob Simpson.
Identified overProof corrections WORKED UP
Identified overProof non-corrections PARKROYAL [**VANDALISED] NAN
Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words9092.296.757.1
Searchability of unique words5694.696.433.3
Weighted Words95.897.233.3

Accumulated stats for 1 articles from year 1980

Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words9092.296.757.7
Searchability of unique words5694.696.433.3
Weighted Words95.897.233.3