NLA Trial index

NLA Trial Articles from 1835

Notes
  1. Accuracy of OCR and overProof is measured in comparison with the human corrections. We know human corrections in this sample are incomplete, and themselves contain errors, but they are the best we could find automatically from the NLA newspapers corpus, tagged as completely corrected then further filtered to those with at least 3 corrections, at least 40% of lines corrected and lowest third percentage of non-dictionary words.
  2. Accuracy is measured by a separate process from that used to colour words in this output: the colouring process is heuristic, and not completely accurate.
  3. Colour legend:
    Text - OCR text corrected by human and/or overProof
    Text - human and/or overProof corrections
    Text - discrepencies between human and/or overProof
    Text - human corrections not applied by overProof
  4. Identified overProof corrections are calculated by the statistical calculation process, and shows those words changed by overProof which ALSO match human corrections. As human corrections are often wrong and incomplete, so too is this list.
  5. Identified overProof non-corrections are calculated by the statistical calculation process, and shows those words in the overProof output which DO NOT MATCH human corrections. As human corrections are often wrong and incomplete, so too is this list. Words marked as [**VANDALISED] are those which have been changed by overProof but not by the human correction; as before, a missed human correction will be (incorrectly) classified as vandalisation by overProof.
  6. Searchability of unique words refers to the distinct words in an article, and how many are present before and after correction. It is measure of how many of the words within an article could be used to find the article using a search engine.
  7. Weighted Words refers to a calculation in which common words count for little (a fraction of a word) and unusual words count for more, in proportion to the log of the inverse of their frequency in the corpus. It may be an indicator of how well distinctive words in an article can be searched before and after correction.

Article ID 2200965, Family Notices, Family Notices, page 3 1835-10-31, The Sydney Gazette and New South Wales Advertiser (NSW : 1803 - 1842), 71 words, 4 corrections

Raw OCRHuman CorrectedoverProof Corrected
BERTH. BIRTH. BIRTH.
On Tuesday, 27lh Instant, at his re- On Tuesday, 27th Instant, at his re- On Tuesday, 27th instant, at his re-
sidence George-3treet, tho Lady of Mr. sidence George-street, the Lady of Mr. sidence George-street, the Lady of Mr.
Moses Brown, tailor, of a son and MOSES BROWN, tailor, of a son and Moses Brown, tailor, of a son and
heir. heir. heir.
MARRIED, MARRIED. MARRIED,
On Thursday, the 29th October, by On Thursday, the 29th October, by On Thursday, the 29th October, by
the Reverend Charles Dickenson, at the the Reverend Charles Dickenson, at the the Reverend Charles Dickenson, at the
Field of Mars Church, Dudley, third son Field of Mars Church, Dudley, third son Field of Mars Church, Dudley, third son
of the late Francis Frederick North, of the late Francis Frederick North, of the late Francis Frederick North,
Esquire, of Hougham Hall, Norfolk and Esquire, of Rougham Hall, Norfolk and Esquire, of Rougham Hall, Norfolk and
Hastings, in the County of Sussex, to Hastings, in the County of Sussex, to Hastings, in the County of Sussex, to
Sarah, oldest daughter ot Edmund Lock- Sarah, eldest daughter of Edmund Lock- Sarah, eldest daughter of Edmund Lockyer,
yer, Esquire, of Ermington. yer, Esquire, of Ermington. Esquire, of Ermington.
Identified overProof corrections ELDEST /GEORGE/STREET|GEORGESTREET ROUGHAM BIRTH
Identified overProof non-corrections
Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words6590.8100.0100.0
Searchability of unique words4991.8100.0100.0
Weighted Words92.8100.0100.0

Accumulated stats for 1 articles from year 1835

Word
count
OCR
accuracy %
overProof
accuracy %
Errors
corrected %
All Words6590.8100.0100.0
Searchability of unique words4991.8100.0100.0
Weighted Words92.8100.0100.0