About and Help

Accounts: Test drives are encouraged — required for file upload — getting an account — your account dashboard
Processing: uploading files — retrieving results — processed file formats
Examples: "before and after" examples
Limitations: file sizes — languages & vocab — perfection is impossible — indexing corrected text
Pricing: rates and services
Paper: technical description of overProof delivered to the DaTECH 2014 conference
About Project Computing: who are we?

Accounts

Test Drives are encouraged

We've worked hard to create overProof as a tool which makes searching and using digitised archives much more effective, but only you can assess its suitability for your particular requirements.

We want to make that assessment as easy as possible, so trial accounts let you test overProof on your own content, obligation-free.

You need an account to upload files

Anyone can try overProof by using the Web demo form, but to upload files of OCR'ed text for processing, you'll need an account.

Get an account by emailing us at contact@projectcomputing.com. and starting a discussion with us about:

your text sources (books, newspapers, ...), formats (plain, ALTO, hOCR, ...) and languages
your sample text for assessing overProof's effectiveness
your anticipated volumes and required turn-around times
any special requirements you may have.

Your account dashboard

Your account dashboard shows the processing status of your uploaded jobs. You can filter the results by processing status and the contents of the job and comment metadata fields provided when you created the job.

Processing

Uploading files

You need an account to upload files.

Files are uploaded using http binary transfer to supply the file, HTTP basic authentication to supply your account credentials, and HTTP headers to provide the metadata overProof requires.

Each upload is referred to as a job by overProof. You can check on the progress of your jobs using your account's dashboard which is shown to you when you login to the overProof website.

The required HTTP headers are:

X-CUSTOMER - your overProof account name
X-JOB - an identifier of the file meaningful to you. It could be the name of the file you are uploading. It is important that you can use this identifier to associate the job with content in your system so that your know how to process overProof's output file. Up to 128 characters in length.
X-COMMENT - a comment you choose to associate with this job. Up to 256 characters in length.
X-FORMAT - the format of the supplied file. Must be one of: plain, alto or hocr.
X-COMPRESSED - whether the supplied file has been compressed using gzip. Must be one of: y or n.

Here's an example of using curl to upload an ALTO file:

curl --user yourAccount:yourPassword \
  --header 'X-CUSTOMER: yourAccount' --header 'X-JOB: your job description' \
  --header 'X-COMMENT: your job comment'  --header 'X-FORMAT: alto' \
  --header 'X-COMPRESSED: n' -i \
  --data-binary '@/some/file/to/upload.gz' http://overproof.projectcomputing.com/supply

You don't have to use curl; you can use any tool or program to generate the simple stream required. Here's a sample stream to upload a file containing two simple lines:

POST /supply HTTP/1.1
Authorization: Basic eW91ckFjY291bnQ6eW91clBhc3N3b3Jk
User-Agent: myDemoProgram
Host: overproof.projectcomputing.com
Accept: */*
X-CUSTOMER: yourAccount
X-JOB: your job description
X-COMMENT: your job comment
X-FORMAT: plain
X-COMPRESSED: n
Content-Length: 18
Content-Type: application/x-www-form-urlencoded

ABC
one two three

The response to the upload request will contain a unique job id generated by overProof which you may use to track the job and download the corrected file. Here's a sample response:

OK: 1382587314752121
Format: plain
Compressed: n
Job: your job description
Comment: your job comment
FileLength: 18
id: 1382587314752121

The job id in the above response is 1382587314752121, which may be provided as a parameter to the overProof showRequest url:

http://overproof.projectcomputing.com/showRequest/{requestId}

Using `curl` to upload multiple files in a directory

The following shell script will upload all files (presumed to be ALTO files) ending in .gz in the current directory, setting the X-JOB header to the file name, which provides a handy way to associate each file with its overProof job:

for f in *.gz; do \
curl --user yourAccount:yourPassword \
  --header 'X-CUSTOMER: yourAccount' --header "X-JOB: $f" --header 'X-COMMENT: your job comment' \
  --header 'X-FORMAT: alto'  --header 'X-COMPRESSED: n' \
  -i --data-binary "@$f" http://overproof.projectcomputing.com/supply; done

Using `java` to upload files and retrieve results

Easy-to-use java classes which upload files and retrieve results are available to account-holders.

Retrieving processing status and results

Account-holders have access to a web dashboard which displays the processing status of all their requests.

They may also use a HTTP REST API to programatically retrieve processing status and output files.

Processed file formats

ALTO
The ALTO XML standard schema defines the <ALTERNATIVE> element for specifying alternative interpretations of OCR'ed text.
OverProof generates this element in accordance with the standard. For example, given the following snippet of ALTO to correct:
```
<String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21'
     STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'/>
```
overProof will generate the following ALTO:
```
<String ID='S14' CONTENT='extensiveenquiry' HPOS='521' VPOS='831' WIDTH='258' HEIGHT='21'
     STYLEREFS='TS4' WC='0.94' CC='6 8 8 9 9 9 8 8 8 9 9 9 8 9 9 9'>
     <ALTERNATIVE PURPOSE='overProof' CONTENT='extensive enquiry'/>
</String>
```
As shown, all <ALTERNATIVE> elements generated by overProof contain the standard PURPOSE attribute with a value of overProof.
hOCR
The hOCR format is based on html. OverProof corrects the contents of span elements with a class attribute equal to ocrx_word, and adds a CORRECTION attribute to corrected span elements, for example:
```
<span class='ocrx_word' id='xword_1_189' title='x_wconf -1'
    CORRECTION='extensive enquiry'>extensiveenquiry</span>
```

Plain text

OverProof generates a plain text file with corrections "in place". For example:

Input	Output
, At 'a banquet giwn inthe Parliannent Hous'o,Mdbourno, to' tho Now South Walos legislEtors who, lately proceeded to Mclbauriie to plaj in a Parlianientorj Crickot NIatch, his Bxcelloiiey Sir Hcnry B. Locli, in rcturning tlianks for tho toast of hie health; said "the future of these ureat : Austrdian colonieswas hound up in ; the far wider and greater in tho namo oi Austral asia	At a banquet given in the Parliament House Melbourne, to the New South Wales legislators who, lately proceeded to Melbourne to play in a Parliamentary Cricket Match, his Excellency Sir Henry B. Loch, in returning thanks for the toast of his health said "the future of these great Australian colonies was bound up in the far wider and greater in the name of Australasia

Input

Output

, At 'a banquet giwn inthe Parliannent
Hous'o,Mdbourno, to' tho Now South
Walos legislEtors who, lately proceeded 
to Mclbauriie to plaj in a Parlianientorj 
Crickot NIatch, his Bxcelloiiey Sir 
Hcnry B. Locli, in rcturning tlianks for 
tho toast of hie health; said "the future 
of these ureat : Austrdian colonieswas
hound up in ; the far wider and greater 
in tho namo oi Austral asia

At a banquet given in the Parliament
House Melbourne, to the New South
Wales legislators who, lately proceeded
to Melbourne to play in a Parliamentary
Cricket Match, his Excellency Sir
Henry B. Loch, in returning thanks for
the toast of his health said "the future
of these great Australian colonies was
bound up in the far wider and greater
in the name of Australasia

Note: for plain text input, incoming "white space" formatting (such as new lines and paragraphs) is not preserved: all emitted words are separated by a single space. Use ALTO or hOCR if you need to preserve formatting.

Examples

Here's a list of sample articles from evaluation dataset 2 (SMH) and evaluation dataset 3 (U.S. Chronicling America). The first link in each line shows the uncorrected OCR and corrected text, with changed words highlighted. The second link shows the article (in NLA's Trove) or page PDF (in LoC's Chronicling America).

Raw OCR and corrected text		Article in source corpus
Land Sale	Sydney Morning Herald, 14 Apr 1843	Trove
To the Editor	Sydney Morning Herald, 11 Jul 1861	Trove
Country Works	Sydney Morning Herald, 21 Jul 1864	Trove
Water Police Court	Sydney Morning Herald, 31 Mar 1871	Trove
Police	Sydney Morning Herald, 17 Oct 1884	Trove
Country News	Sydney Morning Herald, 12 Jan 1889	Trove
Salvation Army	Sydney Morning Herald, 11 Mar 1899	Trove
Antarctic Exploration	Sydney Morning Herald, 20 May 1901	Trove
Personal	Sydney Morning Herald, 30 May 1905	Trove
Explosion	Sydney Morning Herald, 01 Sep 1920	Trove
Quarter Sessions	Sydney Morning Herald, 20 Oct 1923	Trove
School Church	Sydney Morning Herald, 07 Apr 1926	Trove
Explosion at Ryde	Sydney Morning Herald, 19 Jul 1935	Trove
Visit of Americans	Sydney Morning Herald, 23 Jan 1936	Trove
Mothers' Day Appeal	Sydney Morning Herald, 28 Apr 1938	Trove
Supreme Court Divorce	Sydney Morning Herald, 21 Jan 1939	Trove
Bulgarian Reservists	Sydney Morning Herald, 26 Aug 1939	Trove
W H Donald Rescue	Sydney Morning Herald, 26 Feb 1945	Trove
Â£60,000 Estate	Sydney Morning Herald, 05 Feb 1952	Trove
Political	Cairo Daily Bulletin (Illinois), 21 Jun 1871	Chronicling America
Senate Proceedings (partial)	Cairo Daily Bulletin (Illinois), 21 Dec 1871	Chronicling America
Death of Samuel Morse	Cairo Daily Bulletin (Illinois), 5 Apr 1872	Chronicling America
Maui Notes	The Independent (Honolulu), 25 Jun 1895	Chronicling America
300 letters	The Independent (Honolulu), 21 Jun 1896	Chronicling America
Ladies of the Cannibal Isles	The Independent (Honolulu), 21 Jul 1896	Chronicling America
Great Britain of the East	The Independent (Honolulu), 21 Jun 1897	Chronicling America
Passenger Travel	The Independent (Honolulu), 21 Jun 1898	Chronicling America
Commencements	The Independent (Honolulu), 21 Jun 1898	Chronicling America
Roman mining	Mohave County Miner, 18 Jun 1887	Chronicling America
India copper	Mohave County Miner, 17 Jun 1899	Chronicling America
General Mining News	Mohave County Miner, 20 Jun 1903	Chronicling America
In Solemn Service	San Francisco Call, 1 Jan 1900	Chronicling America
Sparrows attack men	The Washington Times, 21 Jun 1921	Chronicling America
Moyer Prison Job	The Washington Times, 21 Jun 1921	Chronicling America
D.C. Rent Act	The Washington Times, 21 Jun 1921	Chronicling America

Details of our evaluation datasets and improvements in recall, false positives and text readability are discussed here.

Limitations

File size

The default compressed plain text, ALTO and hOCR file size limit is 20MB. Let us know if this is inadequate for your intended usage.

Languages and vocabularies

The current production version of overProof has been trained on a "general" vocabulary of the English language.

However, overProof's architecture is language and vocabulary neutral, so if you have a requirement to correct texts in other languages, or specialist vocabularies (such as industrial patents, or physical science papers) please contact us as we'd love to work with your data to extend overProof to meet your needs.

The impossibility of perfect correction, and implications for indexing corrected text

Correction can be neither complete nor perfect. Some source materials are of such a poor quality that even expert human correctors can only guess at the original text, even with reference to the image. Some text contains such atypical language that any statistical language approach to analysing it will fail to correct OCR errors within it, and will even introduce new errors: that is, rather than making it better, it will make it worse.

The development of overProof has focused on providing the best possible correction rate that can be achieved algorithmically and based on an immense corpus of trained text and OCR error analysis, but errors are inevitable.

Hence a prudent approach to indexing text will index both the original and the correction. The annoyances of false positives arising from search hits on uncorrected text will typically be outweighed by improved recall, particularly as the majority of corrected words are unsearched "nonsense" words in their pre-corrected version.

Read more about how we have evaluated overProof's performance here.

Pricing

Pricing varies based on the number of words submitted for correction per month:

Words per month	Cost per month
Less than 1 million	$100
Between 1 million - 10 million	$100 plus $10 per million words
Between 10 million - 100 million	$190 + $7 per million words
Between 100 million - 200 million	$820 + $6 per million words
Between 200 million - 1 billion	$1420 + $5 per million words
Over 1 billion	$5420 + $4.80 per million words

Real-time processing (very rapid turnaround), and highly secure processing (dedicated customer servers) services are available at additional cost.

Costs are quoted in US dollars, and include GST for Australian customers.

Paper

A peer-reviewed technical description of overProof delivered to the DaTECH 2014 Conference Madrid, May 19-20 2014, is available as a pdf. Slides and video of the presentation are available from the digitisation.eu blog entry: DATeCH 3rd Session - Postcorrection.

About Project Computing

Project Computing Pty Ltd - since 1983

We're an Australian-based software house. Over the past 30 years we've developed systems used by large commercial and government organisations around the world.

We've recently been heavily involved with the design and implementation of the immensely popular and award-winning Newspaper digitisation and Trove discovery systems at the National Library of Australia.

We specialise in large text processing systems and big data

We've been designing and implementing large-scale text searching systems since the 1980's. We've also have a long and deep experience with large mathematical models of the type used by overProof which we can use to assist you with other text corpus processing such as quality analysis, vocabulary extraction, named entity extraction, preliminary search term extraction, sentiment analysis and visualisation.

You can find out more about us here.