Re: new addon: NVDA Advanced OCR.


Sarah k Alawami
 

Actually, I have used one thing that did create the correct tables and charts in a scanned pdf that my teacher gave me. The program is no more now but I  successfully used this program to write and site correct page numbers of these scanned pdfs, and some were quite old. I dunno how it did it, but it tagged everything correctly. Ok, it didn’t describe the images that were present in the images but I didn’t need that at the time, just the text which was perfect. All headings were correct, tables, charts, etc. if anyone is interested the app was called docuscan. It has now been replaced by I forget what, but it just came out this year and I get the feeling I might need to use this in my studies over the next few years.

 

From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Brian Vogel
Sent: Wednesday, December 8, 2021 1:52 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] new addon: NVDA Advanced OCR.

 

On Wed, Dec 8, 2021 at 03:46 PM, Gene wrote:

But without experience of a large number of PDF documents, I wouldn’t assume that.

 

-
Gene, I can say, with complete honesty, that I cannot count the number of PDF documents I've dealt with, and in the context of a screen reader.  The general hierarchy of accessibility has been:

1. Image Scanned - Inaccessible unless OCRed, and if OCRed, much depends on when as far as how well that works.

2. OCR processed by something designed to do so -  If it's a fairly modern OCR engine, things like columnar text are generally handled with very good flow.  If it was an early OCR engine, not so much.  Document will not have, to quote Mr. Moxley, "proper heading structure, table structures (with appropriately marked headers), accessible links, alt text etc."  OCR engines are generally not that sophisticated, though most can detect tables these days and set them up as such.

3. Created as PDF in a PDF Editor or MS-Word:  100% basic accessibility, but not necessarily "prettified" with all of the above noted features.  I've created quite a few tutorials in MS-Word that I've then saved as PDF that are one to maybe three pages long, and step by step, and I certainly never go to that level of elaboration because of what the content is and how it's to be accessed.  People creating things like church bulletins, flyers, and lots of other simple documents that are often of a "read once and then done" nature are unlikey to ever do so, either.

3. Created as PDF in a PDF Editor and of significant length, and intended for publication and/or a long archival life:  100% maximally accessible with all the features Mr. Moxley noted.

The fact of the matter is that I don't disagree with him, one bit, about what needs to be done to create a maximally accessible PDF if one is creating it from scratch and it is of any significant length.  What I do disagree with is that this is necessary for the vast majority of very short PDFs out there that may or may not have been created as such.

When it comes to PDFs, and particularly PDFs of unknown origin, it's completely unrealistic to call them inaccessible if they don't have the prettification.  I have scanned, and with OCR scanning at the time of scan, things like owner's manuals and service manuals that are hundreds of pages long.  They will not ever have all of the prettification because it's just not possible, but their text content is complete, and searchable.  That's accessible, and in most instances way more than just minimally accessible.

It's way faster for me to find what I'm looking for in these scanned PDFs because they are searchable than it is to find it using the source material, as often certain bits of information are put where you really wouldn't expect to find it and there's noting in the table of contents nor index or indices that would indicate that.  But if you know the term you're looking for, you can blaze through hundreds of pages very quickly using search functionality.  That's accessible whether you're doing this the sighted way or using a screen reader to do the same thing.  It may not be as nice as it would be had the source material been created as PDF, but there will never come a time where every PDF started out life that way nor where whatever was used to OCR it could possibly produce something with all the features in characteristic of PDF born as PDF.
 
There's basic accessibility and publisher-layout-quality accessibility.  They're not the same thing.  We should, of course, constantly encourage the use of publisher-layout-quality with regard to accessibility where such is warranted.  My 2-page flyer for next week's picnic, as a fictional example, would not be one of those times.  If it's entirely readable, in the expected order, that's good enough.

The perfect should never be the enemy of the good.
--

Brian - Windows 10, 64-Bit, Version 21H1, Build 19043  

The difference between a top-flight creative man and the hack is his ability to express powerful meanings indirectly.

         ~ Vance Packard

 

Join nvda@nvda.groups.io to automatically receive all group messages.