Topics

OCR Solution


Vlad Dragomir
 

Dear all,

I need to find a way to transform image pdf files into text. These are mostly books that have been scanned a while ago, but I haven't had the time and patience to find a solution. Now I really need to do something about this, for professional reasons.

I found this Windows 10 app called KNFB Reader, which seems to do exactly that. However, since this is a rather expensive app, I'd like to ask those who have already used it a few things if I may:
1. How does this app deal with multi-lingual documents? Is it possible to choose two or more recognission languages for the same book? Most of my books are manuals. I am a language teacher and it is very important that both languages used in a book be accurately detected.
2. Is formatting being retained, at least in part?
3. Are there any alternatives to this application? It seems to be the only accessible solution, but I might be wrong.

I would be very grateful if anyone could help me with this.

Most sincerely,

Vlad


 

Hi,




Am 27.05.2018 um 18:44 schrieb Vlad Dragomir:
Dear all,

I need to find a way to transform image pdf files into text. These are mostly books that have been scanned a while ago, but I haven't had the time and patience to find a solution. Now I really need to do something about this, for professional reasons.
try robobraille.org thiis is a Danish Service for converting Documents to mp3, braille and text. I don't know if you need to recognize letters from Russia or so, this is also possible with this browser app. This app uses tesseract as engine for recognizing the Text out of the document.

I found this Windows 10 app called KNFB Reader, which seems to do exactly that. However, since this is a rather expensive app, I'd like to ask those who have already used it a few things if I may:
1. How does this app deal with multi-lingual documents? Is it possible to choose two or more recognission languages for the same book? Most of my books are manuals. I am a language teacher and it is very important that both languages used in a book be accurately detected.
2. Is formatting being retained, at least in part?
3. Are there any alternatives to this application? It seems to be the only accessible solution, but I might be wrong.

I would be very grateful if anyone could help me with this.

Most sincerely,

Vlad
Greetings, Wolfram


Vlad Dragomir
 

Many thanks, I do appreciate your answer. Unfortunately, this won't work in my case, since the site only accepts files that do not exceed 64 MB. But I do want to thank you for taking the time to get back to me.

Cheers,

Vlad.


Christo de Klerk
 

Hello Vlad

I bought the KNFB Reader app, but I am rather disappointed with it. It does not at all retain any semblance of formatting. It throws all the text into one ginormous paragraph. I don't think this is what you want.

Maybe you can look at the Office Lens app which is free and available for Windows 10 as well as for smart phones. It retains formatting. It supports language switching and if you are using a synthesiser which also supports language switching, different languages will be spoken correctly. The OCR quality is quite good. Even though it is free, it is better in my opinion than KNFB in every respect except for one: KNFB does the OCR on your local machine or mobile device, while Office Lens does it in the cloud. But I don't think this advantage KNFB has justifies the price. Give Office Lens a bash and see if it will work for you. It really is the best option that I am aware of.

If you can use a mobile device for your OCR, you could also try Prizmo Go which is also free and retains formatting. I am quite impressed with it.

Kind regards

Christo

On 2018/05/27 18:44, Vlad Dragomir wrote:
Dear all,

I need to find a way to transform image pdf files into text. These are mostly books that have been scanned a while ago, but I haven't had the time and patience to find a solution. Now I really need to do something about this, for professional reasons.

I found this Windows 10 app called KNFB Reader, which seems to do exactly that. However, since this is a rather expensive app, I'd like to ask those who have already used it a few things if I may:
1. How does this app deal with multi-lingual documents? Is it possible to choose two or more recognission languages for the same book? Most of my books are manuals. I am a language teacher and it is very important that both languages used in a book be accurately detected.
2. Is formatting being retained, at least in part?
3. Are there any alternatives to this application? It seems to be the only accessible solution, but I might be wrong.

I would be very grateful if anyone could help me with this.

Most sincerely,

Vlad


Antony Stone
 

Given that robobraille.org's OCR system is based on Tesseract, you might be
able to get good enough results simply by running Tesseract on your own
machine.

There are two versions available for Windows which can be downloaded from
https://github.com/tesseract-ocr/tesseract/wiki/Downloads

I'm not recommending it, I have no idea how accessible it is, and I haven't
used it myself, but I would think it's worth a try to see if it does what you
need - otherwise just uninstall and carry on looking elsewhere...


Antony.

On Sunday 27 May 2018 at 19:13:51, Vlad Dragomir wrote:

Many thanks, I do appreciate your answer. Unfortunately, this won't work
in my case, since the site only accepts files that do not exceed 64 MB.
But I do want to thank you for taking the time to get back to me.

Cheers,

Vlad.
--
"A person lives in the UK, but commutes to France daily for work.
He belongs in the UK."

- From UK Revenue & Customs notice 741, page 13, paragraph 3.5.1
- http://tinyurl.com/o7gnm4

Please reply to the list;
please *don't* CC me.


Vlad Dragomir
 

Thanks a lot Christo, I didn't know that Word could do that. I'll definitely try that!

Cheers,

Vlad.


Vlad Dragomir
 

Thanks, anything's worth trying for sure.

Kindest regards.

Vlad.


JM Casey <crystallogic@...>
 

Abbyy Finereader is not too expensive (around $200 uS as I recall for home users) and will do this. I'm probably going to buy it soon, myself. All reports I have had suggest it is fairly accessible, with some caveats that can be worked around.

-----Original Message-----
From: nvda@nvda.groups.io [mailto:nvda@nvda.groups.io] On Behalf Of Vlad Dragomir
Sent: May 27, 2018 1:14 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] OCR Solution

Many thanks, I do appreciate your answer. Unfortunately, this won't work in my case, since the site only accepts files that do not exceed 64 MB.
But I do want to thank you for taking the time to get back to me.

Cheers,

Vlad.


Christo de Klerk
 

Hi Vlad

No, it is not Word. Office Lens is an alonestanding app, not part of the MS Office suite.

I am just not sure if you can feed it a PDF which seems to be your requirement. You can send image files to it, but not sure about PDF. But have a look at it.

Kind regards

Christo

On 2018/05/27 19:28, Vlad Dragomir wrote:
Thanks a lot Christo, I didn't know that Word could do that. I'll definitely try that!

Cheers,

Vlad.


Christo de Klerk
 

I have Abbyy FineReader and I think it would be perfect for Vlad's purposes if he feels he can afford it. I think its quality is of the very best and it converts image PDFs flawlessly. I often use it for that purpose.

Kind regards

Christo

On 2018/05/27 19:48, JM Casey wrote:
Abbyy Finereader is not too expensive (around $200 uS as I recall for home users) and will do this. I'm probably going to buy it soon, myself. All reports I have had suggest it is fairly accessible, with some caveats that can be worked around.



-----Original Message-----
From: nvda@nvda.groups.io [mailto:nvda@nvda.groups.io] On Behalf Of Vlad Dragomir
Sent: May 27, 2018 1:14 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] OCR Solution

Many thanks, I do appreciate your answer. Unfortunately, this won't work in my case, since the site only accepts files that do not exceed 64 MB.
But I do want to thank you for taking the time to get back to me.

Cheers,

Vlad.





 

This question has also been asked on the Win10 for Screen Reader Users forum and similar answers shared there.

I'm now beginning to think, if large volumes of files are involved, that the tesseract command line version would be best with a script that just keeps looping through all the files in a given folder that have matching combinations of languages and running OCR on them.  Using a GUI interface with any significant numbers of files gets really tedious really quickly.

--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

     Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.

          ~ H.L. Mencken, AKA The Sage of Baltimore

 

 


Nika Tsiklauri
 

 hello! from my experience,  some gui apps start crashing whenever they see large files.  I also think that the commandline version would be wonderful. There is an issue though, there is no official support for the latest version of Tesseract OCR  on windows. based on my experience, previous versions have a terrible quality  of recognition.
 Best wishes, 
Nick.


Vlad Dragomir
 

I definitely will, thank you!

Regards,

Vlad.


 

On Sun, May 27, 2018 at 11:25 am, Nika Tsiklauri wrote:
latest version of Tesseract OCR  on windows
What are you considering "the latest version"?   I can find Windows support for Windows binaries, as well as installers, for v3.5.1 plus 4.0 alpha here:

                  https://github.com/tesseract-ocr/tesseract/wiki/Downloads

and, following one of the links there to UB Mannheim:  https://github.com/UB-Mannheim/tesseract/wiki 
 
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

     Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.

          ~ H.L. Mencken, AKA The Sage of Baltimore

 

 


Nika Tsiklauri
 

 thank you Brian. I’ll definitely try this.
 Best wishes, 
Nick.


Brian's Mail list account <bglists@...>
 

The problem with all of these and of course the win10 and the add on for nvda is that its extremely time consuming. It would be really neet if you could just present something with a series of pdfs or one big one and come back leater to a workable file of text, even if the graphics and diagrams were missed out.
Not seen this on anything thus far, yet one might suppose those using pdf files might like to have this as a built in process. maybe it is in some pro version of the Adobe Suite, don't know. Certainly fine reader is used a lot by other ocr programs., Muy one issue for English is that i and l can get muddled.
I have a friend called Della, it always makes her Delia, and a nearby place called New malden is New Maiden. Which though can be funny is also a bit irritating!

Brian

bglists@blueyonder.co.uk
Sent via blueyonder.
Please address personal E-mail to:-
briang1@blueyonder.co.uk, putting 'Brian Gaff'
in the display name field.

----- Original Message -----
From: "JM Casey" <crystallogic@ca.inter.net>
To: <nvda@nvda.groups.io>
Sent: Sunday, May 27, 2018 6:48 PM
Subject: Re: [nvda] OCR Solution


Abbyy Finereader is not too expensive (around $200 uS as I recall for home users) and will do this. I'm probably going to buy it soon, myself. All reports I have had suggest it is fairly accessible, with some caveats that can be worked around.



-----Original Message-----
From: nvda@nvda.groups.io [mailto:nvda@nvda.groups.io] On Behalf Of Vlad Dragomir
Sent: May 27, 2018 1:14 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] OCR Solution

Many thanks, I do appreciate your answer. Unfortunately, this won't work in my case, since the site only accepts files that do not exceed 64 MB.
But I do want to thank you for taking the time to get back to me.

Cheers,

Vlad.


Brian's Mail list account <bglists@...>
 

There is a new version of this just out. However some say it actually works better on phones than on the pc, which is odd as on a phone its getting photo and on the pc, one assumes its a scanned in image already.
Brian

bglists@blueyonder.co.uk
Sent via blueyonder.
Please address personal E-mail to:-
briang1@blueyonder.co.uk, putting 'Brian Gaff'
in the display name field.

----- Original Message -----
From: "Christo de Klerk" <christodeklerk@gmail.com>
To: <nvda@nvda.groups.io>
Sent: Sunday, May 27, 2018 6:20 PM
Subject: Re: [nvda] OCR Solution


Hello Vlad

I bought the KNFB Reader app, but I am rather disappointed with it. It does not at all retain any semblance of formatting. It throws all the text into one ginormous paragraph. I don't think this is what you want.

Maybe you can look at the Office Lens app which is free and available for Windows 10 as well as for smart phones. It retains formatting. It supports language switching and if you are using a synthesiser which also supports language switching, different languages will be spoken correctly. The OCR quality is quite good. Even though it is free, it is better in my opinion than KNFB in every respect except for one: KNFB does the OCR on your local machine or mobile device, while Office Lens does it in the cloud. But I don't think this advantage KNFB has justifies the price. Give Office Lens a bash and see if it will work for you. It really is the best option that I am aware of.

If you can use a mobile device for your OCR, you could also try Prizmo Go which is also free and retains formatting. I am quite impressed with it.

Kind regards

Christo


On 2018/05/27 18:44, Vlad Dragomir wrote:
Dear all,

I need to find a way to transform image pdf files into text. These are mostly books that have been scanned a while ago, but I haven't had the time and patience to find a solution. Now I really need to do something about this, for professional reasons.

I found this Windows 10 app called KNFB Reader, which seems to do exactly that. However, since this is a rather expensive app, I'd like to ask those who have already used it a few things if I may:
1. How does this app deal with multi-lingual documents? Is it possible to choose two or more recognission languages for the same book? Most of my books are manuals. I am a language teacher and it is very important that both languages used in a book be accurately detected.
2. Is formatting being retained, at least in part?
3. Are there any alternatives to this application? It seems to be the only accessible solution, but I might be wrong.

I would be very grateful if anyone could help me with this.

Most sincerely,

Vlad



 

Any OCR engine requires "passable copy" in order to achieve anything near to reasonable accuracy.  Part of that, and a part that many overlook, is that you need a 300 dpi scan (minimum, it can be denser) to have "passable copy."  There are utilities that "up convert" things scanned at less than 300 dpi to 300 dpi in preparation for OCR processing.

For those that are Linux or OSX users as well as Windows 10 users there is an already extant utility for batch processing PDF files in particular via scripting so that you don't have to attend to it.  See the website for OCRmyPDF.  One could probably get Windows executables for most of the stuff there as well, but I'd far rather use the extant scripting, and modify it, from Linux running from USB drive than reinvent the wheel (though one would have to use the sudo commands to install the needed packages each time if running from a USB drive).  For anyone who wants some really, really in depth information about preparing images (and "deconstructing" image PDFs) for OCR processing see:  How to Digitize Texts with Open-Source Command-Line Optical Character Recognition (OCR) Software
--

Brian - Windows 10 Home, 64-Bit, Version 1803, Build 17134  

     Explanations exist; they have existed for all time; there is always a well-known solution to every human problem — neat, plausible, and wrong.

          ~ H.L. Mencken, AKA The Sage of Baltimore