TesseractOCR add-on


Rui Fontes
 

Hello!

Released version 2022.05 of TesseractOCR, as first public version...


• Authors: Rui Fontes rui.fontes@... and Angelo Abrantes ampa4374@...
• Updated in 28/05/2022

• Download stable version:

https://github.com/ruifontes/tesseractOCR/releases/download/2022.05/tesseractOCR-2022.05.nvda-addon

• Compatibility: NVDA version 2019.3 and beyond


*** Informations

This add-on uses the free and open source Tesseract OCR engine, to perform optical character recognition on an image file, PDF, JPG, TIF or other, without the need to open it.

It also can scan and recognize a paper document through a WIA compatible scanner.

In the Preferences of NVDA, it is created a cathegory, TesseractOCR, where you can set the language to be used on the recognition and the type of documents to be recognized.


*** Shortcuts

The default commands are:

Windows+Control+r - to recognize the selected document;

Windows+Control+Shift+r - to scan and recognize a document through the scanner.

Then just wait that ocr.txt opens with the recognized text.

If you want to preserve the recognized text, don't forget to save the document under another name and in another location, as all files in the temporary directory are deleted at the start of the next OCR process!

This commands can be modified in the "Input gestures" dialog in the "TesseractOCR" section.


*** Automatic update

This add-on includes an automatic update feature. The check for a new version will be executed everytime NVDA is loaded. If you want this, go to NVDA, Preferences, Options and in the add-on category check the check box.


*** Known problems

• This version only works in 64-bit Windows.
• When selecting the "Various" option in the "Documents type" combobox, the recognized text probably appear with many blank lines This is a known problem with Tesseract, and, without consumming lots of processing time, I haven't yet found any solution. But, I still haven't given up!


*** Languages supported

The supported languages in this version are: Afrikans Amharik Arabic Bulgarian Burnese Catalan/Valencian Chinese simplified Chinese traditional Croatian Czech Dannish Deutch Dutch English Finnish French Galician Georgian Greek Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kannada Kirghiz Korean Lativia Lituanian Macedonian Nepali Norwegian Panjabi Persian Polish Portuguese Romanian/Moldave Russian Serbian (Latin) Slovak) Slovenian) Spanis Swedish Tamil Thai Turkish Ukrainian Urdu Vietnamese


*** Image types supported

This add-on supports the following types of files: PDF jpg tif png bmp pnm pbm pgm jp2 gif jfif jpeg tiff spix webp


Best regards,

NVDA portuguese team


Pranav Lal
 

Hi Rui,

1. Does tesseract have any advantages over the windows built-in OCR?
2. How is this add-on different from the now add-on?

Pranav


Rui Fontes
 

Hello!

Comparing with NAO, it have the following differences:

1 - Use of Tesseract OCR instead of Windows OCR, and for me it is better...;
2 - Different exibition of results, TXT instead of a specific interface;
3 - TesseractOCR do not make OCR to the screen;
4 - TesseractOCR can make OCR from a paper document through a WIA compatible scanner.

It is a large add-on due to the need of the files needed to recognize all the languages supported...

Best regards,

Rui Fontes
NVDA portuguese team


Às 00:50 de 31/05/2022, Pranav Lal escreveu:

Hi Rui,

1. Does tesseract have any advantages over the windows built-in OCR?
2. How is this add-on different from the now add-on?

Pranav





Dan Beaver
 

I tried it and couldn't get it to work for my needs.  I receive an email whenever I have class one mail coming.  If there is class one mail coming to my mailbox then the email includes images of scans of those mail items.  They are included in the email and not attached.


Using the Windows OCR mechanism it reads what it can from those images.  Using this mechanism it says it is an unsupported format or something to that effect.


Any idea why it won't read these images?


Dan Beaver

Dan Beaver (KA4DAN)
On 5/30/2022 7:57 PM, Rui Fontes wrote:

Hello!

Comparing with NAO, it have the following differences:

1 - Use of Tesseract OCR instead of Windows OCR, and for me it is better...;
2 - Different exibition of results, TXT instead of a specific interface;
3 - TesseractOCR do not make OCR to the screen;
4 - TesseractOCR can make OCR from a paper document through a WIA compatible scanner.

It is a large add-on due to the need of the files needed to recognize all the languages supported...

Best regards,

Rui Fontes
NVDA portuguese team


Às 00:50 de 31/05/2022, Pranav Lal escreveu:
Hi Rui,

1. Does tesseract have any advantages over the windows built-in OCR?
2. How is this add-on different from the now add-on?

Pranav












Rui Fontes
 

For images in mails or in other type of windows you must use the NVDA OCR feature or NAO...

TesseractOCR only do OCR to image files, PDF, JPG, TIFF and so on, or to a paper document through a scanner...


Rui Fontes


Às 01:04 de 31/05/2022, Dan Beaver escreveu:

I tried it and couldn't get it to work for my needs.  I receive an email whenever I have class one mail coming.  If there is class one mail coming to my mailbox then the email includes images of scans of those mail items.  They are included in the email and not attached.


Using the Windows OCR mechanism it reads what it can from those images.  Using this mechanism it says it is an unsupported format or something to that effect.


Any idea why it won't read these images?


Dan Beaver

Dan Beaver (KA4DAN)
On 5/30/2022 7:57 PM, Rui Fontes wrote:
Hello!

Comparing with NAO, it have the following differences:

1 - Use of Tesseract OCR instead of Windows OCR, and for me it is better...;
2 - Different exibition of results, TXT instead of a specific interface;
3 - TesseractOCR do not make OCR to the screen;
4 - TesseractOCR can make OCR from a paper document through a WIA compatible scanner.

It is a large add-on due to the need of the files needed to recognize all the languages supported...

Best regards,

Rui Fontes
NVDA portuguese team


Às 00:50 de 31/05/2022, Pranav Lal escreveu:
Hi Rui,

1. Does tesseract have any advantages over the windows built-in OCR?
2. How is this add-on different from the now add-on?

Pranav












nvdainth@...
 

Hi Rui Fontes

I have try your add-on. found that it can work well. But there are some suggestions + some bugs I'd like to inform you about.

1. buck it's cann't remember the Config value
when restart NVDA your add-on config value always reset to default

2. I find that you use folder images as cache for add-on functionality.
This would be nice if the folder was cleaned regularly. for the privacy of user data

3. If possible, I find that tessdata makes the add-on large. and for most users They may only need to use a few languages. This means that not all language options are required.
So is it possible? If the add-on has a language option for the user to choose and then download and install it. Without including all language tessdata files in the add-on's installer.

Thank you for your development It is of great benefit to the community.


Rui Fontes
 

Hello!


Thanks by your words!

1 - I am going to check, but I think if you make changes and save them or have NVDA to save changes on exit they are preservved...


2 - The images folder is cleaned each time you start an OCR process...



Or if the files are stored anywhere in GitHub...3 - I have to check if I still have space on server to store more data...


Rui Fontes



Às 01:59 de 31/05/2022, nvdainth@... escreveu:

Hi Rui Fontes

I have try your add-on. found that it can work well. But there are some suggestions + some bugs I'd like to inform you about.

1. buck it's cann't remember the Config value
when restart NVDA your add-on config value always reset to default

2. I find that you use folder images as cache for add-on functionality.
This would be nice if the folder was cleaned regularly. for the privacy of user data

3. If possible, I find that tessdata makes the add-on large. and for most users They may only need to use a few languages. This means that not all language options are required.
So is it possible? If the add-on has a language option for the user to choose and then download and install it. Without including all language tessdata files in the add-on's installer.

Thank you for your development It is of great benefit to the community.


Pranav Lal
 

Hi Rui,

I hear you and thanks for the comparison.

Pranav


Ravindran V.S.
 

Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav


Kakarla Nageswaraiah
 

Hello,
I didn't find Telugu language.
Will they add in future?
Thanks and regards.

On 6/1/22, Ravindran V.S. <ravivssl@...> wrote:
Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about
the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav













--
కాకర్ల నాగేశ్వరయ్య

K. Nageswaraiah


Rui Fontes
 

I do not have any idea...


Rui Fontes


Às 14:11 de 01/06/2022, Kakarla Nageswaraiah escreveu:

Hello,
I didn't find Telugu language.
Will they add in future?
Thanks and regards.


On 6/1/22, Ravindran V.S. <ravivssl@...> wrote:
Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about
the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav














Mobeen Iqbal
 

Hello Rui.

Thanks for this add-on. I have been trying to get it working, but can't seem to get my scanner to scan. Does it have support for twain scanners? or does it just support WIA devices? Also, I can't find anywhere to specify which scanner to use as my system has more than one. I would like to use a twain scanner if possible. Do you know of an add-on which allows use of a twain scanner for OCR?

Very best wishes,

Mo.

On 30/05/2022 21:36, Rui Fontes wrote:
Hello!

Released version 2022.05 of TesseractOCR, as first public version...


• Authors: Rui Fontes rui.fontes@... and Angelo Abrantes ampa4374@...
• Updated in 28/05/2022

• Download stable version:

https://github.com/ruifontes/tesseractOCR/releases/download/2022.05/tesseractOCR-2022.05.nvda-addon

• Compatibility: NVDA version 2019.3 and beyond


*** Informations

This add-on uses the free and open source Tesseract OCR engine, to perform optical character recognition on an image file, PDF, JPG, TIF or other, without the need to open it.

It also can scan and recognize a paper document through a WIA compatible scanner.

In the Preferences of NVDA, it is created a cathegory, TesseractOCR, where you can set the language to be used on the recognition and the type of documents to be recognized.


*** Shortcuts

The default commands are:

Windows+Control+r - to recognize the selected document;

Windows+Control+Shift+r - to scan and recognize a document through the scanner.

Then just wait that ocr.txt opens with the recognized text.

If you want to preserve the recognized text, don't forget to save the document under another name and in another location, as all files in the temporary directory are deleted at the start of the next OCR process!

This commands can be modified in the "Input gestures" dialog in the "TesseractOCR" section.


*** Automatic update

This add-on includes an automatic update feature. The check for a new version will be executed everytime NVDA is loaded. If you want this, go to NVDA, Preferences, Options and in the add-on category check the check box.


*** Known problems

• This version only works in 64-bit Windows.
• When selecting the "Various" option in the "Documents type" combobox, the recognized text probably appear with many blank lines This is a known problem with Tesseract, and, without consumming lots of processing time, I haven't yet found any solution. But, I still haven't given up!


*** Languages supported

The supported languages in this version are: Afrikans Amharik Arabic Bulgarian Burnese Catalan/Valencian Chinese simplified Chinese traditional Croatian Czech Dannish Deutch Dutch English Finnish French Galician Georgian Greek Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kannada Kirghiz Korean Lativia Lituanian Macedonian Nepali Norwegian Panjabi Persian Polish Portuguese Romanian/Moldave Russian Serbian (Latin) Slovak) Slovenian) Spanis Swedish Tamil Thai Turkish Ukrainian Urdu Vietnamese


*** Image types supported

This add-on supports the following types of files: PDF jpg tif png bmp pnm pbm pgm jp2 gif jfif jpeg tiff spix webp


Best regards,

NVDA portuguese team







Rui Fontes
 

Hello!


You can try this one:

https://www.dropbox.com/s/n78xf7022gdbsqr/NAPS2TesseractOCR_2022.06.nvda-addon?dl=1


It contains only a few OCR languages, but you can import the ones you need through the configuration panel included in the NVDA settings.

If it do not detect automatically your scanner, try to do the following:

1 - Go to %appdata%\nvda\addons,

2 - Go to NAPS2TesseractOCR, globalPlugins, NAPS2TesseractOCR, naps2-6.1.2-portable;

3 - Execute the app NAPS2.Portable.exe,

4 - Tab until Main Menu;

5 - Arrow right to Profiles and press Enter;

6 - Tab until Edit and press Enter;

7 - Shift tab untill Driver WIA;

8 - Arrow down to Driver TWAIN;

9 - Tab untill Select device, press Enter and select your device;

10 - Tab untill Ok and press enter;

11 - Tab untill Finish and press Enter.

Best regards,

Rui Fontes
NVDA portuguese team



Às 15:16 de 01/06/2022, Mobeen Iqbal escreveu:

Hello Rui.

Thanks for this add-on. I have been trying to get it working, but can't seem to get my scanner to scan. Does it have support for twain scanners? or does it just support WIA devices? Also, I can't find anywhere to specify which scanner to use as my system has more than one. I would like to use a twain scanner if possible. Do you know of an add-on which allows use of a twain scanner for OCR?

Very best wishes,

Mo.


On 30/05/2022 21:36, Rui Fontes wrote:
Hello!

Released version 2022.05 of TesseractOCR, as first public version...


• Authors: Rui Fontes rui.fontes@... and Angelo Abrantes ampa4374@...
• Updated in 28/05/2022

• Download stable version:

https://github.com/ruifontes/tesseractOCR/releases/download/2022.05/tesseractOCR-2022.05.nvda-addon

• Compatibility: NVDA version 2019.3 and beyond


*** Informations

This add-on uses the free and open source Tesseract OCR engine, to perform optical character recognition on an image file, PDF, JPG, TIF or other, without the need to open it.

It also can scan and recognize a paper document through a WIA compatible scanner.

In the Preferences of NVDA, it is created a cathegory, TesseractOCR, where you can set the language to be used on the recognition and the type of documents to be recognized.


*** Shortcuts

The default commands are:

Windows+Control+r - to recognize the selected document;

Windows+Control+Shift+r - to scan and recognize a document through the scanner.

Then just wait that ocr.txt opens with the recognized text.

If you want to preserve the recognized text, don't forget to save the document under another name and in another location, as all files in the temporary directory are deleted at the start of the next OCR process!

This commands can be modified in the "Input gestures" dialog in the "TesseractOCR" section.


*** Automatic update

This add-on includes an automatic update feature. The check for a new version will be executed everytime NVDA is loaded. If you want this, go to NVDA, Preferences, Options and in the add-on category check the check box.


*** Known problems

• This version only works in 64-bit Windows.
• When selecting the "Various" option in the "Documents type" combobox, the recognized text probably appear with many blank lines This is a known problem with Tesseract, and, without consumming lots of processing time, I haven't yet found any solution. But, I still haven't given up!


*** Languages supported

The supported languages in this version are: Afrikans Amharik Arabic Bulgarian Burnese Catalan/Valencian Chinese simplified Chinese traditional Croatian Czech Dannish Deutch Dutch English Finnish French Galician Georgian Greek Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kannada Kirghiz Korean Lativia Lituanian Macedonian Nepali Norwegian Panjabi Persian Polish Portuguese Romanian/Moldave Russian Serbian (Latin) Slovak) Slovenian) Spanis Swedish Tamil Thai Turkish Ukrainian Urdu Vietnamese


*** Image types supported

This add-on supports the following types of files: PDF jpg tif png bmp pnm pbm pgm jp2 gif jfif jpeg tiff spix webp


Best regards,

NVDA portuguese team










Rui Fontes
 

Hello!


The new version already have it!

https://github.com/ruifontes/tesseractOCR/releases/download/2022.06.27/tesseractOCR-2022.06.27.nvda-addon

Changes:

- Updated Tesseract from version 5.0 Alpha (64-bit) to 5.1 (32-bit);
- Added several more recognition languages;
- Introduced the option to select a second language to be used in OCR of documents with multiple languages and a button to forget it;
- Introduced a new document type, "With auto-orientation", that allows the OCR engine to rotate the image as necessary;
- Introduced beeps to signal the add-on is working;
- Corrected code to avoid the non population of the download languages combobox;
- Corrected a problem with controlTypes roles preventing compatibility with NVDA 2020.4;
- Added russian translation.

Best regards,

Rui Fontes
NVDA portuguese team



Às 14:11 de 01/06/2022, Kakarla Nageswaraiah escreveu:

Hello,
I didn't find Telugu language.
Will they add in future?
Thanks and regards.


On 6/1/22, Ravindran V.S. <ravivssl@...> wrote:
Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about
the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav














mk360
 

Hi,

The change To Tesseract 5.1 32 b implies that it will work in 32 bits systems?

Regards,

mk.

El 27/06/2022 a las 16:04, Rui Fontes escribió:
Hello!


The new version already have it!

https://github.com/ruifontes/tesseractOCR/releases/download/2022.06.27/tesseractOCR-2022.06.27.nvda-addon

Changes:

- Updated Tesseract from version 5.0 Alpha (64-bit) to 5.1 (32-bit);
- Added several more recognition languages;
- Introduced the option to select a second language to be used in OCR of documents with multiple languages and a button to forget it;
- Introduced a new document type, "With auto-orientation", that allows the OCR engine to rotate the image as necessary;
- Introduced beeps to signal the add-on is working;
- Corrected code to avoid the non population of the download languages combobox;
- Corrected a problem with controlTypes roles preventing compatibility with NVDA 2020.4;
- Added russian translation.

Best regards,

Rui Fontes
NVDA portuguese team



Às 14:11 de 01/06/2022, Kakarla Nageswaraiah escreveu:
Hello,
I didn't find Telugu language.
Will they add in future?
Thanks and regards.


On 6/1/22, Ravindran V.S. <ravivssl@...> wrote:
Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about
the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav
















Rui Fontes
 

Yes, that is the major reason to change to 32-bit...


Best regards,

Rui Fontes
NVDA portuguese team


Às 21:59 de 27/06/2022, mk360 escreveu:

Hi,

The change To Tesseract 5.1 32 b implies that it will work in 32 bits systems?

Regards,

mk.

El 27/06/2022 a las 16:04, Rui Fontes escribió:
Hello!


The new version already have it!

https://github.com/ruifontes/tesseractOCR/releases/download/2022.06.27/tesseractOCR-2022.06.27.nvda-addon

Changes:

- Updated Tesseract from version 5.0 Alpha (64-bit) to 5.1 (32-bit);
- Added several more recognition languages;
- Introduced the option to select a second language to be used in OCR of documents with multiple languages and a button to forget it;
- Introduced a new document type, "With auto-orientation", that allows the OCR engine to rotate the image as necessary;
- Introduced beeps to signal the add-on is working;
- Corrected code to avoid the non population of the download languages combobox;
- Corrected a problem with controlTypes roles preventing compatibility with NVDA 2020.4;
- Added russian translation.

Best regards,

Rui Fontes
NVDA portuguese team



Às 14:11 de 01/06/2022, Kakarla Nageswaraiah escreveu:
Hello,
I didn't find Telugu language.
Will they add in future?
Thanks and regards.


On 6/1/22, Ravindran V.S. <ravivssl@...> wrote:
Hi Rui,
I have tried this add-on and still could not succeed with the OCR with it.
If this list permits, here, if not would like to have a chat off line about
the usage of this OCR Add-on please.
Let me know your convenience.
Thanks,
Ravi. ..

-----Original Message-----
From: nvda@nvda.groups.io <nvda@nvda.groups.io> On Behalf Of Pranav Lal
Sent: Tuesday, May 31, 2022 10:22 PM
To: nvda@nvda.groups.io
Subject: Re: [nvda] TesseractOCR add-on

Hi Rui,

I hear you and thanks for the comparison.

Pranav