Tesseract integration - hacking around

tesseract

#1

Hi everyone,

I needed to OCR many small images (textfields) so the limit of 10k OCRs of Abby was not enough.
Fortunatelly I manage to substitute command called by OCR REST service to execute my batch script.

What you need, is to:

  1. change application.properties of OCR to:
#ocr.task.command=["..\\\\Applications\\\\java\\\\bin\\\\java","-Xmx1024m","-jar","ocr-task.jar","$CONFIG"]
ocr.task.command=["ocr-task.bat","-Xmx1024m","-jar","ocr-task.jar","$CONFIG"]
  1. create ocr-task.bat
@set PATH=%PATH%;c:\progs\cygwin64\bin;c:\progs\GnuWin32\bin;
@more %4 | sed -r "s/.*images.*\[(.*)].*/\1/" | sed -r "s/\\\\/\\/g" > img_path.txt
@set /p IMG=<img_path.txt
@copy %IMG% img_data.bmp
@c:\progsl\Tesseract-OCR\tesseract.exe --psm 7 --oem 1 %IMG% stdout > img_data.txt
@copy sample.json_result %4_result
  1. create default json response (sample.json_result)
{ "ready":"true", "message":"OK", "result":["c:\\RPAExpress\\OCR\\img_data.txt"] }

You need to install:
Tesseract 4
WinGnu Sed

It is all downloadable.

Hope it helps until we get supported feature from Workfusion Team.


OCR Engine not working as expected
#2

@pmisiewicz,

Thanks for your feedback and ideas! Did you manage to get the Tesseract’s result to the recorder variable?

After doing the actions you described, I’ve got the following files in my OCR folder:

But the OCR action fails and OCR result variable is empty.

The Tesseract result (img_data) is also far from perfect.


#3

Yes, I managed to get results - it worked on 1.1.3 so if json interface is similar it should still work.

First you should verify that you can open the image, and that the Tesseract command from Ocr-task.bat works correctly.

If it does, the rest is just serving json that points to the results file.


#4

And what about Tesseract quality? Have you trained this engine?


#5

I wanted to use OOB models just to recognize Numbers(money) - I hope it works well in large scale.

Abby also had some problems, e.g:

  • mismatch 1 with |
  • mismatch 3 with 6 (sometimes)

I hope I will have more results regarding quality in 1-2 weeks if I solve some other issues.


#6

For hack to work in 1.1.4 you need to change sample.json_result to contain “status”:“SUCCEEDED”

{ "ready":"true", "status":"SUCCEEDED", "message":"OK", "result":["c:\\RPAExpress\\OCR\\img_data.txt"] }

#7

I have run some more tests on Tesseract and I must say that quality in detection numbers is terrible - which means around 0% . Abby with same images is around 90-100% correct - sometimes missing comas or dots in numbers.

Unfortunatelly Abby takes at least 4-6 seconds to process one number, in comparison to Tesseract which is around 1 second (after some additional tweaks). I will try to make Tesseract work - there are lots of config params to modify … or I will find some OCR tuned to detecting numbers … shall see.


#8

Finally it turns out Tesseract is not that bad :slight_smile:
I got almost 100% correct after resizing initial image 3 times - I used imageMagick for that.

The only issues were:

  • 0 got always recognized as I (probably because inside of 0 looks as I)
  • commas got sometimes recognized as dot on an image that letters were dark gray on light gray

The final ocr-task.bat I used for that was:

@set PATH=C:\progsl\imagick;c:\progs\GnuWin32\bin;%PATH%;
@more %4 | sed -r "s/.*images.*\[(.*)].*/\1/" | sed -r "s/\\\\/\\/g" > img_path.txt
@set /p IMG=<img_path.txt
@copy %IMG% img_data.png
@C:\progsl\imagick\convert.exe img_data.png -resize "300%%" -threshold "50%%" img_data_v01.png
@c:\progsl\Tesseract-OCR\tesseract.exe --psm 7 --oem 3 img_data_v01.png stdout > img_data.txt
@copy sample.json_result %4_result

You should verify your best image transformations …


#9

I was trying this was getting error , any steps to be performed in the rpa recoreder side when including the ocr action library?