OCR loop through pdf files

Hi,

I am new to work fusion RPA. I have a usecase: copying the data from multiple .pdf files in loop and transferring the data to another website through login screen.

Please help me out for the same.

Thanks,

Khalid Ansari

Hi @khalid3k,

There are several threads on the forum on this topic. Here are some links, hope they will be helpful :wink:

Hi Expert,

Can we able to loop through pdf file and extract data in loop and save in excel columns.

I need to read data from pdf files which are kept in folder. For this I have use Get Folder Contents to loop through files. In loop first file i can able to read data using ocr but second file i can’t. I want to extract data from pdf and save extracted data in excel in like name, address etc. Single pdf file i can extract data and save in excel need to do for multiple files.

Please can you guide me with steps.

Thanks,

Khalid Ansari

Why cannot it read the data from the second file? Does it show an error?

Hi,

No it is not giving any error.

I have attached my flow for the same.

Thanks,

Khalid Ansari

Before the OCR action, you need to add steps to open the PDF file from which you will read the information, like:

  1. press Win + R
  2. type in the path to the file
  3. press Enter

Also, you can put all OCR actions one after the other, and when the bot reads all required information, open the excel file and populate it with the data from the variables. It will be more efficient as the bot will not need to switch from OCR to Excel actions and back all the time, and it will only need to open the excel file once.

Hi

Thanks for quick reply.

I am getting following error

Here is my OCR capture for single file as all the file format are same.

Thanks,

Khalid Ansari

Please press details, copy the whole text of the error, paste in a text file, and share here.
Thank you.

Hi,

Thanks for quick response. Here is the

this is happening for pulling data from second pdf file.

Thanks,

Khalid Ansari

Thank you. Does it fail at some specific step, or does it right after the start to play the recording, without even starting it?
Also, could you please copy the text and upload here? A part of the text is not visible on the screenshot.

Hi Expert,

It is failing while loop is running for extracting data from second pdf. I am getting two error. I have created OCR for first pdf only as second pdf template is same.

Error executing OcrAction
com.workfusion.studio.rpa.recorder.playback.PlaybackException: Error executing TemplateAction[templateName=OcrAction.ftl,id=6,name=Optional[OcrAction],parent=2,nextSibling=7,arguments=ActionArguments[varName=[address],imageName=[D:\workfusion-workspace\rpae_project\DemoOCRFiles\1528429775340-anchor-1528429775511.apng],fullImageName=[1528429775340.png],xsi:type=[recorder:OcrAction, recorder:OcrAction],pollingInterval=[300],active=[true],type=[CONTROL],offsetX=[54],offsetY=[184],delay=[0],width=[131],actionDetails=[(to ‘address’ rectangle 131 x 73)],height=[73],awaitTimeout=[5000]]]
at com.workfusion.studio.rpa.recorder.playback.flow.StandardControlFlow.execute(StandardControlFlow.java:54)
at com.workfusion.studio.rpa.recorder.playback.action.template.TemplateAction.execute(TemplateAction.java:30)
at com.workfusion.studio.rpa.recorder.playback.action.template.TemplateAction.execute(TemplateAction.java:17)
at com.workfusion.studio.rpa.recorder.playback.player.ActionPlayer.next(ActionPlayer.java:53)
at com.workfusion.studio.rpa.recorder.player.PlaybackLogic.playNextAction(PlaybackLogic.java:153)
at com.workfusion.studio.rpa.recorder.player.PlaybackLogic.run(PlaybackLogic.java:113)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.openqa.selenium.WebDriverException: Image does not found : 1528429775340-anchor-1528429775511.apng
Command duration or timeout: 0 milliseconds
Build info: version: ‘9.0.0.1’, revision: ‘e3a0fd7071’, time: ‘2018-05-11T11:35:20.018Z’
System info: host: ‘MUM-KHALIDA’, ip: ‘192.168.6.73’, os.name: ‘Windows 10’, os.arch: ‘amd64’, os.version: ‘10.0’, java.version: ‘1.8.0_121’
Driver info: com.freedomoss.crowdcontrol.webharvest.selenium.wrapper.RemoteDriverWrapper
Capabilities [{imageSimilarityThreshold=0.8, extra.executor.id={Name=RPA Recorder}, CLOSE_ALL_WINDOWS=false, browserName=universal, javascriptEnabled=true, extra.capabilities.context={“browserType”:“universal”,“startInPrivate”:false,“blockImages”:false,“maximizeOnStartup”:false,“customCapabilities”:{“platform”:“WINDOWS”,“javascriptEnabled”:true,“SEARCH_ALL_WINDOWS”:true,“CLOSE_ALL_WINDOWS”:false,“imageSimilarityThreshold”:“0.8”},“executorId”:{“Name”:“RPA Recorder”}}, platformName=WINDOWS, SEARCH_ALL_WINDOWS=true, platform=WINDOWS}]
Session ID: 12c8bdf3-08df-47f9-b228-137e70f4fc8a
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance

2. Stack frame can be removed only by owner
java.lang.IllegalArgumentException: Stack frame can be removed only by owner
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:93)
at com.workfusion.studio.rpa.recorder.playback.player.PlaybackStack.pop(PlaybackStack.java:73)
at com.workfusion.studio.rpa.recorder.playback.action.ForEachAction.execute(ForEachAction.java:64)
at com.workfusion.studio.rpa.recorder.playback.action.ForEachAction.execute(ForEachAction.java:29)
at com.workfusion.studio.rpa.recorder.playback.player.ActionPlayer.next(ActionPlayer.java:53)
at com.workfusion.studio.rpa.recorder.player.PlaybackLogic.playNextAction(PlaybackLogic.java:153)
at com.workfusion.studio.rpa.recorder.player.PlaybackLogic.run(PlaybackLogic.java:113)
at java.lang.Thread.run(Thread.java:745)

Thanks,

Khalid Ansari

Can you share the details of the OCR action in step 6 so we can see what is your anchor area and what is the scan area? If the first pdf scans correctly but the second doesn’t it might be that you selected an anchor area that changes from the first to the second PDF.

Hi,

Thanks for the reply. Here is the OCR detail and anchor area

For first one it read second one it error out.

I have created OCR for first pdf only as second pdf template is same.

Thanks,
Khalid Ansari

Ok, the template is the same and the content in the anchor área is also the same between pdf1 and pdf2?

Hi,

Thanks for the help!

I can able to loop through the pdf can get the data from other pdf in loop. I want to save the data in excel like Bill To Address, Invoice Number address etc. Currently data in going below one after the another.

Selected cell is Invoice number should go in invoice number column.

For reference excel cell position

Thanks,

Khalid Ansari

I think the easiest and cleanest would be to use separate OCR actions and save results in separate variables: one for scanning the bill to address and the other for scanning the invoice number. After that you can set them separately in the Excel cells where you want them. However, downside: it takes longer as each OCR takes various seconds to complete and you have only 1.000 free OCR scans about every 3 months.
Other option: use split text action on your current OCR result and separate the result into two variables (bill to address and invoice number).
Good luck! :wink:

Hi, would appreciate some help.

How did you manage to get rid of the error you was getting…Im having the same problem, its reading the first file only then giving an error.

Thanks in advance!

My pdfs are scanned.