Working with PDF using Xpath as substitute for OCR

Recently I commented on a post where the question was how someone can export data from a PDF file, without using OCR. My answer: use Xpath after opening the PDF file in a browser. In this topic I would like to elaborate a little bit more on this subject.

Why is using Xpath sometimes better than using OCR?
The most simple answer: its much faster. In my experience OCR takes 5-10s to extract one variable from a PDF file. Using Xpath you can extract more elements in less than a second. Furthermore Xpath is at least as reliable as OCR. However there are a few things that you might consider in using Xpath.

What is Xpath? (reference: https://www.w3schools.com/xml/xpath_intro.asp)
XPath is a major element in the XSLT standard. It can be used to navigate through elements and attributes in an XML document. XPath stands for XML Path Language.

How can I use Xpath:
By opening a PDF file in a web browser it can be viewed as a XML file which enables you to find specific elements in a specific file. By using the Web Element Action in mode get value, you can provide a specific specific Xpath to the element you want to extract.

When I scrape normal web elements I normally use the included Firefox portable that standard comes with WorkFusion and is efficient in determining Xpaths. In this browser you are able to right click on an element, go to Xpaths and then select an option which is automatically copied to your clipboard and looks like something as this:

//span[contains(.,‘Demo’)]

This Xpath refers to a button on a website which in the XML code is referred to as ‘Demo’. However, if you use this method for a PDF file you might expect a path as follows:

//div[@style=‘left: 784.25px; top: 184.617px; font-size: 15px; font-family: sans-serif; transform: scaleX(1.04125);’]

As you might see this Xpath contains a relative position of the element including formatting information. However, this reference is in my experience to sensitive to very small format changes, especially because of the scaleX property. Therefore I use an alternative Xpath when extracting elements with PDF. In these cases I op the PDF file in my normal Firefox Browser and then select the element using the element picker. To do so I open the inspector screen (F12) select the element picker in the top left corner of the inspector screen and then click on the element that you want to extract. In the inspector screen the relevant rows will be highlighted, showing some code such as this:

05/02/2018

When you right click on the highlighted text, go to copy and then select Xpath. Now a string like this will be copied to your clipboard:

/html/body/div[1]/div[2]/div[4]/div/div/div[2]/div[18]

This path doesn’t contain a relative location, but is the technical within the structure of the XML code. This is far more stable. In our project we always define controls in order to be sure that the rights data is extracted, however we also do this when other techniques (such as OCR) are used.

A final remark: When should us use OCR? In my opinion OCR is only better in cases of reading data from image PDF without readable text.

If there are questions, feel free to comment below.

6 Likes

Morning

I’m currently busy with some OCR work and ran into some issues. I do have a few questions.

  1. Are the PDFs / XML created with code? I’m sitting with an issue that documents are being scanned in so the PDF really has an image in it.
  2. When using the WF OCR service to convert PDF to HTML the HTML is quite a mess and looks nothing like the PDF. Are there any other tools we can use for this conversion?

Thank you
Alex

Here are the answers to your questions.

  1. I hope you read the post completely. In that case @wilbertyYuUdrW has clearly mentioned this technique is applicable to readable PDFs. This means that the PDF was generated electronically. Therefore we need not OCR the same. Infact there is no point in doing so.
  2. Yes. Abby only converts PDF to HTML. After that you need to prepare your training set accordingly and train your ML platform in Workfusion. The rest will be taken care of by VDS :sunglasses:
1 Like

Hi

Thank you, I did read the post, wanted to confirm.
That is the issue and the plan, the conversion from PDF to HMTL is not formatted very well, the same PDFs with different information are not rendered the same e.g. Sometimes the table is above the doc details and other times below or next to it, it’s not a consistent conversion.

Regards

Hi, OCR is just the beginning of your Scanned PDF processing. Here is how its done in Workfusion SPA.

  1. OCR the document & convert it into HTML.
  2. Use the HTML document to train your ML model.
  3. Once trained, the model will take care of the information extraction.

During the 2nd stage, your trained model will tag the documents accordingly.
like this

<h1 style=""width:100%;vertical-align:middle"">INVOICE# <invoice_number class=""extraction-tag"">1588187735</invoice_number></h1>

Workfusion ML Engineer compliance is required to perform step 2 & 3.

Hi @alex.lopes and @daniel_sa,

Thanks for youre both replies. It seems to me that you already have soldig input to go further. Just to confirm, the solution in this post is not ment for image based PDF’s. If there are however additional question, just let us know.

Kind regards, WIlbert

1 Like

Hi,

I am trying to scrap data from online documents. Data is in table format, the requirement is search data using cntrl+f. If data found, scrap the full row.

Suppose if there is 3 columns and multiple rows. when i searched the data. it is present in one column.i need to scrap whole row corresponding to search data. Online data is dynamic.

Is xpath works here?

Can you please give me solution for that.

Please find attached sample data.

Hi,

Am using selenium to automate PDF verification. Am launching PDF in Firefox and using XPATH am identifying each element’s text and verifying it against expected value. I could get XPATH from Firefox inspection but could not access. PFB for more details

Selenium version:
selenium-server-standalone-3.9.1
selenium-server-3.9.1
geckodriver-v0.24.0-win64
Firefox 66.0.3 (64 bit)

pdf: http://www.africau.edu/images/default/sample.pdf

xpath: /html/body/div[1]/div[2]/div[4]/div/div[1]/div[2]/span[2]

Error:

Exception in thread “main” org.openqa.selenium.NoSuchElementException: Unable to locate element: /html/body/div[1]/div[2]/div[4]/div/div[1]/div[2]/span[2]
For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html

Hi @balafeb03 can you share your script to have a look?

I tried using this Xpath, and it worked in Firefox.

Hi All,
I am able to get Xpath for PDF-template file text when used Firefox.
But when I edit this template with some information that Edited text is not detectable by firefox.

Anybody has idea on this?
I also have posted- https://forum.workfusion.com/t/read-and-edit-pdfs-and-verifying-data/59808

Thanks in adv.

Hi, @dhanshree.more1 check the XPath of the element after you add the information in the file. Perhaps, it changes after that, so the old XPath doesn’t work anymore.