Recently I commented on a post where the question was how someone can export data from a PDF file, without using OCR. My answer: use Xpath after opening the PDF file in a browser. In this topic I would like to elaborate a little bit more on this subject.
Why is using Xpath sometimes better than using OCR?
The most simple answer: its much faster. In my experience OCR takes 5-10s to extract one variable from a PDF file. Using Xpath you can extract more elements in less than a second. Furthermore Xpath is at least as reliable as OCR. However there are a few things that you might consider in using Xpath.
What is Xpath? (reference: https://www.w3schools.com/xml/xpath_intro.asp)
XPath is a major element in the XSLT standard. It can be used to navigate through elements and attributes in an XML document. XPath stands for XML Path Language.
How can I use Xpath:
By opening a PDF file in a web browser it can be viewed as a XML file which enables you to find specific elements in a specific file. By using the Web Element Action in mode get value, you can provide a specific specific Xpath to the element you want to extract.
When I scrape normal web elements I normally use the included Firefox portable that standard comes with WorkFusion and is efficient in determining Xpaths. In this browser you are able to right click on an element, go to Xpaths and then select an option which is automatically copied to your clipboard and looks like something as this:
This Xpath refers to a button on a website which in the XML code is referred to as ‘Demo’. However, if you use this method for a PDF file you might expect a path as follows:
//div[@style=‘left: 784.25px; top: 184.617px; font-size: 15px; font-family: sans-serif; transform: scaleX(1.04125);’]
As you might see this Xpath contains a relative position of the element including formatting information. However, this reference is in my experience to sensitive to very small format changes, especially because of the scaleX property. Therefore I use an alternative Xpath when extracting elements with PDF. In these cases I op the PDF file in my normal Firefox Browser and then select the element using the element picker. To do so I open the inspector screen (F12) select the element picker in the top left corner of the inspector screen and then click on the element that you want to extract. In the inspector screen the relevant rows will be highlighted, showing some code such as this:
When you right click on the highlighted text, go to copy and then select Xpath. Now a string like this will be copied to your clipboard:
This path doesn’t contain a relative location, but is the technical within the structure of the XML code. This is far more stable. In our project we always define controls in order to be sure that the rights data is extracted, however we also do this when other techniques (such as OCR) are used.
A final remark: When should us use OCR? In my opinion OCR is only better in cases of reading data from image PDF without readable text.
If there are questions, feel free to comment below.