How to extract data from OCR plugin

plugins
recorder-wf_studio_s
question_s
#1

Hello Team,

Could you please help to extract the content of the image (tiff file using OCR plugin).
I have written below reference code for the same.

Not able to extract the data from .tiff file.

      <var-def name="ocr">
    <ocr correct-skew="true" export-format="txt" >
      <ocr-image>
        <http url="${document_image_link}"/>
      </ocr-image>
    </ocr>
        </var-def>
          

      <s3 bucket="${ocr_results_s3_bucket}">

        <s3-put-public path="${ocr_results_s3_path}/ocr/${java.util.UUID.randomUUID()}.txt" content-type="text/plain" content-disposition="inline">

        </s3-put-public>

      </s3>
          
    <script><![CDATA[
             log.info("ocr_scanned_image value" + ocr.getWrappedObject().getClass().toString());
             log.info("ocr_scanned_Image.toString()"+ocr.toString());
            ]]></script>

Could you please look into this and provide me the solution for this.

Thanks
Diptiranjan Panda

#2

Hello @diptiranjanpanda.
You can use this article as an example: https://kb.workfusion.com/display/RPAe/OCR#OCR-UsingOCRPlugin.
Seems you missed < export > section.

#3

I have added this but getting error .
the below code i have used in the ocr for extracting the data.

<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="http://web-harvest.sourceforge.net/schema/1.0/config" scriptlang="groovy">
  <script><![CDATA[

    documentLink = "https://d1l00354g.dc01.its.hpecorp.net:8443/ocr-input/training-set-800/DHL_DUBAI/3399203.pdf";
    cacheTable = "dipti";

 ]]></script>
 

  <var-def name="document">

    <http url="${documentLink}"/>

  </var-def>

 <script><![CDATA[

    if (http.statusCode.toString().matches('^[45]\\d{2}')) {

      throw new RuntimeException("failed downloading the link: " + documentLink);

    } 

    documentHash = org.apache.commons.codec.digest.DigestUtils.md5Hex(document.toBinary());

  ]]></script>

 

  <create-datastore name="${cacheTable}">

    <datastore-column name="key"/>

    <datastore-column name="result"/>

  </create-datastore>

  

  <var-def name="cachedRecord">

    <datastore name="${cacheTable}" max="1">

      <template>

        select * from @this where "key"='${documentHash}'

      </template>

    </datastore>

  </var-def>

 

  <case>

  <if condition='${cachedRecord.toString().length() != 0}'>

  

    <var-def name="result">

      <xpath expression='/row/result/text()'>

        <var name="cachedRecord"/>

      </xpath>

    </var-def>

 

  </if>

  <else>

 <var-def name="ocr">
    <ocr correct-skew="true" export-format="txt" >
      <ocr-image>
        <var name="document"/>
      </ocr-image>
    </ocr>
        </var-def>




 

    <var-def name="result">

      <s3 bucket="doc-upload">

        <s3-put-public path="ocr/${java.util.UUID.randomUUID()}.txt" content-type="text/plain" content-disposition="inline">

          <script return="ocr.get(0).wrappedObject.results['txt']"/>

        </s3-put-public>

      </s3>

    </var-def>

 

    <insert-datastore

      datastore-name="${cacheTable}"

      json-value-map='${new com.google.gson.Gson().toJson(["key": documentHash, "result": result.toString()])}'/>

 

  </else>

  </case>

 

  <export include-original-data="true">

    <single-column name="result" value="${result}"/>

  </export>

 

</config>

and getting the below exception .could you please look into this

13:52:37 [ERROR] plugin: fail: 500 null: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>Could not parse multipart servlet request; nested exception is java.io.IOException: The temporary upload location [C:\Users\pandadip\AppData\Local\Temp\tomcat.3957267344229486687.15580\work\Tomcat\localhost\ROOT] is not valid
13:52:37 [ERROR] 500 null: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>Could not parse multipart servlet request; nested exception is java.io.IOException: The temporary upload location [C:\Users\pandadip\AppData\Local\Temp\tomcat.3957267344229486687.15580\work\Tomcat\localhost\ROOT] is not valid
org.springframework.web.client.RestClientResponseException: 500 null: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>Could not parse multipart servlet request; nested exception is java.io.IOException: The temporary upload location [C:\Users\pandadip\AppData\Local\Temp\tomcat.3957267344229486687.15580\work\Tomcat\localhost\ROOT] is not valid

#4

Hello.
Do you run this script from Studio or in Control Tower?

#5

Both i have ran .