Take advantage of automatic data extraction

API & Documentation

 How does the PDF2Data service work? The ideal solution.

 Example of invoice, template and generated XML result + short explanatory video.

 API and sample projects available here: PDF2Data java API v1.5 and PDF2Folder Desktop Client.

 

 Supported file types

Electronic documents

File type
Description
Supported Notes
pdf Portable Document Format Yes

Version 1.7 or earlier (including multipage)

Only not secured (password protected) pdf are supported.

doc / docx
Microsoft Word
Yes* *experimental, at this moment only ".doc" is supported, soon will be suppported ".docx"
xls / xlsx
Microsoft Excel Yes* *experimental, at this moment only A4 format of excel page is supported. Information beyond the borders will be moved to next page.
ppt / pptx
Microsoft PowerPoint Yes* *experimental 
rtf
Rich Text Format Soon
 
odt Open Document Text Yes* *experimental 
ods Open Document Spreadsheet Yes* *experimental 
odp OpenDocument Presentation Yes* *experimental 
sxw OpenOffice.org 1.0 Text Soon  
sxc OpenOffice.org 1.0 Spreadsheet Soon  
sxi OpenOffice.org 1.0 Presentation Soon  
 wpd Word Perfect  Soon  
txt Plain Text  Yes* *experimental 
tsv Tab Separated Values Soon  
html HyperText Markup Language Soon  

OCR documents (scanned at least 200-300 dpi for best result)

Supported OCR languages: click to see the list.

png Portable Network Graphics Yes Black and white, gray, color
jpeg / jpg Joint Photographic Experts Group Yes Gray, color
jp2 / jpc JPEG 2000 Yes Gray - Part1, color - Part1
pdf Portable Document Format Yes

Version 1.7 or earlier (including multipage)

tiff / tif Tagged Image File Format Yes

Black and white — uncompressed, CCITT3, CCITT4, Packbits, ZIP, LZW;

Gray  uncompressed, Packbits, JPEG, ZIP, LZW;

24-bit color  uncompressed, JPEG, ZIP, LZW;

1-, 4-, 8-bit palette  uncompressed, Packbits, ZIP, LZW

(including multipage TIFF)

gif Graphics Interchange Format Yes 

Black and white — LZW-compressed;

2-, 3-, 4-, 5-, -6, 7-, 8-bit palette — LZW-compressed

djvu / djv DjVu Yes Black and white, gray, color
jb2 JBIG2 Yes Black and white

 

 Requesting available credits of pages

 URL: http://pdf2data.cloudforpeople.com/api/getCredits

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

XML result:   

The server return:

<result>
     <userId>215</userId>
     <userOcrPages>1111</userOcrPages>
     <userElPages>2222</userElPages>
     <userComPages>3333</userComPages>
</result>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

 Requesting list of available templates from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/listTemplates

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

XML result:   

The server return:

<templates>
    <template id="1" key="Name of key1">Name1</template>
    <template id="2" key="Name of key2">Name2</template>
    <template id="3" key="Name of key3">Name3</template>
</templates>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

 Requesting the Template Schema from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/getTemplateSchema

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 template_id integer

Template ID.

This parameter is required.

XML result:   

The server return:

<result>
    .... the content of TemplateSchema
</result>

In case of error:

see the file status.xml

<status>
           <error code="1800">Internal error.</error>
</status>

 

 

 Submit document to PDF2Data server for recognizing

 Submit single document for recognizing:

 URL: http://pdf2data.cloudforpeople.com/api/recognize

 Submit a batch of documents for recognizing:

 URL: http://pdf2data.cloudforpeople.com/batch/api/recognize

 Required: http method "post" and content-type "multipart/form-data"

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 template_id integer

Template id from list of available templates.

This parameter is always required. If you have created on our server template with corresponding "ID" than you can specify "-1" as template_id, so the system will "autorecognize" which template need to apply to document. If the document does not have appropriate template - the server will return Message "Invalid template_ID".

 scanned Boolean

Specify if it is scanned (true) or electronic (false) document. In case of (null) the type of document will be detected automatically.

This parameter is required.

 language string

The language to use for OCR processing. You can specify one primary and one secondary OCR languages divided by comma. See the list of available languages here. If you don`t specify the language - will be used the language from your Control Panel from web Interface.

This parameter is NOT required. If you don`t specify the language - will be applied the primary and secondary language from your Control Panel from PDF2Data web interface.

 mimeType string

MIME type.

This parameter is required.

 file
byte array

File.

This parameter is required.

 batch Boolean

Used only for batch submitting. "True" if file is composed of a many invoices.

 split integer

Used only for batch submitting. This is a number of a splitting step:

0 - split file by separator sheet (download)

1 - split file page by page

"n"... - split file by any "n" page

XML result:

 After the document is submitted the server returns to user the "document ID" which is need to be temporarily stored for further requesting of recognizing result.

 NB: the recognizing result may be ready "instantly" (3 - 10 sec.) or later (5 min - 24 h). The time depends on the type of document (Electronic/OCR) and on whether it is human-controlled or no.

The server return:

In case of a single document:

<result>

      <documentId>1</documentId>

</result>

In case of a batch of documents:

<result>
       <documents>
           <documentId pageIndex="1-2">1</documentId>
           <documentId pageIndex="3-4">2</documentId>
           <documentId pageIndex="5-8">3</documentId>
      </documents>
</result>

In case when document is "in progress" or in case of error:

see the file status.xml

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 

 Getting a "status" of document from PDF2Data server

 URL: http://pdf2data.cloudforpeople.com/api/getStatus

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 document_id integer

Document ID.

This parameter is required.

XML result:   

The server return:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

  

 Getting results from PDF2Data server (for both single documents and batch)

 URL: http://pdf2data.cloudforpeople.com/api/getResult

Parameter Type Description
 api_key string

User autentification. The "api_key" you can locate in your "Control Panel" on PDF2Data web interface.

This parameter is required.

 document_id integer

Document ID.

This parameter is required.

 export_format string

Returns to user result as standard XML (if not specified), CSV file or personalized format (to activate personalized format please contact us, we can support large variety of personalized export formats).

This parameter is NOT required.

XML result:   

The server return:

 

In case of "instantly" ready result:

see the file result.xml

 

<result>
    <userName>usermail(at)gmail.com</userName>
    <documentName>DocumentName.pdf</documentName>

    <documentID>DocumentID</documentID>
    <templateName>TemplateName</templateName>
    <creationTime>12312341234</creationTime>

    <associations>
        <pair name="pair1">
            <label>label1</label>
            <value>value1</value>
        </pair>
        
        <table name="table1">
            <columns>
                <column name="column1">Column1</column>
                <column name="column2">Column2</column>
                <column name="column3">Column3</column>
            </columns>
            <row>
                <value>v11</value>
                <value>v12</value>
                <value>v13</value>
            </row>
            <row>
                <value>v21</value>
                <value>v22</value>
                <value>v23</value>
            </row>
            <row>
                <value>v31</value>
                <value>v32</value>
                <value>v33</value>
            </row>
        </table>
                    
        <textbox name="textbox1">textbox1</textbox>

    </associations>
</result>

 

In case when document is "in progress" or in case of error:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 Error and Info codes

 The list of Error and Info codes is available here.

Code Description
Code

Examples: the document size is > 20 MB; the document format is not supported; the document is "secured"; or other causes.

Code Server error.
XML result:

The server return:

see the file status.xml

 

<status>
          <info code="1100">In progress.</info>

          <error code="1800">Internal error.</error>
</status>

 

 

Goodies:

For XML viewing/editing — Notepad ++.

For timestamp conversion — EpochConverter.

We recommend the ultimate IDE for developers — Eclipse.