Take advantage of automated data extraction

FAQs

 

General

 

 

Data extraction

 

 

Template creating (basics)

 

 


General

 

Q: What is PDF2Data?

 

PDF2Data is a web service that provides a highly accurate and scalable data extraction and invoice processing software in the cloud. It is designed to automate data extraction from electronic and paper-based documents. The "web service" mean that the PDF2Data is "located" in the web, so you cannot download it and install on your local computer.

 

Q: What can I do with PDF2Data?

 

With PDF2Data you can automate extraction of all information/text you need from various types of documents. For example: from invoices you can extract the "Due data", "Invoice nr.", "Description of goods", "Amounts", "Total", etc... you can extract all information or extract only specific information, tables, values or pices of text.

 

Q: What types of documents can I process with PDF2Data?

 

There is almost no limits to use PDF2Data, you can extract data from invoices, contracts, statistics, catalogs, lists, etc. You can process "electronic" documents, e.g., PDFs, DOCs, XSLs or scanned "paper-based" documents, e.g., scanned PDFs, JPG, PNG, TIFF and another file formats. The only restriction is the password protected files. In this case you will need to specify the password.

 

Q: How does PDF2Data web service work?

 

The PDF2Data web service works this way:

Step 1: Uploading 1-2 similar documents into the web interface;

Step 2: Creating a template specifically for this type of documents and application of this template to 1-2 similar documents to make sure that template works well;

Step 3: Sending all other documents to PDF2Data servers through API for automatic data extraction and receiving extracted data from our servers through the same API.

 

Q: I`m not sure I can create a template, can you help me?

 

The web interface for template creating is very easy. We believe that you will be able to create a template, you can view our Video Tutorials or send us a message. But if you still not sure - we can give you suggestions or even take on the entire process of template creating.

 

Q: I have confidential data, what is about my privacy?

 

We take care about your privacy. We do not share any of your personal information, nor your uploaded documents. Also we do not share any of your extracted results. You can read the full text of our Privacy Statement and Terms and Conditions.

 

Q: How long are stored my documents on PDF2Data server?

 

All documents, processed via API are stored for 1 month. After this time the documents (and extracted data) are deleted. The templates and the documents you have uploaded to the WEB interface - are not deleted.

 

  Top


 

Data extraction

 

Q: What are the input formats?

 

We support many of input formats. Please see the entire list of supported formats here.

 

Q: What are the output formats for extracted results?

 

Standard output format for your results is the XML file format. The XML is a logically structured file which contains all the data you need as a tree of "entities". XML file is easy to integrate with your software because it is simple and structured. Also the CSV format is supported.
For another types of output file format you can submit to us a request.

 

Q: Can I extract data from multipage documents?

 

Yes, you can extract data from multipage documents. There is no limits to quantity of pages, but you have to remember that you will pay for all the pages within the recognized file, even if you will extract data just from one page.

 

Q: Can I extract data from multilanguage documents?

 

Yes, you can extract data from multilanguage documents. For fully electronic documents, e.g., electronic PDFs, there is no limitations for supported languages. For scanned documents there is a list of supported languages.

 

Q: Can I extract data from handwritten documents?

 

At moment you can extract only typewritten text. The handwritten text is not supported.

 

Q: I submitted a document, but where is an error in result, why?

 

We are constantly improving our service, but the cause of the error may be different, including for reasons beyond of our control.
For electronic documents it may be that: you have applied the wrong template; you set up the template in wrong way; the document is password protected, etc.
For scanned (OCR) documents it may be that: you have applied the wrong language to ocr recognizing; you have submitted the file at very low resolution (must be at least 200-300dpi); the file is handwritten, etc.
In any case - you can send to us a message by clicking to "Support" button or from our Support page. We will respond to you within a day.

 

  Top


 

Template creating (basics)

 

Q: What is a Template?

 

Template - is a set of elements (entities) that are used to search for specific information in a lot of documents of the same type.

 

Q: When I have to create a Template?

 

You need to create a Template in case when you have a lot of documents of the same type from which you intend to extract some specific information and export it to a highly structured file format (XML or CSV) which can be imported into a Database. For example: you may want to extract from invoices the "Date, Invoice Nr, Description, Price, Total due, Tax, ecc." or extract from lists the "Date, Code, Description, ecc.".

 

Q: What happens if I apply a Template to other document?

 

Each specific template must be applied only to documents "of the same type", otherwise the information to extract will not be detected or will be detected something wrong. For example: if you have an invoice from supplier "Supplier One" you must apply the template created for it only to invoices from "Supplier One".

 

Q: What are the elements of Template?

 

There are 4 elements of template:

- ID: it points out to our system to which of your supplier template is related. ID

- LABEL: it is a pair of a Label and its Value. Label

- TEXTBOX: a square region of page. Can have up to 2 anchors for a top and bottom limits manipulating. Textbox

- TABEL: a table. It have a "header" and a "body". It can have one anchor (indicates where ends the bottom of the table). Table

 

Q: How to create a document ID?

 

Click to a "Add ID" button.

Click to a word (or to a 2-3 words) or drag to make a selection of words (see variants below: it can be a "Name + Address + Type" of document or a "Name + Type of document" or "Phone Number + Type of document", etc...).

Click to a green "Ok" arrow:

 

Invoice Identification Variant 1

or

Invoice Identification Variant 2

 

Q: How to create a Label - Value pair?

 

Click to a "Add Label" button.

Click to a first word (Label) or drag to select more words.

Click to the second word (Value) or drag to select more words. Extend selection of Value if necessary.

Click to a green "Ok" arrow:

 

Label - Value

 or

Label - Value

 or

Label - Value

 or

Label - Value

 

Q: How to create a Textbox?

 

Click to a "Add Textbox" button.

Drag an area on the page. Add upper and lower anchors if necessary.

Click to a green "Ok" arrow:

 

Textbox without anchors

or

Textbox with upper anchor

or

Textbox with both anchors

 

Q: How to create a Table?

 

Click to a "Add Table" button.

Drag an area of the table header on the page.

Release the mouse button (at bottom of the header will appear a small part of the table body ).

Specify an anchor (if neccesary).

Click to a green "Ok" arrow:

 

Table creating

next

Table creating

next

Table creating

 

Q: How to extract all text from the page?

 

Click to a "Add Textbox" button.

Drag on entire area of the page.

Click to a green "Ok" arrow.

Save template and click to "Recognize" button.

Save result by clicking to "Save" button.

 

 Save all text from invoice page

  Top