In this article by Pat Myers, author of the book Intelligent Document Capture with Ephesoft- Second Edition will explain about the following:

Different classification types
Other techniques for exporting your documents and metadata

(For more resources related to this topic, see here.)

As we know how to configure search classification to enable Ephesoft to recognize an invoice document. There are several other classification types available; we will explain these alternatives now.

Classification types

You can select the process that Ephesoft will use to classify documents by editing your batch class, editing the Document Assembly module, editing the Document Assembler plugin module within that module, and then selecting a value for DA Classification Type.

Search

Search classification (also sometimes called Lucene classification) is the default classification method and is recommended for most content. When configured to perform search classification, Ephesoft compares the text on each input page to the text on training documents to determine its confidence that a document is of a certain type.

Image

Image classification is the best option when classification cannot be made based on content. This occurs on forms that do not have a lot of text, or where the textual content is unpredictable but the physical appearance (such as layout, graphics, and formatting) is consistent. Credit card applications that are red dropout forms (where only the user-entered text is visible to the OCR engine) are candidates for this classification technique.

Barcodes

Barcodes can be used for documents that vary in content and layout, like white mail (unformatted correspondence received in the mail). If a barcode is found on the page with a name that matches an Ephesoft document type, Ephesoft will set the current document's type to that type.

Automatic

The automatic classification type tells Ephesoft to use the scores of every classification plugin that is enabled. This may be necessary when no single classification technique will suffice for your batch class, but configuring multiple classification plugins will have a negative impact on Ephesoft's performance.

One document classification

One document classification is a variant of automatic classification. It assembles all the pages in the batch into a single document.

Confidence

Ephesoft calculates confidence scores for each page in a batch. The page scores represent Ephesoft's certainty that the page being considered is the first, middle, or last page within each document type. They are used to classify and assemble the pages into documents. Ephesoft also uses these page scores to create an aggregate score for each document. This score is compared to the confidence threshold for each document type in the batch class definition. Any document that receives a confidence score below the minimum threshold will be flagged for review. A batch with one or more flagged documents will be placed in a queue for review by an operator.

Confidence scores are calculated differently for each classification type.

Search classification

The default classification type is search classification. Search classification separates and classifies documents by using a two-step process. The first step is to collect information about the pages. The Search Classification plugin of the Page Processing module performs this function. The second step is to separate documents and determine their type. This is the responsibility of the Document Assembler plugin.

The Search Classification plugin calculates the initial page scores by comparing the text on the page to the text on the training documents. Multiple scores are generated for each page as Ephesoft finds several matches from samples for any given page. The page scores are then adjusted using weighted values that can be modified in the administrative interface by editing the Search Classification plugin of the Page Processing module. Pages can be weighted on the basis of the page type (first, middle, or last). By default, Ephesoft is configured to reduce the scores for the middle and last pages by 10 percent and 20 percent, respectively, as the first pages are more important when it comes to the separation of documents. This effectively biases Ephesoft in favor of using a page to create a new document (over using it as the middle or last page of a document).

The plugin properties of search classification
Using the page scores calculated in the previous step (and adjusted using the weighted values from the Search Classification plugin), Ephesoft calculates all possible document assemblies and selects the result with the highest score.

The score is calculated as follows: First, the scores of each page in the assembly are averaged. Ephesoft then adjusts the average by using a multiplier in the Document Assembler plugin. You will notice, looking at the following plugin settings screen, that there are several multipliers available. If the assembly has a first and a last page, for example, the DA Rule first-last Page multiplier will be chosen. An assembly with the first, last, and middle pages will use the "DA Rule First-middle-last Page" multiplier.

core-ephesoft-features-img-1

The plugin properties of Document Assembler

Suppose, for example, that you have trained a batch class to recognize the first and middle pages of an invoice. If you run a three-page batch through Ephesoft, you might get the following results:

Page 1 is determined to be the first page of an invoice because Invoice_First_Page received the highest score:
- Page 1 compared to Invoice_First_Page receives a score of 30.2
- Page 1 compared to Invoice_Middle_Page receives a score of 4.2
Page 2 is determined to be the second page of an invoice because Invoice_Middle_Page received the highest score. Because of the order of this page in the batch, it is determined to be the second page of the invoice found in page 1.
- Page 2 compared to Invoice_First_Page receives a score of 2.6
- Page 2 compared to Invoice_Middle_Page receives a score of 12.2
Page 3 was determined to be the first page of an invoice because Invoice_First_Page received the highest score. Since, it was determined to be a first page, it is the first page of a new document.
- Page 3 compared to Invoice_First_Page receives a score of 31.6
- Page 3 compared to Invoice_Middle_Page receives a score of 3.8

In this case, there is no score for Invoice_Last_Page as there were no last page samples used to train this Ephesoft instance.

When using the drag and drop classification training in Batch Class Management, Ephesoft will automatically place a last page for any document having more than one page. If that is not the only possible last page of the document type, you will have to go into Folder Management and move all samples and files from the last page training for the document type into the middle pages. Once the files are moved, go back into Batch Class Management and click on the Learn Files button to retrain the system.

The first document assembled will be a two-page invoice because Ephesoft found a first page of an invoice followed by a middle page of an invoice. The second document assembled will be a one-page invoice since only the first page of an invoice was found.

The confidence scores that each of these documents received are calculated as follows:

Document 1 (page 1 and 2): (30.2 + 12.2)/2 = 21.2 × 50% = 10.6
- Average score of pages times the page weight factor, DA Rule First-middle Page
Document 2 (page 3): (31.6)/1 = 31.6 × 50%= 15.8
- Average score of pages times the page weight factor, DA Rule First Page

If the Minimum Confidence Scores setting of the Invoice document type is set to 10, then this batch will skip the review step and move directly to extraction. If the Minimum Confidence Scores for the Invoice document type is set to 15, then this batch will stop in review with the first document requiring review.

Barcode classification

Barcode classification is also a two-step process similar to search classification. In the Page Processing module, pages with barcodes are processed using either the Recostar plugin or the Barcode Reader plugin. In the Document Assembler plugin, Ephesoft creates documents when the first barcode is found and all the other pages are appended to the document until a new page with a barcode is found. The barcode value found by the barcode or the RecoStar plugin has to match one of the document type names.

On Linux, Ephesoft will always use the Barcode Reader plugin.

Image classification

Image classification compares the pixels on the provided documents to the pixels on the trained documents. The more pixels that match the trained document, the higher is the confidence score that the document will attain. This is in contrast to search classification which OCRs the pages and then compares the text.

When image classification is selected, the Document Assembler plugin uses the image confidence scores to separate and classify documents. The assembly is done using the same algorithm explained in the search classification section.

Automatic classification

Automatic classification uses all enabled classification types. The scores will be combined to come up with an aggregate score per page. This value will be used for assembly and then classification scoring.

Export

We use the Copy Batch XML plugin to export content to the Ephesoft server's file system. There are a number of additional export options. The CMIS and DB export plugins use standard-based interfaces to allow export to a large number of enterprise content management systems and relational databases. Let's take a look at how to configure these two plugins and then, review the other plugins that are available.

CMIS export

The Content Management Interoperability Services (CMIS) API is an open standard for interacting with enterprise document repositories. You can use the CMIS Export plugin to export your scanned content (and associated metadata) to any repository that supports the CMIS standard, such as Alfresco, Documentum, FileNet, or SharePoint. Let's look at how to configure the CMIS Export plugin to send content to Alfresco, a popular open source enterprise content management system.

Ephesoft 4.0 supports CMIS 1.0 and 1.1

Establish a content model in your CMIS

Suppose that you have an Invoice document type in Ephesoft that has fields for Vendor Name, Invoice Date, and Invoice Total. The first thing that you will want to do is define a custom content model in Alfresco to represent your scanned content. Alfresco defines custom content models in XML files that look like the following:

<type name="acme:invoice">
<parent>cm:content</parent>
<properties>
   <property name="acme:vendorName">
     <title>Vendor Name</title>
       <type>d:text</type>
       <mandatory enforced="false">false</mandatory>
       <index enabled="true">
         <atomic>true</atomic>
         <stored>false</stored>
         <tokenised>false</tokenised>
       </index>
   </property>

Alfresco document type and property name have prefixes to prevent namespace collisions in the content models. We have used an acme prefix in our examples, as would be the case if this implementation were for Acme Corporation. The example above shows a document type acme:invoice that extends Alfresco's base document type cm:content. This custom type has a text property called acme:vendorName. Not shown here are the date property called acme:invoiceDate and the float property called acme:invoiceTotal.

Configure the CMIS Export plugin

After creating the content model, you will need to configure Ephesoft to use CMIS to send the processed content to Alfresco. There are three places in Ephesoft where you need to configure the CMIS export:

The plugin settings in the administrative user interface
The mapping files, in your batch class cmis-plugin-mapping folder
The global configuration file, located in your Ephesoft installation folder here: Application/WEB-INF/classes/META-INF/dcma-cmis/dcma-cmis.properties

Let's start with the plugin settings. From the batch class management interface, select and edit your batch class, the export module, and then the CMIS Export plugin. This comes configured by default with a disabled sample connection to Alfresco's public CMIS server.

core-ephesoft-features-img-2

The plugin properties of CMIS Export

The CMIS plugin can be configured as follows:

Root Folder Name: This is the name of the destination folder in the document repository where Ephesoft should load the exported documents. In Alfresco, this folder will be created underneath the root folder (which is typically named Company Home).
Upload File Extension: This setting controls whether the documents are uploaded to your document management system as PDF or TIF images.
Server URL: The services provided by CMIS are defined in an XML service document; this is the location of that document. Alfresco 4.0 hosts this file at /alfresco/service/cmis. Alfresco 5.0 hosts this file at /alfresco/api/-default-/public/cmis/versions/1.1/atom.
User Name and Password: This is the authentication information required to connect to the document management system.
Repository Id: Some document management systems are capable of hosting multiple repositories. When this is the case, each repository is listed in the service document with an associated identifier. You should examine the service document to find the identifier for your repository.
Server Switch: This can be used to enable and disable export to your document management system.
Aspect Switch: Alfresco manages dynamically assignable groups of properties called aspects. This switch enables support for aspects.
Export File Name: Naming convention for the documents exported.
Export Client Key, Secret Key, Refresh Token, Redirect URL, and Export Network: These properties are used to implement OAuth authentication.

Document type and property mapping

Next, you need to associate Ephesoft document types with Alfresco document types. Ephesoft's fields also need to be mapped to the properties of Alfresco documents. Edit this file in your batch class configuration area: cmis-plugin-mapping/DLF-Attribute-mapping.properties. This file contains some examples of content mapping. Delete the examples and set up your own mapping, as follows:

Invoice=D:acme:invoice
Invoice.VendorName=acme:vendorName
Invoice.InvoiceDate=acme:invoiceDate
Invoice.InvoiceTotal=acme:invoiceTotal

The first line of this property file associates the document types, and the last three lines associate the fields. When mapping document types, you will need to prepend D: to the beginning of your document repository's type name. This is the CMIS syntax for representing a document (as opposed to, for example, a folder) in Alfresco.

Aspects are configured in the following batch class configuration file: cmis-plugin-mapping/aspects-mapping.properties.

Global CMIS configuration

The final area where CMIS is configured in Ephesoft is the following file: Application/WEB-INF/classes/META-INF/dcma-cmis/dcma-cmis.properties. This file affects the CMIS configuration of all batch classes.

The most commonly modified setting in this file is the date format. When you map a date field, Ephesoft needs to parse the date in order to reformat the information to match the CMIS specification. The cmis.date_format parameter specifies how Ephesoft fields that will be exported using CMIS will be formatted. See the JavaDoc for the SimpleDateFormat class to learn how to specify date formats.

If your content management system uses Web Service Security (WSS) to secure its CMIS web services, you will need to adjust the value of the cmis.security.mode property. This specifies the security mode to use when attempting to connect to the CMIS web services. There are two possible values: basic and wssecurity. HTTP Basic Authentication is the default setting for the Ephesoft CMIS connection. This corresponds to the basic setting for the cmis.security.mode property. The cmis.security.mode property is set to wssecurity in order to have the CMIS credentials that are configured in the CMIS_EXPORT plugin included in the WS-Security SOAP header of the CMIS web service requests.

If your CMIS web services are not addressable from a single URL, you can configure the location of each service used by Ephesoft. You will see a set of properties that begin with cmis.url. These can be edited to specify where your content management system hosts this service's WDSL.

Database Export

DB Export allows document level fields values and metadata to be exported to relational databases using JDBC. Administrators can map the Ephesoft document fields to the database table columns.

First, go to the system configuration area to create a new connection in Connection Manager:

core-ephesoft-features-img-3

Connection Manager with connection properties for database export

Next, return to the batch class management area and configure your batch class. If the DB Export plugin is configured into this batch class' workflow, then you will be able to configure the plugin from the Modules section.

The configuration of the plugin is simple; there is simply a switch to enable the plugin.

core-ephesoft-features-img-4

Plugin properties of database export

In Batch Class Management under the document type, you can configure DB Export Configuration. Select the correct database connection, and then, map the document type fields to the table and column. Click on Apply to save your changes.

core-ephesoft-features-img-5

Database export mapping

When the DB Export plugin runs, it will export the extracted field data for each document in the batch.

core-ephesoft-features-img-6

Sample results of database export

Other export plugins

Thus far, we have shown you how to export to the local file system or use CMIS and JDBC. These are general-purpose plugins that can be used in a variety of situations. Ephesoft comes with a few other general-purpose plugins such as the CSV plugin and the tabbed PDF plugin.

Ephesoft also provides a handful of plugins to facilitate export into specific content management systems such as Docushare, HPII FileNet, and IBM CM.

To see the list of available plugins, you should edit your batch class and then edit the export module.

Summary

In this article, you have learned how to process forms with many different layouts, additional extraction techniques. At this point, you should be able to use Ephesoft to implement intelligent document capture for a wide variety of organizations.