





















































In this article by Pat Myers, author of the book Intelligent Document Capture with Ephesoft- Second Edition will explain about the following:
(For more resources related to this topic, see here.)
As we know how to configure search classification to enable Ephesoft to recognize an invoice document. There are several other classification types available; we will explain these alternatives now.
You can select the process that Ephesoft will use to classify documents by editing your batch class, editing the Document Assembly module, editing the Document Assembler plugin module within that module, and then selecting a value for DA Classification Type.
Search classification (also sometimes called Lucene classification) is the default classification method and is recommended for most content. When configured to perform search classification, Ephesoft compares the text on each input page to the text on training documents to determine its confidence that a document is of a certain type.
Image classification is the best option when classification cannot be made based on content. This occurs on forms that do not have a lot of text, or where the textual content is unpredictable but the physical appearance (such as layout, graphics, and formatting) is consistent. Credit card applications that are red dropout forms (where only the user-entered text is visible to the OCR engine) are candidates for this classification technique.
Barcodes can be used for documents that vary in content and layout, like white mail (unformatted correspondence received in the mail). If a barcode is found on the page with a name that matches an Ephesoft document type, Ephesoft will set the current document's type to that type.
The automatic classification type tells Ephesoft to use the scores of every classification plugin that is enabled. This may be necessary when no single classification technique will suffice for your batch class, but configuring multiple classification plugins will have a negative impact on Ephesoft's performance.
One document classification is a variant of automatic classification. It assembles all the pages in the batch into a single document.
Ephesoft calculates confidence scores for each page in a batch. The page scores represent Ephesoft's certainty that the page being considered is the first, middle, or last page within each document type. They are used to classify and assemble the pages into documents. Ephesoft also uses these page scores to create an aggregate score for each document. This score is compared to the confidence threshold for each document type in the batch class definition. Any document that receives a confidence score below the minimum threshold will be flagged for review. A batch with one or more flagged documents will be placed in a queue for review by an operator.
Confidence scores are calculated differently for each classification type.
The default classification type is search classification. Search classification separates and classifies documents by using a two-step process. The first step is to collect information about the pages. The Search Classification plugin of the Page Processing module performs this function. The second step is to separate documents and determine their type. This is the responsibility of the Document Assembler plugin.
The plugin properties of search classification
The score is calculated as follows: First, the scores of each page in the assembly are averaged. Ephesoft then adjusts the average by using a multiplier in the Document Assembler plugin. You will notice, looking at the following plugin settings screen, that there are several multipliers available. If the assembly has a first and a last page, for example, the DA Rule first-last Page multiplier will be chosen. An assembly with the first, last, and middle pages will use the "DA Rule First-middle-last Page" multiplier.
The plugin properties of Document Assembler
Suppose, for example, that you have trained a batch class to recognize the first and middle pages of an invoice. If you run a three-page batch through Ephesoft, you might get the following results:
In this case, there is no score for Invoice_Last_Page as there were no last page samples used to train this Ephesoft instance.
When using the drag and drop classification training in Batch Class Management, Ephesoft will automatically place a last page for any document having more than one page. If that is not the only possible last page of the document type, you will have to go into Folder Management and move all samples and files from the last page training for the document type into the middle pages. Once the files are moved, go back into Batch Class Management and click on the Learn Files button to retrain the system.
The first document assembled will be a two-page invoice because Ephesoft found a first page of an invoice followed by a middle page of an invoice. The second document assembled will be a one-page invoice since only the first page of an invoice was found.
The confidence scores that each of these documents received are calculated as follows:
If the Minimum Confidence Scores setting of the Invoice document type is set to 10, then this batch will skip the review step and move directly to extraction. If the Minimum Confidence Scores for the Invoice document type is set to 15, then this batch will stop in review with the first document requiring review.
Barcode classification is also a two-step process similar to search classification. In the Page Processing module, pages with barcodes are processed using either the Recostar plugin or the Barcode Reader plugin. In the Document Assembler plugin, Ephesoft creates documents when the first barcode is found and all the other pages are appended to the document until a new page with a barcode is found. The barcode value found by the barcode or the RecoStar plugin has to match one of the document type names.
On Linux, Ephesoft will always use the Barcode Reader plugin.
Image classification compares the pixels on the provided documents to the pixels on the trained documents. The more pixels that match the trained document, the higher is the confidence score that the document will attain. This is in contrast to search classification which OCRs the pages and then compares the text.
When image classification is selected, the Document Assembler plugin uses the image confidence scores to separate and classify documents. The assembly is done using the same algorithm explained in the search classification section.
Automatic classification uses all enabled classification types. The scores will be combined to come up with an aggregate score per page. This value will be used for assembly and then classification scoring.
We use the Copy Batch XML plugin to export content to the Ephesoft server's file system. There are a number of additional export options. The CMIS and DB export plugins use standard-based interfaces to allow export to a large number of enterprise content management systems and relational databases. Let's take a look at how to configure these two plugins and then, review the other plugins that are available.
The Content Management Interoperability Services (CMIS) API is an open standard for interacting with enterprise document repositories. You can use the CMIS Export plugin to export your scanned content (and associated metadata) to any repository that supports the CMIS standard, such as Alfresco, Documentum, FileNet, or SharePoint. Let's look at how to configure the CMIS Export plugin to send content to Alfresco, a popular open source enterprise content management system.
Ephesoft 4.0 supports CMIS 1.0 and 1.1
Suppose that you have an Invoice document type in Ephesoft that has fields for Vendor Name, Invoice Date, and Invoice Total. The first thing that you will want to do is define a custom content model in Alfresco to represent your scanned content. Alfresco defines custom content models in XML files that look like the following:
<type name="acme:invoice">
<parent>cm:content</parent>
<properties>
<property name="acme:vendorName">
<title>Vendor Name</title>
<type>d:text</type>
<mandatory enforced="false">false</mandatory>
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>false</tokenised>
</index>
</property>
Alfresco document type and property name have prefixes to prevent namespace collisions in the content models. We have used an acme prefix in our examples, as would be the case if this implementation were for Acme Corporation. The example above shows a document type acme:invoice that extends Alfresco's base document type cm:content. This custom type has a text property called acme:vendorName. Not shown here are the date property called acme:invoiceDate and the float property called acme:invoiceTotal.
After creating the content model, you will need to configure Ephesoft to use CMIS to send the processed content to Alfresco. There are three places in Ephesoft where you need to configure the CMIS export:
Let's start with the plugin settings. From the batch class management interface, select and edit your batch class, the export module, and then the CMIS Export plugin. This comes configured by default with a disabled sample connection to Alfresco's public CMIS server.
The plugin properties of CMIS Export
The CMIS plugin can be configured as follows:
Next, you need to associate Ephesoft document types with Alfresco document types. Ephesoft's fields also need to be mapped to the properties of Alfresco documents. Edit this file in your batch class configuration area: cmis-plugin-mapping/DLF-Attribute-mapping.properties. This file contains some examples of content mapping. Delete the examples and set up your own mapping, as follows:
Invoice=D:acme:invoice
Invoice.VendorName=acme:vendorName
Invoice.InvoiceDate=acme:invoiceDate
Invoice.InvoiceTotal=acme:invoiceTotal
The first line of this property file associates the document types, and the last three lines associate the fields. When mapping document types, you will need to prepend D: to the beginning of your document repository's type name. This is the CMIS syntax for representing a document (as opposed to, for example, a folder) in Alfresco.
Aspects are configured in the following batch class configuration file: cmis-plugin-mapping/aspects-mapping.properties.
The final area where CMIS is configured in Ephesoft is the following file: Application/WEB-INF/classes/META-INF/dcma-cmis/dcma-cmis.properties. This file affects the CMIS configuration of all batch classes.
The most commonly modified setting in this file is the date format. When you map a date field, Ephesoft needs to parse the date in order to reformat the information to match the CMIS specification. The cmis.date_format parameter specifies how Ephesoft fields that will be exported using CMIS will be formatted. See the JavaDoc for the SimpleDateFormat class to learn how to specify date formats.
If your content management system uses Web Service Security (WSS) to secure its CMIS web services, you will need to adjust the value of the cmis.security.mode property. This specifies the security mode to use when attempting to connect to the CMIS web services. There are two possible values: basic and wssecurity. HTTP Basic Authentication is the default setting for the Ephesoft CMIS connection. This corresponds to the basic setting for the cmis.security.mode property. The cmis.security.mode property is set to wssecurity in order to have the CMIS credentials that are configured in the CMIS_EXPORT plugin included in the WS-Security SOAP header of the CMIS web service requests.
If your CMIS web services are not addressable from a single URL, you can configure the location of each service used by Ephesoft. You will see a set of properties that begin with cmis.url. These can be edited to specify where your content management system hosts this service's WDSL.
DB Export allows document level fields values and metadata to be exported to relational databases using JDBC. Administrators can map the Ephesoft document fields to the database table columns.
First, go to the system configuration area to create a new connection in Connection Manager:
Connection Manager with connection properties for database export
Next, return to the batch class management area and configure your batch class. If the DB Export plugin is configured into this batch class' workflow, then you will be able to configure the plugin from the Modules section.
The configuration of the plugin is simple; there is simply a switch to enable the plugin.
Plugin properties of database export
In Batch Class Management under the document type, you can configure DB Export Configuration. Select the correct database connection, and then, map the document type fields to the table and column. Click on Apply to save your changes.
Database export mapping
When the DB Export plugin runs, it will export the extracted field data for each document in the batch.
Sample results of database export
Thus far, we have shown you how to export to the local file system or use CMIS and JDBC. These are general-purpose plugins that can be used in a variety of situations. Ephesoft comes with a few other general-purpose plugins such as the CSV plugin and the tabbed PDF plugin.
Ephesoft also provides a handful of plugins to facilitate export into specific content management systems such as Docushare, HPII FileNet, and IBM CM.
To see the list of available plugins, you should edit your batch class and then edit the export module.
In this article, you have learned how to process forms with many different layouts, additional extraction techniques. At this point, you should be able to use Ephesoft to implement intelligent document capture for a wide variety of organizations.
Further resources on this subject: