Alfresco 3 Business Solutions: Document Migration Strategies

Exclusive offer: get 50% off this eBook here
Alfresco 3 Business Solutions

Alfresco 3 Business Solutions — Save 50%

Practical implementation techniques and guidance for delivering business solutions with Alfresco

$38.99    $19.50
by Martin Bergljung | February 2011 | Content Management Open Source

This article covers a very important subject in Document Management projects, the document migration phase. When starting a document migration project it is important to plan ahead and set up a staging area where users can start copying over documents that they want to be migrated over to Alfresco. Most companies have documents on a network drive that they want to migrate over to Alfresco before it goes live. We will talk about setting up a strategy for document migration and how to plan it. We will also look at different ways (like CIFS, ACP file and external tool) to import documents into Alfresco.

In this article by Martin Bergljung, author of Alfresco 3 Business Solutions, you will learn:

  • Different strategies for implementing document migration

 

Alfresco 3 Business Solutions

Alfresco 3 Business Solutions

Practical implementation techniques and guidance for delivering business solutions with Alfresco

  • Deep practical insights into the vast possibilities that exist with the Alfresco platform for designing business solutions.
  • Each and every type of business solution is implemented through the eyes of a fictitious financial organization - giving you the right amount of practical exposure you need.
  • Packed with numerous case studies which will enable you to learn in various real world scenarios.
  • Learn to use Alfresco's rich API arsenal with ease.
  • Extend Alfresco's functionality and integrate it with external systems.
        Read more about this book      

(For more resources on Alfresco 3, see here.)

The Best Money CMS project is now in full swing and we have the folder structure with business rules designed and implemented and the domain content model created. It is now time to start importing any existing documents into the Alfresco repository. Most companies that implement an ECM system, and Best Money is no exception, will have a substantial amount of files that they want to import, classify, and make searchable in the new CMS system.

The planning and preparation for the document migration actually has to start a lot earlier, as there are a lot of things that need to be prepared:

  • Who is going to manage sorting out files that should be migrated?
  • What is the strategy and process for the migration?
  • What sort of classification should be done during the import?
  • What filesystem metadata needs to be preserved during the import?
  • Do we need to write any temporary scripts or rules just for the import?

Document migration strategies

The first thing we need to do is to figure out how the document migration is actually going to be done. There are several ways of making this happen. We will discuss a couple of different ways, such as via the CIFS interface and via tools. There are also some general strategies that apply to any migration method.

General migration strategies

There are some common things that need to be done no matter which import method is used, such as setting up a document migration staging area.

Document staging area

The end users need to be able to copy or move documents—that they want to migrate—to a kind of staging area that mirrors the new folder structure that we have set up in Alfresco. The best way to set up the staging area is to copy it from Alfresco via CIFS. When this is done the end users can start copying files to the staging area. However, it is a good idea to train the users in the new folder structure before they start copying documents to it. We should talk to them about folder structure changes, what rules and naming conventions have been set up, the idea behind it, and why it should be followed.

If we do not train the end users in the new folder structure, they will not honor it and the old structure will get mixed up with the new structure via document migration, and this is not something that we want. We did plan and implement the new structure for today's requirements and future requirements and we do not want it broken before we even start using the system.

The end users will typically work with the staging area over some time. It is good if they get a couple of weeks for this. It will take them time to think about what documents they want to migrate and if any re-organization is needed. Some documents might also need to be renamed.

Preserving Modified Date on imported documents

We know that Best Money wants all their modified dates on the files to be preserved during an import, as they have a review process that is dependent on it. This means that we have to use an import method that can preserve the Modified Date on the network drive files when they are merged into the Alfresco repository. The CIFS interface cannot be used for this as it sets Modified Date to Current Date.

There are a couple of methods that can be used to import content into the repository and preserve the Modified Date:

  • Create an ACP file via an external tool and then import it
  • Custom code the import with the Foundation API and turn off the Audit Aspect before the import
  • Use an import tool that also has the possibility to turn off the Audit Aspect

At the time of writing (when I am using Alfresco 3.3.3 Enterprise and Alfresco Community 3.4a) there is no easy way to import files and preserve the Modified Date. When a file is added via Alfresco Explorer, Alfresco Share, FTP, CIFS, Foundation API, REST API, and so on, the Created Date and Modified Date is set to "now", so we lose all the Modified Date data that was set on the files on the network drive.

The Created Date, Creator, Modified Date, Modifier, and Access Date are all so called Audit properties that are automatically managed by Alfresco if a node has the cm:auditable aspect applied. If we try and set these properties during an import via one of the APIs, it will not succeed.

Most people want to import files via CIFS or via an external import tool. Alfresco is working towards supporting preserving dates when using both these methods for import. Currently, there is a solution to add files via the Foundation API and preserve the dates, which can be used by custom tools. The Alfresco product itself also needs this functionality in, for example, the Transfer Service Receiver, so the dates can be preserved when it receives files.

The new solution that enables the use of the Foundation API to set Auditable properties manually has been implemented in version 3.3.2 Enterprise and 3.4a Community. To be able to set audit properties do the following:

  1. Inject the policy behavior filter in the class that should do the property update:

    <property name="behaviourFilter" ref="policyBehaviourFilter"/>

  2. Then in the class, turn off the audit aspect before the update, it has to be inside a new transaction, as in the following example:

    RetryingTransactionCallback<Object> txnWork = new
    RetryingTransactionCallback<Object>() {
    public Object execute() throws Exception {
    behaviourFilter.disableBehaviour
    (ContentModel.ASPECT_AUDITABLE);

  3. Then in the same transaction update the Created or Modified Date:

    nodeService.setProperty(nodeRef,
    ContentModel.PROP_MODIFIED, someDate);
    . . .
    }
    };

With JDK 6, the Modified Date is the only file data that we can access, so no other file metadata is available via the CIFS interface. If we use JDK 7, there is a new NIO 2 interface that gives access to more metadata. So, if we are implementing an import tool that creates an ACP file, we could use JDK 7 and preserve Created Date, Modified Date, and potentially other metadata as well.

Post migration processing scripts

When the document migration has been completed, we might want to do further processing of the documents such as setting extra metadata. This is specifically needed when documents are imported into Alfresco via the CIFS interface, which does not allow any custom metadata to be set during the import. There might also be situations, such as in the case of Best Money, where a lot of the imported documents have older filenames (that is, following an older naming convention) with important metadata that should be extracted and applied to the new document nodes.

For post migration processing, JavaScript is a convenient tool to use. We can easily define Lucene queries for the nodes we want to process, as the rules have applied domain document types such as Meeting to the imported documents, and we can use regular expressions to match and extract the metadata we want to apply to the nodes.

Search restrictions when running post migration scripts

What we have to think about though, when running these post migration scripts, is that the repository now contains a lot of content, so each query we run might very well return much more than 1,000 rows. And 1,000 rows is the default max limit that a search will return.

To change this to allow for 5,000 rows to be returned, we have to make some changes to the permission check configuration (Alfresco checks the permissions for each node that is being accessed, so the user running the query is not getting back content that he or she should not have access to). Open the alfresco-global.properties file located in the alfresco/tomcat/shared/classes directory and add the following properties:

# The maximum time spent pruning results (was 10000)
system.acl.maxPermissionCheckTimeMillis=100000
# The maximum number of results to perform permission checks
against (was 1000)
system.acl.maxPermissionChecks=5000

Unwanted Modified Date updates when running scripts

So we have turned off the audit feature during document migration, or made some custom code changes to Alfresco, to get the document's Modified Date to be preserved during import. Then we have turned on auditing again so the system behaves in the way the users expect.

The last thing we want now is for all those preserved modified dates to be set to current date when we update metadata. And this is what will happen if we are not running the post migration scripts with the audit feature turned off. So this is important to think about unless you want to start all over again with the document migration.

Versioning problems when running post migration scripts

Another thing that can cause problems is when we have versioning turned on for documents that we are updating with the post migration scripts. If we see the following error:

org.alfresco.service.cmr.version.VersionServiceException: 07120018
The current implementation of the version service does not support
the creation of branches.

By default new versions will be created even when we just update properties/metadata. This can cause errors such as the preceding error and we might not even be able to check-in and check-out the document. To prevent this error from popping up, and turn off versioning during property updates once and for all, we can set the following property at the same time as we set the other domain metadata in the scripts:

legacyContentFile.properties["cm:autoVersionOnUpdateProps"] = false;

Setting this property to false, effectively turns off versioning during any property/metadata update for the document.

Another thing that can be a problem is, if folders have been set up as versionable by mistake. The most likely reason for this is that we probably forgot to set up the Versioning Rule to only apply to cm:content (and not to "All Items"). Folders in the workspace://SpacesStore store do not support versioning

The WCM system comes with an AVM store that supports advanced folder versioning and change sets. Note that the WCM system can also store its data in the Workspace store.

So we need to update the versioning rule to apply to the content and remove the versionable aspect from all folders, which have it applied, before we can update any content in these folders. Here is a script that removes the cm:versionable aspect from any folder having it applied:

var store = "workspace://SpacesStore";
var query = "PATH:\"/app:company_home//*\" AND TYPE:\"cm:folder\"
AND ASPECT:\"cm:versionable\"";
var versionableFolders = search.luceneSearch(store, query);

for each (versionableFolder in versionableFolders) {
versionableFolder.removeAspect("cm:versionable");
logger.log("Removed versionable aspect from folder: " +
versionableFolder.name);
}
logger.log("Removed versionable aspect from " +
versionableFolders.length + " folders");

Post migration script to extract legacy meeting metadata

Best Money has a lot of documents that they are migrating to the Alfresco repository. Many of the documents have filenames following a certain naming convention. This is the case for the meeting documents that are imported. The naming convention for the old imported documents are not exactly the same as the new meeting naming convention, so we have to write the regular expression a little bit differently.

An example of a filename with the new naming convention looks like this:
10En-FM.02_3_annex1.doc and the same filename with the old naming convention looks like this: 10Eng-FM.02_3_annex1.doc. The difference is that the old naming convention does not specify a two-character code for language but instead a list that looks like this: Arabic,Chinese,Eng|eng,F|Fr,G|Ger,Indonesian,Jpn,Port,Rus|Russian,Sp,Sw,Tagalog,Turkish. What we are interested in extracting is the language and the department code and the following script will do that with a regular expression:

// Regulars Expression Definition
var re = new RegExp("^\\d{2}(Arabic|Chinese|Eng|eng|F|Fr|G|Ger|
Indonesian|Ital|Jpn|Port|Rus|Russian|Sp|Sw|Tagalog|Turkish)-(A|
HR|FM|FS|FU|IT|M|L).*");

var store = "workspace://SpacesStore";
var query = "+PATH:\"/app:company_home/cm:Meetings//*\" +
TYPE:\"cm:content\"";
var legacyContentFiles = search.luceneSearch(store, query);

for each (legacyContentFile in legacyContentFiles) {
if (re.test(legacyContentFile.name) == true) {
var language = getLanguageCode(RegExp.$1);
var department = RegExp.$2;
logger.log("Extracted and updated metadata (language=" + language
+ ")(department=" + department + ") for file: " +
legacyContentFile.name);
if (legacyContentFile.hasAspect("bmc:document_data")) {
// Set some metadata extracted from file name
legacyContentFile.properties["bmc:language"] = language;
legacyContentFile.properties["bmc:department"] = department;

// Make sure versioning is not enabled for property updates
legacyContentFile.properties["cm:autoVersionOnUpdateProps"] =
false;

legacyContentFile.save();
} else {
logger.log("Aspect bmc:document_data is not set for
document" + legacyContentFile.name);
}
} else {
logger.log("Did NOT extract metadata from file: " +
legacyContentFile.name);
}
}



/**
* Convert from legacy language code to new 2 char language code
*

* @param parsedLanguage legacy language code
*/
function getLanguageCode(parsedLanguage) {
if (parsedLanguage == "Arabic") {
return "Ar";
} else if (parsedLanguage == "Chinese") {
return "Ch";
} else if (parsedLanguage == "Eng" || parsedLanguage == "eng") {
return "En";
} else if (parsedLanguage == "F" || parsedLanguage == "Fr") {
return "Fr";
} else if (parsedLanguage == "G" || parsedLanguage == "Ger") {
return "Ge";
} else if (parsedLanguage == "Indonesian") {
return "In";
} else if (parsedLanguage == "Ital") {
return "";
} else if (parsedLanguage == "Jpn") {
return "Jp";
} else if (parsedLanguage == "Port") {
return "Po";
} else if (parsedLanguage == "Rus" || parsedLanguage == "Russian") {
return "Ru";
} else if (parsedLanguage == "Sp") {
return "Sp";
} else if (parsedLanguage == "Sw") {
return "Sw";
} else if (parsedLanguage == "Tagalog") {
return "Ta";
} else if (parsedLanguage == "Turkish") {
return "Tu";
} else {
logger.log("Invalid parsed language code: " + parsedLanguage);
return "";
}
}

This script can be run from any folder and it will search for all documents under the /Company Home/Meetings folder or any of its subfolders. All the documents that are returned by the search are looped through and matched with the regular expression. The regular expression defines two groups: one for the language code and one for the department. So after a document has been matched with the regular expression it is possible to back-reference the values that were matched in the groups by using RegExp.$1 and RegExp.$2.

When the language code and the department code properties are set, we also set the cm:autoVersionOnUpdateProps property, so we do not get any problem with versioning during the update.

Alfresco 3 Business Solutions Practical implementation techniques and guidance for delivering business solutions with Alfresco
Published: February 2011
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on Alfresco 3, see here.)

Importing documents via CIFS

When the new staging area has been populated by the end users, it is time to do the actual migration. One way to do this is to use the CIFS interface as it is a simple and well-known way of copying files from one drive to another. When copying the files it's a good idea to copy them in batches, selecting one top folder per batch, for example.

Before starting the migration it is also often necessary to turn off rules that check things such as naming conventions. In Best Money's case, there are going to be documents that do not follow the new naming conventions that we have defined. And when the rules throw an error the document migration will halt.

If we do not turn off rules Alfresco might throw errors after different kinds of checks, then we will most likely get "Could not find this item" error in Windows Explorer. This is because the node was not added to the repository after the rule error, and when Windows Explorer tries to make updates to it via CIFS it gets back this message that the file was not found by Alfresco.
Also, note that any error message that is being displayed in the Alfresco Explorer UI because of a rule is not going to be displayed in Windows Explorer in the same way. Most likely we will get the "Could not find this item" message.

When we are ready to start the document migration, it is best to do it first on a Test Box and sort out any problems. Then when we are confident that the document migration works nicely, we can move onto the Production Box.

An alternative way is to do the migration on Test and then export the complete Test folder structure including migrated documents to an ACP file. Remember that you might have to split the exported data up into several ACP files if there is a lot of content and the ACP file gets bigger than 4 GB. This is the max ZIP file size that Java handles (you can try using JDK7 as it is supposed to handle larger ZIP files). Then wipe out the folder structure on the Production Box and import the ACP file from the Test Box. That will be faster and easier than doing the import via CIFS. We can export several ACP files if there are a lot of documents

Make sure to always copy all documents that should be migrated to the local disk where Alfresco is running for best performance. One should also take an ACP backup of the new file structure on the Test Box before doing the migration. This way it will be easy to go back to the initial start state if something goes wrong with the migration.

When using CIFS for document migration, it is not going to be the fastest way of doing an import. The CIFS interface is quite "chatty" and as an example, importing 13,500 documents (9 GB) took around four hours on a Windows 2008 R2 64-bit box, Xeon E5520 2.34 GHz, 4 CPUs, and 4 GB RAM. The documents were then stored on the same disk as Alfresco was running. So if you are looking at significantly more data to import, it might be better to look at another import method such as using a purpose-built tool.

The following picture shows an overview of CIFS-based document migration :

The picture shows the legacy file server that has all documents that should be migrated over to Alfresco. On the legacy server, we have set up the staging area that mirrors the folder structure in Alfresco. End users are then over time, at their own pace, copying files into the staging area.

When the population of the staging area is completed, it is copied via CIFS into Alfresco and any temporary migration rules are executed and permanent rules are also executed setting custom types, for example.

Pros and cons with CIFS import.

The following are the advantages and disadvantages when using the CIFS interface for importing documents.

Advantages:

  • Easy and well-known interface to work with
  • Does not require any use of external tool or coding

Disadvantages:

  • Slow—CIFS is a rather "chatty" protocol, so it takes much longer to do the import than for example, an in-process method
  • Cannot apply custom metadata during import

Importing documents via external tool

We have talked about using CIFS for the import but it has some disadvantages, such as being slow, which we could overcome by using a tool for the import. One good open source tool that we can use is called Alfresco bulk filesystem import and can be downloaded from http://code.google.com/p/alfresco-bulkfilesystem-import/

This tool provides a bulk import process that loads content into the Alfresco repository from the local filesystem. It will update an imported document if it already exists in the repository. This tool is not designed to do a full synchronization of the filesystem with Alfresco, so it will not delete files.

This tool assumes that the imported files are on the disk that is locally accessible to the Alfresco server. This will allow the import code to directly stream from the disk into the repository. Typically this means disk-to-disk streaming, which is far more efficient than any kind of mechanism that requires network I/O.

This tool is different from some other tools—in that it executes all import logic in-process via Foundation API calls. This makes it very fast and eliminates any RPC calls over the network. The tool also breaks up large imports into multiple batches, where each batch runs in its own transaction, eliminating problems with long running transactions.

We can compare the speed of this tool to CIFS by looking at one Alfresco implementation that regularly loaded thousands of image files, each one being several MBs in size. The CIFS import took approximately an hour to load 1,500 image files while this tool could load the same image files into the repository in less than five minutes. So this tool will make a huge difference when there is a lot of content to be imported.

This tool will also allow us to set metadata on any document or folder that is imported into Alfresco. This is supported by plugins so we can easily custom code what metadata should be set. Some plugins exist out of the box, such as:

  • Basic metadata loader: It checks the type of data and sets either cm:content or cm:folder depending on if it is a file or a directory. Also populates the cm:name and cm:title with the filename as on the disk. This plugin is mandatory as it sets the types and names of the nodes. This is done automatically when CIFS is used.
  • Properties file metadata loader: It reads a properties file that is associated with the imported file and sets type, aspects, and properties according to the property file.

Finally, this tool supports preserving the modified dates of imported files. It uses the technique of disabling the Audit Aspect before the import. So if you are building your own import tool it might be worth having a look at the source code for this tool, if you plan to support preserving dates.

Pros and cons with tool import

The following are the advantages and disadvantages of using an external tool for importing documents.

Advantages:

  • Very fast in-process tool that doesn't do RPC calls over the network
  • Splits up import into multiple transactions
  • This tool can be used to apply metadata such as types and aspects during the import
  • Preserves modified dates of imported files

Disadvantages:

  • More complicated than the easy CIFS interface and might require some Java coding to get the best out of the tool.
  • Requires you to install an AMP and restart Alfresco

Importing documents via ACP file

Another way of importing documents is via the so called Alfresco Content Package (ACP) files . An ACP file can be generated from an existing Alfresco installation via Alfresco Explorer UI or from several available command-line tools.

When importing ACP files from the Alfresco Explorer interface, the folder hierarchy cannot exist as this will give you duplicate name errors. So the folder hierarchy would have to be wiped out before importing the ACP.

If we did not want to wipe out the folder structure in Production, we could use Alfresco's import command-line tool to do the import. This tool can be told to update nodes if they exist. Unfortunately, the node UUID in the ACP file has to match the node UUID in the folder structure; so this is highly unlikely to work unless the tool can query Alfresco for UUIDs during runtime.

The following are the advantages and disadvantages with using an ACP file for importing documents.

Advantages:

  • It is fast and the import is in-context, so it will be much faster than CIFS as it uses the Foundation API
  • The Alfresco server does not have to be stopped
  • The import is done in one transaction, so either everything gets imported or nothing
  • It preserves Created and Modified Dates (using ACP import might be the only way to preserve dates if you are running an older version of Alfresco)
  • The tool that generates the ACP file can apply metadata such as types to nodes that it adds to the ACP file

Disadvantages:

  • Import is done in a single transaction so might cause long-running transaction problems if a huge amount of data is being imported
  • The ACP file itself has to be copied into Alfresco before import starts
  • More complicated than the easy CIFS interface and might require some Java coding to get the best out of the ACP Generator tool.

Common steps during document migration

No matter what method we use for the actual document import, there are some general steps that we can follow for the document migration process. They are in chronological order:

  1. Set up staging area on the file share (that is, network drive). It should mirror the new folder structure in Alfresco.
  2. Train end users in the new folder structure.
  3. End users copy files to the staging area over a period of time.
  4. Set up any temporary migration rules in Alfresco (Test Box).
  5. Turn off rules that might throw errors and stop migration process (Test Box).
  6. Copy the staging area to the Test Box running Alfresco.
  7. Do the document migration, copy staging area via CIFS or use purpose-built tool (Test Box).
  8. Remove any temporary document migration rules (Test Box).
  9. Turn on any rules that were turned off during migration (Test Box).
  10. Run post migration processing scripts (Test Box).
  11. If Test box migration went OK, do the same thing on the Production Box. Alternatively, if the amount of data is not that huge, export all necessary top folders from the Test Box and then import the ACPs into the Production Box.

Summary

in this article we have talked about setting up a strategy for document migration and how to plan it. We have also looked at different ways to import documents into Alfresco.

In the next article we will cover Planning and Implementing Document Migration.


Further resources on this subject:


Alfresco 3 Business Solutions Practical implementation techniques and guidance for delivering business solutions with Alfresco
Published: February 2011
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:

About the Author :


Martin Bergljung

Martin Bergljung is a Principal ECM Architect at Ixxus, a UK Platinum Alfresco partner. He has over 25 years of experience in the IT sector, where he has worked with the Java platform since 1997.

Martin began working with Alfresco in 2007, developing an e-mail management extension for Alfresco called OpsMailmanager. In 2009, he started working on Alfresco consulting projects and has worked with customers such as Pearson, World Wildlife Fund, International Financial Data Services, NHS, VHI Healthcare, Virgin Money, Unibet, BPN Paribas, University of Westminster, Aker Oilfield Services, and Amnesty International.

He is a frequent speaker and has delivered talks at Alfresco conferences in London, Berlin, and Barcelona. He is also the author of Alfresco 3 Business Solutions, Packt Publishing.

Books From Packt


Alfresco 3 Web Services
Alfresco 3 Web Services

Alfresco 3 Records Management
Alfresco 3 Records Management

Alfresco 3 Web Content Management
Alfresco 3 Web Content Management

Alfresco Developer Guide
Alfresco Developer Guide

Alfresco 3 Enterprise Content Management Implementation
Alfresco 3 Enterprise Content Management Implementation

Liferay Portal 6 Enterprise Intranets
Liferay Portal 6 Enterprise Intranets

MySQL Admin Cookbook
MySQL Admin Cookbook

Alfresco Enterprise Content Management Implementation
Alfresco Enterprise Content Management Implementation

No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
W
7
v
b
H
C
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software