Alfresco 3 Business Solutions: Planning and Implementing Document Migration

Exclusive offer: get 50% off this eBook here
Alfresco 3 Business Solutions

Alfresco 3 Business Solutions — Save 50%

Practical implementation techniques and guidance for delivering business solutions with Alfresco

$38.99    $19.50
by Martin Bergljung | February 2011 | Content Management Open Source

The Alfresco bulk filesystem import tool is probably the most efficient and fastest tool to use at the moment and will probably be many people's favorite tool for document migration. And it supports the very important feature of being able to preserve modified dates during imports. As discussed in the previous article on Document Migration Strategies, CIFS is also used by many for document imports because of its simplicity and non-intrusiveness (that is, you do not have to install anything).

In this article by Martin Bergljung, author of Alfresco 3 Business Solutions, you will learn:

  • Planning the document migration
  • Implementing document migration using purpose-built tools

 

Alfresco 3 Business Solutions

Alfresco 3 Business Solutions

Practical implementation techniques and guidance for delivering business solutions with Alfresco

  • Deep practical insights into the vast possibilities that exist with the Alfresco platform for designing business solutions.
  • Each and every type of business solution is implemented through the eyes of a fictitious financial organization - giving you the right amount of practical exposure you need.
  • Packed with numerous case studies which will enable you to learn in various real world scenarios.
  • Learn to use Alfresco's rich API arsenal with ease.
  • Extend Alfresco's functionality and integrate it with external systems.
        Read more about this book      

(For more resources on Alfresco 3, see here.)

Planning document migration

Now we have got a strategy for how to do the document migration and we have several import methods to choose from, but we have not yet thought about planning the document migration. The end users will need time to select and organize the files they want to migrate and we might need some time to write temporary import scripts. So we need to plan this well ahead of production day.

The end users will have to go through all their documents and decide which ones they want to keep and which ones they will no longer need. Sometimes the decision to keep a document is not up to the end user but instead might be controlled by regulations, so this requires extra research

The following screenshot shows the Best Money schedule for document migration:

It is not only electronic files that might need to be imported, sometimes there are paper-based files that need to be scanned and imported. This needs to be planned into the schedule too.

Implementing document migration

So we have a document migration strategy and we have a plan. Now let's see a couple of examples of how we can implement document migration in practice.

Using Alfresco bulk filesystem import tool

A tool such as the Alfresco bulk filesystem import tool is probably what most people will use and it is also the preferred import tool in the Best Money project. So let's start looking at how this tool is used. It is delivered in an AMP and is installed by dropping the AMP into the ALFRESCO_HOME/amps directory and restarting Alfresco.

However, we prefer to install it manually with the Module Management Tool (MMT) as we have other AMPs, such as the Best Money AMP, that have been installed with the MMT tool.

Copy the alfresco-bulk-filesystem-import-0.8.amp (or newest version) file into the ALFRESCO_HOME/bin directory. Stop Alfresco and then install the AMP as follows:

C:\Alfresco3.3\bin>java -jar alfresco-mmt.jar install alfresco-
bulkfilesystem-import-0.8.amp C:\Alfresco3.3\tomcat\webapps\
alfresco.war-verbose

Running Alfresco bulk import tool

Remove the ALFRESCO_HOME/tomcat/webapps/alfresco directory, so the files contained in the new AMP are recognized when the updated WAR file is exploded on restart of Alfresco.

The tool provides a UI form in Alfresco Explorer that makes it very simple to do the import. It can be accessed via the http://localhost:8080/alfresco/service/bulk/import/filesystem URL, which will display the following form (you will be prompted to log in first, so make sure to log in with a user that has access to the spaces where you want to upload the content):

Here, the Import directory field is mandatory and specifies the absolute path to the filesystem directory from where to load the documents and folders from. It should be specified in an OS-specific format such as for example C:\docmigration\meetings or /docmigration/meetings. Note that this directory must be locally accessible to the server where the Alfresco instance is running. It must either be a local filesystem or a locally mounted remote filesystem.

The Target space field is also mandatory and specifies the target space/folder to load the documents and folders into. It is specified as a path starting with /Company Home. The separator character is Unix-style (that is, "/"), regardless of the platform Alfresco is running on. This field includes an AJAX auto-suggest feature, so you may type any part of the target space name, and an AJAX search will be performed to find and display matching items.

The Update existing files checkbox field specifies whether to update files that already exist in the repository (checked) or skip them (unchecked).

The import is started by clicking on the Initiate Bulk Import button. Once an import has been initiated, a status Web Script will display that reports on the status of the background import process. This Web Script automatically refreshes every 10 seconds until the import process completes.

For the Best Money project, we have set up a staging area for the document migration where users can add documents to be imported into Alfresco. Let's import the Meetings folder, which looks as follows, in the staging area:

Alfresco 3 Business Solutions: Planning and Implementing Document Migration

One Committee meeting has been added and that is what we will test to import with the tool. Fill out the Bulk Import form as follows

Alfresco 3 Business Solutions: Planning and Implementing Document Migration

Click Initiate Bulk Import button to start the import. The form should show the progress of the import and when finished we should see something like this:

Alfresco 3 Business Solutions: Planning and Implementing Document Migration

In this case, the import took 9.5 seconds and 31 documents (totaling 28 MB) were imported and five folders created. If we look at the document nodes, we will see that they all have the bmc:document type applied and the bmc:documentData aspect applied. This is accomplished by a type rule which is added to the Meetings folder. All documents also have the cm:versionable aspect applied via the "Apply Versioning" rule, which is added to the Meetings folder.

Alfresco 3 Business Solutions Practical implementation techniques and guidance for delivering business solutions with Alfresco
Published: February 2011
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on Alfresco 3, see here.)

Running Alfresco bulk import tool and applying extra metadata

What would be nice though is if the bmc:language property would be populated with the language that the document was written in as we have three subfolders under the committee meeting "Staff Committee, 12 Nov" for documents in English, French, and Spanish.

We can solve this by using metadata property files for each document. So for example, if we have a document in the doc_migration_stagingarea\Meetings\Committee\2009\Staff Committee, 12 Nov\Eng folder with the name 09Eng-ABCReport.pdf, we can create a property file for it called 09Eng-ABC Report.pdf.metadata.properties with the following content:

bmc\:language=En

Note that the language has to be specified according to the bmc:language_options constraints that have been defined in the content model. The filename pattern for these "shadow" property files is <filename>.<extension>.metadata.properties. Other property files will be imported like any other file.

Now just re-run the import, with the Update existing files checkbox checked, and this particular file should be updated with the language property set to En. If we look at the details page for this document, we will see the following screenshot:

Alfresco 3 Business Solutions: Planning and Implementing Document Migration

We can now use this metadata when searching via Advanced Search dialog in Alfresco Explorer:

Alfresco 3 Business Solutions: Planning and Implementing Document Migration

Creatin g these "shadow" property files is of course not going to be very practical when there are thousands of files to import. Then it is better to have some temporary migration rules handling it or do some post migration processing of documents with JavaScript

Using an ACP Generator tool

Sometimes it is useful to be able to do the document migration using an ACP file. For once, it preserves modified dates, which is useful when we use older Alfresco versions that do not support preserving modified dates by turning off the Audit aspect. In other cases it might be useful to do a document migration without having to restart the Alfresco server or install extra modules in the Alfresco server, and an ACP import does not require any of this.

There are a couple of different "ACP generators" that can be found on the Internet, written in different languages and working with different Alfresco versions. The source code for this article comes with an ACP Generator implemented in Java (Download load ch:8).

This is to be able to get the ACP Generator to work with the latest Alfresco version, use the Java NIO API for file copy, and be able to extend it with custom functionality. We will use this ACP Generator in the following example:

The executable jar file for the ACP Generator can be found in the 3340_08_Code\bestmoney\alf_extensions\trunk\_acp_generator\build\dist directory and is called acp_gen-1.0.jar. If you want to build it from scratch, use the package-acpgen-jar ant target located in the 3340_08_Code\bestmoney\alf_extensions\trunk\_acp_generator directory.

Let's try out the ACP Generator by creating a content package from the same documents and folders that we imported with the Alfresco bulk filesystem import tool. To do this we feed the ACP Generator with the source path on the local disk where the documents and folders exist and what name the ACP file should have:

C:\tools\acpgen>java -jar acp_gen-1.0.jar C:\doc_migration_stagingarea\
Meetings MEETINGS_FOLDER_HIERARCHY.ACP
ACP File will be created from content in: C:\doc_migration_stagingarea\
Meetings
Temp directory for ACP: C:\tools\acpgen\ACPtmp
Temp directory for ACP content: C:\tools\acpgen\ACPtmp\import
ACP Metadata file: C:\tools\acpgen\ACPtmp/import.xml
---- Generating Metadata XML for directory: C:\doc_migration_stagingarea\
Meetings
About to navigate directory tree: C:\doc_migration_stagingarea\Meetings
---- Generating Metadata XML for directory: C:\doc_migration_stagingarea\
Meetings\Committee
...
Added file (C:\doc_migration_stagingarea\Meetings\Committee\2009\Staff
Committee, 12 Nov\Eng\09Eng-ATS Report.pdf) as /import/content0.pdf
...

The ACP Generator will log what it's doing and here we can see that it prints logs about navigating directories, generating metadata, and adding files to the ACP package. We can now import the MEETINGS_FOLDER_HIERARCHY.ACP file from the Alfresco Explorer UI. However, when we stand on the /Company Home folder and import the ACP file, it will not work and a "Duplicate child name not allowed: meetings" error message will be displayed.

This is because the import via the Alfresco Explorer UI does not support updating nodes. We can import the ACP file to, for example, /Company Home/Test and it would work fine as there are no duplicate folders in that case. Alfresco supports updating nodes during import by using a command-line tool for the import. The import tool can be passed a parameter called uuidBinding that specifies what to do when encountering duplicate nodes.

The importer tool should be run from the Alfresco\tomcat\webapps\alfresco\WEB-INF directory and the ACP file should be copied into this directory. The importer tool also starts its own embedded Alfresco, so the Alfresco server needs to be shut down before doing the import. Note that the MySQL server needs to be running though. Here is how to use it:

Alfresco\tomcat\webapps\alfresco\WEB-INF>java -cp "classes\alfresco\
module;..\..\..\shared\classes;classes;..\..\..\lib\*;..\..\..\common\
endorsed\*;lib\*" org.alfresco.tools.Import -user admin -pwd admin
-store workspace://SpacesStore -path /app:company_home -uuidBinding
UPDATE_EXISTING -verbose MEETINGS_FOLDER_HIERARCHY.ACP
Alfresco Repository Importer
. . .

This is however, not working either and we will see the "DuplicateChildNodeNameException: Duplicate child name not allowed: meetings" error in the console window. This is because we do not have the correct UUIDs in the metadata XML for folders that exist in the repository. For the import tool to update a folder when it exists, the metadata XML in the ACP file must specify the same UUID as the folder has in the repository.

The ACP Generator tool does not add any UUIDs, if we do not tell it to. The ACP Generator can take an extra parameter that specifies the name of a property file with UUID mappings. This UUID mapping file can be generated with the following JavaScript:

var filename = "uuidMapping.properties";
var file = companyhome.childByNamePath(filename);

if (file == null) {
file = space.createFile(filename);
}

if (file != null){
var store = "workspace://SpacesStore";
var query = "+PATH:\"/app:company_home//*\" +TYPE:\"cm:folder\"";
var folders = search.luceneSearch(store, query);
var content = "";
for each (folder in folders) {
var uuid = folder.properties["sys:node-uuid"];
var pathAndName = folder.displayPath + "/" + folder.name;
content += pathAndName + "=" + uuid + "\r\n";
}
file.content = content;
}

This script will generate a file with mappings as in the following example:

/Company Home/Meetings=a98682d2-217a-45dd-86cb-73fe01e8dde8

The only thing we need to do now is feed this mapping file into the ACP Generator and it will make sure each existing folder has the correct <sys:node-uuid> entity value, so when we run the import tool again it will work even if folders exist. Here is how to run the ACP Generator with the UUID mapping file specified:

C:\tools\acpgen>java -jar acp_gen-1.0.jar C:\doc_migration_stagingarea\
Meetings MEETINGS_FOLDER_HIERARCHY.ACP uuidMapping.properties

Now run the Alfresco import tool again and it should work fine.

Summary

The Alfresco bulk filesystem import tool is probably the most efficient and fastest tool to use at the moment and will probably be many people's favorite tool for document migration. And it supports the very important feature of being able to preserve modified dates during imports.

CIFS is also used by many for document imports because of its simplicity and non-intrusiveness (that is, you do not have to install anything). It is however, slow if you have a lot of data to import and cannot handle metadata.


Further resources on this subject:


Alfresco 3 Business Solutions Practical implementation techniques and guidance for delivering business solutions with Alfresco
Published: February 2011
eBook Price: $38.99
Book Price: $64.99
See more
Select your format and quantity:

About the Author :


Martin Bergljung

Martin Bergljung is a Principal ECM Architect at Ixxus, a UK Platinum Alfresco partner. He has over 25 years of experience in the IT sector, where he has worked with the Java platform since 1997.

Martin began working with Alfresco in 2007, developing an e-mail management extension for Alfresco called OpsMailmanager. In 2009, he started working on Alfresco consulting projects and has worked with customers such as Pearson, World Wildlife Fund, International Financial Data Services, NHS, VHI Healthcare, Virgin Money, Unibet, BPN Paribas, University of Westminster, Aker Oilfield Services, and Amnesty International.

He is a frequent speaker and has delivered talks at Alfresco conferences in London, Berlin, and Barcelona. He is also the author of Alfresco 3 Business Solutions, Packt Publishing.

Books From Packt


Alfresco 3 Web Services
Alfresco 3 Web Services

Alfresco 3 Records Management
Alfresco 3 Records Management

Alfresco 3 Web Content Management
Alfresco 3 Web Content Management

Alfresco Developer Guide
Alfresco Developer Guide

Alfresco 3 Enterprise Content Management Implementation
Alfresco 3 Enterprise Content Management Implementation

Liferay Portal 6 Enterprise Intranets
Liferay Portal 6 Enterprise Intranets

MySQL Admin Cookbook
MySQL Admin Cookbook

Alfresco Enterprise Content Management Implementation
Alfresco Enterprise Content Management Implementation


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
X
S
N
H
L
3
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software