Alfresco 3 Business Solutions

By Martin Bergljung
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. The Alfresco Platform

About this book

Alfresco is the renowned and multiple award winning open source Enterprise content management system which allows you to build, design and implement your very own ECM solutions. It offers much more advanced and cutting edge features than its commercial counterparts with its modularity and scalability. If you are looking for quick and effective ways to use Alfresco to design and implement effective and world class business solutions that meet your organizational needs - Your search ends with this book.

Welcome to Alfresco 3 Business Solutions - Your practical and easy to use guide which, instead of teaching you just how to use Alfresco, teaches you how to live Alfresco. It will guide you through implementing real world solutions through real world scenarios. Each ECM problem is treated as a separate case study and has its own chapter, enabling you to uncover the practical aspects of an ECM implementation. You want more than just the theoretical details - You want practical insights to building, designing and implementing nothing less than world class business solutions with Alfresco - and Alfresco 3 Business Solutions is your solution.

This practical companion cuts short the preamble and you dive right into the world of business solutions with Alfresco.

Learn all techniques, basic and advanced, required to design and implement different solutions with Alfresco in easy and efficient ways. Learn all you need to know about Document Management, Records Management- the lot. Connect Alfresco with directory servers. Learn how to use CIFS and troubleshoot all types of problems. Migrate data when you have an existing network drive with documents and want to merge them into Alfresco. Implement Business Process Design Solutions with Swimlane diagrams. Easily extract content from Alfresco and build mashups in a portal like Liferay. Gain insights into mobile access and email integration.

This book will teach you to implement all that and more, in real world environments.

Publication date:
February 2011
Publisher
Packt
Pages
608
ISBN
9781849513340

 

Chapter 1. The Alfresco Platform

Before we dive into implementing ECM solutions, we are going to have a look at the Alfresco platform. There are some key concepts and features that are important for us to know about before implementing anything on top of Alfresco.

It helps to think about Alfresco as a big toolbox for building Content Management Systems (CMS). Alfresco, out of the box, can obviously be used straightaway but usually you want to configure it and customize it for the organization that is going to use it.

This is important to think about, as otherwise you are missing the full potential of Alfresco. It enables organizations to tweak Alfresco, so that it works with their business processes and business rules. It does not impose a special way of working that the organization has to adopt. Instead, Alfresco adapts to the organization.

In a lot of cases, organizations buy proprietary turnkey solutions that look really good out of the box with predefined content models, domain process definitions, business rules, and so on. However, after a while they usually realize that things are not working exactly as they want them to. Then they realize that to customize the solution will cost way more than if they would have started creating it from scratch, or it might not even be possible to customize functionality in the proprietary solution.

In this chapter, you will learn:

  • Important repository concepts

  • How to use content rules

  • What a metadata extractor is

  • Why content transformers are used

  • How to trigger custom code from events

  • What a Servlet Command is

  • What a subsystem is

  • How the system can be bootstrapped in different ways

  • What user interfaces are available

  • About the directory structure created by the installation

  • How to access content information directly from the database

Throughout the book, we will be working with "Best Money"—a fictive financial institution that offers financial products such as credit cards, loans, and insurances. Best Money wants to complete its range of financial products by offering personal banking products. It is therefore under pressure to improve efficiency by automating business processes, structuring document storage, classifying documents, implementing document lifecycles, improving the level of auditing, managing e-mail content, and many other challenges in a complex business environment subject to heavy regulatory oversight.

Best Money realizes that to do all of this it needs to put in place an Enterprise Content Management solution and it has selected Alfresco.

Platform overview

Alfresco is an open source content management system written entirely in Java that can be run in a standard Servlet container, such as Apache Tomcat or a JEE server, such as JBoss. The Alfresco platform is built using many third-party open source Java libraries and it's good to know about these libraries as we will use many of them when building extensions and solutions.

The platform has many Application Programming Interfaces (APIs) and configuration techniques that we can use to build custom solutions on top of Alfresco.

The following figure gives an overview of the platform:

The Alfresco-specific components, modules, and user interfaces are depicted in a lighter color and the third-party libraries are depicted in a darker color. The Alfresco platform is presented in a layered approach as follows:

  • Repository: The bottom layer comprises the database, the search indices, and the content files.

  • Java Platform: Everything runs on Java, so it is independent of hardware, operating systems, and also of databases as they are accessed via Hibernate.

  • Core: This layer contains all of the modules and libraries used by Alfresco to implement the CMS functionality.

  • APIs: The interface layer contains a variety of application programming interfaces that can be used to communicate with Alfresco both in-process and remotely.

  • Sub-systems: This layer consists of self-contained components that extend the CMS system with important functionality that often need to be configured during installation. Sub-systems can be started and stopped while the server is running.

  • Bootstrap: System integrators can use bootstrap extensions to perform a variety of tasks, for example, to import content or patch the content with some custom metadata.

  • Modules: The modules usually extend the Alfresco system with some major extra functionality such as web content management or records management. We will use a module for all new custom functionality we implement for the Best Money ECM system.

  • User Interfaces (UI): Alfresco comes with a number of user interfaces that can be used to upload and manage content.

Now, let's have a detailed look at each layer starting with the Repository.

 

Platform overview


Alfresco is an open source content management system written entirely in Java that can be run in a standard Servlet container, such as Apache Tomcat or a JEE server, such as JBoss. The Alfresco platform is built using many third-party open source Java libraries and it's good to know about these libraries as we will use many of them when building extensions and solutions.

The platform has many Application Programming Interfaces (APIs) and configuration techniques that we can use to build custom solutions on top of Alfresco.

The following figure gives an overview of the platform:

The Alfresco-specific components, modules, and user interfaces are depicted in a lighter color and the third-party libraries are depicted in a darker color. The Alfresco platform is presented in a layered approach as follows:

  • Repository: The bottom layer comprises the database, the search indices, and the content files.

  • Java Platform: Everything runs on Java, so it is independent of hardware, operating systems, and also of databases as they are accessed via Hibernate.

  • Core: This layer contains all of the modules and libraries used by Alfresco to implement the CMS functionality.

  • APIs: The interface layer contains a variety of application programming interfaces that can be used to communicate with Alfresco both in-process and remotely.

  • Sub-systems: This layer consists of self-contained components that extend the CMS system with important functionality that often need to be configured during installation. Sub-systems can be started and stopped while the server is running.

  • Bootstrap: System integrators can use bootstrap extensions to perform a variety of tasks, for example, to import content or patch the content with some custom metadata.

  • Modules: The modules usually extend the Alfresco system with some major extra functionality such as web content management or records management. We will use a module for all new custom functionality we implement for the Best Money ECM system.

  • User Interfaces (UI): Alfresco comes with a number of user interfaces that can be used to upload and manage content.

Now, let's have a detailed look at each layer starting with the Repository.

 

Repository concepts and definitions


Before doing any custom coding for Alfresco, it is important to get to know the concepts and definitions around the repository. We need to get familiar with things such as Store and Association. And what does it mean when we talk about a Node in the repository.

Repository

When we talk about Alfresco, we often refer to the Alfresco Repository. So what is the repository more specifically? The repository is an abstract term for where everything is stored in Alfresco. It is also often called just "repo" and one of the main packages in the source code is also called repo.

The following figure gives you an overview of the Alfresco Repository:

The repository consists of a hierarchy of different types of nodes. This can be for example, folder nodes or leaf nodes representing a file. Each node is associated with a parent node via a parent-child relationship, except the top root node. Nodes can also be related to each other via source-target associations (that is, peer associations). If the node represents a file, then it is also associated with a file in the filesystem. This is a somewhat simplified view of the repository as each node actually resides in a store as in the following figure:

The repository is built up by a number of stores and each one of them contains a node hierarchy. The nodes make up the metadata for the content and are stored in the database. The actual content itself, such as document files, is stored in the filesystem.

Stores

There are a couple of stores that you will come in contact with, if you work with Alfresco for a while. First, we have the Working Store, which is the main store where metadata for all live content is contained; this store is often referred to as just DM (Document Management). This is the content that you can access from all the different user interface clients. The default behavior when something is deleted from the Working Store via any of the user interfaces is that both the content file and the content metadata ends up in a store called the Archive Store.

Note

Content is not permanently removed from the disk when it is in the Archive Store. To physically remove the content from the disk, you need to manage content via the Admin user profile screen or configure a content store cleaner (http://wiki.alfresco.com/wiki/Content_Store_Configuration).

If you turn on "versioning" for a document, then you will see some activity in the Version Store where all the previous versions of a piece of content will be stored. This is called the "version history" and there is one Node created per version history.

Note

The complete file for a previous version is stored and the system does not store the delta between versions.

Whenever you install a new application module such as Records Management or Web Content Management, the data about this module is stored in the System Store. The data that is stored is, for example, module version number.

Finally, we have the Content Store that contains all the physical files and it lives on the disk compared to the other stores that live in the database. It is called Content Store even though the behavior is not the same as for the stores that live in the database. It is more of an abstract term for where the physical content files are located.

The Content Store

So why are the physical files stored in the filesystem and not in the database as Binary Large Objects (BLOBs)? It would be easier to back up the whole system and also to set up replication if everything was in the database. And system administrators would not have to manage both database space and filesystem space.

There are several reasons why content files are stored in the filesystem:

  • Random access to files: One of the big advantages with Alfresco is that users can keep working the way they are used to by accessing Alfresco as a shared drive (that is, via the CIFS interface). However, this would not be possible if Alfresco was not storing the files in the filesystem, so they can be randomly accessed (sometimes also referred to as direct access). To support frequent updating and reading of database BLOBs would slow down performance of the CIFS interface to an unacceptable level.

  • Real-time streaming: It is a lot easier to stream large content such as video and audio using files. A content file can now be streamed directly back to the browser as the file input stream is copied directly to the HTTP Response Output stream. If BLOBs were used, you would first have to read the BLOB then create a temporary file to stream from. Also, when writing BLOBs to the database, a lot of databases require you to know the stream size when inserting the record, so a temp file needs to be created. Further on, some databases such as MySQL have problems sending very large binaries from the JDBC driver to the database.

  • Standard database access: Most database systems support BLOBs with custom access methods and custom objects. These usually perform much better than the JDBC BLOB objects and access methods. So it would be difficult to use Hibernate to access BLOBs in a standard way for all databases. For example, if you wanted to manage BLOBs with Oracle, you would have the best performance using their BLOB object. Also, the caching of BLOBs in databases is known to slow down the rest of the metadata access.

  • Faster access: It is much faster to access content that is stored as files, which means that the user experience is much better and this leads to happier customers.

Content Store policies

Content Store policies let us decide what media we will store the selected content to. Quite often, content will have a lifetime during which it is relevant and then it will become obsolete; content store policies help with a solution for this. We do not want to get rid of the content files but store them on a cheaper, slower-access disk.

So we might use a very fast tier 1, Solid-State Drives (SSD), for our most important content files, and based on business policies that we control, gradually move the data to cheaper tier 2 drives such as Fiber Channel (FC) drives or Serial ATA drives as it becomes less important. In this way, we can manage the storage of content more cost-effectively.

So you could have one part of the repository store files on one disk and another part of the store files on another disk. The following figure illustrates:

In the preceding figure, we can see that the system has been configured to store images on one disk and documents on another disk. This sort of content store configuration can be done with Content Store Selectors.

The AVM Store

One store that has not been mentioned so far is the special store introduced for the Alfresco WCM module. It is called the Advanced Versioning Manager (AVM) Store and it is modeled after Apache Subversion to be able to support extra features such as:

  • File-level version control

  • File-level branching

  • Directory-level version control

  • Directory-level branching

  • Store-level version control (snapshots)

  • Store-level branching

These extra features are needed to be able to create user and staging sandboxes, so that web content can be created and previewed independently between the users. A staging environment is also supported where different snapshots of the website can be managed and deployed to production servers.

There are some major differences between the Working Store (DM) and the AVM Store (WCM) that are good to know about when we are planning an ECM project. The following list explains some of the differences:

  • Permissions can be set on object level in DM but only on Web Project level in WCM

  • Types are defined with XML Schema files in WCM but with XML files in DM

  • AVM folders do not support rules as in DM

  • In WCM, we can search only one Web Project at a time, whereas in DM the complete repository is searchable

  • E-mailing with SMTP or IMAP is not supported in WCM, but it is in DM

Content can be cross-copied between the DM store and the AVM store and vice versa.

There are things happening now and in the near future to update Alfresco WCM to be able to use the normal Alfresco DM Working Store, so that web content can reside along with all other content and be searchable in the same way.

Store reference

When you work with the application interfaces, you often have to pass a so-called store reference into a method call. This store reference tells Alfresco what store you are working with. A store reference is constructed from two pieces—the protocol and an identifier.

The protocol basically specifies what store you are interested in. For example, two of the protocols are workspace and archive. You also need an identifier to create a store reference and it tells us what kind of store it is, for example, does it contain spaces or is it keeping version information. Most of the time we are accessing a store with folders (that is, spaces) and the identifier is then called SpacesStore.

So if you wanted to access the Working Store from the previous figures, you would create the following store reference: workspace://SpacesStore. And this is the store that you will use most of the time.

The following is a complete list of store references:

  • workspace://SpacesStore: Contains the live content; this is the store reference that will be used in most situations

  • workspace://lightWeightVersionStore: Version history for content

  • workspace://Version2Store: Next-generation version history for content

  • archive://SpacesStore: Archived files (that is, deleted files)

  • user://alfrescoUserStore: User management

  • system://system: Installed modules information

  • avm://sitestore: Alfresco WCM content

Nodes

Each store in the repository contains nodes and every piece of content that is saved in the repository is represented by a node. This can be a document, a folder, a forum, an e-mail, an image, a person, and so on. Everything in a store is a node. A node is stored in the database and contains the following metadata for a piece of content:

  • Type: A node can be of one type.

  • Aspects: A node can have many aspects.

  • Properties: A node can have properties defined in the type or the aspects. One of the properties points to the actual physical content file.

  • Permissions: Permissions for this node.

  • Associations: Associations to other nodes.

Each node is of a certain Type such as Folder, Content, VersionHistory, Forum, and so on. Each type can have one or more properties associated with it. A node can only be of one type. So a Folder cannot also be a VersionHistory, pretty obvious, but it is good to mention this anyway so that there are no misunderstandings when we start creating custom types.

So what if we wanted to have properties from two different types associated with a node, how would we do that? We would use something called an aspect. A node can be associated with more than one aspect. An aspect is for example, Versionable, Emailed, Auditable, and so on. So, this means that a MS Word document could be of type Content and be Versionable and Emailed.

The following figure shows a folder node called User Guides that contains one document called userguide.pdf, which in turn is associated with an image node called logo.png:

Some nodes such as folder nodes are not associated with any content; they just contain metadata, permission settings, and an association to the parent folder.

Root node

All nodes have to have a parent node and the top-level node in the store is called the store root as it does not have a parent node. The root node has an aspect applied to it called sys:aspect_root. It might look like a good idea to search for this aspect to get to the root node in a store, but it does not work as there are other nodes such as the root node for categories that also have this aspect set.

An easy way to get to the root node in any of the stores is to do a Lucene search with PATH:"/" or if we are using the Java Foundation Service API, we can use the Node Service to get the root node for a particular Store Reference.

Node reference

So, we have heard about all these nodes and seen how they can have properties, and so on. But how can one uniquely identify one node in the repository? This is where node references come into the picture. They are used to identify a specific node in one of the stores in the repository. You construct a node reference by combining a Store Reference such as workspace://SpacesStore with an identifier.

The identifier is a Universally Unique Identifier (UUID) and it is generated automatically when a node is created. A UUID looks something like this: 986570b5-4a1b-11dd-823c-f5095e006c11 and it represents a 128-bit value. A complete Node Reference looks like workspace://SpacesStore/986570b5-4a1b-11dd-823c-f5095e006c11.

The node reference is one of the most important concepts when developing custom behavior for Alfresco as it is required by a lot of the application interface methods.

Node properties

Properties contain all the information about the Node and are often referred to as metadata. When you create a new node of a certain type, such as Content, there are certain default number of properties that are set. These are properties such as Name, Created Date, Author, and so on.

What properties are set, and if they are set automatically, depends on the MIME type of the content that is being added to the repository.

Note

Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of e-mail to support message bodies with multiple parts, text in character sets other than ASCII, non-text attachments, header information in non-ASCII character sets, and so on.

The MIME type is these days referred to as a content type after the header name Content-Type. But in the Alfresco environment they are called MIME types.

Some MIME types such as the one for a MS Word document have so-called Metadata Extractors available that automatically extract properties from the content when it is added. So when we add a MS Word document to the repository, we will see that some properties have been filled in automatically via the automatic metadata extraction.

Properties can be defined either as part of a type or as part of an aspect. When you list the properties in the UI, it does not show what type or aspect they belong to. You have to programmatically query the system to find out what properties belong to an aspect or type.

What if we wanted to add a property called Name but it is already defined for type Content, what do we do then? Every property is scoped within a so called namespace, so we can have a property called Name in namespace A and in namespace B without any problem.

Node property sheets

So we got all these properties for a node in the repository, how do we display them in the UI? They are displayed through so-called property sheet definitions. The default property sheet displays a number of properties depending on the type of the node and what aspects have been applied to it.

When we add custom types and aspects to a node, we also have to define appropriate property sheets, so that these custom properties are displayed every time we look at this node.

Property sheets are used to display custom properties both in the Alfresco Explorer UI and in the Alfresco Share UI.

Node associations

Most of the nodes in the repository are linked or associated with other nodes. For example, document nodes are associated with folder nodes and e-mail attachment nodes are associated with e-mail nodes.

There are two kinds of associations: a parent -> child association that you, for example, have between a folder and document, and the source -> target association (that is, peer association) that you have between, for example, a marketing brief and produced marketing materials.

If we delete a parent node in a parent -> child association, an automatic cascading delete will be executed deleting all child nodes. This means that if we delete a folder, all the contained documents and subfolders will also be deleted. Deleting any node in a source -> target association does not automatically delete the associated node.

Node associations can also be configured to be displayed in property sheets in the user interface.

QName

When we define a property, aspect, type, and so on for a node in the repository, we will come in contact with something called QName, which is the fully qualified name of, for example, a property including the local name and the namespace it has been defined in.

Here are some examples of QNames:

  • {http://www.alfresco.org/model/content/1.0}description: Defines the description (that is, description is the local name) property that is part of the standard Alfresco content namespace. This property is part of the cm:titled aspect defined in the same namespace.

  • {http://www.bestmoney.com/model/content/1.0}job: Defines the job aspect that is part of the "Best Money" content namespace.

  • {http://www.mycompany.com/model/content/1.0}language_options: Defines a constraint of languages in some company's namespace.

The QName does not actually tell us if it defines a property, type, constraint, aspect, and so on, it does not know anything about that. The local name of a QName has to be unique within a namespace. So there can only be one QName with local name description within the Alfresco namespace http://www.alfresco.org/model/content/1.0. If we wanted to have a property description for another aspect such as the person aspect, we would have to call it something different such as persondescription. We could also use a different namespace {http://www.mycompany.com/model/person/1.0}description.

The namespace is written with quite a long string but can be shortened by using the prefix that has been set for it. The http://www.alfresco.org/model/content/1.0 namespace has the prefix cm. So we can use cm:description to refer to the description property of the titled aspect in the Alfresco content namespace.

Permissions

We can set individual permissions for every node in the repository. For example, when we create a new folder node, it can be set up to be accessible by only one particular person or one specific group of people. We can also just inherit permissions from the parent node; this is the default behavior if permissions are not manually specified when a node is created.

User groups

Groups of users can be created to better handle permission settings. Usually it is a good idea to set permissions for folders and content via a group. Then we can just add and remove members from the group without having to worry about setting or removing permissions for individual users. And when permissions are changed for a group they are immediately propagated to all users.

Users and groups can be synchronized with an external directory server and then we can manage the group membership in a more centralized way.

There are a couple of groups created automatically when Alfresco is installed and they all have special meanings as follow:

  • EVERYONE: This is a group that implicitly has all users as members. This group is not viewable when managing groups via the user interface, but you can use this group when setting up permissions.

  • ALFRESCO_ADMINISTRATORS: Any member of this group has administrator rights, meaning full access to the complete repository. This group has one member as default, which is the admin user.

  • E-MAIL_CONTRIBUTORS: One of the Alfresco features is to be able to receive e-mails via a built-in SMTP service. Any e-mails sent into Alfresco cannot be stored unless the user sending the e-mail has been authenticated. The authentication is done via the sender's e-mail address and the user matching the e-mail address must be a member of this group. If the user is not a member of this group, then the e-mail will not be stored.

Note

Groups can also be used for other things than permission management, for example, sending e-mails to members of a particular group or assigning workflow tasks to members of a group.

Roles

The permission system is role-based where each role groups together one or more individual permissions such as a Write or Read permission. To configure permissions for a node, we first choose what person or group—and this is referred to as the authority—should have access to the node, and then we specify what kind of access the authority should have to the node by associating it with a role.

There are five roles available out of the box and they are:

  • Consumer: Gives the authority Read permission to the node

  • Contributors: Gives the authority Read and Create permission to the node

  • Collaborator: Gives the authority Read, Create, and Update permission to the node

  • Editor: Gives the authority Read and Update permission to the node

  • Coordinator: Gives the authority full access to the node

The permissions specified for each role in the previous list are a generalization of the permissions for each role. It is good to think about the roles like that when discussing with clients and explaining the permissions for each role. For a complete detailed list of the permissions for each role, see the following page: http://www.alfresco.com/help/webclient/concepts/cuh-user-roles-permissions.html.

For Alfresco WCM, see the following page: http://wiki.alfresco.com/wiki/WCM_roles.

Permission groups

Permission groups are used to group together one or more related permissions. We need to worry about permission groups when we, for example, want to define a new role. The low-level permissions in Alfresco have been grouped as follows:

  • Read: Includes permissions groups: ReadProperties, ReadChildren, and ReadContent

  • Write: WriteProperties, WriteContent

  • Delete: DeleteNode, DeleteChildren

  • AddChildren: CreateChildren, LinkChildren

  • Execute: ExecuteContent

All low-level permissions have also been defined as groups containing only the individual permission. The individual permission name is the same as the group name but with an underscore at the beginning (for example, _ReadProperties).

Then there are higher level permission groups for the roles such as:

  • Consumer: Includes permission group Read

  • Contributor: Consumer, AddChildren, ReadPermissions

  • Editor: Consumer, Write, CheckOut, ReadPermissions

  • Collaborator: Editor, Contributor

  • Coordinator: Full control (can do anything)

If you want to have a look at the complete permission model, it is available in the /tomcat/webapps/alfresco/WEB-INF/classes/alfresco/model/permissionDefinitions.xml file.

Owner authority

There is a special authority called Owner, which is basically what we call a user that created a piece of content. The Owner of content always has full access (that is, same as the Coordinator role) to it no matter what permissions have been set up. Someone can take ownership of content if they have Coordinator or Administrator rights. An extra aspect cm:ownable is then applied to the content with a property cm:owner that specifies the username for the new owner.

Permission example

For a complete permission example, see the following page: http://wiki.alfresco.com/wiki/Security_and_Authentication#A_Simple_Permissions_Example.

Multi-Tenant

Alfresco is usually installed and used locally at a company or an organization. However, sometimes it might be useful to be able to divide the repository, so that content belonging to one group of users is kept separate from another group.

Maybe we are service providers who want to offer content management solutions as a service and we want to do this from a single Alfresco installation. This can be done by using the Multi-Tenant (MT) feature of Alfresco.

The MT feature enables Alfresco to be configured as a single-instance multi-tenant environment, which enables multiple independent tenants such as companies, organizations, or groups to be hosted on a single instance. This instance can be installed either on a single server or across a cluster of servers.

The MT feature is not enabled by default in Alfresco and has to be turned on by configuration in XML files. There is a special Tenant Administration Console that can be used to create new tenants, show tenants, enable and disable tenants.

 

Core platform


The core Alfresco platform is built on Java, which makes it deployable on any platform with a JVM. Typically, Alfresco is deployed to Apache Tomcat, but it is also possible to deploy Alfresco to Java Enterprise Edition (JEE) servers, such as JBoss, WebSphere, and WebLogic. The platform makes use of many of today's best open source projects to build a first class content management solution.

Open source libraries

One of the reasons Alfresco could build such a powerful content management solution in just a few years is that instead of reinventing the wheel for every needed feature, they looked at what open source projects were available that could provide the functionality that was needed. By using the best open source projects, Alfresco can develop functionality really fast and the solutions are much more stable than if everything was created from scratch.

The following is a list of the major open source projects that are used to implement the core platform:

  • Acegi Security: Used for authentication and method-level authorization

  • Apache Axis: Web Service container

  • Apache Abdera: ATOM syndication format and publishing protocol

  • Apache Chemistry: CMIS AtomPub binding and Abdera CMIS Extension for Web Services binding

  • Apache CXF: Service Framework for CMIS Web Services

  • Apache Commons: A lot of the commons libraries are used such as Codec, Lang, File Upload, HTTP Client

  • Apache PDFBox: Document transformations

  • Apache POI: Access Microsoft Office documents

  • Chiba: WCM Forms Engine

  • EHCache: Level 2 caching

  • FreeMarker: Presentation templates

  • Greenmail: IMAP support

  • Hibernate: Database access via Object-relational mapping

  • iBATIS: SQL mapping layer, replaces Hibernate in some places in Alfresco 3.3, but will replace Hibernate completely in the future versions of Alfresco—for scalability and reliability issues

  • Java Mail: Sending e-mails

  • JBOSS jBPM: Workflow engine

  • JGroups: Clustering support via multicast protocol

  • Lucene: Indexing and searching

  • OpenOffice: Transforming office documents to text

  • Spring: Dependency Injection (DI) component container

  • Mozilla Rhino: Server-side JavaScript engine

  • OpenSymphony Quartz: Job scheduler

Services and components

The Alfresco platform is built up around many so-called services such as the Node Service—that is used to manipulate the nodes in the repository, or the Content Service—that is used to manipulate the content that the nodes are pointing to. You can think of a service as the interface to some piece of CMS functionality. Each service has one or more implementations that are sometimes referred to as components.

The core platform uses the Spring Framework as its component container, so anyone familiar with Spring would feel at home looking at the source code and the configuration files. Spring also makes the platform very flexible and easy to customize.

Every service is available as a Spring Bean, so we can easily customize them by overriding with our own implementations, if so, only change a minor feature and use the rest as is.

The platform also uses the aspect-oriented features of Spring to support transactions and security in a transparent way. Using an aspect-oriented approach also minimizes code duplication and intrusiveness in the service implementations.

The following figure illustrates:

These service interfaces are often referred to as the Alfresco Foundation Services and they are the lowest level of public interfaces giving access to the Alfresco content management functionality. A Spring client, in the same process as the Alfresco repository, can inject these services into custom code.

The File Folder Service is a little bit different from the rest of the services, as it uses the other services to implement its functionality. So you could say that it is not at the lowest level but abstracts some of the other services.

An important point is that the Foundation Services is where the transaction and security policies are enforced. These policies are declaratively specified and enforced for each service. The service interfaces are also stateless in the design, so all the data (that is, state) needed for a method operation has to be passed in to the service call.

Anyone familiar with the Strategy Design Pattern will see several uses of this pattern in the Alfresco platform. For example, when you configure an authentication method, you select a particular authentication strategy such as LDAP Authentication.

If the authentication strategy you are looking for is not available, such as LDAP server authentication when using CIFS, you can create your own authenticator component and just plug it into the platform. (Chapter 4, Authentication and Synchronization Solutions shows you a solution on how to do just this.)

Content rules

An important part of any content management platform is to be able to automate business rules. In Alfresco, you can set up content rules to enforce business rules. Rules are applied to folders and they consist of the following parts:

  • General Description: Information about what the rule does

  • Condition: Consists of two parts:

    • When to execute the rule (one or more can be selected):

  • Inbound (when documents are created or added to a folder)

  • Outbound (when documents are deleted or moved from a folder)

  • Update (when documents are updated)

    • Selection criteria (one or more criteria can be specified):

  • Property Values (for example, if document author is martin, if document name contains the text security, and so on)

  • Has Aspect (for example, if document has aspect e-mailed applied)

  • Document of Type or Subtype (for example, if document is of type meeting)

  • All documents (default)

  • And more...

  • Action: What to do when the rule is executed (one or more actions can be selected). A content rule action can be things, such as move a document to another folder, transform an MS Word document into a PDF, apply versioning, send an e-mail, run a script, and so on.

We normally set up these content rules from the Alfresco user interface but they can also be imported into the repository in a bootstrap procedure.

Content Rules are defined per folder and if a rule should apply to a lot of different folders, such as an "Apply Versioning" rule, it does not have to be defined for all folders. It is enough to define the rule for one folder and then link to that rule from all other folders that should have the same rule applied.

If more than one rule is defined, it is possible to specify in what order they should be executed. For example, let's say, we have the following rules:

  • Transform MS Word documents to PDF

  • Apply versioning to MS Word documents

If we run the rules on a new MS Word document, we will end up with a PDF version of the document without versioning applied, and versioning applied to the MS Word document. However, if we change the order of the rules so that the versioning is applied first, then the PDF version of the document will also have versioning applied. So it is very useful to be able to set the order of execution for the rules.

Rules can be executed asynchronously, which is good, as then operations such as adding a document to a folder will not be affected by how long it takes to execute all the enabled rules. The document will just be stored immediately and then at a later time the rules will be executed. What then happens if there is an error executing the rules later on? Can something be done to notify the users or send an e-mail? Yes, a script can be associated with a rule to run if an error occurs.

Event model

Sometimes, using rules might not be enough for what we want to do. Let's say that we wanted to execute a business rule just before a node is deleted and a business rule after the node has been deleted. This cannot be done with rules, as they allow us to execute business rules only when nodes are created, deleted, or updated. Other examples are executing business logic after version changes for a document or when an association is deleted.

Also, in some cases the rule will not behave correctly if you add content via CIFS and then you might have to resort to using the event model.

Note

If we, for example, add a rule that enforces a specific document naming convention for a particular folder hierarchy, then this rule will work okay from the Alfresco Explorer and the Alfresco Share user interfaces where it will prohibit the user from adding the document by throwing an exception. However, via CIFS the document will still be added even though an exception was thrown.

In these cases, we can use the Alfresco event model that enables us to execute Java or JavaScript code when an event happens in the system. In Alfresco, these events are called behavior policies.

The events that we can listen to are associated with the service that triggers them. The Content Service events can be found in the org.alfresco.repo.content.ContentServicePolicies class and look like this:

Event (Behavior policy)

Description

onContentUpdate

Called when one or more node properties are updated. If you need to know what properties were updated, and the before and after value of properties during an update, then use the onContentPropertyUpdate method instead.

onContentPropertyUpdate

Called once for each node property that has been updated. The before update and after update value for the property is available.

onContentRead

Called when a content reader is requested for a node that has content. Could, for example, be used to keep statistics for when a document was last accessed, how many times it has been accessed, and so on.

The Copy Service events can be found in the org.alfresco.repo.copy.CopyServicePolicies class:

Event (Behavior policy)

Description

getCopyCallback

Called just before the copying of the node to find out what should be copied. By default, all aspects, the type, properties, and associations will be copied. However, sometimes it is necessary to be able to customize what should be copied and not copied. For example, if some properties should not be copied, then implement this method and exclude these properties from being copied. When implementing this method all aspects, associations, and properties that should be copied need to be specified.

beforeCopy

Called once it has been decided which properties and aspects will be copied, but before the copy occurs.

This allows us to remove cached data based on the destination node, before it is overwritten. You are unable to make changes to what gets copied though, that must be done earlier via a getCopyCallback.

onCopyComplete

Called when the copying has completed (including any cascading).

The Asynchronous Action Execution Queue events can be found in the org.alfresco.repo.action.AsynchronousActionExecutionQueuePolicies class:

Event (Behavior policy)

Description

onAsyncActionExecute

Called when an asynchronous action has completed execution and transaction has committed. This event is not linked to a content type or aspect, but to a service implementation.

The Node Service events can be found in the org.alfresco.repo.node.NodeServicePolicies class and there are a lot of them. Therefore, we can divide them up into the following groups so that it is easier to overview them:

  • Store events

  • Node events

  • Aspect events

  • Association events

The following node events are related to the store:

Event (Behavior policy)

Description

beforeCreateStore

Called before a new store is created.

onCreateStore

Called just after a new store has been created.

The following node events are related to any node and its properties:

beforeCreateNode

Called before a new node and its parent association is created.

onCreateNode

Called just after a new node and its parent association has been created. Note that this event method is called before onCreateChildAssociation and onUpdateProperties.

onMoveNode

Called when a node has been moved. Note that this method is not called if the node is moved to another store such as from Working Store to Archive Store.

beforeUpdateNode

Called before a node is updated in the following situations:

Adding or removing aspects

Adding, removing, or updating properties

Adding or removing associations

Updating type

onUpdateNode

Called when a node is updated in the following situations:

Just before adding aspects

Just after removing aspects

Just after adding, updating, or removing properties

Adding or removing associations

Updating type

onUpdateProperties

Called when a node's properties are updated in these situations:

Just after adding, updating, or removing properties

Just after a node has been created

beforeDeleteNode

Called before a node is deleted. If the node has children, then this method will be called before each child is deleted. This method will also be called before a node is moved and the old source node is about to be deleted.

onDeleteNode

Called after a node has been deleted. If the node has children, then this method will be called after each child is deleted. This method will also be called after a node is moved and the old source node has been deleted.

The following node events are related to manipulation of a node aspect:

beforeAddAspect

Called before an aspect is added to a node.

onAddAspect

Called after an aspect has been added to a node.

beforeRemoveAspect

Called before an aspect is removed from a node.

onRemoveAspect

Called after an aspect has been removed from a node.

The following node events are related to manipulation of a node association:

beforeCreate

NodeAssociation

Never called. (Should be called before a parent<->child or peer association is created).

onCreate

NodeAssociation

Never called. (Should be called after a parent<->child or peer association has been created).

beforeCreate

ChildAssociation

Called before a parent<->child association is created in the following situations:

When a parent<->child association is directly created

When a parent<->child association is indirectly created as a result of a new node being created

When a parent<->child association is indirectly created as a result of a node being moved

onCreate

ChildAssociation

Called after a parent<->child association has been created in the following situations:

When a parent<->child association is directly created

When a parent<->child association is indirectly created as a result of a new node being created

When a parent<->child association is indirectly created as a result of a node being moved

beforeDelete

ChildAssociation

Called before a parent<->child association is deleted in the following situations:

When a parent<->child association is directly deleted

When a parent<->child association is indirectly deleted as a result of a node being deleted

When a parent<->child association is indirectly deleted as a result of a node being moved

When a parent<->child association is indirectly deleted as a result of an aspect being removed from a node

onDelete

ChildAssociation

Called after a parent<->child association has been deleted in the following situations:

When a parent<->child association is directly deleted

When a parent<->child association is indirectly deleted as a result of a node being deleted

When a parent<->child association is indirectly deleted as a result of a node being moved

When a parent<->child association is indirectly deleted as a result of an aspect being removed from a node

onCreate

Association

Called after a peer association has been created.

onDelete

Association

Called after a peer association has been deleted.

The Records Management events can be found in the org.alfresco.module.org_alfresco_module_dod5015.RecordsManagementPolicies class:

Event (Behavior policy)

Description

beforeRMActionExecution

Called before a records management action executes.

onRMActionExecution

Called when a records management action has been executed.

beforeCreateReference

Called before creation of a custom reference between two components of a record, such as between an e-mail and its attachment.

onCreateReference

Called after the creation of a custom reference between two components of a record, such as between an e-mail and its attachment.

beforeRemoveReference

Called before removal of a custom reference between two components of a record, such as between an e-mail and its attachment.

onRemoveReference

Called after a custom reference has been removed between two components of a record, such as between an e-mail and its attachment.

The Version Service events can be found in the org.alfresco.repo.version.VersionServicePolicies class:

Event (Behavior policy)

Description

beforeCreateVersion

Called before a new version is created for a document. Also called before the version history is checked.

onCreateVersion

Called immediately before the new version node is created. Use it to determine what the versioning policy for a particular type may be.

afterCreateVersion

Called after the version has been created and after any associations to, for example, root version has been created.

calculateVersionLabel

Called when the version label should be calculated.

Implement this method to do version numbering and labeling in a custom way.

Defining custom business logic to be executed when any of these events occur also requires knowledge of the content model that is implemented. When an event handler is registered with the system, the following things need to be specified:

  • Event method name: QName for the event method (for example, {http://www.alfresco.org}onCreateNode)

  • Content model class: QName for type or aspect that should be affected (for example, {http://www.bestmoney.com/content/model/1.0}meeting)

  • Java method name: Name of the method that implements the business logic that should be executed when the event happens

This makes it possible to pinpoint exactly what content should be affected by the business logic when the event occurs. Besides being able to bind the business logic to a particular content model class (that is, type or aspect), it can also be bound to either a content model association or a content model property.

When an operation such as adding a document is executed, it is done in a transaction. During this, any registered event handlers are called in the same transaction and in the order they were registered. Because of this, a faulty event handler could prohibit the system from working properly. Let's say, we install an event handler that triggers when content is added to the system (that is, onCreateNode) and we have made a mistake when coding it.

If this coding mistake results in an exception being thrown, then that rolls back the transaction. This would effectively block anybody from adding content to the system. So if unexpected errors happen after installing a lot of event handlers, it might be a good idea to remove them and see if they are the cause of the problem.

It is possible to do even more fine-tuning of where in the transaction event handlers should be called (that is, compared to just before or after an operation). There are three different stages (Alfresco calls it notification frequencies) where the custom handler could be configured to be invoked:

  • EVERY_EVENT: This is the default if the notification frequency is not specified. The event handler is then just executed wherever it is being invoked in the code. The name of this notification frequency implies that the event handler will be called multiple times, but that is not the case.

  • TRANSACTION_COMMIT: The event handler is queued and invoked at the end of the transaction after it has been committed. A proxy around the event handler manages the queuing.

  • FIRST_EVENT: The event handler is invoked just after the transaction is started. A proxy around the event handler manages this.

The following figure shows an example of how an event handler works:

Here, we have added an event handler that will be called whenever a new document is added to the repository. When that happens, we send an e-mail. The sequence is as follows:

  1. 1. The event handler, onCreateNode, is registered with the Alfresco Event Manager to be triggered for any documents of type cm:content.

  2. 2. Someone uploads a document.

  3. 3. This triggers the Policy Component to check if there are any registered event handlers for the onCreateNode event.

  4. 4. The Policy Component finds one event handler and calls the onCreateNode method in our custom code.

  1. 5. Our custom code sends an e-mail from the onCreateNode method.

Note

When using the event manager it is important to know that it manages synchronous events. So whatever code we implement in the event handler will be executed in the same transaction as the main Alfresco code that triggered the event. This means that the code in the event handler will impact the performance of general Alfresco operations such as adding a document. So we need to be careful about this and use this event system only when it is really necessary.

Metadata extraction

The idea with metadata extraction is to automatically do some classification of content when it is added to the Alfresco repository. The extracted properties are stored together with the physical content as metadata. The more metadata that is stored with a piece of content, the more will be the search possibilities that exist.

The whole idea with a content management system is to be able to manage content in an easier way and metadata plays a big role in that. Metadata helps with:

  • Search: As it gives the possibility to search on individual properties or combinations of properties.

  • Faster access: Metadata is usually indexed, which means that the time it takes to search for content is reduced.

  • Sorting: In Alfresco, documents can be tagged to belong to a certain type of content. For example, you could tag all documents that have to do with, for example, running with the word "running". It is then very easy to find and sort all documents that have to do with running.

  • Management: If, for example, there is a piece of custom code that manages meeting documents in some way, then it would be very easy for that code to find relevant documents, if they have proper meeting metadata.

  • Rights: Content rights management could be handled via metadata.

Metadata extractors are used to automatically extract properties from different content formats. Alfresco comes with a number of metadata extractors that are used by default without us having to do anything:

  • PDF: Extracts the author (as cm:author), title (cm:title), subject (cm:description), and created (cm:created) from PDF files using the Apache PDFBox library.

  • MS Office: Extracts the author (as cm:author), title (cm:title), subject (cm:description), and createdDateTime (cm:created), and lastSaveDateTime (cm:modified) from Microsoft Office documents (97-2003, 2007) using the Apache POI library. There are more MS Office document properties that can be extracted and saved as metadata, but they are not at the moment. The following properties could also be saved as metadata: comments, editTime, format, keywords, lastAuthor, lastPrinted, osVersion, thumbnail, pageCount, and wordCount.

  • MSG Email: Extracts the sentDate (as cm:sentDate), originator (cm:originator), addressee (cm:addressee), addressees (cm:addressees), and subjectLine (cm:subjectline) from an Outlook e-mail (that is,. msg) using Apache POI.

  • HTML: Extracts the author (as cm:author), title (cm:title), and description (cm:description) from HTML files using the Apache PDFBox library.

  • OpenOffice: Extracts the creator (as cm:author), title (cm:title), description (cm:description), and creationDate (cm:created), from OpenOffice.org documents using the Apache Tika library. There are more Open Office document properties that can be extracted and saved as metadata, but they are not at the moment. The following properties could also be saved as metadata: date, generator, initialCreator, keyword, language, printDate, printedBy, and subject.

  • MIME Email: Extracts the messageFrom (as imap:messageFrom), messageTo (imap:messageTo), messageCc (imap:messageCc), messageSubject (imap:messageSubject), messageSent (imap:dateSent), Thread-Index (imap:threadIndex), Message-ID (imap:messageId), and date (imap:dateReceived) from a MIME e-mail (that is, .eml in RFC822 format) using "JavaMail".

  • StarOffice (Oracle Open Office): Extracts the author (as cm:author), title (cm:title), and description (cm:description) from Oracle Open Office documents.

  • DWG: Extracts the author (as cm:author), title (cm:title), and description (cm:description) from drawings (that is, .dwg) produced by several CAD packages such as AutoCAD, IntelliCAD, and Caddie. The Apache Tika library is used for this. There are also a few other drawing properties that are available to set as metadata: keywords, comments, and lastauthor.

There are also experimental metadata extractors such as the MP3 extractor that you would have to manually turn on to test out.

Those metadata extractors that have the possibility to extract a few more properties—than are mapped to metadata—can be configured to set these extra properties to metadata as well. When there is no metadata extractor available for a file type, a custom extractor can be written and plugged into the system. For more information, see http://wiki.alfresco.com/wiki/Metadata_Extraction.

Content transformation

Content transformers are an important part of the Alfresco content management system as they enable content to be indexed by Lucene. All content that should be indexed, first needs to be converted to text files, which is done with transformers.

Content transformers are also very useful when you want to create new content formats for publishing, or the like. For example, when a Word document has been approved, we might want to automatically create a PDF version of it. Another useful feature of transformers is that they can be used to convert images into different formats and sizes.

The Alfresco system comes with a number of content transformers out of the box:

  • Any text to plain text: Converts any textual format to plain text. For example, text/xml to text/plain or application/x-javascript to text/plain (used primarily for indexing).

  • PDF to plain text: Converts PDF files to plain text files using the Apache PDFBox library (used primarily for indexing).

  • Excel to plain text: Converts Microsoft Excel (version 97-2003, 2007) files to plain text files using the Apache POI library (used primarily for indexing).

  • Word to plain text: Converts Microsoft Word (version 97-2003, 2007) files to plain text files using the TextMining library (used primarily for indexing).

  • HTML to plain text: Converts HTML documents to plain text files using the HTML Parser library (used primarily for indexing).

  • E-mail (Outlook) to text: Converts Microsoft Outlook e-mails (that is, .msg) to plain text files using the Apache POI library (used primarily for indexing).

  • E-mail (MIME) to text: Converts RFC822 MIME e-mails (that is, .eml) to plain text files using the Java Mail library (used primarily for indexing).

  • MediaWiki markup to HTML: Converts MediaWiki Markup pages to HTML documents using the Java Wikipedia API library.

  • PDF to image: Converts a PDF file to a PNG image using the Sun PDF Renderer library or the Apache PDFBox library. Converts a PDF file to a JPEG or GIF Image using the ImageMagick tool.

    This transformer can actually transform to a lot of different image formats supported by the ImageMagick tool (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PhotoCD, and TIFF (used for thumbnails).

  • Image to image: Converts an image to a different size or format via the ImageMagick tool.

  • Open Office to PDF: Converts OpenDocument, PDF, RTF, Word, Excel, PowerPoint, and Flash files into PDF files using the JODConverter and OpenOffice.org.

  • Open Office to image: Converts Open Office documents into images using the Open Office to PDF transformer and the PDF to image transformer (used for thumbnails).

  • Plain Text to PDF: Converts a plain text file into a PDF file using the Apache PDFBox library.

  • Plain Text to image: Converts a plain text file into an image using the Plain Text to PDF transformer and the PDF to Image transformer.

When transformers are configured and set up, this can be done in a couple of ways to reach the end transformation goal. As we have seen in the previous list, most transformers convert from one format to the other such as from PDF to plain text.

However, transformers can also be combined to solve a more complex transformation such as a plain text to Image transformation where two or more transformers work together. These transformers are called complex transformers and here is an example of how they work:

In this example, a text document is transformed into a PNG image by first being transformed to a PDF file.

If we are not sure whether one particular transformer can always successfully transform one format to the other, we can chain together the so-called failover transformers. Only one of these transformers will do the transformation. When the first transformer gets the job to transform, it will either respond with successful transformation or an exception. If a transformer responds with an exception, then the turn goes to the next transformer in the chain to see if it can do the transformation successfully.

The following example shows two transformers in a chain that can do PDF to PNG transformations:

In some cases, the transformation we would like to do is not possible via a Java Library but only via a command-line executable file. We can then use a special transformer called a runtime executable content transformer and it can be used to run a tool such as ImageMagick or OpenOffice.org from the command line.

For more information, see http://wiki.alfresco.com/wiki/Content_Transformations.

Alfresco Management Beans (JMX)

The core platform contains management beans (JMX) that can be used to configure Alfresco when it is running and also to inspect the health of the running system. By using a standard JMX Console such as JConsole that supports JSR 160 (that is, JMX Remoting) you can:

  • Manage and control the subsystems

  • Change the log level to, for example, debug for some part of the system

  • Turn on and off file servers such as CIFS

  • Set the server in read-only mode

  • Set the server in single user mode

  • Prevent further logins

  • View user session tickets

Some of these features are really useful such as the possibility to turn on debug logging for a specific component without having to stop the server. It might not even be possible to stop the server whenever we want in a production system. Setting up the system in read-only mode is also very useful if, for example, you need to take a snapshot backup of the system's current state for offline debugging.

 

Application Programming Interfaces (APIs)


There are several APIs that you can use when extending the platform such as the low-level Java Foundation Services API or the JavaScript API. When you implement Java extensions delivered in the form of an Application Module Package (AMP), you would mostly use the so-called Foundation Service API, but when you implement Web Scripts and Business Rules it is more convenient to use the JavaScript API.

If you access the repository from a remote application using a client-server approach then there are several REST-based APIs such as the Repository API and the CMIS API that can be used. The higher-level APIs use the Foundation Service API that has transaction management and security built in.

See the next chapter for more information on how to use these APIs.

 

Subsystems


Subsystems are configurable modules responsible for a piece of functionality in the Alfresco content management system. The functionality is often optional such as the IMAP server or can have several different implementations, such as authentication.

A subsystem can be thought of as a mini-server that runs embedded within the main Alfresco server. A subsystem has the following characteristics:

  • It can be started, stopped, and configured independent of the Alfresco server

  • It has its own isolated Spring application context and configuration

Subsystems are independent processes that can be brought up and down. This design lets an administrator of the system change a single configuration without having to bring down the entire Alfresco system. The advantages are reliability and availability.

Examples of Alfresco subsystems include:

  • Audit: Configuration of audit parameters

  • Authentication: Contains different authentication subsystems such as LDAP

  • E-mail: SMTP support for sending e-mails

  • File servers: CIFS, FTP, and NFS servers

  • IMAP: Internal IMAP server

  • Open Office transformations: Helps converting office documents to text

  • Synchronization: LDAP synchronization settings

  • Sys admin: It allows real-time control across some general repository parameters

  • Third-party: Owns the SWFTools and ImageMagick content transformers

  • WCM development receiver: A built-in WCM deployment target for local AVM to DM content deployments

 

Bootstrap


There are several ways to bootstrap the system with custom functionality or new content.

Patches

In some situations, we want to add something to the system just once during installation and then never do it again. And we want to do this in a controlled way, specifying from what version of the system it is applicable. This is called patching the system or bootstrapping the system. Alfresco uses patches to handle different things such as:

  • Database upgrades

  • Template installations

  • Folder creation

  • Permission updates

  • Group imports, and so on

Every time we do a new Alfresco installation, or an upgrade, the logs will show what patches were executed. Every patch execution is logged in the database with information about whether it was successful or not. If an error occurred, then the database will contain an error message about it. If a patch did not succeed, then Alfresco will try and execute it every time you start the system until it is successful.

We can also set in what order patches should be executed, which is important, as many a times one patch depends on another patch's updates. Patches are implemented as Java classes and it is possible to create custom patches (we will have a look at that in the next chapter).

Importers

An importer component is also used to import data into the repository in the same way as a patch is. However, it is different from a patch—in that the outcome of executing it is not logged in the database and to control the execution order we have to use the Spring depends-on configuration. Importer components are also not usually written in Java. An importer component can be used to import the following things into the repository via XML files:

  • Users

  • Groups

  • Scripts

  • Presentation templates

  • Folder hierarchies

  • Documents

Data that should be imported can first be exported via the export command-line tool provided by Alfresco, which produces XML files that can be loaded into the repository via an importer component. So it might make sense to first create what you want to import via the user interface and then export it. Next chapter shows an example of how to configure an importer component.

 

Extension modules


Extension modules are used to extend Alfresco with significant new functionality, such as records management. Extension Modules are delivered in so-called Application Module Packages (AMP) files.

The following extension modules are available from Alfresco:

  • Web Content Management (WCM) AMP

  • Records Management (RM) AMP

  • SharePoint Protocol Support (VTI) AMP

  • Alfresco Forms Development Kit (FDK) AMP

  • Lotus Quickr Integration AMP

  • MediaWiki Integration AMP

 

Third-party extension modules


A lot of companies develop extensions for Alfresco and deliver them in AMP modules. The following is a list of some available modules:

  • Thumbnails AMP: Adds thumbnails to Alfresco Explorer

  • Enterprise Reporting AMP: Enables running reports from the Alfresco environment

  • OpsMailmanager AMP: Enterprise e-mail management

  • Alfresco Bulk Filesystem Import: Loads content from local filesystem and can preserve modified date during import (see Chapter 8, Document Migration Solutions for more information)

 

User interface clients


There are a couple of different user interfaces that you can use to access Alfresco either from a browser or from the filesystem.

Alfresco Explorer

Alfresco Explorer is the traditional Document Management client that most people use when they access the repository via a web interface. It has access to the complete repository and also offers administration functionality to handle, for example, users, groups, import, and export.

This client is successively being replaced by the Alfresco Share client, which has a much nicer and richer user interface.

However, there are situations when the Alfresco Share client does not yet support all functionality that Alfresco Explorer provides such as when using advanced workflows or creating and managing web content. In these cases, we have to still use the Alfresco Explorer client.

Likewise, the Alfresco Explorer platform does not support the collaboration and sharing features that are available in Alfresco Share.

Alfresco Share

The Alfresco Share client is the new client for Alfresco that started out life as a pure collaboration and sharing platform. After a while, it became very popular and people wanted to use it for more than just collaboration and sharing features. Alfresco responded by adding more functionality to be able to access the document repository via Alfresco Share. Now it also has a lot of the administration features previously only available in Alfresco Explorer.

Alfresco SharePoint

This is not really a standalone client, but more an integration with Microsoft SharePoint functionality. When the SharePoint protocol is enabled in Alfresco, a Microsoft Office program can connect to Alfresco thinking it is connecting to SharePoint.

We can connect to Alfresco from, for example, MS Word as follows:

Here, we are creating a new Document Workspace that will be created as an Alfresco Share site called MyDocSite. We can then save documents directly into Alfresco without having to go through any other Alfresco user interface. This is how it looks in Alfresco Share after creating the workspace and saving a Word 2007 test document called MyTestDoc.docx:

Clicking on this site and then navigating to the document library shows the following:

Note

Unfortunately we cannot save documents anywhere we like in the Repository, only to an Alfresco Share site.

Alfresco Mobile

Alfresco comes with a special web application to support the iPhone. It presents a smaller interface adjusted for the iPhone to be able to access content in an Alfresco Share site.

The interface gives you access to Alfresco Share sites and the possibility to search, view, and edit documents in the Document Library within sites. Users can also view tasks and activities.

For more information about mobile application solutions see Chapter 14,

Alfresco CIFS

The possibility to access files in Alfresco via a shared drive is one of the major benefits with Alfresco, as most people are used to working with a shared drive already. So the transition to a content management system becomes less cumbersome when users can work in the way they have always done.

The shared drive access is probably the interface that most people use on a day-to-day basis. However, there are things that cannot easily be done when using the CIFS interface:

  • Searching for documents based on metadata

  • Enter metadata for added documents

  • Display error messages from rules, some rules will not even work when documents are added via this interface

  • Workflow task management

For more information about the CIFS interface see Chapter 5,

 

The Alfresco installation directory structure


After you have installed Alfresco, you will have a directory structure that looks something like this:

|- Alfresco
|
|- /alf_data
|- /amps
|- /amps-share
|- /bin
|- /extras
|- /ImageMagick
|- /install_extension
|- /licenses
|- /mysql
|- /OpenOffice.org
|- /tomcat
|- /virtual-tomcat
|- alf_start.bat (Starts Apache Tomcat and Alfresco)
|- alf_stop.bat (Stops Apache Tomcat and Alfresco)
|- alfresco.log (Default location for Alfresco log file)
|- virtual_start.bat (Starts another Tomcat for WCM preview)
|- virtual_stop.bat (Stops Apache Tomcat for WCM preview)

It is important to get to know the directory structure of an Alfresco installation. So you know where to go and look for certain information or to be up to date when you get questions from clients.

The alf_data directory

The Alfresco data directory is the most important directory as it contains all the content files and all the content indexes of the repository, for both live content and deleted content. The location of this directory is specified in the alfresco-global.properties file located in the tomcat/shared/classes directory with the property dir.root.

This is the alf_data directory structure:

|- Alfresco
|- /alf_data
|- /audit.contentstore
|- /backup-lucene-indexes
|- /contentstore
|- /contentstore.deleted
|- /lucene-indexes
|- /mysql
|- /oouser

The contentstore directory

The contentstore directory contains all the live content files. You will notice, as you start clicking down in the directory hierarchy, that you will not recognize any of the filenames.

For example, if you upload a file called mytext.doc to a folder, you will not find any file by that name in the contentstore directory. It has been stored under a different name that looks more like a reference number.

To find a specific document you have to first go into Alfresco Node Browser and look up the item and its cm:content field. This field looks something like this:

contentUrl=store://2010/2/12/18/1/62781911-635d-4366-80dc-13d4a3e5e4fe.bin|mimetype=application/pdf|size=375530|encoding=utf-8|locale=en_GB_

And in this case the PDF file would be found under: alfresco/alf_data/contentstore/2010/2/12/18/1.

And it would be called: 62781911-635d-4366-80dc-13d4a3e5e4fe.bin (that is, UUID.bin). If you change the extension to .pdf you will be able to open it.

File versioning

Every version is stored in its entirety. No delta data is kept. You can look up the content store location for the master version, such as:

contentUrl=store://2010/2/16/11/18/d3278bd6-2ad0-4d85-b627-0ce757b985a8.bin

And next to it you will see the .bin files for the other versions.

The contentstore.deleted directory

Whenever you delete a file from the Alfresco repository via any of the user interfaces, it is not physically deleted from disk, instead it is moved to an archive store. This store can be found on disk under the contentstore.deleted directory.

If you do not see any files in the contentstore.deleted directory, then your Alfresco installation might have a custom Content Cleanup Listener configuration.

The audit.contentstore directory

Alfresco can be configured to audit all changes to metadata and content and the audit trail is stored in this directory. This directory will usually be empty in a standard installation as audit logging is not turned on by default.

The lucene-indexes and backup-lucene-indexes directories

All content in the repository is indexed via the Lucene index engine (unless you have turned off indexing for some types of content). The index files are kept in the lucene-indexes directory.

If the index ever gets corrupted or deleted, it is possible to do a full re-indexing by setting the index.recovery.mode property to FULL and restart the system. This property can be found in the repository.properties file located in the tomcat/webapps/alfresco/WEB-INF/classes/alfresco directory. To do a full re-indexing every time you start Alfresco, even if it has been re-deployed, put this property in the alfresco-global.properties file.

Lucene indexes are backed up to the backup-lucene-indexes directory every night at 3 a.m. by a scheduled job.

The mysql directory

If Alfresco is installed from the "Full Installation" file, then MySQL can be installed at the same time. In this case, the data files for MySQL are stored in this directory.

The oouser directory

If Alfresco is installed from the "Full Installation" file, then OpenOffice.org can be installed at the same time. In this case, the user who executes the document conversions uses this directory.

The amps directories

The application modules (that is, AMPs) that are used to extend Alfresco can be found in the following two directories:

|- Alfresco
|- /amps
|- /amps-share

The amps directory is used for modules that extend the Alfresco Explorer web application (that is, alfresco.war) and the amps-share are used for modules that extend the Alfresco Share web application (that is, share.war).

After putting the AMP files in these directories, you can use the alfresco/apply_amps.bat file to install them into the WAR file.

After a full installation of Alfresco 3.3, the following AMPs can be found in the directories:

|- Alfresco3.3
|- /amps
|
|- alfresco-dod5015.amp (Alfresco RM for Explorer UI)
|- alfresco-quickr-unsupported.amp (Lotus Integration)
|- vti-module.amp (MS SharePoint Protocol)
|- /amps-share
|- alfresco-dod5015-share.amp (Alfresco RM for Share UI)

The tomcat directory

This is the main application server directory. When you install Alfresco from the "Full installation" file, it includes an Apache Tomcat installation that ends up in this directory.

The tomcat directory structure looks as follows:

|- Alfresco
|- /tomcat
|- /bin (binaries to start Tomcat)
|- /conf (Tomcat configuration, configure SSL for example)
|- /endorsed
|- /lib (Tomcat libraries)
|- /logs (Tomcat log files, except Alfresco log)
|- /shared (Config files that lives over deployments)
|- /temp (Temporary files such as for EHCache)
|- /webapps (Alfresco webapps go here)
|- /work (JSP compiled into Servlets)

The two most important directories here are the shared and the webapps directories. The shared directory is where you will do all Alfresco configuration that should live over deployments and Alfresco upgrades. Basically, try and always do the configuration here because then you know it is not going to be overwritten when somebody installs a new AMP or does an upgrade.

The webapps directory is where you will find the web application WAR files for Alfresco Explorer (that is, alfresco.war), Alfresco Share (that is, share.war), and Alfresco Mobile (that is, mobile.war).

You will also use files in the bin directory when you want to install Alfresco as a Windows service.

 

Getting the Alfresco source code


If you do not yet have the Alfresco source code downloaded, it is time to do so now.

You can access the trunk from svn://svn.alfresco.com/alfresco/HEAD and it is a good idea to update it every week to see what new stuff is being added. More information about the Alfresco development environment can be found here http://wiki.alfresco.com/wiki/Alfresco_SVN_Development_Environment.

 

The Alfresco database


Normally, we do not have to bother about the database, but there are situations when it is necessary to be familiar with it.

In general, one should not do any CRUD operations directly against the database bypassing the foundation services when building a custom solution on top of Alfresco. This will cause the code to break in the future if the database design is ever changed. Alfresco is required to keep older APIs available for backward compatibility (if they ever change), so it is better to always use the published service APIs.

Query the database directly only when:

  • The customization built with available APIs is not providing acceptable performance and you need to come up with a solution that works satisfyingly

  • Reporting is necessary

  • Information is needed during development for debugging purposes

  • For bootstrapping tweaking, such as when you want to run a patch again

Almost any RDBMS database can be used as the platform that uses Hibernate as the database access layer. For Level 2 caching the EHCache library is used, which speeds up the performance.

DB schema

The Alfresco database schema cannot be found in any .sql file, as the whole database is created via Hibernate the first time Alfresco is started after installation.

To access the tables use a tool like the SQuirreL SQL Client or just use the mysql command-line utility. The database is called alfresco by default and to access it via JDBC an URL that looks like jdbc:mysql://localhost:3306/alfresco can be used. Default username/password is alfresco/alfresco.

Significant tables

Let's take a look at some of the tables that we might come in contact with or need to query for information.

ALF_NODE

This is the parent table for node metadata and many other tables refer to it with a foreign key. Listing a couple of rows from it looks like this:

What we get is the node UUID, version, what store it is saved in, type QName, and so on. The ID of the rows is used to look up related rows in other tables, such as associated properties.

To select a particular node after looking up its node UUID (that is, from the {http://www.alfresco.org/model/system/1.0}node-uuid property) via the Alfresco Explorer Node Browser, we can execute the following query:

select * from alf_node where uuid='456822c7-8f3e-4129-9804-dfaaab54f47a'

This will give us the node ID to use for further queries.

ALF_NODE_PROPERTIES

This table contains all the properties that have been set as metadata for a particular node.

When we have the node ID, we can query for the associated properties as follows:

select * from alf_node_properties where node_id=1016

This table contains all the properties that have been set as metadata for a particular node:

We can still see only the values for the properties not what their names are. The name of the property can be looked up via the qname_id.

Notice also that some properties like the default Created Date, Creator, Modifier, and Modified date are not listed. This is because they are part of the ALF_NODE row.

ALF_NODE_ASPECTS

This table contains all aspects that are associated with a node. When we query this table, we get the following result:

This is not very helpful as we cannot see the names of the aspects, just their QNames. We would have to link up with the ALF_QNAME table to see the names.

ALF_QNAME

This table contains all the QName definitions and it is referred to from lots of the other tables. For example, here is how to use it together with the ALF_NODE_ASPECT table:

In this way, we can clearly see what aspects are associated with a node.

ALF_APPLIED_PATCH

This table contains information about all executed patches. It keeps information about if they were successful or not and any error messages:

Example queries and update statements

The queries are some examples used in real content management deployments.

Querying for number of nodes of a certain type

Let's say you wanted to find out how many e-mails have been stored in the repository so far. Then you could do that with the following query:

SELECT count(*) from ALF_NODE n, ALF_NODE_ASPECTS a, ALF_QNAME q where n.ID = a.NODE_ID and a.QNAME_ID=q.ID and q.LOCAL_NAME='imapemail';

This query will search for all content nodes that have the aspect imapemail applied and count them.

Querying for number of nodes stored in a particular month

If you wanted to build on the previous query and find out how many e-mail nodes have been stored in a particular month, you could do that as follows:

SELECT count(*) from ALF_NODE n, ALF_NODE_ASPECTS a, ALF_QNAME q where n.ID = a.NODE_ID and a.QNAME_ID=q.ID and q.LOCAL_NAME='imapemail' and n.audit_created like '2010-01%';

This will query for the number of e-mail nodes that have been stored in January 2010.

Running a patch again

If you, for some reason, wanted to run a patch again, you can do that as follows:

UPDATE ALF_APPLIED_PATCH SET WAS_EXECUTED=0, SUCCEEDED=0 WHERE ID='patch.applyImapMailboxAspect';

This will set up the patch to status not successfully run last time and the next time we start Alfresco, it will execute the patch again.

 

Summary


This chapter has taken us through most of the features of the Alfresco platform and by now we should be familiar with repository concepts such as stores, nodes, associations, aspects, and types. Everything in the Alfresco repository is represented as a node. If we have business rules that should apply to nodes in the repository, then these can be enforced by implementing the so-called content rules.

Document properties can be automatically extracted with Metadata extractors while the document is being added to the repository. When a document is uploaded to the repository, content transformers can be configured to convert between different formats, such as from an MS Word document to a PDF. Transformers are also used to convert any non-text format to text, so that content can be easily indexed by the Lucene search engine.

If we wanted to send an e-mail every time a document was added anywhere in the repository, then we could use the internal event management system and define a policy for the "add content" event.

Alfresco comes with quite a few user interfaces out of the box, where the traditional JSF based one is called Alfresco Explorer and offers management screens for all features available in Alfresco. This user interface will be replaced by a newer one called Alfresco Share, which is currently the most developed UI. Alfresco CIFS is used to emulate a shared drive for easy migration of users into using Alfresco.

Alfresco uses subsystems for functionality that require many different implantations, such as authentication, or for optional functionality like support for IMAP. The Alfresco system can be bootstrapped with either patches or importers. Patches are used extensively by Alfresco to do, for example, database upgrades.

We also had a look at the directory structure of an Alfresco installation and one of the most important folders is the alf_data folder that contains the content and the Lucene index. The Alfresco web application is contained under the alfresco/tomcat/webapps/alfresco folder in an installation.

Finally, some of the more important database tables were examined and a couple of SQL query examples were shown, and this should give us more confidence in how things work together and we can also create reports directly against the database if necessary.

Now you are probably eager to get going and do some coding. The next chapter introduces the application programming interfaces that can be used to access the repository. For each API, there will be a number of code examples that you can dig into.

About the Author

  • Martin Bergljung

    Martin Bergljung is a Principal ECM Architect at Ixxus, a UK Platinum Alfresco partner. He has over 25 years of experience in the IT sector, where he has worked with the Java platform since 1997.

    Martin began working with Alfresco in 2007, developing an e-mail management extension for Alfresco called OpsMailmanager. In 2009, he started working on Alfresco consulting projects and has worked with customers such as Pearson, World Wildlife Fund, International Financial Data Services, NHS, VHI Healthcare, Virgin Money, Unibet, BPN Paribas, University of Westminster, Aker Oilfield Services, and Amnesty International.

    He is a frequent speaker and has delivered talks at Alfresco conferences in London, Berlin, and Barcelona. He is also the author of Alfresco 3 Business Solutions, Packt Publishing.

    Browse publications by this author
Book Title
Unlock this full book FREE 10 day trial
Start Free Trial