Packt+ | Advance your knowledge in tech

You're reading from Managing Multimedia and Unstructured Data in the Oracle Database

Product type Book

Published in Mar 2013

Publisher Packt

ISBN-13 9781849686921

Pages 504 pages

Edition 1st Edition

Languages

Concepts

Database Administration

Author (1):

MARCEL KRATOCHVIL

Table of Contents (22) Chapters

Managing Multimedia and Unstructured Data in the Oracle Database

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

What is Unstructured Data?

Understanding Digital Objects

The Multimedia Warehouse

Searching the Multimedia Warehouse

Loading Techniques

Delivery Techniques

Techniques for Creating a Multimedia Database

Tuning

Understanding the Limitations of Oracle Products

Working with the Operating System

The Circa Data Type

Multimedia Case Studies

Proactive Database Tuning

Chapter References

Index

Chapter 5. Loading Techniques

Loading in digital objects can be as simple as just taking them from a directory and storing them in the database. In most businesses, the loading gets more complicated as there is a need to attach metadata to the digital object, to process and transform it, and to even verify and validate the metadata being transferred into the digital object. This chapter covers loading methods, the issues encountered in loading, as well as strategies for attaching metadata to digital objects.

Loading methods

In Oracle, the following are utilities or programming environments available for loading digital objects into the database:

SQL*Loader
PL/SQL calling Oracle Multimedia
PL/SQL calling dbms_lob package
Java
Oracle Spatial
XML DB
External language (some examples include C and PHP)
Oracle Database File System
External application (Photoshop, Word, which uses an API call to save straight to the database)

Though these methods are heavily programmatic, they rely on the notion that the digital objects already exist in the file system. There are potentially other locations, which could be considered as sources for loading into the database:

FTP (and sFTP)
E-mail (as attachments or embedded in the e-mail)
Other databases
Websites (HTTP)
Web services (Multipart MIME)
Embedded in Powerpoint or documents

The digital objects might come in a variety of formats, which require processing before they can be loaded into the database. Some include:

ZIP (Gzip)
RAR
TAR
Encrypted

Finding the images

The situation most business find is that their digital objects are sitting on one or more drives, and they need to find them all. They might want to sort out the good images from the bad ones before loading. If they were copied over from a Apple Macintosh computer, then the Mac header information will be copied as another file, in effect creating a ghost version of it.

If the Mac file is called myimage.jpg, then when transferred to Window or Unix, the file .myimage.jpg will also come across. This file is not a digital image but contains Apple Macintosh's specific-header information. It needs to be ignored.

A business might also want to ignore digital objects below a certain size or ignore those of a certain type. The challenge is sorting out the wheat from the chaff.

Unfortunately, Oracle does not provide any database utilities that can help the business achieve this. In fact, to get a directory listing of a file system it should have one of these options:

Shell out to the...

Loading method

There is no right way to load a digital object into the database. Each business will have the digital objects stored in their own structure and the challenge is to work out where all the data is that relates to them. The following describes some of the situations that most businesses will encounter when they try to determine the loading methodology to use that best meets their business requirement.

Metadata matches to digital object

The scenario is that not all digital objects yet exist, some need to be scanned in, and some need to be taken by a photographer. There is thorough metadata for all the digital objects. This is a scenario found in museums or other organizations, where they manage collections. In this case, the following rule applies:

Metadata: Master list of all objects
Digital object is attached to metadata

The metadata can exist in another database, in XML format, or CSV format. It is loaded in first, effectively becoming a relational record (as covered in Chapter...

Matching existing data to images

When loading multiple digital objects into the database, the challenge most organizations will face is how to match similar objects (combine them into one), to match metadata to the digital object, or to match the digital object to the metadata.

The standard approach is to use the filename to match. If a digital object resides in the filesystem, then it will have a filename. For Windows and Unix, the name will have an extension (.jpg, .doc, .mov, .tif) indicating its type (file suffix). On a Mac, as already covered, the file type is stored in a separate file.

The file type is very important, as it determines how the operating system should treat the digital object. If an attempt is made to open it, the file type will match to an application in the computer, which then knows how to handle it. When using a web browser, the equivalent of the file type is mimetype, which is embedded in the header of the digital object as it's downloaded. The browser then has a...

Data cleansing

As covered, there are two key steps when loading a digital object. They are to load the digital object in and to match existing metadata to it. The ordering can be done either way and the match of the existing metadata is optional.

When the digital object is loaded, it needs to be processed. This includes creating derivatives as well as watermarking or general image cleanup (cropping, sharpening, adjusting, censoring).

Once the meta is attached to the digital object, it might need to be cleansed. The concept is similar to what happens in a data warehouse. Some basic cleansing processes include:

Converting varchar (sets of characters) dates into proper dates. A date might be stored in a varchar field. The dates might be of a mixed format, such as 12th Jan 2010, 10/12/08, Jan 15th 2000. They need to be translated into a standard date format.
Converting varchar numbers into numbers.
Ensuring there are no orphaned relationships (all keys storing relationships match to valid digital...

Loading decisions

When loading digital objects, the following issues need to be addressed as part of the load process:

Where are the digital objects found?
Can the database access them? (security)
Is there sufficient storage to handle them?
If archiving is enabled, is there sufficient storage for the archives?
Will there be different types of digital objects (audio, video, images, documents)?
For each type, what are the processing rules? (watermark, cropping, rotation)
For each type, what derivatives are needed?
For the processing, which server should it be performed?
If the digital objects are loaded singularly, how long will the load take to run? Will it run in a realistic time frame?
Should the load be broken up into separate tasks, which can be run independently and in parallel?
Are the digital objects going to be stored in the database or kept in the filesystem?

Depending on what the objectives are for having the digital object determines the answers to the previous questions. This, in turn, impacts...

Loading step-by-step

When loading objects in, from the filesystem, the following are the steps required to load that digital objects in:

Find the images: This involves looking across one or more file systems in multiple directories. It might involve looking for them in an FTP location or on remote websites.
Filter the results: This involves removing redundant images based on filename, suffix, and even directory location.
Extract metadata: Used when matching the metadata embedded in a digital object with an existing metadata. This is an optional step and can be done later. It might need to be done at this point in case the digital object needs to be rejected if there is no match, or if there is no match to be passed to another area for separate processing. In which case, rather than extracting all the metadata, only the meta value needed to match to the key is extracted.
Matching, combining, or replacing: If the previous step is not performed, then based on the image filename, can information...

Summary

Loading digital objects into the database can involve a number of process steps. Based on the type of business, the steps might be as simple as just loading the images in. It might grow to include matching metadata to digital objects or digital objects to metadata. The challenge is working out a method for correctly matching the digital object to the metadata. Loading can be done in bulk with the challenge here being to locate where all the digital objects are, as they could be spread across different disk systems or even be located on websites, ftp sites, or e-mail servers. In some cases, a workflow process might be needed to ensure the digital object has the correct metadata attached to it.

Exercises

These questions are designed to have the reader go beyond the traditional method of answering questions. They involve using the concepts designed in the chapter and doing additional research on the Internet to come up with the best solution to address the questions raised:

This chapter has covered a number of locations where digital objects can be found. This includes the filesystem, mail server, ftp, and via HTTP on websites. Identify another type of location and determine if the digital objects found there could be loaded into a database.
Define a filename syntax (EBNF or Railroad) that can be used to match a digital object to its metadata. It should be able to deal with the variations of names, relationships, digital object types, and can also include instructions on loading, such as this is the master digital object, or the sRBG colorspace should be used when processed. Keep in mind the maximum file length of 255 characters.
Determine a loading strategy for parallel loading for...