Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-managing-ibm-cognos-bi-server-components

12 Dec 2013

6 min read

Managing IBM Cognos BI Server Components

12 Dec 2013

(for more resources related to this topic, see here.) Cognos BI architecture The IBM Cognos 10.2 BI architecture is separated into the following three tiers: Web server (gateways) Applications (dispatcher and Content Manager) Data (reporting/querying the database, content store, metric store) Web server (gateways) The user starts a web session with Cognos to connect to the IBM Cognos Connection's web-based interface/application using the web browser (Internet Explorer and Mozilla Firefox are the currently supported browsers). This web request is sent to the web server where the Cognos gateway resides. The gateway is a server-software program that works as an intermediate party between the web server and other servers, such as an application server. The following diagram shows the basic view of the three tiers of the Cognos BI architecture: The Cognos gateway is the starting point from where a request is received and transferred to the BI Server. On receiving a request from the web server, the Cognos gateway applies encryption to the information received, adds necessary environment variables and authentication namespace, and transfers the information to the application server (or dispatcher). Similarly, when the data has been processed and the presentation is ready, it is rendered towards the user's browser via the gateway and web server. The following diagram shows the Tier 1 layer in detail: The gateways must be configured to communicate with the application component (dispatcher) in a distributed environment. To make a failover cluster, more than one BI Server may be configured. The following types of web gateways are supported: CGI: This is also the default gateway. This is a basic gateway. ISAPI: This is for the Windows environment. It is the best for Windows IIS (Internet Information Services). Servlet: This gateway is the best for application servers that are supporting servlets. Apache_mod: This gateway type may be used for the Apache server. The following diagram shows an environment in which the web server is load balanced by two server machines: To improve performance, gateways (if more than one) must be installed and configured on separate machines. The application tier (Cognos BI Server) The application tier comprises one or multiple BI Servers. A server's job is to run user requests, for example, queries, reports, and analysis that are received from a gateway. The GUI environment (IBM Cognos Connection) that appears after logging in is also rendered and presented by Cognos BI Server. Another such example is the Metric Studio interface. The BI Server must include the dispatcher and Content Manager (the Content Manager component may be separated from the dispatcher). The following diagram shows BI Server's Tier 2 in detail: Dispatcher The dispatcher has static handlers to many services. Each request that is received is routed to the corresponding service for further processing. The dispatcher is also responsible for starting all the Cognos services at startup. These services include the system service, report service, report data service, presentation service, Metric Studio service, log service, job service, event management service, Content Manager service, batch report service, delivery service, and many others. When there are multiple dispatchers in a multitier architecture, a dispatcher may also send and route requests to another dispatcher. The URIs for all dispatchers must be known to the Cognos gateway(s). All dispatchers are registered in Content Manager (CM), making it possible for all dispatchers to know each other. A dispatcher grid is formed in this way. To improve the system performance, multiple dispatchers must be installed but on separate computers, and the Content Manager component must also be on a separate server. The following diagram shows how multiple dispatcher servers can be added. Services for the BI Server (dispatcher) Each dispatcher has a set of services, which are listed alphabetically in the following table. When the Cognos service is started from Cognos Configuration, all services are started one by one. The following table shows the dispatcher services and their short descriptions: Service Description Agent service Runs the agent. Annotation service Adds comments to reports. Batch report service Handles background report requests. Content manager cache service Handles cache for frequent queries to enhance performance of Content Manager. Content manager service Performs DML in content store db. Cognos deployment is another task for this service. Delivery service For sending e-mails. Event management service Manages the Event Objects (creation, scheduling, and so on) Graphics service Renders graphics for other services such as report service. Human task service Manages human tasks. Index data service For basic full-text functions for storage and retrieval of terms and indexed summary documents. Index search service For search and drill-through functions, including lists of aliases and examples. Index update service For write, update, delete, and administration-related functions. Job service Runs jobs in coordination with the monitor service. Log service For extensive logging of the Cognos environment (file, database, remote-log server, event viewer, and system log). Metadata service For displaying data lineage information (data source, calculation expressions) for the Cognos studios and viewer. Metric studio service This service is used for providing a user interface to metric studio for monitoring and manipulating system KPIs. Migration service Used for migration from old versions to new versions, especially series 7. Monitor service Works as a timer service-it manages the monitoring and running of tasks that were scheduled or marked as background tasks. Helps in failover and recovery for running tasks. Presentation service This service prepares and displays the presentation layer by converting the XML data to HTML or any other format view. IBM Cognos Connection is also prepared by this service. Query service For managing dynamic query requests. Report data service This service prepares data for other applications; for example mobile, Microsoft Office, and so on. Report service Manages report requests. The output is displayed in IBM Cognos Connection. System service This service defines the BI-Bus API compliant service. It gives more data about the BI configuration parameters. Summary This article covered the IBM Cognos BI architecture. Now you must be familiar with the single tier and multitier architectures and a variety of features and options that Cognos provides. resources for article: further resources on this subject: IBM Cognos Insight [article] Integrating IBM Cognos TM1 with IBM Cognos 8 BI [article] IBM Cognos 10 BI dashboarding components [article]

0
0
4850

Packt

19 Feb 2016

9 min read

What is logistic regression?

Packt

19 Feb 2016

9 min read

0
0
4847

Robi Sen

16 Apr 2015

4 min read

Text Mining with R: Part 2

Robi Sen

16 Apr 2015

4 min read

In Part 1, we covered the basics of doing text mining in R by selecting data, preparing it, cleaning, then performing various operations on it to visualize that data. In this post we look at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. Building the document matrix A common technique in text mining is using a matrix of documents terms called a document term matrix. A document term matrix is simply a matrix where columns are terms and rows are documents that contain the occurrence of specific terms within the document. Or if you reverse the order and have terms as rows and documents as columns it’s called a term document matrix. For example let’s say we have two documents D 1 and D2. For example let’s say we have the documents: D1 = "I like cats" D2 = "I hate cats" Then the document term matrix would look like: I like hate cats D1 1 1 0 1 D2 1 0 1 1 For our project to make a Document term matrix in R all you need to do is use the DocumentTermMatrix() like this: tdm <- DocumentTermMatrix(mycorpus) You can see information on your document term matrix by using print like: print(tdm) <<DocumentTermMatrix (documents: 4688, terms: 18363)>> Non-/sparse entries: 44400/86041344 Sparsity : 100% Maximal term length: 65 Weighting : term frequency (tf) Next because we need to sum up all the values in each term column so that we can drive the frequency of each term occurrence. We also want to sort those values from highest to lowest. You can use this code: m <- as.matrix(tdm) v <- sort(colSums(m),decreasing=TRUE) Next we will use the names() to pull the each term object’s name which in our case is a word. Then we want to build a dataframe from our words associated with their frequency of occurrences. Finally we want to create our word cloud but remove any terms that have an occurrence of less than 45 times to reduce clutter in our wordcloud. You could also use max.words to limit the total number of words in your word cloud. So your final code should look like this: words <- names(v) d <- data.frame(word=words, freq=v) wordcloud(d$word,d$freq,min.freq=45) If you run this in R studio you should see something like the figure which shows the words with highest occurrence in our corpus. The wordcloud object automatically scales the drawn words by the size of their frequency value. From here you can do a lot with your word cloud including change the scale, associate color to various values, and much more. You can read more about wordcloud here. While word clouds are often used on the web for things like blogs, news sites, and other similar use cases they have real value for data analysis beyond just visual indicators for users to find terms of interest. For example if you look at the word cloud we generated you will notice that one of the most popular terms mentioned in tweets is chocolate. Doing a short inspection of our CSV document for the term chocolate we find a lot of people mentioning the word in a variety of contexts but one of the most common is in relationship to a specific super bowl add. For example here is a tweet: Alexalabesky 41673.39 Chocolate chips and peanut butter 0 0 0 Unknown Unknown Unknown Unknown Unknown This appeared after the airing of this advertisement from Butterfinger. So even with this simple R code we can generate real meaning from social media which is the measurable impact of an advertisement during the Super Bowl. Summary In this post we looked at a simple use case showing how we can derive real meaning and value from a visualization by seeing how a simple word cloud and help you understand the impact of an advertisement. About the author Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as UnderArmour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.

0
0
4816

article-image-cluster-computing-using-scala

Packt

13 Apr 2016

18 min read

Cluster Computing Using Scala

Packt

13 Apr 2016

18 min read

In this article by Vytautas Jančauskas the author of the book Scientific Computing with Scala, explains the way of writing software to be run on distributed computing clusters. We will learn the MPJ Express library here. (For more resources related to this topic, see here.) Very often when dealing with intense data processing tasks and simulations of physical phenomena, there comes a time when no matter how many CPU cores and memory your workstation has, it is not enough. At times like these, you will want to turn to supercomputing clusters for help. These distributed computing environments consist of many nodes (each node being a separate computer) connected into a computer network using specialized high bandwidth and low latency connections (or if you are on a budget standard Ethernet hardware is often enough). These computers usually utilize a network filesystem allowing each node to see the same files. They communicate using messaging libraries, such as MPI. Your program will run on separate computers and utilize the message passing framework to exchange data via the computer network. Using MPJ Express for distributed computing MPJ Express is a message passing library for distributed computing. It works in programming languages using Java Virtual Machine (JVM). So, we can use it from Scala. It is similar in functionality and programming interface to MPI. If you know MPI, you will be able to use MPJ Express pretty much the same way. The differences specific to Scala are explained in this section. We will start with how to install it. For further reference, visit the MPJ Express website given here: http://mpj-express.org/ Setting up and running MPJ Express The steps to set up and run MPJ Express are as follows: First, download MPJ Express from the following link. The version at the time of this writing is 0.44.http://mpj-express.org/download.php Unpack the archive and refer to the included README file for installation instructions. Currently, you have to set MPJ_HOME to the folder you unpacked the archive to and add the bin folder in that archive to your path. For example, if you are a Linux user using bash as your shell, you can add the following two lines to your .bashrc file (the file is in your home directory at /home/yourusername/.bashrc): export MPJ_HOME=/home/yourusername/mpj export PATH=$MPJ_HOME/bin:$PATH Here, mpj is the folder you extracted the archive you downloaded from the MPJ Express website to. If you are using a different system, you will have to do the equivalent of the above for your system to use MPJ Express. We will want to use MPJ Express with Scala Build Tool (SBT), which we used previously to build and run all of our programs. Create the following directory structure: scalacluster/ lib/ project/ plugins.sbt build.sbt I have chosen to name the project folder asscalacluster here, but you can call it whatever you want. The .jar files in the lib folder will be accessible to your program now. Copy the contents of the lib folder from the mpj directory to this folder. Finally, create an empty build.sbt and plugins.sbt files. Let’s now write and run a simple "Hello, World!" program to test our setup: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank val size: Int = MPI.COMM_WORLD.Size println("Hello, World, I'm <" + me + ">") MPI.Finalize() } } This should be familiar to everyone who has ever used MPI. First, we import everything from the mpj package. Then, we initialize MPJ Express by calling MPI.Initialize, the arguments to MPJ Express will be passed from the command-line arguments you will enter when running the program. The MPI.COMM_WORLD.Rank() function returns the MPJ processes rank. A rank is a unique identifier used to distinguish processes from one another. They are used when you want different processes to do different things. A common pattern is to use the process with rank 0 as the master process and the processes with other ranks as workers. Then, you can use the processes rank to decide what action to take in the program. We also determine how many MPJ processes were launched by checking MPI.COMM_WORLD.Size. Our program will simply print a processes rank for now. We will want to run it. If you don't have a distributed computing cluster readily available, don't worry. You can test your programs locally on your desktop or laptop. The same program will work without changes on clusters as well. To run programs written using MPJ Express, you have to use the mpjrun.sh script. This script will be available to you if you have added the bin folder of the MPJ Express archive to your PATH as described in the section on installing MPJ Express. The mpjrun.sh script will setup the environment for your MPJ Express processes and start said processes. The mpjrun.sh script takes a .jar file, so we need to create one. Unfortunately for us, this cannot easily be done using the sbt package command in the directory containing our program. This worked previously, because we used Scala runtime to execute our programs. MPJ Express uses Java. The problem is that the .jar package created with sbt package does not include Scala's standard library. We need what is called a fat .jar—one that contains all the dependencies within itself. One way of generating it is to use a plugin for SBT called sbt-assembly. The website for this plugin is given here: https://github.com/sbt/sbt-assembly There is a simple way of adding the plugin for use in our project. Remember that project/plugins.sbt file we created? All you need to do is add the following line to it (the line may be different for different versions of the plugin. Consult the website): addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1") Now, add the following to the build.sbt file you created: lazy val root = (project in file(".")). settings( name := "mpjtest", version := "1.0", scalaVersion := "2.11.7" ) Then, execute the sbt assembly command from the shell to build the .jar file. The file will be put under the following directory if you are using the preceding build.sbt file. That is, if the folder you put the program and build.sbt in is /home/you/cluster: /home/you/cluster/target/scala-2.11/mpjtest-assembly- 1.0.jar Now, you can run the mpjtest-assembly-1.0.jar file as follows: $ mpjrun.sh -np 4 -jar target/scala-2.11/mpjtest-assembly-1.0.jar MPJ Express (0.44) is started in the multicore configuration Hello, World, I'm <0> Hello, World, I'm <2> Hello, World, I'm <3> Hello, World, I'm <1> Argument -np specifies how many processes to run. Since we specified -np 4, four processes will be started by the script. The order of the "Hello, World" messages can differ on your system since the precise order of execution of different processes is undetermined. If you got the output similar to the one shown here, then congratulations, you have done the majority of the work needed to write and deploy applications using MPJ Express. Using Send and Recv MPJ Express processes can communicate using Send and Recv. These methods constitute arguably the simplest and easiest to understand mode of operation that is also probably the most error prone. We will look at these two first. The following are the signatures for the Send and Recv methods: public void Send(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) throws MPIException public Status Recv(java.lang.Object buf, int offset, int count, Datatype datatype, int source, int tag) throws MPIException Both of these calls are blocking. This means that after calling Send, your process will block (will not execute the instructions following it) until a corresponding Recv is called by another process. Also Recv will block the process, until a corresponding Send happens. By corresponding, we mean that the dest and source arguments of the calls have the values corresponding to receivers and senders ranks, respectively. The two calls will be enough to implement many complicated communication patterns. However, they are prone to various problems such as deadlocks. Also, they are quite difficult to debug, since you have to make sure that each Send has the correct corresponding Recv and vice versa. The parameters for Send and Recv are basically the same. The meanings of those parameters are summarized in the following table: Argument Type Description Buf java.lang.Object It has to be a one-dimensional Java array. When using from Scala, use the Scala array, which is a one-to-one mapping to a Java array. offset int The start of the data you want to pass from the start of the array. Count int This shows the number items of the array you want to pass. datatype Datatype The type of data in the array. Can be one of the following: MPI.BYTE, MPI.CHAR, MPI.SHORT, MPI.BOOLEAN, MPI.INT, MPI.LONG, MPI.FLOAT, MPI.DOUBLE, MPI.OBJECT, MPI.LB, MPI.UB, and MPI.PACKED. dest/source int Either the destination to send the message to or the source to get the message from. You use the rank of the process to identify sources and destinations. tag int Used to tag the message. Can be used to introduce different message types. Can be ignored for most common applications. Let’s look at a simple program using these calls for communication. We will implement a simple master/worker communication pattern: import mpi._ import scala.util.Random object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { Here, we use an if statement to identify who we are based on our rank. Since each process gets a unique rank, this allows us to determine what action should be taken. In our case, we assigned the role of the master to the process with rank 0 and the role of a worker to processes with other ranks: for (i <- 1 until size) { val buf = Array(Random.nextInt(100)) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> please do work on " + buf(0)) } We iterate over workers, who have the ranks from 1 to whatever is the argument for number of processes you passed to the mpjrun.sh script. Let’s say that number is four. This gives us one master process and three worker processes. So, each process with a rank from 1 to 3 will get a randomly generated number. We have to put that number in an array even though it is a single number. This is because both Send and Recv methods expect an array as their first argument. We then use the Send method to send the data. We specified the array as argument buf, offset of 0, size of 1, type MPI.INT, destination as the for loop index, and tag as 0. This means that each of our three worker processes will receive a (most probably) different number: for (i <- 1 until size) { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, i, 0) println("MASTER: Dear <" + i + "> thanks for the reply, which was " + buf(0)) } Finally, we collect the results from the workers. For this, we iterate over the worker ranks and use the Recv method on each one of them. We print the result we got from the worker, and this concludes the master's part. We now move on to the workers: } else { val buf = Array(0) MPI.COMM_WORLD.Recv(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Understood, doing work on " + buf(0)) buf(0) = buf(0) * buf(0) MPI.COMM_WORLD.Send(buf, 0, 1, MPI.INT, 0, 0) println("<" + me + ">: " + "Reporting back") } The workers code is identical for all of them. They receive a message from the master, calculate the square of it, and send it back: MPI.Finalize() } } After you run the program, the results should be akin to the following, which I got when running this program on my system: MASTER: Dear <1> please do work on 71 MASTER: Dear <2> please do work on 12 MASTER: Dear <3> please do work on 55 <1>: Understood, doing work on 71 <1>: Reported back MASTER: Dear <1> thanks for the reply, which was 5041 <3>: Understood, doing work on 55 <2>: Understood, doing work on 12 <2>: Reported back MASTER: Dear <2> thanks for the reply, which was 144 MASTER: Dear <3> thanks for the reply, which was 3025 <3>: Reported back Sending Scala objects in MPJ Express messages Sometimes, the types provided by MPJ Express for use in the Send and Recv methods are not enough. You may want to send your MPJ Express processes a Scala object. A very realistic example of this would be to send an instance of a Scala case class. These can be used to construct more complicated data types consisting of several different basic types. A simple example is a two-dimensional vector consisting of x and y coordinates. This can be sent as a simple array, but more complicated classes can't. For example, you may want to use a case class as the one shown here. It has two attributes of type String and one attribute of type Int. So what do we do with a data type like this? The simplest answer to that problem is to serialize it. Serializing converts an object to a stream of characters or a string that can be sent over the network (or stored to a file or done other things with) and later on deserialized to get the original object back: scala> case class Person(name: String, surname: String, age: Int) defined class Person scala> val a = Person("Name", "Surname", 25) a: Person = Person(Name,Surname,25) A simple way of serializing is to use a format such as XML or JSON. This can be done automatically using a pickling library. Pickling is a term that comes from the Python programming language. It is the automatic conversion of an arbitrary object into a string representation that can later be de-converted to get the original object back. The reconstructed object will behave the same way as it did before conversion. This allows one to store arbitrary objects to files for example. There is a pickling library available for Scala as well. You can of course do serialization in several different ways (for example, using the powerful support for XML available in Scala). We will use the pickling library that is available from the following website for this example: https://github.com/scala/pickling You can install it by adding the following line to your build.sbt file: libraryDependencies += "org.scala-lang.modules" %% "scala- pickling" % "0.10.1" After doing that, use the following import statements to enable easy pickling in your projects: scala> import scala.pickling.Defaults._ import scala.pickling.Defaults._ scala> import scala.pickling.json._ import scala.pickling.json._ Here, you can see how you can then easily use this library to pickle and unpickle arbitrary objects without the use of annoying boiler plate code: scala> val pklA = a.pickle pklA: pickling.json.pickleFormat.PickleType = JSONPickle({ "$type": "Person", "name": "Name", "surname": "Surname", "age": 25 }) scala> val unpklA = pklA.unpickle[Person] unpklA: Person = Person(Name,Surname,25) Let’s see how this would work in an application using MPJ Express for message passing. A program using pickling to send a case class instance in a message is given here: import mpi._ import scala.pickling.Defaults._ import scala.pickling.json._ case class ArbitraryObject(a: Array[Double], b: Array[Int], c: String) Here, we have chosen to define a fairly complex case class, consisting of two arrays of different types and a string: object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val obj = ArbitraryObject(Array(1.0, 2.0, 3.0), Array(1, 2, 3), "Hello") val pkl = obj.pickle.value.toCharArray MPI.COMM_WORLD.Send(pkl, 0, pkl.size, MPI.CHAR, 1, 0) In the preceding bit of code, we create an instance of our case class. We then pickle it to JSON and get the string representation of said JSON with the value method. However, to send it in an MPJ message, we need to convert it to a one-dimensional array of one of the supported types. Since it is a string, we convert it to a char array. This is done using the toCharArray method: } else if (me == 1) { val buf = new Array[Char](1000) MPI.COMM_WORLD.Recv(buf, 0, 1000, MPI.CHAR, 0, 0) val msg = buf.mkString val obj = msg.unpickle[ArbitraryObject] On the receiving end, we get the raw char array, convert it back to string using mkString method, and then unpickle it using unpickle[T]. This will return an instance of the case class that we can use as any other instance of a case class. It is in its functionality the same object that was sent to us: println(msg) println(obj.c) } MPI.Finalize() } } The following is the result of running the preceding program. It prints out the JSON representation of our object, and also show that we can access the attributes of said object by printing the c attribute. MPJ Express (0.44) is started in the multicore configuration: { "$type": "ArbitraryObject", "a": [ 1.0, 2.0, 3.0 ], "b": [ 1, 2, 3 ], "c": "Hello" } Hello You can use this method to send arbitrary objects in an MPJ Express message. However, this is just one of many ways of doing this. As mentioned previously, an example of another way is to use the XML representation. XML support is strong in Scala, and you can use it to serialize objects as well. This will usually require you to add some boiler plate code to your program to serialize to XML. The method discussed earlier has the advantage of requiring no boiler plate code. Non-blocking communication So far, we examined only blocking (or synchronous) communication between two processes. This means that the process is blocked (halted their execution) until the Send or Recv methods have been completed successfully. This is simple to understand and enough for most cases. The problem with synchronous communication is that you have to be very careful otherwise deadlocks may occur. Deadlocks are situations when processes wait for each other to release a resource first. Mexican standoff including the dining philosophers problem is one of the famous example of Deadlock in Operating System. The point is that if you are unlucky, you may end up with a program that is seemingly stuck and you don't know why. Using nonlocking communication allows you to avoid these problems most of the time. If you think you may be at risk of deadlocks, you will probably want to use it. The signatures for the primary methods used in asynchronous communication are given here: Request Isend(java.lang.Object buf, int offset, int count, Datatype datatype, int dest, int tag) Isend works similar to its Send counterpart. The main differences are that it does not block (the program continues execution after the call rather than waiting for a corresponding send), and then it returns a Request object. This object is used to check the status of your Send request, block until it is complete if required, and so on: Request Irecv(java.lang.Object buf, int offset, int count, Datatype datatype, int src, int tag) Irecv is again the same as Recv only non-blocking and returns a Request object used to handle your receive request. The operation of these methods can be seen in action in the following example: import mpi._ object MPJTest { def main(args: Array[String]) { MPI.Init(args) val me: Int = MPI.COMM_WORLD.Rank() val size: Int = MPI.COMM_WORLD.Size() if (me == 0) { val requests = for (i <- 0 until 10) yield { val buf = Array(i * i) MPI.COMM_WORLD.Isend(buf, 0, 1, MPI.INT, 1, 0) } } else if (me == 1) { for (i <- 0 until 10) { Thread.sleep(1000) val buf = Array[Int](0) val request = MPI.COMM_WORLD.Irecv (buf, 0, 1, MPI.INT, 0, 0) request.Wait() println("RECEIVED: " + buf(0)) } } MPI.Finalize() } } This is a very simplistic example used simply to demonstrate the basics of using the asynchronous message passing methods. First, the process with rank 0 will send 10 messages to process with rank 1 using Isend. Since Isend does not block, the loop will finish quickly and the messages it sent will be buffered until they are retrieved using Irecv. The second process (the one with rank 1) will wait for one second before retrieving each message. This is to demonstrate the asynchronous nature of these methods. The messages are in the buffer waiting to be retrieved. Therefore, Irecv can be used at your leisure when convenient. The Wait() method of the Request object, it returns, has to be used to retrieve results. The Wait() method blocks until the message is successfully received from the buffer. Summary Extremely computationally intensive programs are usually parallelized and run on supercomputing clusters. These clusters consist of multiple networked computers. Communication between these computers is usually done using messaging libraries such as MPI. These allow you to pass data between processes running on different machines in an efficient manner. In this article, you have learned how to use MPJ Express—an MPI like library for JVM. We saw how to carry out process to process communication as well as collective communication. Most important MPJ Express primitives were covered and example programs using them were given. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code[article] Getting Started with JavaFX[article] Integrating Scala, Groovy, and Flex Development with Apache Maven[article]

0
0
4814

Packt

10 Feb 2015

36 min read

Hive in Hadoop

Packt

10 Feb 2015

36 min read

In this article by Garry Turkington and Gabriele Modena, the author of the book Learning Hadoop 2. explain how MapReduce is a powerful paradigm that enables complex data processing that can reveal valuable insights. It does require a different mindset and some training and experience on the model of breaking processing analytics into a series of map and reduce steps. There are several products that are built atop Hadoop to provide higher-level or more familiar views of the data held within HDFS, and Pig is a very popular one. This article will explore the other most common abstraction implemented atop Hadoop: SQL. In this article, we will cover the following topics: What the use cases for SQL on Hadoop are and why it is so popular HiveQL, the SQL dialect introduced by Apache Hive Using HiveQL to perform SQL-like analysis of the Twitter dataset How HiveQL can approximate common features of relational databases such as joins and views (For more resources related to this topic, see here.) Why SQL on Hadoop So far we have seen how to write Hadoop programs using the MapReduce APIs and how Pig Latin provides a scripting abstraction and a wrapper for custom business logic by means of UDFs. Pig is a very powerful tool, but its dataflow-based programming model is not familiar to most developers or business analysts. The traditional tool of choice for such people to explore data is SQL. Back in 2008 Facebook released Hive, the first widely used implementation of SQL on Hadoop. Instead of providing a way of more quickly developing map and reduce tasks, Hive offers an implementation of HiveQL, a query language based on SQL. Hive takes HiveQL statements and immediately and automatically translates the queries into one or more MapReduce jobs. It then executes the overall MapReduce program and returns the results to the user. This interface to Hadoop not only reduces the time required to produce results from data analysis, it also significantly widens the net as to who can use Hadoop. Instead of requiring software development skills, anyone who's familiar with SQL can use Hive. The combination of these attributes is that HiveQL is often used as a tool for business and data analysts to perform ad hoc queries on the data stored on HDFS. With Hive, the data analyst can work on refining queries without the involvement of a software developer. Just as with Pig, Hive also allows HiveQL to be extended by means of User Defined Functions, enabling the base SQL dialect to be customized with business-specific functionality. Other SQL-on-Hadoop solutions Though Hive was the first product to introduce and support HiveQL, it is no longer the only one. There are others, but we will mostly discuss Hive and Impala as they have been the most successful. While introducing the core features and capabilities of SQL on Hadoop however, we will give examples using Hive; even though Hive and Impala share many SQL features, they also have numerous differences. We don't want to constantly have to caveat each new feature with exactly how it is supported in Hive compared to Impala. We'll generally be looking at aspects of the feature set that are common to both, but if you use both products, it's important to read the latest release notes to understand the differences. Prerequisites Before diving into specific technologies, let's generate some data that we'll use in the examples throughout this article. We'll create a modified version of a former Pig script as the main functionality for this. The script in this article assumes that the Elephant Bird JARs used previously are available in the /jar directory on HDFS. The full source code is at https://github.com/learninghadoop2/book-examples/ch7/extract_for_hive.pig, but the core of extract_for_hive.pig is as follows: -- load JSON data tweets = load '$inputDir' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad'); -- Tweets tweets_tsv = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str', (chararray)$0#'text' as text, (chararray)$0#'in_reply_to', (boolean)$0#'retweeted' as is_retweeted, (chararray)$0#'user'#'id_str' as user_id, (chararray)$0#'place'#'id' as place_id; } store tweets_tsv into '$outputDir/tweets' using PigStorage('u0001'); -- Places needed_fields = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'place' as place; } place_fields = foreach needed_fields { generate (chararray)place#'id' as place_id, (chararray)place#'country_code' as co, (chararray)place#'country' as country, (chararray)place#'name' as place_name, (chararray)place#'full_name' as place_full_name, (chararray)place#'place_type' as place_type; } filtered_places = filter place_fields by co != ''; unique_places = distinct filtered_places; store unique_places into '$outputDir/places' using PigStorage('u0001'); -- Users users = foreach tweets { generate (chararray)CustomFormatToISO($0#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)$0#'id_str' as id_str, $0#'user' as user; } user_fields = foreach users { generate (chararray)CustomFormatToISO(user#'created_at', 'EEE MMMM d HH:mm:ss Z y') as dt, (chararray)user#'id_str' as user_id, (chararray)user#'location' as user_location, (chararray)user#'name' as user_name, (chararray)user#'description' as user_description, (int)user#'followers_count' as followers_count, (int)user#'friends_count' as friends_count, (int)user#'favourites_count' as favourites_count, (chararray)user#'screen_name' as screen_name, (int)user#'listed_count' as listed_count; } unique_users = distinct user_fields; store unique_users into '$outputDir/users' using PigStorage('u0001'); Run this script as follows: $ pig –f extract_for_hive.pig –param inputDir=<json input> -param outputDir=<output path> The preceding code writes data into three separate TSV files for the tweet, user, and place information. Notice that in the store command, we pass an argument when calling PigStorage. This single argument changes the default field separator from a tab character to unicode value U0001, or you can also use Ctrl +C + A. This is often used as a separator in Hive tables and will be particularly useful to us as our tweet data could contain tabs in other fields. Overview of Hive We will now show how you can import data into Hive and run a query against the table abstraction Hive provides over the data. In this example, and in the remainder of the article, we will assume that queries are typed into the shell that can be invoked by executing the hive command. Recently a client called Beeline also became available and will likely be the preferred CLI client in the near future. When importing any new data into Hive, there is generally a three-stage process: Create the specification of the table into which the data is to be imported Import the data into the created table Execute HiveQL queries against the table Most of the HiveQL statements are direct analogues to similarly named statements in standard SQL. We assume only a passing knowledge of SQL throughout this article, but if you need a refresher, there are numerous good online learning resources. Hive gives a structured query view of our data, and to enable that, we must first define the specification of the table's columns and import the data into the table before we can execute any queries. A table specification is generated using a CREATE statement that specifies the table name, the name and types of its columns, and some metadata about how the table is stored: CREATE table tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; The statement creates a new table tweets defined by a list of names for columns in the dataset and their data type. We specify that fields are delimited by the Unicode U0001 character and that the format used to store data is TEXTFILE. Data can be imported from a location in HDFS tweets/ into hive using the LOAD DATA statement: LOAD DATA INPATH 'tweets' OVERWRITE INTO TABLE tweets; By default, data for Hive tables is stored on HDFS under /user/hive/warehouse. If a LOAD statement is given a path to data on HDFS, it will not simply copy the data into /user/hive/warehouse, but will move it there instead. If you want to analyze data on HDFS that is used by other applications, then either create a copy or use the EXTERNAL mechanism that will be described later. Once data has been imported into Hive, we can run queries against it. For instance: SELECT COUNT(*) FROM tweets; The preceding code will return the total number of tweets present in the dataset. HiveQL, like SQL, is not case sensitive in terms of keywords, columns, or table names. By convention, SQL statements use uppercase for SQL language keywords, and we will generally follow this when using HiveQL within files, as will be shown later. However, when typing interactive commands, we will frequently take the line of least resistance and use lowercase. If you look closely at the time taken by the various commands in the preceding example, you'll notice that loading data into a table takes about as long as creating the table specification, but even the simple count of all rows takes significantly longer. The output also shows that table creation and the loading of data do not actually cause MapReduce jobs to be executed, which explains the very short execution times. The nature of Hive tables Although Hive copies the data file into its working directory, it does not actually process the input data into rows at that point. Both the CREATE TABLE and LOAD DATA statements do not truly create concrete table data as such; instead, they produce the metadata that will be used when Hive generates MapReduce jobs to access the data conceptually stored in the table but actually residing on HDFS. Even though the HiveQL statements refer to a specific table structure, it is Hive's responsibility to generate code that correctly maps this to the actual on-disk format in which the data files are stored. This might seem to suggest that Hive isn't a real database; this is true, it isn't. Whereas a relational database will require a table schema to be defined before data is ingested and then ingest only data that conforms to that specification, Hive is much more flexible. The less concrete nature of Hive tables means that schemas can be defined based on the data as it has already arrived and not on some assumption of how the data should be, which might prove to be wrong. Though changeable data formats are troublesome regardless of technology, the Hive model provides an additional degree of freedom in handling the problem when, not if, it arises. Hive architecture Until version 2, Hadoop was primarily a batch system. Internally, Hive compiles HiveQL statements into MapReduce jobs. Hive queries have traditionally been characterized by high latency. This has changed with the Stinger initiative and the improvements introduced in Hive 0.13 that we will discuss later. Hive runs as a client application that processes HiveQL queries, converts them into MapReduce jobs, and submits these to a Hadoop cluster either to native MapReduce in Hadoop 1 or to the MapReduce Application Master running on YARN in Hadoop 2. Regardless of the model, Hive uses a component called the metastore, in which it holds all its metadata about the tables defined in the system. Ironically, this is stored in a relational database dedicated to Hive's usage. In the earliest versions of Hive, all clients communicated directly with the metastore, but this meant that every user of the Hive CLI tool needed to know the metastore username and password. HiveServer was created to act as a point of entry for remote clients, which could also act as a single access-control point and which controlled all access to the underlying metastore. Because of limitations in HiveServer, the newest way to access Hive is through the multi-client HiveServer2. HiveServer2 introduces a number of improvements over its predecessor, including user authentication and support for multiple connections from the same client. More information can be found at https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. Instances of HiveServer and HiveServer2 can be manually executed with the hive --service hiveserver and hive --service hiveserver2 commands, respectively. In the examples we saw before and in the remainder of this article, we implicitly use HiveServer to submit queries via the Hive command-line tool. HiveServer2 comes with Beeline. For compatibility and maturity reasons, Beeline being relatively new, both tools are available on Cloudera and most other major distributions. The Beeline client is part of the core Apache Hive distribution and so is also fully open source. Beeline can be executed in embedded version with the following command: $ beeline -u jdbc:hive2:// Data types HiveQL supports many of the common data types provided by standard database systems. These include primitive types, such as float, double, int, and string, through to structured collection types that provide the SQL analogues to types such as arrays, structs, and unions (structs with options for some fields). Since Hive is implemented in Java, primitive types will behave like their Java counterparts. We can distinguish Hive data types into the following five broad categories: Numeric: tinyint, smallint, int, bigint, float, double, and decimal Date and time: timestamp and date String: string, varchar, and char Collections: array, map, struct, and uniontype Misc: boolean, binary, and NULL DDL statements HiveQL provides a number of statements to create, delete, and alter databases, tables, and views. The CREATE DATABASE <name> statement creates a new database with the given name. A database represents a namespace where table and view metadata is contained. If multiple databases are present, the USE <database name> statement specifies which one to use to query tables or create new metadata. If no database is explicitly specified, Hive will run all statements against the default database. SHOW [DATABASES, TABLES, VIEWS] displays the databases currently available within a data warehouse and which table and view metadata is present within the database currently in use: CREATE DATABASE twitter; SHOW databases; USE twitter; SHOW TABLES; The CREATE TABLE [IF NOT EXISTS] <name> statement creates a table with the given name. As alluded to earlier, what is really created is the metadata representing the table and its mapping to files on HDFS as well as a directory in which to store the data files. If a table or view with the same name already exists, Hive will raise an exception. Both table and column names are case insensitive. In older versions of Hive (0.12 and earlier), only alphanumeric and underscore characters were allowed in table and column names. As of Hive 0.13, the system supports unicode characters in column names. Reserved words, such as load and create, need to be escaped by backticks (the ` character) to be treated literally. The EXTERNAL keyword specifies that the table exists in resources out of Hive's control, which can be a useful mechanism to extract data from another source at the beginning of a Hadoop-based Extract-Transform-Load (ETL) pipeline. The LOCATION clause specifies where the source file (or directory) is to be found. The EXTERNAL keyword and LOCATION clause have been used in the following code: CREATE EXTERNAL TABLE tweets ( created_at string, tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/tweets'; This table will be created in metastore, but the data will not be copied into the /user/hive/warehouse directory. Note that Hive has no concept of primary key or unique identifier. Uniqueness and data normalization are aspects to be addressed before loading data into the data warehouse. The CREATE VIEW <view name> … AS SELECT statement creates a view with the given name. For example, we can create a view to isolate retweets from other messages, as follows: CREATE VIEW retweets COMMENT 'Tweets that have been retweeted' AS SELECT * FROM tweets WHERE retweeted = true; Unless otherwise specified, column names are derived from the defining SELECT statement. Hive does not currently support materialized views. The DROP TABLE and DROP VIEW statements remove both metadata and data for a given table or view. When dropping an EXTERNAL table or a view, only metadata will be removed and the actual data files will not be affected. Hive allows table metadata to be altered via the ALTER TABLE statement, which can be used to change a column type, name, position, and comment or to add and replace columns. When adding columns, it is important to remember that only metadata will be changed and not the dataset itself. This means that if we were to add a column in the middle of the table which didn't exist in older files, then while selecting from older data, we might get wrong values in the wrong columns. This is because we would be looking at old files with a new format Similarly, ALTER VIEW <view name> AS <select statement> changes the definition of an existing view. File formats and storage The data files underlying a Hive table are no different from any other file on HDFS. Users can directly read the HDFS files in the Hive tables using other tools. They can also use other tools to write to HDFS files that can be loaded into Hive through CREATE EXTERNAL TABLE or through LOAD DATA INPATH. Hive uses the Serializer and Deserializer classes, SerDe, as well as FileFormat to read and write table rows. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified in a CREATE TABLE statement. The DELIMITED clause instructs the system to read delimited files. Delimiter characters can be escaped using the ESCAPED BY clause. Hive currently uses the following FileFormat classes to read and write HDFS files: TextInputFormat and HiveIgnoreKeyTextOutputFormat: will read/write data in plain text file format SequenceFileInputFormat and SequenceFileOutputFormat: classes read/write data in the Hadoop SequenceFile format Additionally, the following SerDe classes can be used to serialize and deserialize data: MetadataTypedColumnsetSerDe: This will read/write delimited records such as CSV or tab-separated records ThriftSerDe, and DynamicSerDe: These will read/write Thrift objects JSON As of version 0.13, Hive ships with the native org.apache.hive.hcatalog.data.JsonSerDe JSON SerDe. For older versions of Hive, Hive-JSON-Serde (found at https://github.com/rcongiu/Hive-JSON-Serde) is arguably one of the most feature-rich JSON serialization/deserialization modules. We can use either module to load JSON tweets without any need for preprocessing and just define a Hive schema that matches the content of a JSON document. In the following example, we use Hive-JSON-Serde. As with any third-party module, we load the SerDe JARS into Hive with the following code: ADD JAR JAR json-serde-1.3-jar-with-dependencies.jar; Then, we issue the usual create statement, as follows: CREATE EXTERNAL TABLE tweets ( contributors string, coordinates struct < coordinates: array <float>, type: string>, created_at string, entities struct < hashtags: array <struct < indices: array <tinyint>, text: string>>, … ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE LOCATION 'tweets'; With this SerDe, we can map nested documents (such as entities or users) to the struct or map types. We tell Hive that the data stored at LOCATION 'tweets' is text (STORED AS TEXTFILE) and that each row is a JSON object (ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'). In Hive 0.13 and later, we can express this property as ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. Manually specifying the schema for complex documents can be a tedious and error-prone process. The hive-json module (found at https://github.com/hortonworks/hive-json) is a handy utility to analyze large documents and generate an appropriate Hive schema. Depending on the document collection, further refinement might be necessary. In our example, we used a schema generated with hive-json that maps the tweets JSON to a number of struct data types. This allows us to query the data using a handy dot notation. For instance, we can extract the screen name and description fields of a user object with the following code: SELECT user.screen_name, user.description FROM tweets_json LIMIT 10; Avro AvroSerde (https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) allows us to read and write data in Avro format. Starting from 0.14, Avro-backed tables can be created using the STORED AS AVRO statement, and Hive will take care of creating an appropriate Avro schema for the table. Prior versions of Hive are a bit more verbose. This dataset was created using Pig's AvroStorage class, which generated the following schema: { "type":"record", "name":"record", "fields": [ {"name":"topic","type":["null","int"]}, {"name":"source","type":["null","int"]}, {"name":"rank","type":["null","float"]} ] } The table structure is captured in an Avro record, which contains header information (a name and optional namespace to qualify the name) and an array of the fields. Each field is specified with its name and type as well as an optional documentation string. For a few of the fields, the type is not a single value, but instead a pair of values, one of which is null. This is an Avro union, and this is the idiomatic way of handling columns that might have a null value. Avro specifies null as a concrete type, and any location where another type might have a null value needs to be specified in this way. This will be handled transparently for us when we use the following schema. With this definition, we can now create a Hive table that uses this schema for its table specification, as follows: CREATE EXTERNAL TABLE tweets_pagerank ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{ "type":"record", "name":"record", "fields": [ {"name":"topic","type":["null","int"]}, {"name":"source","type":["null","int"]}, {"name":"rank","type":["null","float"]} ] }') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '${data}/ch5-pagerank'; Then, look at the following table definition from within Hive (note also that HCatalog): DESCRIBE tweets_pagerank; OK topic int from deserializer source int from deserializer rank float from deserializer In the DDL, we told Hive that data is stored in Avro format using AvroContainerInputFormat and AvroContainerOutputFormat. Each row needs to be serialized and deserialized using org.apache.hadoop.hive.serde2.avro.AvroSerDe. The table schema is inferred by Hive from the Avro schema embedded in avro.schema.literal. Alternatively, we can store a schema on HDFS and have Hive read it to determine the table structure. Create the preceding schema in a file called pagerank.avsc—this is the standard file extension for Avro schemas. Then place it on HDFS; we prefer to have a common location for schema files such as /schema/avro. Finally, define the table using the avro.schema.url SerDe property WITH SERDEPROPERTIES ('avro.schema.url'='hdfs://<namenode>/schema/avro/pagerank.avsc'). If Avro dependencies are not present in the classpath, we need to add the Avro MapReduce JAR to our environment before accessing individual fields. Within Hive, on the Cloudera CDH5 VM: ADD JAR /opt/cloudera/parcels/CDH/lib/avro/avro-mapred-hadoop2.jar; We can also use this table like any other. For instance, we can query the data to select the user and topic pairs with a high PageRank: SELECT source, topic from tweets_pagerank WHERE rank >= 0.9; Columnar stores Hive can also take advantage of columnar storage via the ORC (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) and Parquet (https://cwiki.apache.org/confluence/display/Hive/Parquet) formats. If a table is defined with very many columns, it is not unusual for any given query to only process a small subset of these columns. But even in a SequenceFile each full row and all its columns will be read from disk, decompressed, and processed. This consumes a lot of system resources for data that we know in advance is not of interest. Traditional relational databases also store data on a row basis, and a type of database called columnar changed this to be column-focused. In the simplest model, instead of one file for each table, there would be one file for each column in the table. If a query only needed to access five columns in a table with 100 columns in total, then only the files for those five columns will be read. Both ORC and Parquet use this principle as well as other optimizations to enable much faster queries. Queries Tables can be queried using the familiar SELECT … FROM statement. The WHERE statement allows the specification of filtering conditions, GROUP BY aggregates records, ORDER BY specifies sorting criteria, and LIMIT specifies the number of records to retrieve. Aggregate functions, such as count and sum, can be applied to aggregated records. For instance, the following code returns the top 10 most prolific users in the dataset: SELECT user_id, COUNT(*) AS cnt FROM tweets GROUP BY user_id ORDER BY cnt DESC LIMIT 10 The following are the top 10 most prolific users in the dataset: NULL 7091 1332188053 4 959468857 3 1367752118 3 362562944 3 58646041 3 2375296688 3 1468188529 3 37114209 3 2385040940 3 We can improve the readability of the hive output by setting the following: SET hive.cli.print.header=true; This will instruct hive, though not beeline, to print column names as part of the output. You can add the command to the .hiverc file usually found in the root of the executing user's home directory to have it apply to all hive CLI sessions. HiveQL implements a JOIN operator that enables us to combine tables together. In the Prerequisites section, we generated separate datasets for the user and place objects. Let's now load them into hive using external tables. We first create a user table to store user data, as follows: CREATE EXTERNAL TABLE user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/users'; We then create a place table to store location data, as follows: CREATE EXTERNAL TABLE place ( place_id string, country_code string, country string, `name` string, full_name string, place_type string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE LOCATION '${input}/places'; We can use the JOIN operator to display the names of the 10 most prolific users, as follows: SELECT tweets.user_id, user.name, COUNT(tweets.user_id) AS cnt FROM tweets JOIN user ON user.user_id = tweets.user_id GROUP BY tweets.user_id, user.user_id, user.name ORDER BY cnt DESC LIMIT 10; Only equality, outer, and left (semi) joins are supported in Hive. Notice that there might be multiple entries with a given user ID but different values for the followers_count, friends_count, and favourites_count columns. To avoid duplicate entries, we count only user_id from the tweets tables. We can rewrite the previous query as follows: SELECT tweets.user_id, u.name, COUNT(*) AS cnt FROM tweets join (SELECT user_id, name FROM user GROUP BY user_id, name) u ON u.user_id = tweets.user_id GROUP BY tweets.user_id, u.name ORDER BY cnt DESC LIMIT 10; Instead of directly joining the user table, we execute a subquery, as follows: SELECT user_id, name FROM user GROUP BY user_id, name; The subquery extracts unique user IDs and names. Note that Hive has limited support for subqueries, historically only permitting a subquery in the FROM clause of a SELECT statement. Hive 0.13 has added limited support for subqueries within the WHERE clause also. HiveQL is an ever-evolving rich language, a full exposition of which is beyond the scope of this article. A description of its query and ddl capabilities can be found at https://cwiki.apache.org/confluence/display/Hive/LanguageManual. Structuring Hive tables for given workloads Often Hive isn't used in isolation, instead tables are created with particular workloads in mind or needs invoked in ways that are suitable for inclusion in automated processes. We'll now explore some of these scenarios. Partitioning a table With columnar file formats, we explained the benefits of excluding unneeded data as early as possible when processing a query. A similar concept has been used in SQL for some time: table partitioning. When creating a partitioned table, a column is specified as the partition key. All values with that key are then stored together. In Hive's case, different subdirectories for each partition key are created under the table directory in the warehouse location on HDFS. It's important to understand the cardinality of the partition column. With too few distinct values, the benefits are reduced as the files are still very large. If there are too many values, then queries might need a large number of files to be scanned to access all the required data. Perhaps the most common partition key is one based on date. We could, for example, partition our user table from earlier based on the created_at column, that is, the date the user was first registered. Note that since partitioning a table by definition affects its file structure, we create this table now as a non-external one, as follows: CREATE TABLE partitioned_user ( created_at string, user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; To load data into a partition, we can explicitly give a value for the partition into which to insert the data, as follows: INSERT INTO TABLE partitioned_user PARTITION( created_at_date = '2014-01-01') SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count FROM user; This is at best verbose, as we need a statement for each partition key value; if a single LOAD or INSERT statement contains data for multiple partitions, it just won't work. Hive also has a feature called dynamic partitioning, which can help us here. We set the following three variables: SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; SET hive.exec.max.dynamic.partitions.pernode=5000; The first two statements enable all partitions (nonstrict option) to be dynamic. The third one allows 5,000 distinct partitions to be created on each mapper and reducer node. We can then simply use the name of the column to be used as the partition key, and Hive will insert data into partitions depending on the value of the key for a given row: INSERT INTO TABLE partitioned_user PARTITION( created_at_date ) SELECT created_at, user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user; Even though we use only a single partition column here, we can partition a table by multiple column keys; just have them as a comma-separated list in the PARTITIONED BY clause. Note that the partition key columns need to be included as the last columns in any statement being used to insert into a partitioned table. In the preceding code we use Hive's to_date function to convert the created_at timestamp to a YYYY-MM-DD formatted string. Partitioned data is stored in HDFS as /path/to/warehouse/<database>/<table>/key=<value>. In our example, the partitioned_user table structure will look like /user/hive/warehouse/default/partitioned_user/created_at=2014-04-01. If data is added directly to the filesystem, for instance by some third-party processing tool or by hadoop fs -put, the metastore won't automatically detect the new partitions. The user will need to manually run an ALTER TABLE statement such as the following for each newly added partition: ALTER TABLE <table_name> ADD PARTITION <location>; To add metadata for all partitions not currently present in the metastore we can use: MSCK REPAIR TABLE <table_name>; statement. On EMR, this is equivalent to executing the following statement: ALTER TABLE <table_name> RECOVER PARTITIONS; Notice that both statements will work also with EXTERNAL tables. Overwriting and updating data Partitioning is also useful when we need to update a portion of a table. Normally a statement of the following form will replace all the data for the destination table: INSERT OVERWRITE INTO <table>… If OVERWRITE is omitted, then each INSERT statement will add additional data to the table. Sometimes, this is desirable, but often, the source data being ingested into a Hive table is intended to fully update a subset of the data and keep the rest untouched. If we perform an INSERT OVERWRITE statement (or a LOAD OVERWRITE statement) into a partition of a table, then only the specified partition will be affected. Thus, if we were inserting user data and only wanted to affect the partitions with data in the source file, we could achieve this by adding the OVERWRITE keyword to our previous INSERT statement. We can also add caveats to the SELECT statement. Say, for example, we only wanted to update data for a certain month: INSERT INTO TABLE partitioned_user PARTITION (created_at_date) SELECT created_at , user_id, location, name, description, followers_count, friends_count, favourites_count, screen_name, listed_count, to_date(created_at) as created_at_date FROM user WHERE to_date(created_at) BETWEEN '2014-03-01' and '2014-03-31'; Bucketing and sorting Partitioning a table is a construct that you take explicit advantage of by using the partition column (or columns) in the WHERE clause of queries against the tables. There is another mechanism called bucketing that can further segment how a table is stored and does so in a way that allows Hive itself to optimize its internal query plans to take advantage of the structure. Let's create bucketed versions of our tweets and user tables; note the following additional CLUSTER BY and SORT BY statements in the CREATE TABLE statements: CREATE table bucketed_tweets ( tweet_id string, text string, in_reply_to string, retweeted boolean, user_id string, place_id string ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; CREATE TABLE bucketed_user ( user_id string, `location` string, name string, description string, followers_count bigint, friends_count bigint, favourites_count bigint, screen_name string, listed_count bigint ) PARTITIONED BY (created_at string) CLUSTERED BY(user_ID) SORTED BY(name) into 64 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY 'u0001' STORED AS TEXTFILE; Note that we changed the tweets table to also be partitioned; you can only bucket a table that is partitioned. Just as we need to specify a partition column when inserting into a partitioned table, we must also take care to ensure that data inserted into a bucketed table is correctly clustered. We do this by setting the following flag before inserting the data into the table: SET hive.enforce.bucketing=true; Just as with partitioned tables, you cannot apply the bucketing function when using the LOAD DATA statement; if you wish to load external data into a bucketed table, first insert it into a temporary table, and then use the INSERT…SELECT… syntax to populate the bucketed table. When data is inserted into a bucketed table, rows are allocated to a bucket based on the result of a hash function applied to the column specified in the CLUSTERED BY clause. One of the greatest advantages of bucketing a table comes when we need to join two tables that are similarly bucketed, as in the previous example. So, for example, any query of the following form would be vastly improved: SET hive.optimize.bucketmapjoin=true; SELECT … FROM bucketed_user u JOIN bucketed_tweet t ON u.user_id = t.user_id; With the join being performed on the column used to bucket the table, Hive can optimize the amount of processing as it knows that each bucket contains the same set of user_id columns in both tables. While determining which rows against which to match, only those in the bucket need to be compared against, and not the whole table. This does require that the tables are both clustered on the same column and that the bucket numbers are either identical or one is a multiple of the other. In the latter case, with say one table clustered into 32 buckets and another into 64, the nature of the default hash function used to allocate data to a bucket means that the IDs in bucket 3 in the first table will cover those in both buckets 3 and 35 in the second. Sampling data Bucketing a table can also help while using Hive's ability to sample data in a table. Sampling allows a query to gather only a specified subset of the overall rows in the table. This is useful when you have an extremely large table with moderately consistent data patterns. In such a case, applying a query to a small fraction of the data will be much faster and will still give a broadly representative result. Note, of course, that this only applies to queries where you are looking to determine table characteristics, such as pattern ranges in the data; if you are trying to count anything, then the result needs to be scaled to the full table size. For a non-bucketed table, you can sample in a mechanism similar to what we saw earlier by specifying that the query should only be applied to a certain subset of the table: SELECT max(friends_count) FROM user TABLESAMPLE(BUCKET 2 OUT OF 64 ON name); In this query, Hive will effectively hash the rows in the table into 64 buckets based on the name column. It will then only use the second bucket for the query. Multiple buckets can be specified, and if RAND() is given as the ON clause, then the entire row is used by the bucketing function. Though successful, this is highly inefficient as the full table needs to be scanned to generate the required subset of data. If we sample on a bucketed table and ensure the number of buckets sampled is equal to or a multiple of the buckets in the table, then Hive will only read the buckets in question. For example: SELECT MAX(friends_count) FROM bucketed_user TABLESAMPLE(BUCKET 2 OUT OF 32 on user_id); In the preceding query against the bucketed_user table, which is created with 64 buckets on the user_id column, the sampling, since it is using the same column, will only read the required buckets. In this case, these will be buckets 2 and 34 from each partition. A final form of sampling is block sampling. In this case, we can specify the required amount of the table to be sampled, and Hive will use an approximation of this by only reading enough source data blocks on HDFS to meet the required size. Currently, the data size can be specified as either a percentage of the table, as an absolute data size, or as a number of rows (in each block). The syntax for TABLESAMPLE is as follows, which will sample 0.5 percent of the table, 1 GB of data or 100 rows per split, respectively: TABLESAMPLE(0.5 PERCENT) TABLESAMPLE(1G) TABLESAMPLE(100 ROWS) If these latter forms of sampling are of interest, then consult the documentation, as there are some specific limitations on the input format and file formats that are supported. Writing scripts We can place Hive commands in a file and run them with the -f option in the hive CLI utility: $ cat show_tables.hql show tables; $ hive -f show_tables.hql We can parameterize HiveQL statements by means of the hiveconf mechanism. This allows us to specify an environment variable name at the point it is used rather than at the point of invocation. For example: $ cat show_tables2.hql show tables like '${hiveconf:TABLENAME}'; $ hive -hiveconf TABLENAME=user -f show_tables2.hql The variable can also be set within the Hive script or an interactive session: SET TABLE_NAME='user'; The preceding hiveconf argument will add any new variables in the same namespace as the Hive configuration options. As of Hive 0.8, there is a similar option called hivevar that adds any user variables into a distinct namespace. Using hivevar, the preceding command would be as follows: $ cat show_tables3.hql show tables like '${hivevar:TABLENAME}'; $ hive -hivevar TABLENAME=user –f show_tables3.hql Or we can write the command interactively: SET hivevar_TABLE_NAME='user'; Summary In this article, we learned that in its early days, Hadoop was sometimes erroneously seen as the latest supposed relational database killer. Over time, it has become more apparent that the more sensible approach is to view it as a complement to RDBMS technologies and that, in fact, the RDBMS community has developed tools such as SQL that are also valuable in the Hadoop world. HiveQL is an implementation of SQL on Hadoop and was the primary focus of this article. In regard to HiveQL and its implementations, we covered the following topics: How HiveQL provides a logical model atop data stored in HDFS in contrast to relational databases where the table structure is enforced in advance How HiveQL offers the ability to extend its core set of operators with user-defined code and how this contrasts to the Pig UDF mechanism The recent history of Hive developments, such as the Stinger initiative, that have seen Hive transition to an updated implementation that uses Tez Resources for Article: Further resources on this subject: Big Data Analysis [Article] Understanding MapReduce [Article] Amazon DynamoDB - Modelling relationships, Error handling [Article]

0
0
4808

Packt

13 Dec 2012

15 min read

Meet QlikView

Packt

13 Dec 2012

15 min read

(For more resources related to this topic, see here.) What is QlikView? QlikView is developed by QlikTech, a company that was founded in Sweden in 1993, but has since moved its headquarters to the US. QlikView is a tool used for Business Intelligence, often shortened to BI. Business Intelligence is defined by Gartner, a leading industry analyst firm, as: An umbrella term that includes the application, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. Following this definition, QlikView is a tool that enables access to information in order to analyze this information, which in turn improves and optimizes business decisions and performance. Historically, BI has been very much IT-driven. IT departments were responsible for the entire Business Intelligence life cycle, from extracting the data to delivering the final reports, analyses, and dashboards. While this model works very well for delivering predefined static reports, most businesses find that it does not meet the needs of their business users. As IT tightly controls the data and tools, users often experience long lead-times whenever new questions arise that cannot be answered with the standard reports. How does QlikView differ from traditional BI? QlikTech prides itself in taking an approach to Business Intelligence that is different from what companies such as Oracle, SAP, and IBM—described by QlikTech as traditional BI vendors—are delivering. They aim to put the tools in the hands of business users, allowing them to become self-sufficient because they can perform their own analyses. Independent industry analyst firms have noticed this different approach as well. In 2011, Gartner created a subcategory for Data Discovery tools in its yearly market evaluation, the Magic Quadrant Business Intelligence platform. QlikView was named the poster child for this new category of BI tools. QlikTech chooses to describe itself as a Business Discovery enterprise instead of Data Discovery enterprise. It believes that discovering business insights is much more important than discovering data. The following diagram outlines this paradigm: Besides the difference in who uses the tool — IT users versus business users — there are a few other key features that differentiate QlikView from other solutions. Associative user experience The main difference between QlikView and other BI solutions is the associative user experience. Where traditional BI solutions use predefined paths to navigate and explore data, QlikView allows users to take whatever route they want. This is a far more intuitive way to explore data. QlikTech describes this as "working the way your mind works." An example is shown in the following image. While in a typical BI solution, we would need to start by selecting a Region and then drill down step-by-step through the defined drill path, in QlikView we can choose whatever entry point we like — Region, State, Product, or Sales Person. We are then shown only the data related to that selection, and in our next selection we can go wherever we want. It is infinitely flexible. Additionally, the QlikView user interface allows us to see which data is associated with our selection. For example, the following screenshot (from QlikTech's What's New in QlikView 11 demo document) shows a QlikView Dashboard in which two values are selected. In the Quarter field, Q3 is selected, and in the Sales Reps field, Cart Lynch is selected. We can see this because these values are green, which in QlikView means that they have been selected. When a selection is made, the interface automatically updates to not only show which data is associated with that selection, but also which data is not associated with the selection. Associated data has a white background, while non-associated data has a gray background. Sometimes the associations can be pretty obvious; it is no surprise that the third quarter is associated with the months July, August, and September. However, at other times, some not-so-obvious insights surface, such as the information that Cart Lynch has not sold any products in Germany or Spain. This extra information, not featured in traditional BI tools, can be of great value, as it offers a new starting point for investigation. Technology QlikView's core technological differentiator is that it uses an in-memory data model, which stores all of its data in RAM instead of using disk. As RAM is much faster than disk, this allows for very fast response times, resulting in a very smooth user-experience. Adoption path There is also a difference between QlikView and traditional BI solutions in the way it is typically rolled out within a company. Where traditional BI suites are often implemented top-down—by IT selecting a BI tool for the entire company—QlikView often takes a bottom-up adoption path. Business users in a single department adopt it, and its use spreads out from there. QlikView is free of charge for single-user use. This is called the Personal Edition or PE. Documents created in Personal Edition can be opened by fully-licensed users or deployed on a QlikView server. The limitation is that, with the exception of some documents enabled for PE by QlikTech, you cannot open documents created elsewhere, or even your own documents if they have been opened and saved by another user or server instance. Often, a business user will decide to download QlikView to see if he can solve a business problem. When other users within the department see the software, they get enthusiastic about it, so they too download a copy. To be able to share documents, they decide to purchase a few licenses for the department. Then other departments start to take notice too, and QlikView gains traction within the organization. Before long, IT and senior management also take notice, eventually leading to enterprise-wide adoption of QlikView. QlikView facilitates every step in this process, scaling from single laptop deployments to full enterprise-wide deployments with thousands of users. The following graphic demonstrates this growth within an organization: As the popularity and track record of QlikView have grown, it has gotten more and more visibility at the enterprise level. While the adoption path described before is still probably the most common adoption path, it is not uncommon nowadays for a company to do a top-down, company-wide rollout of QlikView. Exploring data with QlikView Now that we know what QlikView is and how it is different from traditional BI offerings, we will learn how we can explore data within QlikView. Getting QlikView Of course, before we can start exploring, we need to install QlikView. You can download QlikView's Personal Edition from http://www.qlikview.com/download. You will be asked to register on the website, or log in if you have registered before. Registering not only gives you access to the QlikView software, but you can also use it to read and post on the QlikCommunity (http://community.qlikview.com) which is the QlikTech's user forum. This forum is very active and many questions can be answered by either a quick search or by posting a question. Installing QlikView is very straightforward, simply double-click on the executable file and accept all default options offered. After you are done installing it, launch the QlikView application. QlikView will open with the start page set to the Getting Started tab, as seen in the following screenshot: The example we will be using is the Movie Database, which is an example document that is supplied with QlikView. Find this document by scrolling down the Examples list (it is around halfway down the list) and click to open it. The opening screen of the document will now be displayed: Navigating the document Most QlikView documents are organized into multiple sheets. These sheets often display different viewpoints on the same data, or display the same information aggregated to suit the needs of different types of users. An example of the first type of grouping might be a customer or marketing view of the data, an example of the second type of grouping might be a KPI dashboard for executives, with a more in-depth sheet for analysts. Navigating the different sheets in a QlikView document is typically done by using the tabs at the top of the sheet, as shown in the following screenshot. More sophisticated designs may opt to hide the tab row and use buttons to switch between the different sheets. The tabs in the Movie Database document also follow a logical order. An introduction is shown on the Intro tab, followed by a demonstration of the key concept of QlikView on the How QlikView works tab. After the contrast with Traditional OLAP is shown, the associative QlikView Model is introduced. The last two tabs show how this can be leveraged by showing a concrete Dashboard and Analysis: Slicing and dicing your data As we saw when we learned about the associative user experience, any selections made in QlikView are automatically applied to the entire data model. As we will see in the next section, slicing and dicing your data really is as easy as clicking and viewing! List-boxes But where should we click? QlikView lets us select data in a number of ways. A common method is to select a value from a list-box. This is done by clicking in the list-box. Let's switch to the How QlikView works tab to see how this works. We can do this by either clicking on the How QlikView works tab on the top of the sheet, or by clicking on the Get Started button. The selected tab shows two list boxes, one containing Fruits and the other containing Colors. When we select Apple in the Fruits list-box, the screen automatically updates to show the associated data in the Colors list-box: Green and Red. The color Yellow is shown with a gray background to indicate that it is not associated, as seen below, since there are no yellow apples. To select multiple values, all we need to do is hold down Ctrl while we are making our selection. Selections in charts Besides selections in list-boxes, we can also directly select data in charts. Let's jump to the Dashboard tab and see how this is done. The Dashboard tab contains a chart labeled Number of Movies, which lists the number of movies by a particular actor. If we wish to select only the top three actors, we can simply drag the pointer to select them in the chart, instead of selecting them from a list-box: Because the selection automatically cascades to the rest of the model, this also results in the Actor list-box being updated to reflect the new selection: Of course, if we want to select only a single value in a chart, we don't necessarily need to lasso it. Instead, we can just click on the data point to select it. For example, clicking on James Stewart leads to only that actor being selected. Search While list-boxes and lassoing are both very convenient ways of selecting data, sometimes we may not want to scroll down a big list looking for a value that may or may not be there. This is where the search option comes in handy. For example, we may want to run a search for the actor Al Pacino. To do this, we first activate the corresponding list-box by clicking on it. Next, we simply start typing and the list-box will automatically be updated to show all values that match the search string. When we've found the actor we're looking for, Al Pacino in this case, we can click on that value to select it: Sometimes, we may want to select data based on associated values. For example, we may want to select all of the actors that starred in the movie Forrest Gump. While we could just use the Title list-box, there is also another option: associated search. To use associated search, we click on the chevron on the right-hand side of the search box. This expands the search box and any search term we enter will not only be checked against the Actor list-box, but also against the contents of the entire data model. When we type in Forrest Gump, the search box will show that there is a movie with that title, as seen in the screenshot below. If we select that movie and click on Return, all actors which star in the movie will be selected. Bookmarking selections Inevitably, when exploring data in QlikView, there comes a point where we want to save our current selections to be able to return to them later. This is facilitated by the bookmark option. Bookmarks are used to store a selection for later retrieval. Creating a new bookmark To create a new bookmark, we need to open the Add Bookmark dialog. This is done by either pressing Ctrl + B or by selecting Bookmark | Add Bookmark from the menu. In the Add Bookmark dialog, seen in the screenshot below, we can add a descriptive name for the bookmark. Other options allow us to change how the selection is applied (as either a new selection or on top of the existing selection) and if the view should switch to the sheet that was open at the time of creating the bookmark. The Info Text allows for a longer description to be entered that can be shown in a pop-up when the bookmark is selected. Retrieving a bookmark We can retrieve a bookmark by selecting it from the Bookmarks menu, seen here: Undoing selections Fortunately, if we end up making a wrong selection, QlikView is very forgiving. Using the Clear, Back, and Forward buttons in the toolbar, we can easily clear the entire selection, go back to what we had in our previous selections, or go forward again. Just like in our Internet browser, the Back button in QlikView can take us back multiple steps: Changing the view Besides filtering data, QlikView also lets us change the information being displayed. We'll see how this is done in the following sections. Cyclic Groups Cyclic Groups are defined by developers as a list of dimensions that can be switched between users. On the frontend, they are indicated with a circular arrow. For an example of how this works, let's look at the Ratio to Total chart, seen in the following image. By default, this chart shows movies grouped by duration. If we click on the little downward arrow next to the circular arrow, we will see a list of alternative groupings. Click on Decade to switch to the view to movies grouped by decade. Drill down Groups Drill down Groups are defined by the developer as a hierarchical list of dimensions which allows users to drill down to more detailed levels of the data. For example, a very common drill down path is Year | Quarter | Month | Day. On the frontend, drill down groups are indicated with an upward arrow. In the Movies Database document, a drill down can be found on the tab labeled Traditional OLAP. Let's go there. This drill down follows the path Director | Title | Actor. Click on the Director A. Edward Sutherland to drill down to all movies that he directed, shown in the following screenshot. Next, click on Every Day's A Holiday to see which actors starred in that movie. When drilling down, we can always go back to the previous level by clicking on the upward arrow, located at the top of the list-box in this example. Containers Containers are used to alternate between the display of different objects in the same screen space. We can select the individual objects by selecting the corresponding tab within the container. Our Movies Database example includes a container on the Analysis sheet. The container contains two objects, a chart showing Average length of Movies over time and a table showing the Movie List, shown in the following screenshot. The chart is shown by default, you can switch to the Movie List by clicking on the corresponding tab at the top of the object. On the time chart, we can switch between Average length of Movies and Movie List by using the tabs at the top of the container object. But wait, there's more! After all of the slicing, dicing, drilling, and view-switching we've done, there is still the question on our minds: how can we export our selected data to Excel? Fortunately, QlikView is very flexible when it comes to this, we can simply right-click on any object and choose Send to Excel, or, if it has been enabled by the developer, we can click on the XL icon in an object's header. Click on the XL icon in the Movie List table to export the list of currently selected movies to Excel. A word of warning when exporting data When viewing tables with a large number of rows, QlikView is very good at only rendering those rows that are presently visible on the screen. When Export values to Excel is selected, all values must be pulled down into an Excel file. For large data sets, this can take a considerable amount of time and may cause QlikView to become unresponsive while it provides the data.

0
0
4802

Packt

03 Mar 2015

11 min read

MapReduce functions

Packt

03 Mar 2015

11 min read

In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num): if num % 2 == 0: return num else: return num * 2 doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[ { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow" }, { "id": 2, "title": "The Godfather", "author": "Mario Puzzo" }, { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi" } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) { return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) { return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[ { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow" }, { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow" }, { "id": 3, "title": "The Godfather", "author": "Mario Puzzo" }, { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow" }, { "id": 5, "title": "The Family", "author": "Mario Puzzo" }, { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi" } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) { return ([book.id, book.author]); }); counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1]; if (! counts[key]) counts[key] = 0; ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): { "name": "Sundog", "type": "beer", "brewery_id": "new_holland_brewing_company", "description": "Sundog is an amber ale...", "style": "American-Style Amber/Red Ale", "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) { emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): { "name": "Thomas Hooker Brewing", "city": "Bloomfield", "state": "Connecticut", "website": "http://www.hookerbeer.com/", "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) { if (doc.type == "beer") { emit(doc.name); } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) { if (doc.type == "beer") { emit(doc.category, 1); } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) { return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]

0
0
4795

article-image-oracle-integration-and-consolidation-products

Packt

09 Aug 2011

11 min read

Oracle Integration and Consolidation Products

Packt

09 Aug 2011

11 min read

Oracle Information Integration, Migration, and Consolidation The definitive book and eBook guide to Oracle information integration and migration in a heterogeneous world Data services Data services are at the leading edge of data integration. Traditional data integration involves moving data to a central repository or accessing data virtually through SQL-based interfaces. Data services are a means of making data a 'first class' citizen in your SOA. Recently, the idea of SOA-enabled data services has taken off in the IT industry. This is not any different than accessing data using SQL, JDBC, or ODBC. What is new is that your service-based architecture can now view any database access service as a web service. Service Component Architecture (SCA) plays a big role in data services as now data services created and deployed using Oracle BPEL, Oracle ESB, and other Oracle SOA products can be part of an end-to-end data services platform. No longer do data services deployed in one of the SOA products have to be deployed in another Oracle SOA product. SCA makes it possible to call a BPEL component from Oracle Service Bus and vice versa. Oracle Data Integration Suite Oracle Data Integration (ODI)Suite includes the Oracle Service Bus to publish and subscribe messaging capabilities. Process orchestration capabilities are provided by Oracle BPEL Process Manager, and can be configured to support rule-based, event-based, and data-based delivery services. The Oracle Data Quality for Data Integrator, Oracle Data Profiling products, and Oracle Hyperion Data Relationship Manager provide best-in-class capabilities for data governance, change management hierarchical data management, and provides the foundation for reference data management of any kind. ODI Suite allows you to create data services that can be used in your SCA environment. These data services can be created in ODI, Oracle BPEL or the Oracle Service Bus. You can surround your SCA data services with Oracle Data Quality and Hyperion Data Relationship to cleanse your data and provide master data management. ODI Suite effectively serves two purposes: Bundle Oracle data integration solutions as most customers will need ODI, Oracle BPEL, Oracle Service Bus, and data quality and profiling in order to build a complete data services solution Compete with similar offerings from IBM (InfoSphere Information Server) and Microsoft (BizTalk 2010) that offer complete EII solutions in one offering The ODI Suite data service source and target data sources along with development languages and tools supported are: Data sourceData targetDevelopment languages and toolsERPs, CRMs, B2B systems, flat files, XML data, LDAP, JDBC, ODBCAny data sourceSQL, Java, GUI The most likely instances or use cases when ODI Suite would be the Oracle product or tool selected are: SCA-based data services An end-to-end EII and data migration solution Data services can be used to expose any data source as a service. Once a data service is created, it is accessible and consumable by any web service-enabled product. In the case of Oracle, this is the entire set of products in the Oracle Fusion Middleware Suite. Data consolidation The mainframe was the ultimate solution when it came to data consolidation. All data in an enterprise resided in one or several mainframes that were physically located in a data center. The rise of the hardware and software appliance has created a 'what is old is new again' situation; a hardware and software solution that is sold as one product. Oracle has released the Oracle Exadata appliance and IBM acquired the pure database warehouse appliance company Netezza, HP, and Microsoft announced work on an SQL Server database appliance, and even companies like SAP, EMC, and CICSO are talking about the benefits of database appliances. The difference is (and it is a big difference) that the present architecture is based upon open standards hardware platforms, operating systems, client devices, network protocols, interfaces, and databases. So, you now have a database appliance that is not based upon proprietary operating systems, hardware, network components, software, and data disks. Another very important difference is that enterprise software COTS packages, management tools, and other software infrastructure tools will work across any of these appliance solutions. One of the challenges for customers that run their business on the mainframe is that they are 'locked into' vendor- specific sorting, reporting, job scheduling, system management, and other products usually only offered from IBM, CA, BMC, or Compuware. Mainframe customers also suffer from a lack of choice when it comes to COTS applications. Since appliances are based upon open systems, there is an incredibly large software ecosystem. Oracle Exadata Oracle Exadata is the only database appliance that runs both data warehouse and OLTP applications. Oracle Exadata is an appliance that includes every component an IT organization needs to process information—from a grid database down to the power supply. It is a hardware and software solution that can be up and running in an enterprise in weeks instead of months for typical IT database solutions. Exadata provides high speed data access using a combination of hardware and a database engine that runs at the storage tier. Typical database solutions have to use indexes to retrieve data from storage and then pull large volumes of data into the core database engine, which churns through millions of rows of data to send a handful of row results to the client. Exadata eliminates the need for indexes and data engine processing by placing a lightweight database engine at the storage tier. Therefore, the database engine is only provided with the end result and does not have to utilize complicated indexing schemes, large amounts of CPU, and memory to produce the end results set. Exadata's capabilities to run large OLTP and data warehouse applications, or a large number of smaller OLTP and data warehouse applications on one machine make it a great platform for data consolidation. The first release of Oracle Exadata was based upon HP hardware and was for data warehouses only. The second release came out shortly before Oracle acquired Sun. This release was based upon Sun hardware, but ironically not on Sun Sparc or Solaris (Solaris is now an OS option). The Exadata source and target data sources along with development languages and tools supported are: Data sourceData targetDevelopment languages and toolsAny (depending upon the data source this may involve an intensive migration effort)Oracle ExadataSQL, PL/SQL, Java The most likely instances or use cases when Exadata would be the Oracle product or tool selected are: A move from hundreds of standalone database hardware and software nodes to one database machine A reduction in hardware and software vendors, and one vendor for hardware and software support Keepin It Real The database appliance has become the latest trend in the IT industry. Data warehouse appliances like Netezza have been around for a number of years. Oracle has been the first vendor to offer an open systems database appliance for both DW and OLTP environments. Data grid Instead of consolidating databases physically or accessing the data where it resides, a data grid places the data into an in-memory middle tier. Like physical federation, the data is being placed into a centralized data repository. Unlike physical federation, the data is not placed into a traditional RDBMS system (Oracle database), but into a high-speed memory-based data grid. Oracle offers both a Java and SQL-based data grid solution. The decision of what product to implement often depends on where the corporations system, database, and application developer skills are strongest. If your organization has strong Java or .Net skills and is more comfortable with application servers than databases, then Oracle Coherence is typically the product of choice. If you have strong database administration and SQL skills, then Oracle TimesTen is probably a better solution. The Oracle Exalogic solution takes the data grid to another level by placing Oracle Coherence, along with other Oracle hardware and software solutions, into an appliance. This appliance provides an 'end-to-end' solution or data grid 'in a box'. It reduces management, increases performance, reduces TCO, and eliminates the need for the customer having to build their own hardware and software solution using multiple vendor solutions that may not be certified to work together. Oracle Coherence Oracle Coherence is an in-memory data grid solution that offers next generation Extreme Transaction Processing (XTP). Organizations can predictably scale mission critical applications by using Oracle Coherence to provide fast and reliable access to frequently used data. Oracle Coherence enables customers to push data closer to the application for faster access and greater resource utilization. By automatically and dynamically partitioning data in memory across multiple servers, Oracle Coherence enables continuous data availability and transactional integrity, even in the event of a server failure. Oracle Coherence was purchased from Tangosol Software in 2007. Coherence was an industry-leading middle tier caching solution. The product only offered a Java solution at the time of acquisition, but a .NET offering was already scheduled before the acquisition took place. The Oracle Coherence source and target data sources along with development languages and tools supported are: Data sourceData targetDevelopment languages and toolsJDBC, any data source accessible through Oracle SOA adaptersCoherenceJava, .Net The most likely instances or use cases when Oracle Coherence would be the Oracle product or tool selected are: When it is necessary to replace custom, hand-coded solutions that cache data in middle tier Java or .NET application servers Your company's strengths are in application servers Java or .NET Oracle TimesTen Oracle TimesTen is a data grid/cache offering that has similar characteristics to Oracle Coherence. Both of the solutions offer a product that caches data in the middle tier for high throughput and high transaction volumes. The technology implementations are much different. TimesTen is an in-memory database solution that is accessed through SQL and the data storage mechanism is a relational database. The TimesTen solution data grid can be implemented across a wide area network (WAN) and the nodes that make up the data grid are kept in sync with your back end Oracle database using Oracle Cache Connect. Cache Connect is also used to automatically refresh the TimesTen database on a push or pull basis from your Oracle backend database. Cache Connect can also be used to keep TimesTen databases spread across the global in sync. Oracle TimesTen offers both read and update support, unlike other database in- memory solutions. This means that Oracle TimesTen can be used to run your business even if your backend database is down. The transactions that occur during the downtime are queued and applied to your backend database once it is restored. The other similarity between Oracle Coherence and TimesTen is that they both were acquired technologies. Oracle TimesTen was acquired from the company TimesTen in 2005. The Oracle TimesTen source and target data sources along with development languages and tools supported are: Data sourceData targetDevelopment languages and toolsOracleTimesTenSQL, CLI The most likely instances or use cases when Oracle TimesTen would be the Oracle product or tool selected are: For web-based read-only applications that require a millisecond responseand data close to where request is made For applications where updates need not be reflected back to the user in real-time Oracle Exalogic A simplified explanation of Oracle Exalogic is that it is Exadata for the middle tier application infrastructure. While Exalogic is optimized for enterprise Java, it is also a suitable environment for the thousands of third-party and custom Linux and Solaris applications widely deployed on Java, .NET, Visual Basic, PHP, or any other programming language. The core software components of Exalogic are WebLogic, Coherence, JRocket or Java Hotspot, and Oracle Linux or Solaris. Oracle Exalogic has an optimized version of WebLogic to run Java applications more efficiently and faster than a typical WebLogic implementation. Oracle Exalogic is branded with the Oracle Elastic cloud as an enterprise application consolidation platform. This means that applications can be added on demand and in real-time. Data can be cached in Oracle Coherence for a high speed, centralized, data grid sharable on the cloud. The Exalogic source and target data sources along with development languages and tools supported are: Data sourceData targetDevelopment languages and toolsAny data sourceCoherenceAny language The most likely instances or use cases when Exalogic would be the Oracle product or tool selected are: Enterprise consolidated application server platform Cloud hosted solution Upgrade and Consolidation of hardware or software Oracle Coherence is the product of choice for Java and .NET versed development shops. Oracle TimesTen is more applicable to database-centric and shops more comfortable with SQL.

0
0
4765

Packt

08 Jan 2016

14 min read

Advanced Shiny Functions

Packt

08 Jan 2016

14 min read

In this article by Chris Beeley, author of the book, Web Application Development with R using Shiny - Second Edition, we are going to extend our toolkit by learning about advanced Shiny functions. These allow you to take control of the fine details of your application, including the interface, reactivity, data, and graphics. We will cover the following topics: Learn how to show and hide parts of the interface Change the interface reactively Finely control reactivity, so functions and outputs run at the appropriate time Use URLs and reactive Shiny functions to populate and alter the selections within an interface Upload and download data to and from a Shiny application Produce beautiful tables with the DataTables jQuery library (For more resources related to this topic, see here.) Summary of the application We're going to add a lot of new functionality to the application, and it won't be possible to explain every piece of code before we encounter it. Several of the new functions depend on at least one other function, which means that you will see some of the functions for the first time, whereas a different function is being introduced. It's important, therefore, that you concentrate on whichever function is being explained and wait until later in the article to understand the whole piece of code. In order to help you understand what the code does as you go along it is worth quickly reviewing the actual functionality of the application now. In terms of the functionality, which has been added to the application, it is now possible to select not only the network domain from which browser hits originate but also the country of origin. The draw map function now features a button in the UI, which prevents the application from updating the map each time new data is selected, the map is redrawn only when the button is pushed. This is to prevent minor updates to the data from wasting processor time before the user has finished making their final data selection. A Download report button has been added, which sends some of the output as a static file to a new webpage for the user to print or download. An animated graph of trend has been added; this will be explained in detail in the relevant section. Finally, a table of data has been added, which summarizes mean values of each of the selectable data summaries across the different countries of origin. Downloading data from RGoogleAnalytics The code is given and briefly summarized to give you a feeling for how to use it in the following section. Note that my username and password have been replaced with XXXX; you can get your own user details from the Google Analytics website. Also, note that this code is not included on the GitHub because it requires the username and password to be present in order for it to work: library(RGoogleAnalytics) ### Generate the oauth_token object oauth_token <- Auth(client.id = "xxxx", client.secret = "xxxx") # Save the token object for future sessions save(oauth_token, file = "oauth_token") Once you have your client.id and client.secret from the Google Analytics website, the preceding code will direct you to a browser to authenticate the application and save the authorization within oauth_token. This can be loaded in future sessions to save from reauthenticating each time as follows: # Load the token object and validate for new run load("oauth_token") ValidateToken(oauth_token) The preceding code will load the token in subsequent sessions. The validateToken() function is necessary each time because the authorization will expire after a time this function will renew the authentication: ## list of metrics and dimensions query.list <- Init(start.date = "2013-01-01", end.date = as.character(Sys.Date()), dimensions = "ga:country,ga:latitude,ga:longitude, ga:networkDomain,ga:date", metrics = "ga:users,ga:newUsers,ga:sessions, ga:bounceRate,ga:sessionDuration", max.results = 10000, table.id = "ga:71364313") gadf = GetReportData(QueryBuilder(query.list), token = oauth_token, paginate_query = FALSE) Finally, the metrics and dimensions of interest (for more on metrics and dimensions, see the documentation of the Google Analytics API online) are placed within a list and downloaded with the GetReportData() function as follows: ...[data tidying functions]... save(gadf, file = "gadf.Rdata") The data tidying that is carried out at the end is omitted here for brevity, as you can see at the end the data is saved as gadf.Rdata ready to load within the application. Animation Animation is surprisingly easy. The sliderInput() function, which gives an HTML widget that allows the selection of a number along a line, has an optional animation function that will increment a variable by a set amount every time a specified unit of time elapses. This allows you to very easily produce a graphic that animates. In the following example, we are going to look at the monthly graph and plot a linear trend line through the first 20% of the data (0–20% of the data). Then, we are going to increment the percentage value that selects the portion of the data by 5% and plot a linear through that portion of data (5–25% of the data). Then, increment again from 10% to 30% and plot another line and so on. There is a static image in the following screenshot: The slider input is set up as follows, with an ID, label, minimum value, maximum value, initial value, step between values, and the animation options, giving the delay in milliseconds and whether the animation should loop: sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) Having set this up, the animated graph code is pretty simple, looking very much like the monthly graph data except with the linear smooth based on a subset of the data instead of the whole dataset. The graph is set up as before and then a subset of the data is produced on which the linear smooth can be based: groupByDate <- group_by(passData(), YearMonth, networkDomain) %>% summarise(meanSession = mean(sessionDuration, na.rm = TRUE), users = sum(users), newUsers = sum(newUsers), sessions = sum(sessions)) groupByDate$Date <- as.Date(paste0(groupByDate$YearMonth, "01"), format = "%Y%m%d") smoothData <- groupByDate[groupByDate$Date %in% quantile(groupByDate$Date, input$animation / 100, type = 1): quantile(groupByDate$Date, (input$animation + 20) / 100, type = 1), ] We won't get too distracted by this code, but essentially, it tests to see which of the whole date range falls in a range defined by percentage quantiles based on the sliderInput() values. See ?quantile for more information. Finally, the linear smooth is drawn with an extra data argument to tell ggplot2 to base the line only on the smaller smoothData object and not the whole range: ggplot(groupByDate, aes_string(x = "Date", y = input$outputRequired, group = "networkDomain", colour = "networkDomain") ) + geom_line() + geom_smooth(data = smoothData, method = "lm", colour = "black" ) Not bad for a few lines of code. We have both ggplot2 and Shiny to thank for how easy this is. Streamline the UI by hiding elements This is a simple function that you are certainly going to need if you build even a moderately complex application. Those of you who have been doing extra credit exercises and/or experimenting with your own applications will probably have already wished for this or, indeed, have already found it. conditionalPanel() allows you to show/hide UI elements based on other selections within the UI. The function takes a condition (in JavaScript, but the form and syntax will be familiar from many languages) and a UI element and displays the UI only when the condition is true. This has actually used a couple of times in the advanced GA application, and indeed in all the applications, I've ever written of even moderate complexity. We're going to show the option to smooth the trend graph only when the trend graph tab is displayed, and we're going to show the controls for the animated graph only when the animated graph tab is displayed. Naming tabPanel elements In order to allow testing for which tab is currently selected, we're going to have to first give the tabs of the tabbed output names. This is done as follows (with the new code in bold): tabsetPanel(id = "theTabs", # give tabsetPanel a name tabPanel("Summary", textOutput("textDisplay"), value = "summary"), tabPanel("Trend", plotOutput("trend"), value = "trend"), tabPanel("Animated", plotOutput("animated"), value = "animated"), tabPanel("Map", plotOutput("ggplotMap"), value = "map"), tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") As you can see, the whole panel is given an ID (theTabs) and then each tabPanel is also given a name (summary, trend, animated, map, and table). They are referred to in the server.R file very simply as input$theTabs. Finally, we can make our changes to ui.R to remove parts of the UI based on tab selection: conditionalPanel( condition = "input.theTabs == 'trend'", checkboxInput("smooth", label = "Add smoother?", # add smoother value = FALSE) ), conditionalPanel( condition = "input.theTabs == 'animated'", sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) ) As you can see, the condition appears very R/Shiny-like, except with the . operator familiar to JavaScript users in place of $. This is a very simple but powerful way of making sure that your UI is not cluttered with an irrelevant material. Beautiful tables with DataTable The latest version of Shiny has added support to draw tables using the wonderful DataTables jQuery library. This will enable your users to search and sort through large tables very easily. To see DataTable in action, visit the homepage at http://datatables.net/. The version in this application summarizes the values of different variables across the different countries from which browser hits originate and looks as follows: The package can be installed using install.packages("DT") and needs to be loaded in the preamble to the server.R file with library(DT). Once this is done using the package is quite straightforward. There are two functions: one in server.R (renderDataTable) and other in ui.R (dataTableOutput). They are used as following: ### server. R output$countryTable <- DT::renderDataTable ({ groupCountry <- group_by(passData(), country) groupByCountry <- summarise(groupCountry, meanSession = mean(sessionDuration), users = log(sum(users)), sessions = log(sum(sessions)) ) datatable(groupByCountry) }) ### ui.R tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") Anything that returns a dataframe or a matrix can be used within renderDataTable(). Note that as of Shiny V. 0.12, the Shiny functions renderDataTable() and dataTableOutput() functions are deprecated: you should use the DT equivalents of the same name, as in the preceding code adding DT:: before each function name specifies that the function should be drawn from that package. Reactive user interfaces Another trick you will definitely want up your sleeve at some point is a reactive user interface. This enables you to change your UI (for example, the number or content of radio buttons) based on reactive functions. For example, consider an application that I wrote related to survey responses across a broad range of health services in different areas. The services are related to each other in quite a complex hierarchy, and over time, different areas and services respond (or cease to exist, or merge, or change their name), which means that for each time period the user might be interested in, there would be a totally different set of areas and services. The only sensible solution to this problem is to have the user tell you which area and date range they are interested in and then give them back the correct list of services that have survey responses within that area and date range. The example we're going to look at is a little simpler than this, just to keep from getting bogged down in too much detail, but the principle is exactly the same, and you should not find this idea too difficult to adapt to your own UI. We are going to allow users to constrain their data by the country of origin of the browser hit. Although we could design the UI by simply taking all the countries that exist in the entire dataset and placing them all in a combo box to be selected, it is a lot cleaner to only allow the user to select from the countries that are actually present within the particular date range they have selected. This has the added advantage of preventing the user from selecting any countries of origin, which do not have any browser hits within the currently selected dataset. In order to do this, we are going to create a reactive user interface, that is, one that changes based on data values that come about from user input. Reactive user interface example – server.R When you are making a reactive user interface, the big difference is that instead of writing your UI definition in your ui.R file, you place it in server.R and wrap it in renderUI(). Then, point to it from your ui.R file. Let's have a look at the relevant bit of the server.R file: output$reactCountries <- renderUI({ countryList = unique(as.character(passData()$country)) selectInput("theCountries", "Choose country", countryList) }) The first line takes the reactive dataset that contains only the data between the dates selected by the user and gives all the unique values of countries within it. The second line is a widget type we have not used yet, which generates a combo box. The usual id and label arguments are given, followed by the values that the combo box can take. This is taken from the variable defined in the first line. Reactive user interface example – ui.R The ui.R file merely needs to point to the reactive definition, as shown in the following line of code (just add it in to the list of widgets within sidebarPanel()): uiOutput("reactCountries") You can now point to the value of the widget in the usual way as input$subDomains. Note that you do not use the name as defined in the call to renderUI(), that is, reactCountries, but rather the name as defined within it, that is, theCountries. Progress bars It is quite common within Shiny applications and in analytics generally to have computations or data fetches that take a long time. However, even using all these tools, it will sometimes be necessary for the user to wait some time before their output is returned. In cases like this, it is a good practice to do two things: first, to inform that the server is processing the request and has not simply crashed or otherwise failed, and second to give the user some idea of how much time has elapsed since they requested the output and how much time they have remaining to wait. This is achieved very simply in Shiny using the withProgress() function. This function defaults to measuring progress on a scale from 0 to 1 and produces a loading bar at the top of the application with the information from the message and detail arguments of the loading function. You can see in the following code, the withProgress function is used to wrap a function (in this case, the function that draws the map), with message and detail arguments describing what is happened and an initial value of 0 (value = 0, that is, no progress yet): withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... } ) As the code is stepped through, the value of progress can steadily be increased from 0 to 1 (for example, in a for() loop) using the following code: incProgress(1/3) The third time this is called, the value of progress will be 1, which indicates that the function has completed (although other values of progress can be selected where necessary, see ?withProgess()). To summarize, the finished code looks as follows: withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... incProgress(1/3) .. . function code... incProgress(1/3) ... function code... incProgress(1/3) } ) It's very simple. Again, have a look at the application to see it in action. Summary In this article, you have now seen most of the functionality within Shiny. It's a relatively small but powerful toolbox with which you can build a vast array of useful and intuitive applications with comparatively little effort. In this respect, ggplot2 is rather a good companion for Shiny because it too offers you a fairly limited selection of functions with which knowledgeable users can very quickly build many different graphical outputs. Resources for Article: Further resources on this subject: Introducing R, RStudio, and Shiny[article] Introducing Bayesian Inference[article] R ─ Classification and Regression Trees[article]

0
0
4752

article-image-big-data-analysis-r-and-hadoop

Packt

26 Mar 2015

37 min read

Big Data Analysis (R and Hadoop)

Packt

26 Mar 2015

37 min read

0
0
4746

article-image-creating-your-first-collection-simple

Packt

26 Jun 2013

7 min read

Creating your first collection (Simple)

Packt

26 Jun 2013

7 min read

(For more resources related to this topic, see here.) Getting ready Assuming that you have walked through the tutorial, you should be nearly ready with the setup. Still, it does not hurt to go through the checklist: Be familiar that you know how to start your operating system's shell (cmd.exe on Windows, Terminal/iTerm on Mac, and sh/bash/tch/zsh on Unix). Ensure that running the java –version command on the shell's prompt returns at least Version 1.6. You may need to upgrade if you have an older version. Ensure that you know where you unpacked the Solr distribution and the full path to the example directory within that. You needed that directory for the tutorial, but that's also where we are going to start our own Solr instance. That allows us to easily run an embedded Jetty web server and to also find all the additional JAR files that Solr needs to operate properly. Now, create a directory where we will store our indexes and experiments. It can be anywhere on your drive. As Solr can run on any operating system where Java can run, we will use SOLRINDEXING as a name whenever we refer to that directory. Make sure to use absolute path names when substituting with your real path for the directory. How to do it... As our first example, we will create an index that stores and allows for the searching of simplified e-mail information. For now, we will just look at the addr_from and addr_to e-mail addresses and the subject line. You will see that it takes only two simple configuration files to get the basic Solr index working. Under the SOLR-INDEXING directory, create a collection1 directory and inside that create a conf directory. In the conf directory, create two files: schema.xml and solrconfig.xml. The schema.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><schema version="1.5"><fields><field name="id" type="string" indexed="true" stored="true"required="true"/><field name="addr_from" type="string" indexed="true"stored="true" required="true"/><field name="addr_to" type="string" indexed="true"stored="true" required="true"/><field name="subject" type="string" indexed="true"stored="true" required="true"/></fields><uniqueKey>id</uniqueKey><types><fieldType name="string" class="solr.StrField" /></types></schema> The solrconfig.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><config><luceneMatchVersion>LUCENE_43</luceneMatchVersion><requestDispatcher handleSelect="false"><httpCaching never304="true" /></requestDispatcher><requestHandler name="/select" class="solr.SearchHandler" /><requestHandler name="/update" class="solr.UpdateRequestHandler" /><requestHandler name="/admin" class="solr.admin.AdminHandlers" /><requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /></config> That is it. Now, let's start our just-created Solr instance. Open a new shell (we'll need the current one later). On that shell's command prompt, change the directory to the example directory of the Solr distribution and run the following command: java -Dsolr.solr.home=SOLR-INDEXING -jar start.jar Notice that solr.solr.home is not a typo; you do need the solr part twice. And, as always, if you have spaces in your paths (now or later), you may need to escape them in platform-specific ways, such as with backslashes on Unix/Linux or by quoting the whole value. In the window of your shell, you should see a long list of messages that you can safely ignore (at least for now). You can verify that everything is working fine by checking for the following three elements: The long list of messages should finish with a message like Started SocketConnector@0.0.0.0:8983. This means that Solr is now running on port 8983 successfully. You should now have a directory called data, right next to the directory called conf that we created earlier. If you open the web browser and go to the http:// localhost:8983/ solr/, you should see a web-based admin interface that makes testing and troubleshooting your Solr instance much easier. We will be using this interface later, so do spend a couple of minutes clicking around now. Now, let's load some actual content into our collection: Copy post.jar from the Solr distribution's example/exampledocs directory to our root SOLR-INDEXING directory. Create a file called input1.csv in the collection1 directory, next to the conf and data directories with the following three-line content: id,addr_from,addr_to,subjectemail1,fulan@acme.example.com,kari@acme.example.com,"Kari,we need more Junior Java engineers"email2,kari@acme.example.com,maija@acme.example.com,"Updating vacancy description" Run the import command from the command line in the SOLR-INDEXING directory (one long command; do not split it across lines): java -Dauto -Durl=http://localhost:8983/solr/collection1/update -jar post.jar collection1/input1.csv You should see the following in one of the message lines: "1 files indexed". If you now open a web browser and go to http:// localhost:8983/solr/ collection1/select?q=*%3A*&wt=ruby&indent=true, you should see Solr output with all the three documents displayed on the screen in a somewhat readable format. How it works... We have created two files to get our example working. Let's review what they mean and how they fit together: The schema.xml file in the collection's conf directory defines the actual shape of data that you want to store and index. The fields define a structure of a record. Each field has a type, which is also defined in the same file. The field defines whether it is stored, indexed, required, multivalued, or a small number of other, more advanced properties. On the other hand, the field type defines what is actually done to the field when it is indexed and when it is searched. We will explore all of these later. The solrconfig.xml file also in the collection's conf directory defines and tunes the components that make up Solr's runtime environment. At the very least, it needs to define which URLs can be called to add records to a collection (here, /update), which to query a collection (here, /select), and which to do various administrative tasks (here, /admin and /analysis/field). Once Solr started, it created a single collection with the default name of collection1, assigned an update handler to it at the /solr/collection1/update URL and search handler at the /solr/collection1/select URL (as per solrconfig.xml). At that point, Solr was ready for the data to be imported into the four required fields (as per schema.xml). We then proceeded to populate the index from a CSV file (one of many update formats available) and then verified that the records are all present in an indented Ruby format (again, one of many result formats available). Summary This article helped you create a basic Solr collection and populate it with a simple dataset in CSV format. Resources for Article : Further resources on this subject: Integrating Solr: Ruby on Rails Integration [Article] Indexing Data in Solr 1.4 Enterprise Search Server: Part2 [Article] Text Search, your Database or Solr [Article]

0
0
4693

article-image-setting-synchronous-replication

Packt

10 Aug 2015

17 min read

Setting Up Synchronous Replication

Packt

10 Aug 2015

17 min read

0
0
4683

Packt

15 Jun 2015

3 min read

Neural Network in Azure ML

Packt

15 Jun 2015

3 min read

In this article written by Sumit Mund, author of the book Microsoft Azure Machine Learning, we will learn about neural network, which is a kind of machine learning algorithm inspired by the computational models of a human brain. It builds a network of computation units, neurons, or nodes. In a typical network, there are three layers of nodes. First, the input layer, followed by the middle layer or hidden layer, and in the end, the output layer. Neural network algorithms can be used for both classification and regression problems. (For more resources related to this topic, see here.) The number of nodes in a layer depends on the problem and how you construct the network to get the best result. Usually, the number of nodes in an input layer is equal to the number of features in the dataset. For a regression problem, the number of nodes in the output layer is one while for a classification problem, it is equal to the number of class or label. Each node in a layer gets connected to all the nodes in the next layer. Each edge that connects between nodes is assigned a weight. So, a neural network can well be imagined as a weighted directed acyclic graph. In a typical neural network, as shown in the preceding figure, the middle layer or hidden layer contains the number nodes, which are chosen to make the computation right. While there is no formula or agreed convention for this, it is often optimized after trying out different options. Azure Machine Learning supports neural network for regression, two-class classification, and multiclass classification. It provides a separate module for each kind of problem and lets the users tune it with different parameters, such as the number of hidden nodes, number of iterations to train the model, and so on. A special kind of neural network algorithms where there are more than one hidden layers is known as deep networks or deep learning algorithms. Azure Machine Learning allows us to choose the number of hidden nodes as a property value of the neural network module. These kind of neural networks are getting increasingly popular these days because of their remarkable results and because they allow us to model complex and nonlinear scenarios. There are many kinds of deep networks, but recently, a special kind of deep network known as the convolutional neural network got very popular because of its significant performance in image recognition or classification problems. Azure Machine Learning supports the convolutional neural network. For simple networks with three layers, this can be done through a UI just by choosing parameters. However, to build a deep network like a convolutional deep network, it’s not easy to do so through a UI. So, Azure Machine Learning supports a new kind of language called Net#, which allows you to script different kinds of neural network inside ML Studio by defining different node, the connections (edges), and kind of connections. While deep networks are complex to build and train, Net# makes things relatively easy and simple. Though complex, neural networks are very powerful and Azure Machine Learning makes it fun to work with these be it three-layered shallow networks or multilayer deep networks. Resources for Article: Further resources on this subject: Security in Microsoft Azure [article] High Availability, Protection, and Recovery using Microsoft Azure [article] Managing Microsoft Cloud [article]

0
0
4680

article-image-oracle-goldengate-considerations-designing-solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate: Considerations for Designing a Solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators At a high level, the design must include the following generic requirements: Hardware Software Network Storage Performance All the above must be factored into the overall system architecture. So let's take a look at some of the options and the key design issues. Replication methods So you have a fast reliable network between your source and target sites. You also have a schema design that is scalable and logically split. You now need to choose the replication architecture; One to One, One to Many, active-active, active-passive, and so on. This consideration may already be answered for you by the sheer fact of what the system has to achieve. Let's take a look at some configuration options. Active-active Let's assume a multi-national computer hardware company has an office in London and New York. Data entry clerks are employed at both sites inputting orders into an Order Management System. There is also a procurement department that updates the system inventory with volumes of stock and new products related to a US or European market. European countries are managed by London, and the US States are managed by New York. A requirement exists where the underlying database systems must be kept in synchronisation. Should one of the systems fail, London users can connect to New York and vice-versa allowing business to continue and orders to be taken. Oracle GoldenGate's active-active architecture provides the best solution to this requirement, ensuring that the database systems on both sides of the pond are kept synchronised in case of failure. Another feature the active-active configuration has to offer is the ability to load balance operations. Rather than have effectively a DR site in both locations, the European users could be allowed access to New York and London systems and viceversa. Should a site fail, then the DR solution could be quickly implemented. Active-passive The active-passive bi-directional configuration replicates data from an active primary database to a full replica database. Sticking with the earlier example, the business would need to decide which site is the primary where all users connect. For example, in the event of a failure in London, the application could be configured to failover to New York. Depending on the failure scenario, another option is to start up the passive configuration, effectively turning the active-passive configuration into active-active. Cascading The Cascading GoldenGate topology offers a number of "drop-off" points that are intermediate targets being populated from a single source. The question here is "what data do I drop at which site?" Once this question has been answered by the business, it is then a case of configuring filters in Replicat parameter files allowing just the selected data to be replicated. All of the data is passed on to the next target where it is filtered and applied again. This type of configuration lends itself to a head office system updating its satellite office systems in a round robin fashion. In this case, only the relevant data is replicated at each target site. Another design, is the Hub and Spoke solution, where all target sites are updated simultaneously. This is a typical head office topology, but additional configuration and resources would be required at the source site to ship the data in a timely manner. The CPU, network, and file storage requirements must be sufficient to accommodate and send the data to multiple targets. Physical Standby A Physical Standby database is a robust Oracle DR solution managed by the Oracle Data Guard product. The Physical Standby database is essentially a mirror copy of its Primary, which lends itself perfectly for failover scenarios. However , it is not easy to replicate data from the Physical Standby database, because it does not generate any of its own redo. That said, it is possible to configure GoldenGate to read the archived standby logs in Archive Log Only (ALO) mode. Despite being potentially slower, it may be prudent to feed a downstream system on the DR site using this mechanism, rather than having two data streams configured from the Primary database. This reduces network bandwidth utilization, as shown in the following diagram: Reducing network traffic is particularly important when there is considerable distance between the primary and the DR site. Networking The network should not be taken for granted. It is a fundamental component in data replication and must be considered in the design process. Not only must it be fast, it must be reliable. In the following paragraphs, we look at ways to make our network resilient to faults and subsequent outages, in an effort to maintain zero downtime. Surviving network outages Probably one of your biggest fears in a replication environment is network failure. Should the network fail, the source trail will fill as the transactions continue on the source database, ultimately filling the filesystem to 100% utilization, causing the Extract process to abend. Depending on the length of the outage, data in the database's redologs may be overwritten causing you the additional task of configuring GoldenGate to extract data from the database's archived logs. This is not ideal as you already have the backlog of data in the trail files to ship to the target site once the network is restored. Therefore, ensure there is sufficient disk space available to accommodate data for the longest network outage during the busiest period. Disks are relatively cheap nowadays. Providing ample space for your trail files will help to reduce the recovery time from the network outage. Redundant networks One of the key components in your GoldenGate implementation is the network. Without the ability to transfer data from the source to the target, it is rendered useless. So, you not only need a fast network but one that will always be available. This is where redundant networks come into play, offering speed and reliability. NIC teaming One method of achieving redundancy is Network Interface Card (NIC) teaming or bonding. Here two or more Ethernet ports can be "coupled" to form a bonded network supporting one IP address. The main goal of NIC teaming is to use two or more Ethernet ports connected to two or more different access network switches thus avoiding a single point of failure. The following diagram illustrates the redundant features of NIC teaming: Linux (OEL/RHEL 4 and above) supports NIC teaming with no additional software requirements. It is purely a matter of network configuration stored in text files in the /etc/sysconfig/network-scripts directory. The following steps show how to configure a server for NIC teaming: First, you need to log on as root user and create a bond0 config file using the vi text editor. # vi /etc/sysconfig/network-scripts/ifcfg-bond0 Append the following lines to it, replacing the IP address with your actual IP address, then save file and exit to shell prompt: DEVICE=bond0 IPADDR=192.168.1.20 NETWORK=192.168.1.0 NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes Choose the Ethernet ports you wish to bond, and then open both configurations in turn using the vi text editor, replacing ethn with the respective port number. # vi /etc/sysconfig/network-scripts/ifcfg-eth2 # vi /etc/sysconfig/network-scripts/ifcfg-eth4 Modify the configuration as follows: DEVICE=ethn USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none Save the files and exit to shell prompt. To make sure the bonding module is loaded when the bonding interface (bond0) is brought up, you need to modify the kernel modules configuration file: # vi /etc/modprobe.conf Append the following two lines to the file: alias bond0 bonding options bond0 mode=balance-alb miimon=100 Finally, load the bonding module and restart the network services: # modprobe bonding # service network restart You now have a bonded network that will load balance when both physical networks are available, providing additional bandwidth and enhanced performance. Should one network fail, the available bandwidth will be halved, but the network will still be available. Non-functional requirements (NFRs) Irrespective of the functional requirements, the design must also include the nonfunctional requirements (NFR) in order to achieve the overall goal of delivering a robust, high performance, and stable system. Latency One of the main NFRs is performance. How long does it take to replicate a transaction from the source database to the target? This is known as end-to-end latency that typically has a threshold that must not be breeched in order to satisfy the specified NFR. GoldenGate refers to latency as lag, which can be measured at different intervals in the replication process. These are: Source to Extract: The time taken for a record to be processed by the Extract compared to the commit timestamp on the database Replicat to Target: The time taken for the last record to be processed by the Replicat compared to the record creation time in the trail file A well designed system may encounter spikes in latency but it should never be continuous or growing. Trying to tune GoldenGate when the design is poor is a difficult situation to be in. For the system to perform well you may need to revisit the design.

0
0
4669

article-image-lambdaarchitecture-pattern

Packt

19 Jun 2017

8 min read

LambdaArchitecture Pattern

Packt

19 Jun 2017

8 min read

In this article by Tomcy John and Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of big data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of data lake. (For more resources related to this topic, see here.) The concept of a data lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed, and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same enterprise. These created information silos across arious applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While data mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have data lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this article, we will look at some of the critical aspects of a data lake and understand why does it matter for an enterprise. If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms. Lambda architecture and data lake Lambda architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations. The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (batch and speed - the curved parts of the lambda symbol) which then combines and served through the serving layer (the line merging from the curved part). Figure 01 : Lambda Symbol The main layers constituting the Lambda layer are shown below: Figyure 02 : Components of Lambda Architecure In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results. A practical realization of such a data lake can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of ingestion layer, batch layer.and speed layer springs into action: Figure 03: Layers in Data Lake Data Acquisition Layer:In an organization, data exists in various forms which can be classified as structured data, semi-structured data, or as unstructured data.One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below. Figure 04: Data Acquisition Layer Messaging Layer: The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as in the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics). Figure 05: Message Queue Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion. Data Ingestion Layer: A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda architecture. The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. Figure 06: Data Ingestion Layer Batch Processing: The batch processing layer of lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of lambda architecture. While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A map and reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on divide and rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. Figure 07: Batch Processing Speed (Near Real Time) Data Processing: This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below. Figure 08: Speed (Near Real Time) Processing Data Storage Layer: The data storage layer is very eminent in the lambda architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below. Figure 09: Non-Indexed and Indexed Data Storage Examples Lambda in action Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations. Figure 10: Lambda in action Summary In this article we have briefly discussed a practical approach towards implementing a data lake for enterprises by leveraging Lambda architecture pattern. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] System Architecture and Design of Ansible [article] Microservices and Service Oriented Architecture [article]

0
0
4658

How-To Tutorials - Data

Managing IBM Cognos BI Server Components

What is logistic regression?

Text Mining with R: Part 2

Cluster Computing Using Scala

Hive in Hadoop

Meet QlikView

MapReduce functions

Oracle Integration and Consolidation Products

Advanced Shiny Functions

Big Data Analysis (R and Hadoop)

Trending Topics

Creating your first collection (Simple)

Setting Up Synchronous Replication

Neural Network in Azure ML

Oracle GoldenGate: Considerations for Designing a Solution

LambdaArchitecture Pattern

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access