Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-interacting-your-visualization
Packt
22 Oct 2013
9 min read
Save for later

Interacting with your Visualization

Packt
22 Oct 2013
9 min read
(For more resources related to this topic, see here.) The ultimate goal of visualization design is to optimize applications so that they help us perform cognitive work more efficiently. Ware C. (2012) The goal of data visualization is to help the audience gain information from a large quantity of raw data quickly and efficiently through metaphor, mental model alignment, and cognitive magnification. So far in this article we have introduced various techniques to leverage D3 library implementing many types of visualization. However, we haven't touched a crucial aspect of visualization: human interaction. Various researches have concluded the unique value of human interaction in information visualization. Visualization combined with computational steering allows faster analyses of more sophisticated scenarios...This case study adequately demonstrate that the interaction of a complex model with steering and interactive visualization can extend the applicability of the modelling beyond research Barrass I. & Leng J (2011) In this article we will focus on D3 human visualization interaction support, or as mentioned earlier learn how to add computational steering capability to your visualization. Interacting with mouse events The mouse is the most common and popular human-computer interaction control found on most desktop and laptop computers. Even today, with multi-touch devices rising to dominance, touch events are typically still emulated into mouse events; therefore making application designed to interact via mouse usable through touches. In this recipe we will learn how to handle standard mouse events in D3. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/mouse.html How to do it... In the following code example we will explore techniques of registering and handling mouse events in D3. Although, in this particular example we are only handling click and mousemove, the techniques utilized here can be applied easily to all other standard mouse events supported by modern browsers: <script type="text/javascript"> var r = 400; var svg = d3.select("body") .append("svg"); var positionLabel = svg.append("text") .attr("x", 10) .attr("y", 30); svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) .transition() .delay(Math.pow(i, 2.5) * 50) .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } }); </script> This recipe generates the following interactive visualization: Mouse Interaction How it works... In D3, to register an event listener, we need to invoke the on function on a particular selection. The given event listener will be attached to all selected elements for the specified event (line A). The following code in this recipe attaches a mousemove event listener which displays the current mouse position (line B): svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } On line C we used d3.mouse function to obtain the current mouse position relative to the given container element. This function returns a two-element array [x, y]. After this we also registered an event listener for mouse click event on line D using the same on function: svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) // <-E .transition() .delay(Math.pow(i, 2.5) * 50) // <-F .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); // <-G }); } }); Once again, we retrieved the current mouse position using d3.mouse function and then generated five concentric expanding circles to simulate the ripple effect. The ripple effect was simulated using geometrically increasing delay (line F) with decreasing stroke-width (line E). Finally when the transition effect is over, the circles were removed using transition end listener (line G). There's more... Although, we have only demonstrated listening on the click and mousemove events in this recipe, you can listen on any event that your browser supports through the on function. The following is a list of mouse events that are useful to know when building your interactive visualization: click: Dispatched when user clicks a mouse button dbclick: Dispatched when a mouse button is clicked twice mousedown: Dispatched when a mouse button is pressed mouseenter: Dispatched when mouse is moved onto the boundaries of an element or one of its descendent elements mouseleave: Dispatched when mouse is moved off of the boundaries of an element and all of its descendent elements mousemove: Dispatched when mouse is moved over an element mouseout: Dispatched when mouse is moved off of the boundaries of an element mouseover: Dispatched when mouse is moved onto the boundaries of an element mouseup: Dispatched when a mouse button is released over an element Interacting with a multi-touch device Today, with the proliferation of multi-touch devices, any visualization targeting mass consumption needs to worry about its interactability not only through the traditional pointing device, but through multi-touches and gestures as well. In this recipe we will explore touch support offered by D3 to see how it can be leveraged to generate some pretty interesting interaction with multi-touch capable devices. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/touch.html. How to do it... In this recipe we will generate a progress-circle around the user's touch and once the progress is completed then a subsequent ripple effect will be triggered around the circle. However, if the user prematurely ends his/her touch, then we shall stop the progress-circle without generating the ripples: <script type="text/javascript"> var initR = 100, r = 400, thickness = 20; var svg = d3.select("body") .append("svg"); d3.select("body") .on("touchstart", touch) .on("touchend", touch); function touch() { d3.event.preventDefault(); var arc = d3.svg.arc() .outerRadius(initR) .innerRadius(initR - thickness); var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000) .attrTween("d", function (d) { var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { if (complete(g)) ripples(d); g.remove(); }); g.exit().remove().each(function () { this.__stopped__ = true; }); } function complete(g) { return g.node().__stopped__ != true; } function ripples(position) { for (var i = 1; i < 5; ++i) { var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", initR - (thickness / 2)) .style("stroke-width", thickness / (i)) .transition().delay(Math.pow(i, 2.5) * 50) .duration(2000).ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } } </script> This recipe generates the following interactive visualization on a touch enabled device: Touch Interaction How it works... Event listener for touch events are registered through D3 selection's on function similar to what we have done with mouse events in the previous recipe: d3.select("body") .on("touchstart", touch) .on("touchend", touch); One crucial difference here is that we have registered our touch event listener on the body element instead of the svg element since with many OS and browsers there are default touch behaviors defined and we would like to override it with our custom implementation. This is done through the following function call: d3.event.preventDefault(); Once the touch event is triggered we retrieve multiple touch point data using the d3.touches function as illustrated by the following code snippet: var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); Instead of returning a two-element array as what d3.mouse function does, d3.touches returns an array of two-element arrays since there could be multiple touch points for each touch event. Each touch position array has data structure that looks like the following: Touch Position Array Other than the [x, y] position of the touch point each position array also carries an identifier to help you differentiate each touch point. We used this identifier here in this recipe to establish object constancy. Once the touch data is bound to the selection the progress circle was generated for each touch around the user's finger: g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000).ease('linear') .attrTween("d", function (d) { // <-A var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { // <-B if (complete(g)) ripples(d); g.remove(); }); This is done through a standard arc transition with attribute tweening (line A). Once the transition is over if the progress-circle has not yet been canceled by the user then a ripple effect similar to what we have done in the previous recipe was generated on line B. Since we have registered the same event listener touch function on both touchstart and touchend events, we can use the following lines to remove progress-circle and also set a flag to indicate that this progress circle has been stopped prematurely: g.exit().remove().each(function () { this.__stopped__ = true; }); We need to set this stateful flag since there is no way to cancel a transition once it is started; hence, even after removing the progress-circle element from the DOM tree the transition will still complete and trigger line B. There's more... We have demonstrated touch interaction through the touchstart and touchend events; however, you can use the same pattern to handle any other touch events supported by your browser. The following list contains the proposed touch event types recommended by W3C: touchstart: Dispatched when the user places a touch point on the touch surface touchend: Dispatched when the user removes a touch point from the touch surface touchmove: Dispatched when the user moves a touch point along the touch surface touchcancel: Dispatched when a touch point has been disrupted in an implementation-specific manner
Read more
  • 0
  • 0
  • 1936

article-image-drag
Packt
21 Oct 2013
4 min read
Save for later

Drag

Packt
21 Oct 2013
4 min read
(For more resources related to this topic, see here.) I can't think of a better dragging demonstration than animating with the parallax illusion. The illusion works by having several keyframes rendered in vertical slices and dragging a screen over them to create an animated thingamabob. Drawing the lines by hand would be tedious, so we're using an image Marco Kuiper created in Photoshop. I asked on Twitter and he said we can use the image, if we check out his other work at marcofolio.net. You can also get the image in the examples repository at https://raw.github.com/Swizec/d3.js-book-examples/master/ch4/parallax_base.png. We need somewhere to put the parallax: var width = 1200, height = 450, svg = d3.select('#graph') .append('svg') .attr({width: width, height: height}); We'll use SVG's native support for embedding bitmaps to insert parallax_base.png into the page: svg.append('image') .attr({'xlink:href': 'parallax_base.png', width: width, height: height}); The image element's magic stems from its xlink:href attribute. It understands links and even lets us embed images to create self-contained SVGs. To use that, you would prepend an image MIME type to a base64 encoded representation of the image. For instance, the following line is the smallest embedded version of a spacer GIF. Don't worry if you don't know what a spacer GIF is; they were useful up to about 2005. data:image/gif;base64,R0lGODlhAQABAID/ AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw== Anyway, now that we have the animation base, we need a screen that can be dragged. It's going to be a bunch of carefully calibrated vertical lines: var screen_width = 900, lines = d3.range(screen_width/6), x = d3.scale.ordinal().domain(lines).rangeBands([0, screen_width]); We'll base the screen off an array of numbers (lines). Since line thickness and density are very important, we divide screen_width by 6—five pixels for a line and one for spacing. Make sure the value of screen_width is a multiple of 6; otherwise anti-aliasing ruins the effect. The x scale will help us place the lines evenly: svg.append('g') .selectAll('line') .data(lines) .enter() .append('line') .style('shape-rendering', 'crispEdges') .attr({stroke: 'black', 'stroke-width': x.rangeBand()-1, x1: function (d) { return x(d); }, y1: 0, x2: function (d) { return x(d); }, y2: height}); There's nothing particularly interesting here, just stuff you already know. The code goes through the array and draws a new vertical line for each entry. We made absolutely certain there won't be any anti-aliasing by setting shape-rendering to crispEdges. Time to define and activate a dragging behavior for our group of lines: var drag = d3.behavior.drag().origin(Object).on('drag', function () {d3.select(this).attr('transform', 'translate('+d3.event.x+', 0)').datum({x: d3.event.x, y: 0});}); We created the behavior with d3.behavior.drag(), defined a .origin() accessor, and specified what happens on drag. The behavior automatically translates touch and mouse events to the higher-level drag event. How cool is that! We need to give the behavior an origin so it knows how to calculate positions relatively; otherwise, the current position is always set to the mouse cursor and objects jump around. It's terrible. Object is the identity function for elements and assumes a datum with x and y coordinates. The heavy lifting happens inside the drag listener. We get the screen's new position from d3.event.x, move the screen there, and update the attached .datum() method. All that's left to do is to call drag and make sure to set the attached datum to the current position: svg.select('g') .datum({x: 0, y: 0}) .call(drag); The item looks solid now! Try dragging the screen at different speeds. The parallax effect doesn't work very well on a retina display because the base image gets resized and our screen loses calibration. Summary In this article, we looked into the drag behavioud of d3. All this can be done by with just click events, but I heartily recommend d3's behaviors module. It makes complex behaviors is that they automatically create relevant event listeners and let you work at a higher level of abstraction. Resources for Article: Further resources on this subject: Visualizing Productions Ahead of Time with Celtx [Article] Custom Data Readers in Ext JS [Article] The Login Page using Ext JS [Article]
Read more
  • 0
  • 0
  • 1830

article-image-using-spark-shell
Packt
18 Oct 2013
5 min read
Save for later

Using the Spark Shell

Packt
18 Oct 2013
5 min read
(For more resources related to this topic, see here.) Loading a simple text file When running a Spark shell and connecting to an existing cluster, you should see something specifying the app ID like Connected to Spark cluster with app ID app-20130330015119-0001. The app ID will match the application entry as shown in the web UI under running applications (by default, it would be viewable on port 8080). You can start by downloading a dataset to use for some experimentation. There are a number of datasets that are put together for The Elements of Statistical Learning, which are in a very convenient form for use. Grab the spam dataset using the following command: wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data Now load it as a text file into Spark with the following command inside your Spark shell: scala> val inFile = sc.textFile("./spam.data") This loads the spam.data file into Spark with each line being a separate entry in the RDD (Resilient Distributed Datasets). Note that if you've connected to a Spark master, it's possible that it will attempt to load the file on one of the different machines in the cluster, so make sure it's available on all the cluster machines. In general, in future you will want to put your data in HDFS, S3, or similar file systems to avoid this problem. In a local mode, you can just load the file directly, for example, sc.textFile([filepah]). To make a file available across all the machines, you can also use the addFile function on the SparkContext by writing the following code: scala> import spark.SparkFiles; scala> val file = sc.addFile("spam.data") scala> val inFile = sc.textFile(SparkFiles.get("spam.data")) Just like most shells, the Spark shell has a command history.You can press the up arrow key to get to the previous commands. Getting tired of typing or not sure what method you want to call on an object? Press Tab, and the Spark shell will autocomplete the line of code as best as it can. For this example, the RDD with each line as an individual string isn't very useful, as our data input is actually represented as space-separated numerical information. Map over the RDD, and quickly convert it to a usable format (note that _.toDouble is the same as x => x.toDouble): scala> val nums = inFile.map(x => x.split(' ').map(_.toDouble)) Verify that this is what we want by inspecting some elements in the nums RDD and comparing them against the original string RDD. Take a look at the first element of each RDD by calling .first() on the RDDs: scala> inFile.first() [...] res2: String = 0 0.64 0.64 0 0.32 0 0 0 0 0 0 0.64 0 0 0 0.32 0 1.29 1.93 0 0.96 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.778 0 0 3.756 61 278 1 scala> nums.first() [...] res3: Array[Double] = Array(0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0, 1.0) Using the Spark shell to run logistic regression When you run a command and have not specified a left-hand side (that is, leaving out the val x of val x = y), the Spark shell will print the value along with res[number]. The res[number] function can be used as if we had written val res[number] = y.Now that you have the data in a more usable format, try to do something cool with it! Use Spark to run logistic regression over the dataset as follows: scala> import spark.util.Vectorimport spark.util.Vectorscala> case class DataPoint(x: Vector, y: Double)defined class DataPointscala> def parsePoint(x: Array[Double]): DataPoint = {DataPoint(new Vector(x.slice(0,x.size-2)) , x(x.size-1))}parsePoint: (x: Array[Double])this.DataPointscala> val points = nums.map(parsePoint(_))points: spark.RDD[this.DataPoint] = MappedRDD[3] at map at<console>:24scala> import java.util.Randomimport java.util.Randomscala> val rand = new Random(53)rand: java.util.Random = java.util.Random@3f4c24scala> var w = Vector(nums.first.size-2, _ => rand.nextDouble)13/03/31 00:57:30 INFO spark.SparkContext: Starting job: first at<console>:20...13/03/31 00:57:30 INFO spark.SparkContext: Job finished: first at<console>:20, took 0.01272858 sw: spark.util.Vector = (0.7290865701603526, 0.8009687428076777,0.6136632797111822, 0.9783178194773176, 0.3719683631485643,0.46409291255379836, 0.5340172959927323, 0.04034252433669905,0.3074428389716637, 0.8537414030626244, 0.8415816118493813,0.719935849109521, 0.2431646830671812, 0.17139348575456848,0.5005137792223062, 0.8915164469396641, 0.7679331873447098,0.7887571495335223, 0.7263187438977023, 0.40877063468941244,0.7794519914671199, 0.1651264689613885, 0.1807006937030201,0.3227972103818231, 0.2777324549716147, 0.20466985600105037,0.5823059390134582, 0.4489508737465665, 0.44030858771499415,0.6419366305419459, 0.5191533842209496, 0.43170678028084863,0.9237523536173182, 0.5175019655845213, 0.47999523211827544,0.25862648071479444, 0.020548000101787922, 0.18555332739714137, 0....scala> val iterations = 100iterations: Int = 100scala> import scala.math._scala> for (i <- 1 to iterations) {val gradient = points.map(p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x).reduce(_ + _)w -= gradient}[....]scala> wres27: spark.util.Vector = (0.2912515190246098, 1.05257972144256,1.1620192443948825, 0.764385365541841, 1.3340446477767611,0.6142105091995632, 0.8561985593740342, 0.7221556020229336,0.40692442223198366, 0.8025693176035453, 0.7013618380649754,0.943828424041885, 0.4009868306348856, 0.6287356973527756,0.3675755379524898, 1.2488466496117185, 0.8557220216380228,0.7633511642942988, 6.389181646047163, 1.43344096405385,1.729216408954399, 0.4079709812689015, 0.3706358251228279,0.8683036382227542, 0.36992902312625897, 0.3918455398419239,0.2840295056632881, 0.7757126171768894, 0.4564171647415838,0.6960856181900357, 0.6556402580635656, 0.060307680034745986,0.31278587054264356, 0.9273189009376189, 0.0538302050535121,0.545536066902774, 0.9298009485403773, 0.922750704590723,0.072339496591 If things went well, you just used Spark to run logistic regression. Awsome! We have just done a number of things: we have defined a class, we have created an RDD, and we have also created a function. As you can see the Spark shell is quite powerful. Much of the power comes from it being based on the Scala REPL (the Scala interactive shell), so it inherits all the power of the Scala REPL (Read-Evaluate-Print Loop). That being said, most of the time you will probably want to work with a more traditionally compiled code rather than working in the REPL environment. Summary In this article, you have learned how to load our data and how to use Spark to run logistic regression. Resources for Article: Further resources on this subject: Python Data Persistence using MySQL Part II: Moving Data Processing to the Data [Article] Configuring Apache and Nginx [Article] Advanced Hadoop MapReduce Administration [Article]
Read more
  • 0
  • 0
  • 4294

article-image-working-basic-components-make-threejs-scene
Packt
17 Oct 2013
22 min read
Save for later

Working with the Basic Components That Make Up a Three.js Scene

Packt
17 Oct 2013
22 min read
(For more resources related to this topic, see here.) Creating a scene We know that for a scene to show anything, we need three types of components: Component Description Camera It determines what is rendered on the screen Lights They have an effect on how materials are shown and used when creating shadow effects Objects These are the main objects that are rendered from the perspective of the camera: cubes, spheres, and so on The THREE.Scene() object serves as the container for all these different objects. This object itself doesn't have too many options and functions. Basic functionality of the scene The best way to explore the functionality of the scene is by looking at an example. I'll use this example to explain the various functions and options that a scene has. When we open this example in the browser, the output will look something like the following screenshot: Even though the scene looks somewhat empty, it already contains a couple of objects. By looking at the following source code, we can see that we've used the Scene.add(object) function from the THREE.Scene() object to add a THREE.Mesh (the ground plane that you see), a THREE.SpotLight. and a THREE.AmbientLight object. The THREE.Camera object is added automatically by the Three.js library when you render the scene, but can also be added manually if you prefer. var scene = new THREE.Scene(); var camera = new THREE.PerspectiveCamera(45, window.innerWidth / window.innerHeight, 0.1, 1000); ... var planeGeometry = new THREE.PlaneGeometry(60,40,1,1); var planeMaterial = new THREE.MeshLambertMaterial({color: 0xffffff}); var plane = new THREE.Mesh(planeGeometry,planeMaterial); ... scene.add(plane); var ambientLight = new THREE.AmbientLight(0x0c0c0c); scene.add(ambientLight); ... var spotLight = new THREE.SpotLight( 0xffffff ); ... scene.add( spotLight ); Before we look deeper into the THREE.Scene() object, I'll first explain what you can do in the demonstration, and after that we'll look at some code. Open this example in your browser and look at the controls at the upper-right corner as you can see in the following screenshot: With these controls you can add a cube to the scene, remove the last added cube from the scene, and show all the current objects that the scene contains. The last entry in the control section shows the current number of objects in the scene. What you'll probably notice when you start up the scene is that there are already four objects in the scene. These are the ground plane, the ambient light, the spot light, and the camera that we had mentioned earlier. In the following code fragment, we'll look at each of the functions in the control section and start with the easiest one: the addCube() function: this.addCube = function() { var cubeSize = Math.ceil((Math.random() * 3)); var cubeGeometry = new THREE.CubeGeometry(cubeSize,cubeSize,cubeSize); var cubeMaterial = new THREE.MeshLambertMaterial( {color: Math.random() * 0xffffff }); var cube = new THREE.Mesh(cubeGeometry, cubeMaterial); cube.castShadow = true; cube.name = "cube-" + scene.children.length; cube.position.x=-30 + Math.round( (Math.random() * planeGeometry.width)); cube.position.y= Math.round((Math.random() * 5)); cube.position.z=-20 + Math.round((Math.random() * planeGeometry.height)); scene.add(cube); this.numberOfObjects = scene.children.length; }; This piece of code should be pretty easy to read by now. Not many new concepts are introduced here. When you click on the addCube button, a new THREE.CubeGeometry instance is created with a random size between zero and three. Besides a random size, the cube also gets a random color and position in the scene. A new thing in this piece of code is that we also give the cube a name by using the name attribute. Its name is set to cube-appended with the number of objects currently in the scene (shown by the scene.children.length property). So you'll get names like cube-1, cube-2, cube-3, and so on. A name can be useful for debugging purposes, but can also be used to directly find an object in your scene. If you use the Scene.getChildByName(name) function, you can directly retrieve a specific object and, for instance, change its location. You might wonder what the last line in the previous code snippet does. The numberOfObjects variable is used by our control GUI to list the number of objects in the scene. So whenever we add or remove an object, we set this variable to the updated count. The next function that we can call from the control GUI is removeCube and, as the name implies, clicking on this button removes the last added cube from the scene. The following code snippet shows how this function is defined: this.removeCube = function() { var allChildren = scene.children; var lastObject = allChildren[allChildren.length-1]; if (lastObject instanceof THREE.Mesh) { scene.remove(lastObject); this.numberOfObjects = scene.children.length; } } To add an object to the scene we will use the add() function. To remove an object from the scene we use the not very surprising remove() function. In the given code fragment we have used the children property from the THREE.Scene() object to get the last object that was added. We also need to check whether that object is a Mesh object in order to avoid removing the camera and the lights. After we've removed the object, we will once again update the GUI property that holds the number of objects in the scene. The final button on our GUI is labeled as outputObjects. You've probably already clicked on it and nothing seemed to happen. What this button does is print out all the objects that are currently in our scene and will output them to the web browser Console as shown in the following screenshot: The code to output information to the Console log makes use of the built-in console object as shown: this.outputObjects = function() { console.log(scene.children); } This is great for debugging purposes; especially when you name your objects, it's very useful for finding issues and problems with a specific object in your scene. For instance, the properties of the cube-17 object will look like the following code snippet: __webglActive: true __webglInit: true _modelViewMatrix: THREE.Matrix4 _normalMatrix: THREE.Matrix3 _vector: THREE.Vector3 castShadow: true children: Array[0] eulerOrder: "XYZ" frustumCulled: true geometry: THREE.CubeGeometry id: 20 material: THREE.MeshLambertMaterial matrix: THREE.Matrix4 matrixAutoUpdate: true matrixRotationWorld: THREE.Matrix4 matrixWorld: THREE.Matrix4 matrixWorldNeedsUpdate: false name: "cube-17" parent: THREE.Scene position: THREE.Vector3 properties: Object quaternion: THREE.Quaternion receiveShadow: false renderDepth: null rotation: THREE.Vector3 rotationAutoUpdate: true scale: THREE.Vector3 up: THREE.Vector3 useQuaternion: false visible: true __proto__: Object So far we've seen the following scene-related functionality: Scene.Add(): This method adds an object to the scene Scene.Remove(): This removes an object from the scene Scene.children(): This method gets a list of all the children in the scene Scene.getChildByName(): This gets a specific object from the scene by using the name attribute These are the most important scene-related functions, and most often you won't need any more. There are, however, a couple of helper functions that could come in handy, and I'd like to show them based on the code that handles the cube rotation. We use a render loop to render the scene. Let's look at the same code snippet for this example: function render() { stats.update(); scene.traverse(function(e) { if (e instanceof THREE.Mesh && e != plane ) { e.rotation.x+=controls.rotationSpeed; e.rotation.y+=controls.rotationSpeed; e.rotation.z+=controls.rotationSpeed; } }); requestAnimationFrame(render); renderer.render(scene, camera); } Here we can see that the THREE.Scene.traverse() function is being used. We can pass a function as an argument to the traverse() function. This passed in function will be called for each child of the scene. In the render() function, we will use the traverse() function to update the rotation for each of the cube instances (we will explicitly ignore the ground plane). We could also have done this by iterating over the children property array by using a for loop. Before we dive into the Mesh and Geometry object details, I'd like to show you two interesting properties that you can set on the Scene object: fog and overrideMaterial. Adding the fog effect to the scene The fog property let's you add a fog effect to the complete scene. The farther an object is, the more it will be hidden from sight. The following screenshot shows how the fog property is enabled: Enabling the fog property is really easy to do in the Three.js library. Just add the following line of code after you've defined your scene: scene.fog=new THREE.Fog( 0xffffff, 0.015, 100 ); Here we are defining a white fog (0xffffff). The last two properties can be used to tune how the mist will appear. The 0.015 value sets the near property and the 100 value sets the far property. With these properties you can determine where the mist will start and how fast it will get denser. There is also a different way to set the mist for the scene; for this you will have to use the following definition: scene.fog=new THREE.FogExp2( 0xffffff, 0.015 ); This time we don't specify the near and far properties, but just the color and the mist density. It's best to experiment a bit with these properties in order to get the effect that you want. Using the overrideMaterial property The last property that we will discuss for the scene is the overrideMaterial property, which is used to fix the materials of all the objects. When you use this property as shown in the following code snippet, all the objects that you add to the scene will make use of the same material: scene.overrideMaterial = new THREE.MeshLambertMaterial({color: 0xffffff}); The scene will be rendered as shown in the following screenshot: In the earlier screenshot, you can see that all the cube instances are rendered by using the same material and color. In this example we've used a MeshLambertMaterial object as the material. With this material type, you can create non-shiny looking objects which will respond to the lights that you add to the scene. In this section we've looked at the first of the core concepts of the Three.js library: the scene. The most important thing to remember about the scene is that it is basically a container for all the objects, lights, and cameras that you want to use while rendering. The following table summarizes the most important functions and attributes of the Scene object: Function/Property Description add(object) Adds an object to the scene. You can also use this function, as we'll see later, to create groups of objects. children Returns a list of all the objects that have been added to the scene, including the camera and lights. getChildByName(name) When you create an object, you can give it a distinct name by using the name attribute. The Scene object has a function that you can use to directly return an object with a specific name. remove(object)  If you've got a reference to an object in the scene, you can also remove it from the scene by using this function. traverse(function) The children attribute returns a list of all the children in the scene. With the traverse() function we can also access these children by passing in a callback function. fog This property allows you to set the fog for the scene. It will render a haze that hides the objects that are far away. overrideMaterial With this property you can force all the objects in the scene to use the same material. In the next section we'll look closely at the objects that you can add to the scene. Working with the Geometry and Mesh objects In each of the examples so far you've already seen the geometries and meshes that are being used. For instance, to add a sphere object to the scene we did the following: var sphereGeometry = new THREE.SphereGeometry(4,20,20); var sphereMaterial = new THREE.MeshBasicMaterial({color: 0x7777ff); var sphere = new THREE.Mesh(sphereGeometry,sphereMaterial); We have defined the shape of the object, its geometry, what this object looks like, its material, and combined all of these in a mesh that can be added to a scene. In this section we'll look a bit closely at what the Geometry and Mesh objects are. We'll start with the geometry. The properties and functions of a geometry The Three.js library comes with a large set of out-of-the-box geometries that you can use in your 3D scene. Just add a material, create a mesh variable, and you're pretty much done. The following screenshot, from example 04-geometries.html, shows a couple of the standard geometries available in the Three.js library: In we'll explore all the basic and advanced geometries that the Three.js library has to offer. For now, we'll go into more detail on what the geometry variable actually is. A geometry in Three.js, and in most other 3D libraries, is basically a collection of points in a 3D space and a number of faces connecting all those points together. Take, for example, a cube: A cube has eight corners. Each of these corners can be defined as a combination of x, y, and z coordinates. So, each cube has eight points in a 3D space. In the Three.js library, these points are called vertices. A cube has six sides, with one vertex at each corner. In the Three.js library, each of these sides is called a face. When you use one of the Three.js library-provided geometries, you don't have to define all the vertices and faces yourself. For a cube you only need to define the width, height, and depth. The Three.js library uses that information and creates a geometry with eight vertices at the correct position and the correct face. Even though you'd normally use the Three.js library-provided geometries, or generate them automatically, you can still create geometries completely by hand by defining the vertices and faces. This is shown in the following code snippet: var vertices = [ new THREE.Vector3(1,3,1), new THREE.Vector3(1,3,-1), new THREE.Vector3(1,-1,1), new THREE.Vector3(1,-1,-1), new THREE.Vector3(-1,3,-1), new THREE.Vector3(-1,3,1), new THREE.Vector3(-1,-1,-1), new THREE.Vector3(-1,-1,1) ]; var faces = [ new THREE.Face3(0,2,1), new THREE.Face3(2,3,1), new THREE.Face3(4,6,5), new THREE.Face3(6,7,5), new THREE.Face3(4,5,1), new THREE.Face3(5,0,1), new THREE.Face3(7,6,2), new THREE.Face3(6,3,2), new THREE.Face3(5,7,0), new THREE.Face3(7,2,0), new THREE.Face3(1,3,4), new THREE.Face3(3,6,4), ]; var geom = new THREE.Geometry(); geom.vertices = vertices; geom.faces = faces; geom.computeCentroids(); geom.mergeVertices(); This code shows you how to create a simple cube. We have defined the points that make up this cube in the vertices array. These points are connected to create triangular faces and are stored in the faces array. For instance, the new THREE.Face3(0,2,1) element creates a triangular face by using the points 0, 2, and 1 from the vertices array. In this example we have used a THREE.Face3 element to define the six sides of the cube, that is, two triangles for each face. In the previous versions of the Three.js library, you could also use a quad instead of a triangle. A quad uses four vertices instead of three to define the face. Whether using quads or triangles is better is a much-heated debate in the 3D modeling world. Basically, using quads is often preferred during modeling, since they can be more easily enhanced and smoothed much easier than triangles. For rendering and game engines, though, working with triangles is easier since every shape can be rendered as a triangle. Using these vertices and faces, we can now create our custom geometry, and use it to create a mesh. I've created an example that you can use to play around with the position of the vertices. In example 05-custom-geometry.html, you can change the position of all the vertices of a cube. This is shown in the following screenshot: This example, which uses the same setup as all our other examples, has a render loop. Whenever you change one of the properties in the drop-down control box, the cube is rendered correctly based on the changed position of one of the vertices. This isn't something that works out-of-the-box. For performance reasons, the Three.js library assumes that the geometry of a mesh won't change during its lifetime. To get our example to work we need to make sure that the following is added to the code in the render loop: mesh.geometry.vertices=vertices; mesh.geometry.verticesNeedUpdate=true; mesh.geometry.computeFaceNormals(); In the first line of the given code snippet, we point the vertices of the mesh that you see on the screen to an array of the updated vertices. We don't need to reconfigure the faces, since they are still connected to the same points as they were before. After we've set the updated vertices, we need to tell the geometry that the vertices need to be updated. We can do this by setting the verticesNeedUpdate property of the geometry to true. Finally we will do a recalculation of the faces to update the complete model by using the computeFaceNormals() function. The last geometry functionality that we'll look at is the clone() function. We had mentioned that the geometry defines the form, the shape of an object, and combined with a material we can create an object that can be added to the scene to be rendered by the Three.js library. With the clone() function, as the name implies, we can make a copy of the geometry and, for instance, use it to create a different mesh with a different material. In the same example, that is, 05-custom-geometry.html, you can see a clone button at the top of the control GUI, as seen in the following screenshot: If you click on this button, a clone will be made of the geometry as it currently is, and a new object is created with a different material and is added to the scene. The code for this is rather trivial, but is made a bit more complex because of the materials that I have used. Let's take a step back and first look at the code that was used to create the green material for the cube: var materials = [ new THREE.MeshLambertMaterial( { opacity:0.6, color: 0x44ff44, transparent:true } ), new THREE.MeshBasicMaterial( { color: 0x000000, wireframe: true } ) ]; As you can see, I didn't use a single material, but an array of two materials. The reason is that besides showing a transparent green cube, I also wanted to show you the wireframe, since that shows very clearly where the vertices and faces are located. The Three.js library, of course, supports the use of multiple materials when creating a mesh. You can use the SceneUtils.createMultiMaterialObject() function for this as shown: var mesh = THREE.SceneUtils.createMultiMaterialObject( geom,materials); What the Three.js library does in this function is that it doesn't create one THREE.Mesh instance, but it creates one for each material that you have specified, and puts all of these meshes in a group. This group can be used in the same manner that you've used for the Scene object. You can add meshes, get meshes by name, and so on. For instance, to add shadows to all the children in this group, we will do the following: mesh.children.forEach(function(e) {e.castShadow=true}); Now back to the clone() function that we were discussing earlier: this.clone = function() { var cloned = mesh.children[0].geometry.clone(); var materials = [ new THREE.MeshLambertMaterial( { opacity:0.6, color: 0xff44ff, transparent:true } ), new THREE.MeshBasicMaterial({ color: 0x000000, wireframe: true } ) ]; var mesh2 = THREE.SceneUtils.createMultiMaterialObject(cloned,materials); mesh2.children.forEach(function(e) {e.castShadow=true}); mesh2.translateX(5); mesh2.translateZ(5); mesh2.name="clone"; scene.remove(scene.getChildByName("clone")); scene.add(mesh2); } This piece of JavaScript is called when the clone button is clicked on. Here we clone the geometry of the first child of the cube. Remember, the mesh variable contains two children: a mesh that uses the MeshLambertMaterial and a mesh that uses the MeshBasicMaterial. Based on this cloned geometry, we will create a new mesh, aptly named mesh2. We can move this new mesh by using the translate() function remove the previous clone (if present), and add the clone to the scene. That's enough on geometries for now. The functions and attributes for a mesh We've already learned that, in order to create a mesh, we need a geometry and one or more materials. Once we have a mesh, we can add it to the scene, and it is rendered. There are a couple of properties that you can use to change where and how this mesh appears in the scene. In the first example, we'll look at the following set of properties and functions: Function/Property Description position Determines the position of this object relative to the position of its parent. Most often the parent of an object is a THREE.Scene() object. rotation With this property you can set the rotation of an object around any of its axes. scale This property allows you to scale the object around its x, y, and z axes. translateX(amount)  Moves the object through the specified amount over the x axis. translateY(amount)  Moves the object through the specified amount over the y axis. translateZ(amount)  Moves the object through the specified amount over the z axis. As always, we have an example ready for you that'll allow you to play around with these properties. If you open up the 06-mesh-properties.html example in your browser, you will get a drop-down menu where you can alter all these properties and directly see the result, as shown in the following screenshot: Let me walk you through them; I'll start with the position property. We've already seen this property a couple of times, so let's quickly address it. With this property you can set the x, y, and z coordinates of the object. The position of an object is relative to its parent object, which usually is the scene that you have added the object to. We can set an object's position property in three different ways; each coordinate can be set directly as follows: cube.position.x=10; cube.position.y=3; cube.position.z=1; But we can also set all of them at once: cube.position.set(10,3,1); There is also a third option. The position property is a THREE.Vector3 object. This means that we can also do the following to set this object: cube.postion=new THREE.Vector3(10,3,1) I want to make a quick sidestep before looking at the other properties of this mesh. I had mentioned that this position is set relative to the position of its parent. In the previous section on THREE.Geometry, we made use of the THREE.SceneUtils.createMultiMaterialObject object to create a multimaterial object. I had explained that this doesn't really return a single mesh, but a group that contains a mesh based on the same geometry for each material. In our case, it is a group that contains two meshes. If we change the position of one of the meshes that is created, you can clearly see that there really are two distinct objects. However, if we now move the created group around, the offset will remain the same. These two meshes are shown in the following screenshot: Ok, the next one on the list is the rotation property. You've already seen this property being used a couple of times in this article. With this property, you can set the rotation of the object around one of its axes. You can set this value in the same manner as we did the for the position property. A complete rotation, as you might remember from math class, is two pi. The following code snippet shows how to configure this: cube.rotation.x=0.5*Math.PI; cube.rotation.set(0.5*Math.PI,0,0); cube.rotation = new THREE.Vector3(0.5*Math.PI,0,0); You can play around with this property by using the 06-mesh-properties.html example. The next property on our list is one that we haven't talked about: scale. The name pretty much sums up what you can do with this property. You can scale the object along a specific axis. If you set the scale to values smaller than one, the object will shrink as shown: When you use values larger than one, the object will become larger as shown in the screenshot that follows: The last part of the mesh that we'll look at in this article is the translate functionality. With translate, you can also change the position of an object, but instead of defining the absolute position of where you want the object to be, you will define where the object should move to, relative to its current position. For instance, we've got a sphere object that is added to a scene and its position has been set to (1,2,3). Next, we will translate the object along its x axis by translateX(4). Its position will now be (5,2,3). If we want to restore the object to its original position we will set it to translateX(-4). In the 06-mesh-properties.html example, there is a menu tab called translate. From there you can experiment with this functionality. Just set the translate values for the x, y, and z axes, and click on the translate button. You'll see that the object is being moved to a new position based on these three values.
Read more
  • 0
  • 0
  • 14000

article-image-knime-terminologies
Packt
17 Oct 2013
12 min read
Save for later

KNIME terminologies

Packt
17 Oct 2013
12 min read
(For more resources related to this topic, see here.) Organizing your work In KNIME, you store your files in a workspace. When KNIME starts, you can specify which workspace you want to use. The workspaces are not just for fi les; they also contain settings and logs. It might be a good idea to set up an empty workspace, and instead of customizing a new one each time, you start a new project; you just copy (extract) it to the place you want to use, and open it with KNIME (or switch to it). The workspace can contain workflow groups (sometimes referred to as workflow set) or workflows. The groups are like folders in a filesystem that can help organize your work flows. Workflows might be your programs and processes that describe the steps which should be applied to load, analyze, visualize, or transform the data you have, something like an execution plan. Work flows contain the executable parts, which can be edited using the workflow editor, which in turn is similar to a canvas. Both the groups and the workflows might have metadata associated with them, such as the creation date, author, or comments (even the workspace can contain such information). Workflows might contain nodes, meta nodes, connections, work flow variables (or just flow variables), work flow credentials, and annotations besides the previously introduced metadata. Workflow credentials is the place where you can store your login name and password for different connections. These are kept safe, but you can access them easily. It is safe to share a work flow if you use only the work flow credentials for sensitive information (although the user name will be saved). Nodes Each node has a type, which identifies the algorithm associated with the node. You can think of the type as a template; it specifies how to execute for different inputs and parameters, and what should be the result. The nodes are similar to functions (or operators) in programs. The node types are organized according to the following general types, which specify the color and the shape of the node for easier understanding of work flows. The general types are shown in the following image: Example representation of different general types of nodes The nodes are organized in categories; this way, it is easier to find them. Each node has a node documentation that describes what can be achieved using that type of node, possibly use cases or tips. It also contains information about parameters and possible input ports and output ports. (Sometimes the last two are called inports and outports, or even in-ports and out-ports.) Parameters are usually single values (for example, filename, column name, text, number, date, and so on) associated with an identifier; although, having an array of texts is also possible. These are the settings that influence the execution of a node. There are other things that can modify the results, such as work flow variables or any other state observable from KNIME. Node lifecycle Nodes can have any of the following states: Misconfigured (also called IDLE) Configured Queued for execution Running Executed There are possible warnings in most of the states, which might be important; you can read them by moving the mouse pointer over the triangle sign. Meta nodes Meta nodes look like normal nodes at first sight, although they contain other nodes (or meta nodes) inside them. The associated context of the node might give options for special execution. Usually they help to keep your work flow organized and less scary at first sight. A user-defined meta node Ports The ports are where data in some form flows through from one node to another. The most common port type is the data table. These are represented by white triangles. The input ports (where data is expected to get into) are on the left-hand side of the nodes, but the output ports (where the created data comes out) are on the right-hand side of the nodes. You cannot mix and match the different kinds of ports. It is also not allowed to connect a node's output to its input or create circles in the graph of nodes; you have to create a loop if you want to achieve something similar to that. Currently, all ports in the standard KNIME distribution are presenting the results only when they are ready; although the infrastructure already allows other strategies, such as streaming, where you can view partial results too. The ports might contain information about the data even if their nodes are not yet executed. Data tables These are the most common form of port types. It is similar to an Excel sheet or a data table in the database. Sometimes these are named example set or data frame. Each data table has a name, a structure (or schema, a table specification), and possibly properties. The structure describes the data present in the table by storing some properties about the columns. In other contexts, columns may be called attributes, variables, or features. A column can only contain data of a single type (but the types form a hierarchy from the top and can be of any type). Each column has a type , a name, and a position within the table. Besides these, they might also contain further information, for example, statistics about the contained values or color/shape information for visual representation. There is always something in the data tables that looks like a column, even if it is not really a column. This is where the identifiers for the rows are held, that is, the row keys. There can be multiple rows in the table, just like in most of the other data handling software (similar to observations or records). The row keys are unique (textual) identifiers within the table. They have multiple roles besides that; for example, usually row keys are the labels when showing the data, so always try to find user-friendly identifiers for the rows. At the intersection of rows and columns are the (data) cells , similar to the data found in Excel sheets or in database tables (whereas in other contexts, it might refer to the data similar to values or fi elds). There is a special cell that represents the missing values. The missing value is usually represented as a question mark (?). If you have to represent more information about the missing data, you should consider adding a new column for each column, where this requirement is present, and add that information; however, in the original column, you just declare it as missing. There are multiple cell types in KNIME, and the following table contains the most important ones: Cell type Symbol Remarks Int cell I This represents integral numbers in the range from -231 to 231-1 (approximately 2E9). Long cell L This represents larger integral numbers, and their range is from -263 to 263-1 (approximately 9E18). Double cell D This represents real numbers with double (64 bit) floating point precision. String cell S This represents unstructured textual information. Date and time cell  calendar & clock With these cells, you can store either date or time. Boolean cell B This represents logical values from the Boolean algebra (true or false); note that you cannot exclude the missing value. Xml cell XML This cell is ideal for structured data. Set cell {...} This cell can contain multiple cells (so a collection cell type) of the same type (no duplication or order of values are preserved). List cell {...} This is also a collection cell type, but this keeps the order and does not filter out the duplicates. Unknown type cell ? When you have different type of cells in a column (or in a collection cell), this is the generic cell type used. There are other cell types, for example, the ones for chemical data structures (SMILES, CDK, and so on), for images (SVG cell, PNG cell, and so on), or for documents. This is extensible, so the other extension can defi ne custom data cell types. Note that any data cell type can contain the missing value. Port view The port view allows you to get information about the content of the port. Complete content is available only after the node is executed, but usually some information is available even before that. This is very handy when you are constructing the workflow. You can check the structure of the data even if you will usually use node view in the later stages of data exploration during work flow construction. Flow variables Workflows can contain flow variables, which can act as a loop counter, a column name, or even an expression for a node parameter. These are not constants, but you can introduce them to the workspace level as well. This is a powerful feature; once you master it, you can create workflows you thought were impossible to create using KNIME. A typical use case for them is to assign roles to different columns (by assigning the column names to the role name as a flow variable) and use this information for node configurations. If your work flow has some important parameters that should be adjusted or set before each execution (for example a filename), this is an ideal option to provide these to the user; use the flow variables instead of a preset value that is hard to find. As the automatic generation of figures gets more support, the flow variables will find use there too. Iterating a range of values or files in a folder should also be done using flow variables. Node views Nodes can also have node views associated with them. These help to visualize your data or a model, show the node's internal state, or select a subset of the data using the HiLite feature. An important feature exists that a node's views can be opened multiple times. This allows us to compare different options of visualization without taking screenshots or having to remember what was it like, and how you reached that state. You can export these views to image fi les. HiLite The HiLite feature of KNIME is quite unique. Its purpose is to help identify a group of data that is important or interesting for some reason. This is related to the node views, as this selection is only visible in nodes with node views (for example, it is not available in port views). Support for data high lighting is optional, because not all views support this feature. The HiLite selection data is based on row keys, and this information can be lost when the row keys change. For this reason, some of the nonview nodes also have an option to keep this information propagated to the adjacent nodes. On the other hand, when the row keys remain the same, the marks in different views point to the same data rows. It is very important that the HiLite selection is only visible in a well-connected subgraph of work flow. It can also be available for non-executed nodes (for example, the HiLite Collector node). The HiLite information is not saved in the workflow, so you should use the HiLite filter node once you are satisfied with your selection to save that state, and you can reset that HiLite later. Eclipse concepts Because KNIME is based on the Eclipse platform (http://eclipse.org), it inherits some of its features too. One of them is the workspace model with projects (work flows in case of KNIME), and another important one is modularity. You can extend KNIME's functionality using plugins and features; sometimes these are named KNIME extensions. The extensions are distributed through update sites , which allow you to install updates or install new software from a local folder, a zip fi le, or an Internet location. The help system, the update mechanism (with proxy settings), or the fi le search feature are also provided by Eclipse. Eclipse's main window is the workbench. The most typical features are the perspectives and the views. Perspectives are about how the parts of the UI are arranged, while these independently configurable parts are the views. These views have nothing to do with node views or port views. The Eclipse/KNIME views can be detached, closed, moved around, minimized, or maximized within the window. Usually each view can have at most one instance visible (the Console view is an exception). KNIME does not support alternative perspectives (arrangements of views), so it is not important for you; however, you can still reset it to its original state. It might be important to know that Eclipse keeps the contents of fi les and folders in a special form. If you generate files, you should refresh the content to load it from the filesystem. You can do this from the context menu, but it can also be automated if you prefer that option. Preferences The preferences are associated with the workspace you use. This is where most of the Eclipse and KNIME settings are to be specified. The node parameters are stored in the workflows (which are also within the workspace), and these parameters are not considered to be preferences. Logging KNIME has something to tell you about almost every action. Usually, you do not care to read these logs, you do not need to do so. For this reason, KNIME dispatches these messages using different channels. There is a file in the workplace that collects all the messages by default with considerable details. There is even a KNIME/Eclipse view named Console, which contains only the most important details initially. Summary In this article we went through the most important terminologies and concepts you will use when using KNIME. Resources for Article : Further resources on this subject: Visualizing Productions Ahead of Time with Celtx [Article] N-Way Replication in Oracle 11g Streams: Part 1 [Article] Sage: 3D Data Plotting [Article]
Read more
  • 0
  • 0
  • 2263

article-image-administrating-solr
Packt
11 Oct 2013
10 min read
Save for later

Administrating Solr

Packt
11 Oct 2013
10 min read
(For more resources related to this topic, see here.) Query nesting You might come across situations wherein you need to nest a query within another query in order to search specific keyword or phrase. Let us imagine that you want to run a query using the standard request handler, but you need to embed a query that is parsed by the dismax query parser inside it. Isn't that interesting? We will show you how to do it. Our example data looks like this: <add> <doc> <field name="id">1</field> <field name="title">Reviewed solrcook book</field> </doc> <doc> <field name="id">2</field> <field name="title">Some book reviewed</field> </doc> <doc> <field name="id">3</field> <field name="title">Another reviewed little book</field> </doc> </add> Here, we are going to use the standard query parser to support lucene query syntax, but we would like to boost phrases using the dismax query parser. At first it seems to be impossible to achieve, but don't worry, we will handle it. Let us suppose that we want to find books having the words reviewed and book in their title field and we would like to boost the reviewed book phrase by 10. Here we go with the query: http: //localhost:8080/solr/select?q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book The results of the preceding query should look like: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="fl">*,score</str> <str name="qq">book reviewed</str> <str name="q">book AND reviewed AND _query_:"{!dismax qf=title pf=title^10 v=$qq}"</str> </lst> </lst> <result name="response" numFound="3" start="0" maxScore="0.77966106"> <doc> <float name="score">0.77966106</float> <str name="id">2</str> <str name="title">Some book reviewed</str> </doc> <doc> <float name="score">0.07087828</float> <str name="id">1</str> <str name="title">Reviewed solrcook book</str> </doc> <doc> <float name="score">0.07087828</float> <str name="id">3</str> <str name="title">Another reviewed little book</str> </doc> </result> </response> Let us focus on the query. The q parameter is built of two parts connected together with AND operator. The first one reviewed+AND+book is just a usual query with a logical operator AND defined. The second part building the query starts with a strange looking expression, _query_. This expression tells Solr that another query should be made that will affect the results list. We then see the expression stating that Solr should use the dismax query parser (the !dismax part) along with the parameters that will be passed to the parser (qf and pf). The v parameter is an abbreviation for value and it is used to pass the value of the q parameter (in our case, reviewed+book is being passed to dismax query parser). And that's it! We land to the search results which we had expected. Stats.jsp From the admin interface, when you click on the Statistics link, though you receive a web page of information about the specific index, this information is actually being served to the browser as an XML linked to an embedded XSL stylesheet. This is then transformed into HTML in the browser. This means that if you perform a GET request on stats.jsp, you will be back with XML demonstrated as follows. curl http://localhost:8080/solr/mbartists/admin/stats.jsp If you open the downloaded file, you will see all the data as XML. The following code is an extract of the statistics available that stores individual documents and the standard request handler with the metrics you might wish to monitor (highlighted in the following code): <entry> <name>documentCache</name> <class>org.apache.solr.search.LRUCache</class> <version>1.0</version> <description>LRU Cache(maxSize=512, initialSize=512)</description> <stats> <stat name="lookups">3251</stat> <stat name="hits">3101</stat> <stat name="hitratio">0.95</stat> <stat name="inserts">160</stat> <stat name="evictions">0</stat> <stat name="size">160</stat> <stat name="warmupTime">0</stat> <stat name="cumulative_lookups">3251</stat> <stat name="cumulative_hits">3101</stat> <stat name="cumulative_hitratio">0.95</stat> <stat name="cumulative_inserts">150</stat> <stat name="cumulative_evictions">0</stat> </stats> </entry> <entry> <name>standard</name> <class>org.apache.solr.handler.component.SearchHandler</class> <version>$Revision: 1052938 $</version> <description>Search using components: org.apache.solr.handler.component.QueryComponent, org.apache.solr.handler.component.FacetComponent</description> <stats> <stat name="handlerStart">1298759020886</stat> <stat name="requests">359</stat> <stat name="errors">0</stat> <stat name="timeouts">0</stat> <stat name="totalTime">9122</stat> <stat name="avgTimePerRequest">25.409472</stat> <stat name="avgRequestsPerSecond">0.446995</stat> </stats> </entry> The method of integrating with monitoring system various from system to system., as an example you may explore ./examples/8/check_solr.rb for a simple Ruby script that queries the core and check if the average hit ratio and the average time per request are above a defined threshold. ./check_solr.rb -w 13 -c 20 -imtracks CRITICAL - Average Time per request more than 20 milliseconds old: 39.5 In the previous example, we have defined 20 milliseconds as the threshold and the average time for a request to serve is 39.5 milliseconds (which is far greater than the threshold we had set). Ping status It is defined as the outcome from PingRequestHandler, which is primarily used for reporting SolrCore health to a Load Balancer; that is, this handler has been designed to be used as the endpoint for an HTTP Load Balancer to use while checking the "health" or "up status" of a Solr server. In a simpler term, ping status denotes the availability of your Solr server (up-time and downtime) for the defined duration. Additionally, it should be configured with some defaults indicating a request that should be executed. If the request succeeds, then the PingRequestHandler will respond with a simple OK status. If the request fails, then the PingRequestHandler will respond with the corresponding HTTP error code. Clients (such as load balancers) can be configured to poll the PingRequestHandler monitoring for these types of responses (or for a simple connection failure) to know if there is a problem with the Solr server. PingRequestHandler can be implemented which looks something like the following: <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="qt">/search</str><!-- handler to delegate to --> <str name="q">some test query</str> </lst> </requestHandler> You may try this out even with a more advanced option, which is to configure the handler with a healthcheckFile that can be used to enable/disable the PingRequestHandler. It would look something like the following: <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <!-- relative paths are resolved against the data dir --> <str name="healthcheckFile">server-enabled.txt</str> <lst name="invariants"> <str name="qt">/search</str><!-- handler to delegate to --> <str name="q">some test query</str> </lst> </requestHandler> A couple of points which you should know while selecting the healthcheckFile option are: If the health check file exists, the handler will execute the query and returns status as described previously. If the health check file does not exist, the handler will throw an HTTP error even though the server is working fine and the query would have succeeded. This health check file feature can be used as a way to indicate to some load balancers that the server should be "removed from rotation" for maintenance, or upgrades, or whatever reason you may wish. Business rules You might come across situations wherein your customer who is running an e-store consisting of different types of products such as jewelry, electronic gazettes, automotive products, and so on defines a business need which is flexible enough to cope up with changes in the search results based on the search keyword. For instance, imagine of a customer's requirement wherein your need to add facets such as Brand, Model, Lens, Zoom, Flash, Dimension, Display, Battery, Price, and so on whenever a user searches for "Camera" keyword. So far the requirement is easy and can be achieved in simpler way. Now let us add some complexity in our requirement wherein facets such as Year, Make, Model, VIN, Mileage, Price, and so on should get automatically added when the user searches for a keyword "Bike". Worried about how to overrule such complex requirement? This is where business rules come into play. There is n-number of rule engines (both proprietary and open source) in market such as Drools, JRules, and so on which can be plugged-in into your Solr. Drools Now let us understand how Drools functions. It injects the rules into working memory, and then it evaluates which custom rules should be triggered based on the conditions stated in the working memory. It is based on if-then clauses, which enables the rules coder to define the what condition must be true (using if or when clause), and what action/event should be triggered when the defined condition is met, that is true (using then clause). Drools conditions are nothing but any Java object that the application wishes to inject as input. A business rule is more or less in the following format: rule "ruleName" when // CONDITION then //ACTION We will now show you how to write an example rule in Drools: rule "WelcomeLucidWorks" no-loop when $respBuilder : ResponseBuilder(); then $respBuilder.rsp.add("welcome", "lucidworks"); end In the given code snippet, it checks for ResponseBuilder object (one of the prime objects which help in processing search requests in a SearchComponent) in the working memory and then adds a key-value pair to that ResponseBuilder (in our case, welcome and lucidworks). Summary In this article, we saw how to nest a query within another query, learned about stats.jsp, how to use ping status, and what are business rules, how and when they prove to be important for us and how to write your custom rule using Drools. Resources for Article: Further resources on this subject: Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article] Apache Solr Configuration [Article]
Read more
  • 0
  • 0
  • 2070
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-creating-network-graphs-gephi
Packt
08 Oct 2013
10 min read
Save for later

Creating Network Graphs with Gephi

Packt
08 Oct 2013
10 min read
(For more resources related to this topic, see here.) Basic network graph terminology Network graphs are essentially based on the construct of nodes and edges. Nodes represent points or entities within the data, while edges refer to the connections or lines between nodes. Individual nodes might be students in a school, or schools within an educational system, or perhaps agencies within a government structure. Individual nodes may be represented through equal sizes, but can also be depicted as smaller or larger based on the magnitude of a selected measure. For example, a node with many connections may be portrayed as far larger and thus more influential than a sparsely connected node. This approach will provide viewers with a visual cue that shows them where the highest (and lowest) levels of activity occur within a graph. Nodes will generally be positioned based on the similarity of their individual connections, leading to clusters of individual nodes within a larger network. In most network algorithms, nodes with higher levels of connections will also tend to be positioned near the center of the graph, while those with few connections will move toward the perimeter of the display. Edges are the connections between nodes, and may be displayed as undirected or directed paths. Undirected relationships indicate a connection that flows in both directions, while a directed relationship moves in a single direction that always originates in one node and moves toward another node. Undirected connections will tend to predominate in most cases, such as in social networks where participant activity flows in both directions. On occasion, we will see directed connections, as in the case of some transportation or technology networks where there are connections that flow in a single direction. Edges may also be weighted, to show the strength of the connection between nodes. In the case where University A has performed research with both University B and University C, the strength (width) of the edge will show viewers where the stronger relationship exists. If A and B have combined for three projects, while A and C have collaborated on 9 projects, we should weight the A to C connection three times that of the A to B connection. Another commonly used term is neighbors, which is nothing more than a node that is directly connected to a second node. Neighbors can be stated to be one degree apart. Degrees is the term used to refer to the number of connections flowing into (or away from) a node (also known as Degree Centrality), as well as to the number of connections required to connect to another node via the shortest possible path. In complex graphs, you may find nodes that are four, five, or even more degrees away from a distant node, and in some cases two nodes may not be connected at all. Now that you have a very basic understanding of network graph theory, let's learn about some of the common network graph algorithms. Common network graph algorithms Before we introduce you to some specific graph algorithms, we'll briefly discuss some of the theory behind network graphs and introduce you to a few of the terms you will frequently encounter. Network graphs are drawn through positioning nodes and their respective connections relative to one another. In the case of a graph with 8 or 10 nodes, this is a rather simple exercise, and could probably be drawn rather accurately without the help of complex methodologies. However, in the typical case where we have hundreds of nodes with thousands of edges, the task becomes far more complex. Some of the more prominent graph classes in Gephi include the following: Force-directed algorithms refer to a category of graphs that position elements based on the principles of attraction, repulsion, and gravity Circular algorithms position graph elements around the perimeter of one or more circles, and may allow the user to dictate the order of the elements based on data properties Dual circle layouts position a subset of nodes in the interior of the graph with the remaining nodes around the diameter, similar to a single circular graph Concentric layouts arrange the graph nodes using an approximately circular graph design, with less connected nodes at the perimeter of the graph and highly connected nodes typically near the center Radial axis layouts provide the user with the ability to determine some portion of the graph layout by defining graph attributes The type of graph you select may well be dictated by the sort of results you seek. If you wish to feature certain groups within your dataset, one of the layouts that allows you to segment the graph by groups will provide a potentially quick solution to your needs. In this instance, one of the circular or radial axis graphs may prove ideal. On the other hand, if you are attempting to discover relationships in a complex new dataset, one of the several available Force-directed layouts is likely a better choice. These algorithms will rely on the values in your dataset to determine the positioning within the graph. When choosing one of these approaches, please note that there will often be an extensive runtime to calculate the graph layout, especially as the data becomes more complex. Even on a powerful computer, examples may run for minutes or hours in an attempt to fully optimize the graph. Fortunately, you will have the ability in Gephi to stop these algorithms at any given point, and you will still have a very reasonable, albeit imperfect graph. In the next section, we'll look at a few of the standard layouts that are part of the Gephi base package. Standard network graph layouts Now that you are somewhat familiar with the types of layout algorithms, we'll take a look at what Gephi offers within the Layout tab. We'll begin with some common Force-directed approaches, and then examine some of the other choices. One of the best known force algorithms is Force Atlas, which in Gephi provides users with multiple options for drawing the graph. Foremost among these settings are Repulsion, Attraction, and Gravity settings. Repulsion strength adjustments will make individual nodes either more or less sensitive to other nodes they differ from. A higher repulsion level, for example, will push these nodes further apart. Conversely, setting the Attraction strength higher will force related nodes closer together. Finally, the Gravity setting will draw nodes closer to the center of the graph if it is set to a high level, and disperse them toward the edges if a very low value is set. Force Atlas 2 is another layout option that employs slightly different parameters than the original Force Atlas method. You may wish to compare these methods and determine which one gives you better results. Fruchterman Reingold is one more Force method; albeit one that provides you with just three parameters – Area, Gravity, and Speed. While the general approach is not unlike the Force Atlas algorithms, your results will appear different in a visual sense. Finally, Gephi provides three Yifan Hu methods. Each of these models—Yifan Hu, Yifan Hu Proportional, and Yifan Hu Multilevel, are likely to run much more rapidly than the methods discussed earlier, while providing generally similar results. Gephi also provides a variety of methods that do not employ the force approach. Some of the models, as we noted earlier in this article, provide you with more control over the final result. This may be the result of selecting how to order the nodes, or of which attributes to use in grouping nodes together, either through color or location. In the section above, I referenced several layout options, but in the interest of space we'll take a closer look at two of them—the Circular and Radial Axis layouts. Circular layouts are well suited to relatively small networks, given the limited flexibility of their fixed layout. We can adjust this to some degree by specifying the diameter of the graph, but anything more than a few dozen well-connected nodes often becomes difficult to manage. However, with smaller networks, these layouts can be intriguing, providing us with the ability to see patterns within and between specific groups more easily than we might see them in some other layouts. While this article will not cover any filtering options, those too can be used to help us better utilize the circular layouts, by providing us with the ability to highlight specific groups and their resulting connections. Think of the circle resembling a giant spider web filled with connections, and the filters as tools that help us see specific threads within the web. Our final notes are on Radial Axis layouts, which can provide us with fascinating looks at our data, especially if there are natural groups within the network. Think of a school with several classrooms full of students, for example. Each of these classrooms can be easily identified and grouped, perhaps by color. In a complex force directed graph we may be able to spot each of these groups, but it may become difficult due to the interactions with other classes. In a Radial Axis layout we can dictate the group definitions, forcing each group to be bound together, apart from any other groups. There are pros and cons to this approach, of course, as there are with any of the other methods. If we wish to understand how a specific group interacts with another group, this method can prove beneficial, as it isolates these groups visually, making it easier to see connections between them. On the negative side, it is often quite difficult to see connections between members within the group, due to the nature of the layout. As with any layouts, it is critical to look at the results and see how they apply to our original need. Always test your data using multiple layout algorithms, so that you wind up with the best possible approach. Summary Gephi is an ideal tool for users new to network graph analysis and visualization, as it provides a rich set of tools to create and customize network graphs. The user interface makes it easy to understand basic concepts such as nodes and edges, as well as descriptive terminology such as neighbors, degrees, repulsion, and attraction. New users can move as slowly or as rapidly as they wish, given Gephi's gentle learning curve. Gephi can also help you see and understand patterns within your data through a variety of sophisticated graph methods that will appeal to both the novice as well as seasoned users. The variety of sophisticated layout algorithms will provide you the opportunity to experiment with multiple layouts as you search for the best approach to display your data. In short, Gephi provides everything needed to produce first-rate network visualizations. Resources for Article: Further resources on this subject: OpenSceneGraph: Advanced Scene Graph Components [Article] Cacti: Using Graphs to Monitor Networks and Devices [Article] OpenSceneGraph: Managing Scene GraphOpenSceneGraph: Managing Scene Graph [Article]
Read more
  • 0
  • 0
  • 6270

article-image-cql-client-applications
Packt
04 Oct 2013
7 min read
Save for later

CQL for client applications

Packt
04 Oct 2013
7 min read
(For more resources related to this topic, see here.) Using the Thrift API The Thrift library is based on the Thrift RPC protocol. High-level clients built over it have been a standard way of building an application for a long time. In this section, we'll explain how to write a client application using CQL as the query language and thrift as the Java API. When we start Cassandra, by default it listens to Thrift clients (start_rpc: true property in the CASSANDRA_HOME/conf/cassandra.yaml file enables this). Let's build a small program that connects to Cassandra using the Thrift API, and runs CQL 3 queries for reading/writing data in the UserProfiles table we created for the facebook application. The program can be built by performing the following steps: For downloading the Thrift Library, you need to enter apache-assandra-thrift-1.2.x.jar (which is to be found in the CASSANDRA_HOME/lib folder) into your classpath. If your Java project is mavenized, you need to insert the following entry in pom.xml under the dependency section (version will vary depending upon your Cassandra server installation): <dependency> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-thrift</artifactId> <version>1.2.5</version> </dependency> For connecting to the Cassandra server on a given host and port, you need to open org.apache.thrift.transport.TTransport to the Cassandra node and create an instance of org.apache.cassandra.thrift.Cassandra.Client as follows: TTransport transport = new TFramedTransport(new TSocket("localhost", 9160)); TProtocol protocol = new TBinaryProtocol(transport); Cassandra.Client client = new Cassandra.Client(protocol); transport.open(); client.set_cql_version("3.0.0"); The default CQL version for Thrift is 2.0.0. You must set it to 3.0.0 if you are writing CQL 3 queries and don't want to see any version related errors. After you are done with transport, close it gracefully (usually at the end of read/write operations) as follows: transport.close(); Creating a schema : The executeQuery() utility method accepts String CQL 3 query and runs it: CqlResult executeQuery(String query) throws Exception { return client.execute_cql3_query(ByteBuffer.wrap(query.getBytes("UTF-8")), Compression.NONE, ConsistencyLevel.ONE); } Now, create keyspace and the table by directly executing CQL 3 query: //Create keyspace executeQuery("CREATE KEYSPACE facebook WITH replication = "{'class':'SimpleStrategy','replication_factor':3};"); executeQuery("USE facebook;"); //Create table executeQuery("CREATE TABLE UserProfiles(" +"email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data: A couple of records can be inserted as follows: executeQuery("USE facebook;"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd',' John Smith',32,0x8e37);"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t',' David Bergin',42,0xc9f1);"); Executing the SELECT query returns CQLResult, on which we can iterate easily to fetch records: CqlResult result = executeQuery("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (CqlRow row : result.getRows()) { System.out.println(row.getKey(); } Using the Datastax Java driver The Datastax Java driver is based on the Cassandra binary protocol that was introduced in Cassandra 1.2, and works only with CQL 3. The Cassandra binary protocol is specifically made for Cassandra in contrast to Thrift, which is a generic framework and has many limitations. Now, we are going to write a Java program that uses the Datastax Java driver for reading/writing data into Cassandra, by performing the following steps: Downloading the driver library : This driver library JAR file must be in your classpath in order to build an application using it. If you have a maven-based Java project, you need to insert the following entry into the pom.xml file under the dependeny section: <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>1.0.1</version> </dependency> This driver project is hosted on Github: (https://github.com/datastax/java-driver). It makes sense to check and download the latest version. Configuring Cassandra to listen to native clients : In the newer version of Cassandra, this would be enabled by default and Cassandra will listen to clients using binary protocol. But the earlier Cassandra installations may require enabling this. All you have to do is to check and enable the start_native_transport property into the CASSANDRA_HOME/conf/Cassandra.yaml file by inserting/uncommenting the following line: start_native_transport: true The port that Cassandra will use for listening to native clients is determined by the native_transport_port property. It is possible for Cassandra to listen to both Thrift and native clients simultaneously. If you want to disable Thrift, just set the start_rpc property to false in CASSANDRA_HOME/conf/Cassandra.yaml. Connecting to Cassandra : The com.datastax.driver.core.Cluster class is the entry point for clients to connect to the Cassandra cluster: Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); After you are done with using it (usually when application shuts down), close it gracefully: cluster.shutdown(); Creating a session : An object of com.datastax.driver.core.Session allows you to execute a CQL 3 statement. The following line creates a Session instance: Session session = cluster.connect(); Creating a schema : Before reading/writing data, let's create a keyspace and a table similar to UserProfiles in the facebook application we built earlier: // Create Keyspace session.execute("CREATE KEYSPACE facebook WITH replication = " + "{'class':'SimpleStrategy','replication_factor':1};"); session.execute("USE facebook"); // Create table session.execute("CREATE TABLE UserProfiles(" + "email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data : We can insert a couple of records as follows: session.execute("USE facebook"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('john.smith@example.com','p4ssw0rd','John Smith',32,0x8e37);"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('david.bergin@example.com','guess1t','David Bergin',42,0xc9f1);"); Finding and printing records : A SELECT query returns an instance of com.datastax.driver.core.ResultSet. You can fetch individual rows by iterating over it using the com.datastax.driver.core.Row object: ResultSet results = session.execute ("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = 'john.smith@example.com';"); for (Row row : results) { System.out.println ("Email: " + row.getString("email_id") + "tName: " + row.getString("name")+ "t Age : " + row.getInt("age")); } Deleting records : We can delete a record as follows: session.execute("DELETE FROM facebook.UserProfiles WHERE email_id='john.smith@example.com';"); Using high-level clients In addition to the libraries based on Thrift and binary protocols, some high-level clients are built with the purpose to ease development and provide additional services, such as connection pooling, load balancing, failover, secondary indexing, and so on. Some of them are listed here: Astyanax (https://github.com/Netflix/astyanax): Astyanax is a high-level Java client for Cassandra. It allows you to run both simple and prepared CQL queries. Hector (https://github.com/hector-client/hector): Hector is a high-level client for Cassandra. At the time of writing this book, it supported CQL 2 only (not CQL 3). Kundera (https://github.com/impetus-opensource/Kundera): Kundera is a JPA 2.0-based object datastore mapping library for Cassandra and many other NoSQL datastores. CQL 3 queries are run with Kundera using the native queries as described in JPA specification. Summary From this article, we basically learn about using CQL in queries using three different preceding methods. Resources for Article : Further resources on this subject: Quick start – Creating your first Java application [Article] Apache Cassandra: Libraries and Applications [Article] Getting Started with Apache Cassandra [Article]
Read more
  • 0
  • 0
  • 2129

article-image-simple-graphs-d3js
Packt
03 Oct 2013
8 min read
Save for later

Simple graphs with d3.js

Packt
03 Oct 2013
8 min read
(For more resources related to this topic, see here.) There are many component vendors selling graphing controls. Frequently, these graph libraries are complicated to work with and expensive. When using them, you need to consider what would happen if the vendor went out of business or refused to fix an issue: For simple graphs, such as the one shown in the preceding screenshot, d3.js (http://d3js.org/) brings a number of functions and a coding style that makes creating graphs a snap. Let's create a using d3. The first thing to do is introduce an SVG element to the page. In d3, we'll append an SVG element explicitly: var graph = d3.select("#visualization") .append("svg") .attr("width", 500) .attr("height", 500); d3 relies heavily on the use of method chaining. If you're new to this concept, it is quick to pick up. Each call to a method performs some action and then returns an object, and the next method call operates on this object. So, in this case, the select method returns the div with the id of visualization. Calling append on the selected div adds an SVG element and then returns it. Finally, the attr methods set a property inside the object and then return the object. At first, method chaining may seem odd, but as we move on you'll see that it is a very powerful technique and cleans up the code considerably. Without method chaining, we end up with a lot of temporary variables. Next, we need to find the maximum element in the data array. Previously we might have used a jQuery each loop to find that element. d3 has built-in array functions that make this much cleaner: var maximumValue = d3.max(data, function(item){ return item.value;}); There are similar functions for finding minimums and means. None of the functions are anything you couldn't get by using a JavaScript utility library such as underscore.js (http://underscorejs.org/) or lodash (http://lodash.com/), however, it is convenient to make use of the built-in versions. In the next piece, we make us of are d3's scaling functions: var yScale = d3.scale.linear() .domain([maximumValue, 0]) .range([0, 300]); Scaling functions serve to map from one dataset to another. In our case, we're mapping from the values in our data array to the coordinates in our SVG. We use two different scales here: linear and ordinal. The linear scale is used to map a continuous domain to a continuous range. The mapping will be done linearly, so if our domain contained values between 0 and 10, and our range had values between 0 and 100, a value of 6 would map to 60, 3 to 30, and so forth. It seems trivial, but with more complicated domains and ranges, scales are very helpful. Apart from linear scales, there are power and logarithmic scales that may fit your data better. In our example data, our y values are not continuous, they're not even numeric. For this case we can make use of an ordinal scale: var xScale = d3.scale.ordinal() .domain(data.map(function(item){return item.month;})) .rangeBands([0,500], .1); ordinal scales map a discrete domain into a continuous range. Here, the domain is the list of months and the range the width of our SVG. You'll note that instead of using range we use rangeBands. rangebands splits the range up into chunks each to which each range item is assigned. So, if our domain was {May, June} and the range 0 to 100, from May we would receive a band from 0 to 49 and 50 to 100 from June. You'll also note that rangeBands takes an additional parameter, in our case 0.1. This is a padding value that generates a sort of no man's land between each band. This is ideal for creating a bar or column graph, as we may not want the columns touching each other. The padding parameter can take a value between 0 and 1 as a decimal representation of how much of the range should be reserved for padding. 0.25 would reserve 25% of the range for padding. There are also a family of built-in scales that deal with providing colors. Selecting colors for your visualization can be challenging, as the colors have to be far enough apart to be discernible. If you're color-challenged like me, the scales category10, category20, category20b, and category20c may be for you. You can declare a color scale in the the following manner: var colorScale = d3.scale.category10() .domain(data.map(function(item){return item.month;})); The preceding code will assign a different color to each month out of a set of 10 pre-calculated possible colors. Finally, we need to actually draw our graph: var graphData = graph.selectAll(".bar") .data(data); We select all the .bar elements inside the graph using selectAll. Hang on! There aren't any elements inside the graph, let alone elements that match the .bar selector. Typically, selectAll will return a collection of elements matching the selector just as the $ function does in jQuery. In this case, we're using selectAll as a short-hand method of creating an empty d3 collection that has a data method and can be chained. We next specify a set of data to union with the data from the existing selection of elements. d3 operates on collections objects without using looping techniques. This allows for a more declarative style of programming, but can be difficult to grasp immediately. In effect, we're unioning two sets of data: the currently existing data (found using selectAll) and the new data (provided by the data function). This method of dealing with data allows for easy updates to the data elements, should further elements be added or removed later. When new data elements are added, you can select just those elements by using enter(). This prevents repeating actions on existing data. You don't need to redraw the entire image, just the portions that have been updated with new data. Equally, if there are elements in the new dataset that didn't appear in the old one, they can be selected with exit(). Typically, you want to just remove those elements that can be done by running the following: graphData.exit().remove() When we create elements using the newly generated dataset, the data elements will actually be attached to the newly created DOM elements. Creating the elements involves calling append: graphData.enter() .append("rect") .attr("x", function(item){ return xScale(item.month);}) .attr("y", function(item){ return yScale(item.value);}) .attr("height", function(item){ return 300 - yScale(item.value);}) .attr("width", xScale.rangeBand()) .attr("fill", function(item, index){return colorScale(item.month);}); The following diagram shows how data() works with new and existing dataset: You can see in this chunk of code how useful method chaining has become. It makes the code much shorter and more readable than assigning a series of temporary variables or passing the results into standalone methods. The scales also come on their own here. The x coordinate is found simply by scaling the month we have using the ordinal scale. Because that scale takes into account the number of elements as well as the padding, nothing more complicated is needed. The y coordinate is similarly found using previously defined yScale. Because the origin in an SVG is in the top-left, we have to take the inverse of the scale to calculate the height. Again, this is a place where we wouldn't generally be using a constant except for the brevity of our example. The width of the column is found by asking the xScale for the width of the bands. Finally, we set the color based on the color scale so it appears as follows: Transitions Being able to animate your visualization is a powerful technique for adding that "wow" factor. People are far more likely to stop and pay attention to your visualization if it looks cool. Spending time to make your visualization look nifty can actually have a payback other than getting mad cred from other developers. d3 makes transitions simple by doing most of the heavy lifting for you. The transitions work by creating gradual changes in values of properties: graph.selectAll(".bar") .attr("height", 0) .transition() .attr("height", 50) This code will gradually change the height on a .bar from 0 to 50 px. By default, the transition will take 250ms, but this can be changed by chaining a call to duration and delayed with a call to delay: graph.selectAll(".bar") .attr("height", 0) .transition() .duration(400) .delay(100) .attr("height", 50) This transition will wait 100ms then grow the bar height over a course of 400ms. Transitions can also be used to change non-numeric attributes such as color. I like to use a few transitions during loading to get attention, but they can also be useful when changing the state of the visualization. Even such things as adding new elements using .data can have a transition attached to them. If there is a problem with transitions, it is that they are too easy to use. Be careful that you don't overuse them and overwhelm the user. You should also pay attention to the duration of the transition: if it takes too long to execute, users will lose interest. Summary This article shed light on the features of d3.js that can be used to create simple graphs. It also covered how to add that "wow" factor to your visualizations using transitions. Resources for Article: Further resources on this subject: Introducing QlikView elements [Article] Data sources for the Charts [Article] HTML5 Presentations - creating our initial presentation [Article]
Read more
  • 0
  • 0
  • 2730

article-image-testing-camel-application
Packt
30 Sep 2013
5 min read
Save for later

Testing a Camel application

Packt
30 Sep 2013
5 min read
(For more resources related to this topic, see here.) Let's start testing. Apache Camel comes with the Camel Test Kit: some classes leverage testing framework capabilities and extend with Camel specifics. To test our application, let's add this Camel Test Kit to our list of dependencies in the POM file, as shown in the following code: <dependency> <groupId>org.apache.camel</groupId> <artifactId>camel-test</artifactId> <version>${camel-version}</version> </dependency> At the same time, if you have any JUnit dependency, the best solution would be to delete it for now so that Maven will resolve the dependency and we will get a JUnit version required by Camel. Let's rewrite our main program a little bit. Change the class App as shown in the following code: public class App { public static void main(String[] args) throws Exception { Main m = new Main(); m.addRouteBuilder( new AppRoute() ); m.run(); } static class AppRoute extends RouteBuilder { @Override public void configure() throws Exception { from("stream:in") .to("file:test"); } } } Instead of having an anonymous class extending RouteBuilder, we made it an inner class. That is, we are not going to test the main program. Instead, we are going to test if our routing works as expected, that is, messages from the system input are routed into the files in the test directory. At the beginning of the test, we will delete the test directory and our assertion will be that we have the directory test after we send the message and that it has exactly one file. To simplify the deleting of the directory test at the beginning of the unit test, we will use FileUtils.deleteDirectory from Apache Commons IO. Let's add it to our list of dependencies: <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-io</artifactId> <version>1.3.2</version> </dependency> In our project layout, we have a file src/test/java/com/company/AppTest.java. This is a unit test that has been created from the Maven artifact that we used to create our application. Now, let's replace the code inside that file with the following code: package com.company; import org.apache.camel.builder.RouteBuilder; import org.apache.camel.test.junit4.CamelTestSupport; import org.apache.commons.io.FileUtils; import org.junit.After; import org.junit.BeforeClass; import org.junit.Test; import java.io.*; public class AppTest extends CamelTestSupport { static PipedInputStream in; static PipedOutputStream out; static InputStream originalIn; @Test() public void testAppRoute() throws Exception { out.write("This is a test message!n".getBytes()); Thread.sleep(2000); assertTrue(new File("test").listFiles().length == 1); } @BeforeClass() public static void setup() throws IOException { originalIn = System.in; out = new PipedOutputStream(); in = new PipedInputStream(out); System.setIn(in); FileUtils.deleteDirectory(new File("test")); } @After() public void teardown() throws IOException { out.close(); System.setIn(originalIn); } @Override public boolean isCreateCamelContextPerClass() { return false; } @Override protected RouteBuilder createRouteBuilder() throws Exception { return new App.AppRoute(); } } Now we can run mvn compile test from the console and see that the test was run and that it is successful: Some important things to take note of in the code of our unit test are as follows: We have extended the CamelTestSupport class for Junit4 (see the package it is imported from). There are also classes that support TestNG and Junit3. We have overridden the method createRouteBuilder() to return RouteBuilder with our routes. We made our test class create CamelContext for each test method (annotated by @Test) by making isCreateCamelContextPerClass return false. System.in has been substituted with a piped stream in the startup() method and has been set back to the original value in the teardown() method. The trick is in doing it before CamelContext is created and started (now you see why we create CamelContext for each test). Also, you may see that after we send the message into the output stream piped to System.in, we made the test thread stop for couple of seconds to ensure that the message passes through the routes into the file. In short, our test running suite overrides System.in with a pipe stream so we can write into System.in from the code and deletes the directory test before the Test class is loaded. After the class is loaded and right before the testAppRoute() method, it creates CamelContext, using routes created by the overridden method createRouteBuilder(). Then it runs the test method which sends bytes of the message into the piped stream so that it gets into System.in where it is read by the Camel (note the n limiting the message). Camel then does what is written in the routes, that is, creates a file in the test directory. To be sure it's done before we do assertions, we make the thread executing the test sleep for 2 seconds. Then, we assert that we do have a file in the test directory at the end of the test. Our test works, but you see that it already gets quite hairy with piping streams and making calls to Thread.sleep()—and that's just the beginning. We haven't yet started using external systems, such as FTP servers, web services, and JMS queues. Another concern is the integration of our application with other systems. Some of them may not have a test environment. In this case, we can't easily control the side effects of our application, messages that it sends and receives from those systems; or how the systems interact with our application. To solve this problem, software developers use mocking. Summary Thus we learned about testing a Camel application in this article. Resources for Article: Further resources on this subject: Migration from Apache to Lighttpd [Article] Apache Felix Gogo [Article] Using the OSGi Bundle Repository in OSGi and Apache Felix 3.0 [Article]
Read more
  • 0
  • 0
  • 2511
article-image-hadoop-and-hdinsight-heartbeat
Packt
30 Sep 2013
6 min read
Save for later

Hadoop and HDInsight in a Heartbeat

Packt
30 Sep 2013
6 min read
(For more resources related to this topic, see here.) Apache Hadoop is the leading Big Data platform that allows to process large datasets efficiently and at low cost. Other Big Data 0platforms are MongoDB, Cassandra, and CouchDB. This section describes Apache Hadoop core concepts and its ecosystem. Core components The following image shows core Hadoop components: At the core, Hadoop has two key components: Hadoop Distributed File System (HDFS) Hadoop MapReduce (distributed computing for batch jobs) For example, say we need to store a large file of 1 TB in size and we only have some commodity servers each with limited storage. Hadoop Distributed File System can help here. We first install Hadoop, then we import the file, which gets split into several blocks that get distributed across all the nodes. Each block is replicated to ensure that there is redundancy. Now we are able to store and retrieve the 1 TB file. Now that we are able to save the large file, the next obvious need would be to process this large file and get something useful out of it, like a summary report. To process such a large file would be difficult and/or slow if handled sequentially. Hadoop MapReduce was designed to address this exact problem statement and process data in parallel fashion across several machines in a fault-tolerant mode. MapReduce programing models use simple key-value pairs for computation. One distinct feature of Hadoop in comparison to other cluster or grid solutions is that Hadoop relies on the "share nothing" architecture. This means when the MapReduce program runs, it will use the data local to the node, thereby reducing network I/O and improving performance. Another way to look at this is when running MapReduce, we bring the code to the location where the data resides. So the code moves and not the data. HDFS and MapReduce together make a powerful combination, and is the reason why there is so much interest and momentum with the Hadoop project. Hadoop cluster layout Each Hadoop cluster has three special master nodes (also known as servers): NameNode: This is the master for the distributed filesystem and maintains a metadata. This metadata has the listing of all the files and the location of each block of a file, which are stored across the various slaves (worker bees). Without a NameNode HDFS is not accessible. Secondary NameNode: This is an assistant to the NameNode. It communicates only with the NameNode to take snapshots of the HDFS metadata at intervals configured at cluster level. JobTracker: This is the master node for Hadoop MapReduce. It determines the execution plan of the MapReduce program, assigns it to various nodes, monitors all tasks, and ensures that the job is completed by automatically relaunching any task that fails. All other nodes of the Hadoop cluster are slaves and perform the following two functions: DataNode: Each node will host several chunks of files known as blocks. It communicates with the NameNode. TaskTracker: Each node will also serve as a slave to the JobTracker by performing a portion of the map or reduce task, as decided by the JobTracker. The following image shows a typical Apache Hadoop cluster: The Hadoop ecosystem As Hadoop's popularity has increased, several related projects have been created that simplify accessibility and manageability to Hadoop. I have organized them as per the stack, from top to bottom. The following image shows the Hadoop ecosystem: Data access The following software are typically used access mechanisms for Hadoop: Hive: It is a data warehouse infrastructure that provides SQL-like access on HDFS. This is suitable for the ad hoc queries that abstract MapReduce. Pig: It is a scripting language such as Python that abstracts MapReduce and is useful for data scientists. Mahout: It is used to build machine learning and recommendation engines. MS Excel 2013: With HDInsight, you can connect Excel to HDFS via Hive queries to analyze your data. Data processing The following are the key programming tools available for processing data in Hadoop: MapReduce: This is the Hadoop core component that allows distributed computation across all the TaskTrackers Oozie: It enables creation of workflow jobs to orchestrate Hive, Pig, and MapReduce tasks The Hadoop data store The following are the common data stores in Hadoop: HBase: It is the distributed and scalable NOSQL (Not only SQL) database that provides a low-latency option that can handle unstructured data HDFS: It is a Hadoop core component, which is the foundational distributed filesystem Management and integration The following are the management and integration software: Zookeeper: It is a high-performance coordination service for distributed applications to ensure high availability Hcatalog: It provides abstraction and interoperability across various data processing software such as Pig, MapReduce, and Hive Flume: Flume is distributed and reliable software for collecting data from various sources for Hadoop Sqoop: It is designed for transferring data between HDFS and any RDBMS Hadoop distributions Apache Hadoop is an open-source software and is repackaged and distributed by vendors offering enterprise support. The following is the listing of popular distributions: Amazon Elastic MapReduce (cloud, http://aws.amazon.com/elasticmapreduce/) Cloudera (http://www.cloudera.com/content/cloudera/en/home.html) EMC PivitolHD (http://gopivotal.com/) Hortonworks HDP (http://hortonworks.com/) MapR (http://mapr.com/) Microsoft HDInsight (cloud, http://www.windowsazure.com/) HDInsight distribution differentiator HDInsight is an enterprise-ready distribution of Hadoop that runs on Windows servers and on Azure HDInsight cloud service. It is 100 percent compatible with Apache Hadoop. HDInsight was developed in partnership with Hortonworks and Microsoft. Enterprises can now harness the power of Hadoop on Windows servers and Windows Azure cloud service. The following are the key differentiators for HDInsight distribution: Enterprise-ready Hadoop: HDInsight is backed by Microsoft support, and runs on standard Windows servers. IT teams can leverage Hadoop with Platform as a Service ( PaaS ) reducing the operations overhead. Analytics using Excel: With Excel integration, your business users can leverage data in Hadoop and analyze using PowerPivot. Integration with Active Directory: HDInsight makes Hadoop reliable and secure with its integration with Windows Active directory services. Integration with .NET and JavaScript: .NET developers can leverage the integration, and write map and reduce code using their familiar tools. Connectors to RDBMS: HDInsight has ODBC drivers to integrate with SQL Server and other relational databases. Scale using cloud offering: Azure HDInsight service enables customers to scale quickly as per the project needs and have seamless interface between HDFS and Azure storage vault. JavaScript console: It consists of easy-to-use JavaScript console for configuring, running, and post processing of Hadoop MapReduce jobs. Summary In this article, we reviewed the Apache Hadoop components and the ecosystem of projects that provide a cost-effective way to deal with Big Data problems. We then looked at how Microsoft HDInsight makes the Apache Hadoop solution better by simplified management, integration, development, and reporting. Resources for Article : Further resources on this subject: Making Big Data Work for Hadoop and Solr [Article] Understanding MapReduce [Article] Advanced Hadoop MapReduce Administration [Article]
Read more
  • 0
  • 0
  • 3534

article-image-what-new-12c
Packt
30 Sep 2013
23 min read
Save for later

What is New in 12c

Packt
30 Sep 2013
23 min read
(For more resources related to this topic, see here.) Oracle Database 12c has introduced many new features and enhancements for backup and recovery. This article will introduce you to some of them and you will have the opportunity to learn in more detail how they could be used in real life situations. But I cannot start talking about Oracle 12 c without talking first about a revolutionary whole new concept that was introduced with this new version of the database product, called Multitenant Container Database( CDB ) that will contain two or more pluggable databases ( PDB ). When a container database only contains one PDB it is called Single Tenant Container Database. You can also have your database on Oracle 12c using the same format as before 12c, it will be called non-CDB database and will not allow the use of PDBs. Pluggable database We are now able to have multiple databases sharing a single instance and Oracle binaries. Each of the databases will be configurable to a degree and will allow some parameters to be set specifically for themselves (due that they will share the same initialization parameter file) and what is better, each database will be completely isolated from each other without either knowing that the other exists. A CDB is a single physical database that contains a root container with the main Oracle data dictionary and at least one PDB with specific application data. A PDB is a portable container with its own data dictionary, including metadata and internal links to the system-supplied objects in the root container, and this PDB will appear to an Oracle Net client as a traditional Oracle database. The CDB also contains a PDB called SEED, which is used as a template when an empty PDB needs to be created. The following figure shows an example of a CDB with five PDBs: When creating a database on Oracle 12 c , you can now create a CDB with one or more PDBs, and what is even better is that you can easily clone a PDB, or unplug it and plug it into a different server with a preinstalled CDB, if your target server is running out of resources such as CPU or memory. Many years ago, the introduction of external storage gave us the possibility to store data on external devices and the flexibility to plug and unplug them to any system independent of their OS. For example, you can connect an external device to a system using Windows XP and read your data without any problems. Later you can unplug it and connect it to a laptop running Windows 7 and you will still be able to read your data. Now with the introduction of Oracle pluggable databases, we will be able to do something similar with Oracle when upgrading a PDB, making this process simple and easy. All you will need to do to upgrade a PDB, as per example, is: Unplug your PDB (step 1 in the following figure) that is using a CDB running 12.1.0.1. Copy the PDB to the destination location with a CDB that is using a later version such as 12.2.0.1 (step 2 in the following figure). Plug the PDB to the CDB (step 3 in the following figure), and your PDB is now upgraded to 12.2.0.1. This new concept is a great solution for database consolidation and is very useful for multitenant SaaS (Software as a Service) providers, improving resource utilization, manageability, integration, and service management. Some key points about pluggable databases are: You can have many PDBs if you want inside a single container (a CDB can contain a maximum of 253 PDBs) A PDB is fully backwards compatible with an ordinary pre-12.1 database in an applications perspective, meaning that an application built for example to run on Oracle 11.1 will have no need to be changed to run on Oracle 12c A system administrator can connect to a CDB as a whole and see a single system image If you are not ready to make use of this new concept, you can still be able to create a database on Oracle 12c as before, called non-CDB (non-Container Database) Each instance in RAC opens the CDB as a whole. A foreground session will see only the single PDB it is connected to and sees it just as a non-CDB The Resource Manager is extended with some new between-PDB capabilities Fully integrated with Oracle Enterprise Manager 12c and SQL Developer Fast provisioning of new databases (empty or as a copy/clone of an existing PDB) On Clone triggers can be used to scrub or mask data during a clone process Fast unplug and plug between CDBs Fast path or upgrade by unplugging a PDB and plugging it into a different CDB already patched or with a later database version Separation of duties between DBA and application administrators Communication between PDBs is allowed via intra-CDB dblinks Every PDB has a default service with its name in one Listener An unplugged PDB carries its lineage, Opatch, encryption key info, and much more All PDBs in a CDB should use the same character set All PDBs share the same control files, SPFILE, redo log files, flashback log files, and undo Flashback PDB is not available on 12.1, it expected to be available with 12.2 Allows multitenancy of Oracle Databases, very useful for centralization, especially if using Exadata Multitenant Container Database is only available for Oracle Enterprise Edition as a payable option, all other editions of the Oracle database can only deploy non-CDB or Single Tenant Pluggable databases. RMAN new features and enhancements Now we can continue and take a fast and closer look at some of the new features and enhancements introduced in this database version for RMAN. Container and pluggable database backup and restore As we saw earlier, the introduction of Oracle 12c and the new pluggable database concept made it possible to easily centralize multiple databases maintaining the individuality of each one when using a single instance. The introduction of this new concept also forced Oracle to introduce some new enhancements to the already existent BACKUP, RESTORE, and RECOVERY commands to enable us to be able to make an efficient backup or restore of the complete CDB. This includes all PDBs or just one of more PDBs, or if you want to be more specific, you can also just backup or restore one or more tablespaces from a PDB. Some examples of how to use the RMAN commands when performing a backup on Oracle 12c are: RMAN> BACKUP DATABASE; (To backup the CBD + all PDBs) RMAN> BACKUP DATABASE root; (To backup only the CBD) RMAN> BACKUP PLUGGABLE DATABASE pdb1,pdb2; (To backup all specified PDBs) RMAN> BACKUP TABLESPACE pdb1:example; (To backup a specific tablespace in a PDB) Some examples when performing RESTORE operations are: RMAN> RESTORE DATABASE; (To restore an entire CDB, including all PDBs) RMAN> RESTORE DATABASE root; (To restore only the root container) RMAN> RESTORE PLUGGABLE DATABASE pdb1; (To restore a specific PDB) RMAN> RESTORE TABLESPACE pdb1:example; (To restore a tablespace in a PDB) Finally, some example of RECOVERY operations are: RMAN> RECOVER DATABASE; (Root plus all PDBs) RMAN> RUN { SET UNTIL SCN 1428; RESTORE DATABASE; RECOVER DATABASE; ALTER DATABASE OPEN RESETLOGS; } RMAN> RUN } RESTORE PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; RECOVER PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; ALTER PLUGGABLE DATABASE pdb1 OPEN RESETLOGS;} Enterprise Manager Database Express The Oracle Enterprise Manager Database Console or Database Control that many of us used to manage an entire database is now deprecated and replaced by the new Oracle Enterprise Manager Database Express. This new tool uses Flash technology and allows the DBA to easily manage the configurations, storage, security, and performance of a database. Note that RMAN, Data Pump, and the Oracle Enterprise Manager Cloud Control are now the only tools able to perform backup and recovery operations in a pluggable database environment, in other words, you cannot use the Enterprise Manager Database Express for database backup/recovery operations. Backup privileges Oracle Database 12c provides separation support for the separation of DBA duties for the Oracle Database by introducing task-specific and least privileged administrative privileges for backups that do not require the SYSDBA privilege. The new system privilege introduced with this new release is SYSBACKUP. Avoid the use of the SYSDBA privilege for backups unless it is strictly necessary. When connecting to the database using the AS SYSDBA system privilege, you are able to see any object structure and all the data within the object, whereas if you are connecting using the new system privilege AS SYSBACKUP, you will still be able to see the structure of an object but not the object data. If you try to see any data using the SYSBACKUP privilege, the ORA-01031: insufficient privileges message will be raised. Tighter security policies require a separation of duties. The new SYSBACKUP privilege facilitates the implementation of the separation of duties, allowing backup and recovery operations to be performed without implicit access to the data, so if access to the data is required for one specific user, it will need to be granted explicitly to this user. RMAN has introduced some changes when connecting to a database such as: TARGET: It will require the user to have the SYSBACKUP administrative privilege to be able to connect to the TARGET database CATALOG: As in the earlier versions a user was required to have the RECOVERY_CATALOG_OWNER role assigned to be able to connect to the RMAN catalog, now it will need to have assigned the SYSBACKUP privilege to be able to connect to the catalog AUXILIARY: It will require the SYSBACKUP administrative privilege to connect to the AUXILIARY database Some important points about the SYSBACKUP administrative privilege are: It includes permissions for backup and recovery operations It does not include data access privileges such as SELECT ANY TABLE that the SYSDBA privilege has It can be granted to the SYSBACKUP user that is created during the database installation process It's the default privilege when a RMAN connection string is issued and does not contain the AS SYSBACKUP clause: $ RMAN TARGET / Before connecting as the SYSBACKUP user created during the database creation process, you will need to unlock the account and grant the SYSBACKUP privilege to the user. When you use the GRANT command to give the SYSBACKUP privilege to a user, the username and privilege information will be automatically added to the database password file. The v$pwfile_users view contains all information regarding users within the database password file and indicates whether a user has been granted any privileged system privilege. Let's take a closer look to this view: SQL> DESC v$pwfile_users Name Null? Type ----------------------------- -------- ----------------- USERNAME VARCHAR2(30) SYSDBA VARCHAR2(5) SYSOPER VARCHAR2(5) SYSASM VARCHAR2(5) SYSBACKUP VARCHAR2(5) SYSDG VARCHAR2(5) SYSKM VARCHAR2(5) CON_ID NUMBER As you can see, this view now contains some new columns, such as: SYSBACKUP: It indicates if the user is able to connect using the SYSBACKUP privileges SYSDG: It indicates if the user is able to connect using the SYSDG (new for Data Guard) privileges SYSKM: It indicates if the user is able to connect using the SYSKM (new for Advanced Security) privileges. CON_ID: It is the ID of the current container. If 0, it will indicate that it is related to the entire CDB or to an entire traditional database (non-CDB): if the value is 1, then this user has the access only to root; if other value, then the view will identify a specific container ID. To help you clearly understand the use of the SYSBACKUP privilege, let's run a few examples to make it completely clear. Let's connect to our newly created database as SYSDBA and take a closer look at the SYSBACKUP privilege: $ sqlplus / as sysdbaSQL> SET PAGES 999SQL> SET LINES 99SQL> COL USERNAME FORMAT A21SQL> COL ACCOUNT_STATUS FORMAT A20SQL> COL LAST_LOGIN FORMAT A41 SQL> SELECT username, account_status, last_login 2 FROM dba_users 3 WHERE username = 'SYSBACKUP';USERNAME ACCOUNT_STATUS LAST_LOGIN------------ -------------------- -----------------------SYSBACKUP EXPIRED & LOCKED As you can see, the SYSBACKUP account created during the database creation is currently EXPIRED & LOCKED, you will need to unlock this account and grant the SYSBACKUP privilege to it if you want to use this user for any backup and recovery purposes: For this demo I will use the original SYSBACKUP account, but in a production environment never use the SYSBACKUP account, instead grant the SYSBACKUP privilege to the user(s) that will be responsible for the backup and recovery operations. SQL> ALTER USER sysbackup IDENTIFIED BY "demo" ACCOUNT UNLOCK; User altered. SQL> GRANT sysbackup TO sysbackup; Grant succeeded. SQL> SQL> SELECT username, account_status 2 FROM dba_users 3 WHERE account_status NOT LIKE '%LOCKED'; USERNAME ACCOUNT_STATUS --------------------- -------------------- SYS OPEN SYSTEM OPEN SYSBACKUP OPEN We can also easily identify what system privileges and roles are assigned to SYSBACKUP by executing the following SQLs: SQL> COL grantee FORMAT A20 SQL> SELECT * 2 FROM dba_sys_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE PRIVILEGE ADM COM ------------- ----------------------------------- --- --- SYSBACKUP ALTER SYSTEM NO YES SYSBACKUP AUDIT ANY NO YES SYSBACKUP SELECT ANY TRANSACTION NO YES SYSBACKUP SELECT ANY DICTIONARY NO YES SYSBACKUP RESUMABLE NO YES SYSBACKUP CREATE ANY DIRECTORY NO YES SYSBACKUP UNLIMITED TABLESPACE NO YES SYSBACKUP ALTER TABLESPACE NO YES SYSBACKUP ALTER SESSION NO YES SYSBACKUP ALTER DATABASE NO YES SYSBACKUP CREATE ANY TABLE NO YES SYSBACKUP DROP TABLESPACE NO YES SYSBACKUP CREATE ANY CLUSTER NO YES 13 rows selected. SQL> COL granted_role FORMAT A30 SQL> SELECT * 2 FROM dba_role_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE GRANTED_ROLE ADM DEF COM -------------- ------------------------------ --- --- --- SYSBACKUP SELECT_CATALOG_ROLE NO YES YES Where the column ADMIN_OPTION refers to if the user has or not, the ADMIN_OPTION privilege, the column DEFAULT_ROLE indicates whether or not ROLE is designated as a default role for the user, and the column COMMON refers to if it's common to all the containers and pluggable databases available. SQL and DESCRIBE As you know well, you are able to execute the SQL commands, and the PL/SQL procedures from the RMAN command line starting with Oracle 12.1, do not require the use of the SQL prefix or quotes for most SQL commands in RMAN. You can now run some simple SQL commands in RMAN such as: RMAN> SELECT TO_CHAR(sysdate,'dd/mm/yy - hh24:mi:ss') 2> FROM dual; TO_CHAR(SYSDATE,'DD) ------------------- 17/09/12 - 02:58:40 RMAN> DESC v$datafile Name Null? Type --------------------------- -------- ------------------- FILE# NUMBER CREATION_CHANGE# NUMBER CREATION_TIME DATE TS# NUMBER RFILE# NUMBER STATUS VARCHAR2(7) ENABLED VARCHAR2(10) CHECKPOINT_CHANGE# NUMBER CHECKPOINT_TIME DATE UNRECOVERABLE_CHANGE# NUMBER UNRECOVERABLE_TIME DATE LAST_CHANGE# NUMBER LAST_TIME DATE OFFLINE_CHANGE# NUMBER ONLINE_CHANGE# NUMBER ONLINE_TIME DATE BYTES NUMBER BLOCKS NUMBER CREATE_BYTES NUMBER BLOCK_SIZE NUMBER NAME VARCHAR2(513) PLUGGED_IN NUMBER BLOCK1_OFFSET NUMBER AUX_NAME VARCHAR2(513) FIRST_NONLOGGED_SCN NUMBER FIRST_NONLOGGED_TIME DATE FOREIGN_DBID NUMBER FOREIGN_CREATION_CHANGE# NUMBER FOREIGN_CREATION_TIME DATE PLUGGED_READONLY VARCHAR2(3) PLUGIN_CHANGE# NUMBER PLUGIN_RESETLOGS_CHANGE# NUMBER PLUGIN_RESETLOGS_TIME DATE CON_ID NUMBER RMAN> ALTER TABLESPACE users 2> ADD DATAFILE '/u01/app/oracle/oradata/cdb1/pdb1/user02.dbf' size 50M; Statement processed Remember that the SYSBACKUP privilege does not grant access to the user tables or views, but the SYSDBA privilege does. Multi-section backups for incremental backups Oracle Database 11g introduced multi-section backups to allow us to backup and restore very large files using backup sets (remember that Oracle datafiles can be up to 128 TB in size). Now with Oracle Database 12c , we are able to make use of image copies when creating multi-section backups as a complement of the previous backup set functionality. This helps us to reduce image copy creation time for backups, transporting tablespaces, cloning, and doing a TSPITR (tablespace point-in-time recovery), it also improves backups when using Exadata. The main restrictions to make use of this enhancement are: The COMPATIBLE initialization parameter needs to be set to 12.0 or higher to make use of the new image copy multi-section backup feature This is only available for datafiles and cannot be used to backup control or password files Not to be used with a large number of parallelisms when a file resides on a small number of disks, to avoid each process to compete with each other when accessing the same device Another new feature introduced with multi-section backups is the ability to create multi-section backups for incremental backups. This will allow RMAN to only backup the data that has changed since the last backup, consequently enhancing the performance of multi-section backups due that they are processed independently, either serially or in parallel. Network-based recovery Restoring and recovering files over the network is supported starting with Oracle Database 12c . We can now recover a standby database and synchronize it with its primary database via the network without the need to ship the archive log files. When the RECOVER command is executed, an incremental backup is created on the primary database. It is then transferred over the network to the physical standby database and applied to the standby database to synchronize it within the primary database. RMAN uses the SCN from the standby datafile header and creates the incremental backup starting from this SCN on the primary database, in other words, only bringing the information necessary to the synchronization process. If block change tracking is enabled for the primary database, it will be used while creating the incremental backup making it faster. A network-based recovery can also be used to replace any missing datafiles, control files, SPFILE, or tablespaces on the primary database using the corresponding entity from the physical standby to the recovery operation. You can also use multi-section backup sets, encryption, or even compression within a network-based recovery. Active Duplicate The Active Duplicate feature generates an online backup on the TARGET database and directly transmits it via an inter-instance network connection to the AUXILIARY database for duplication (not written to disk in the source server). Consequently, this reduces the impact on the TARGET database by offloading the data transfer operation to the AUXILIARY database, also reducing the duplication time. This very useful feature has now received some important enhancements. In Oracle 11 g when this feature was initially introduced, it only allowed us to use a push process based on the image copies. Now it allows us to make use of the already known push process or to make use of the newly introduced pull process from the AUXILIARY database that is based on backup sets (the pull process is now the new default and automatically copies across all datafiles, control files, SPFILE and archive log files). Then it performs the restore of all files and uses a memory script to complete the recovery operation and open the AUXILIARY database. RMAN will dynamically determine, based on your DUPLICATE clauses, which process will be used (push or pull). It is very possible that soon Oracle will end deprecating the push process on the future releases of the database. You can now choose your choice of compression, section size, and encryption to be used during the Active Duplication process. For example, if you specify the SET ENCRYPTION option before the DUPLICATE command, all the backups sent from the target to the auxiliary database will be encrypted. For an effective use of parallelism, allocate more AUXILIARY channels instead of TARGET channels as in the earlier releases. Finally, another important new enhancement is the possibility to finish the duplication process with the AUXILIARY database in not open state (the default is to open the AUXILIARY database after the duplication is completed). This option is very useful when you are required to: Modify the block change tracking Configure fast incremental backups or flashback database settings Move the location of the database, for example, to ASM Upgrade the AUXILIARY database (due that the database must not be open with reset logs prior to applying the upgrade scripts) Or when you know that the attempt to open the database would produce errors To make it clearer, let's take a closer look at what operations RMAN will perform when a DUPLICATE command is used: Create an SPFILE string for the AUXILIARY instance. Mount the backup control file. Restore the TARGET datafiles on the AUXILIARY database. Perform incomplete recovery using all the available incremental backups and archived redo log files. Shut down and restart the AUXILIARY instance in the NOMOUNT mode. Create a new control file, create and store the new database ID in the datafiles (it will not happen if the FOR STANDBY clause is in use). Mount and opens the duplicate database using the RESETLOGS option, and create the online redo log files by default. If the NOOPEN option is used, the duplicated database will not be opened with RESETLOGS and will remain in the MOUNT state. Here are some examples of how to use the DUPLICATE command with PDBs: RMAN> DUPLICATE TARGET DATABASE TO <CDB1>; RMAN> DUPLICATE TARGET DATABASE TO <CDB1> PLUGGABLE DATABASE <PDB1>, <PDB2>, <PDB3>; Support for the third-party snapshot In the past when using a third-party snapshot technology to make a backup or clone of a database, you were forced to change the database to the backup mode (BEGIN BACKUP) before executing the storage snapshot. This requirement is no longer necessary if the following conditions are met: The database crash is consistent at the point of the snapshot Write ordering is preserved for each file within the snapshot The snapshot stores the time at which the snapshot is completed If a storage vendor cannot guarantee compliance with the conditions discussed, then you must place your database in backup mode before starting with the snapshot. The RECOVER command now has a newly introduced option called SNAPSHOT TIME that allows RMAN to recover a snapshot that was taken without being in backup mode to a consistent point-in-time. Some examples of how to use this new option are: RMAN> RECOVER DATABASE UNTIL TIME '10/12/2012 10:30:00' SNAPSHOT TIME '10/12/2012 10:00:00'; RMAN> RECOVER DATABASE UNTIL CANCEL SNAPSHOT TIME '10/12/2012 10:00:00'; Only trust your backups after you ensure that they are usable for recovery. In other words, always test your backup methodology first, ensuring that it can be used in the future in case of a disaster. Cross-platform data transport Starting with Oracle 12c, transporting data across platforms can be done making use of backup sets and also create cross-platform inconsistent tablespace backups (when the tablespace is not in the read-only mode) using image copies and backup sets. When using backup sets, you are able to make use of the compression and multi-section options, reducing downtime for the tablespace and the database platform migrations. RMAN does not catalog backup sets created for cross-platform transport in the control file, and always takes into consideration the endian format of the platforms and the database open mode. Before creating a backup set that will be used for a cross-platform data transport, the following prerequisites should be met: The compatible parameter in the SPFILE string should be 12.0 or greater The source database must be open in read-only mode when transporting an entire database due that the SYS and SYSAUX tablespaces will participate in the transport process If using Data Pump, the database must be open in read-write mode You can easily check the current compatible value and open_mode of your database by running the following SQL commands: SQL> SHOW PARAMETER compatible NAME TYPE VALUE ---------------------- ----------- ---------------------- compatible string 12.0.0.0.0 SQL> SELECT open_mode FROM v$database; OPEN_MODE -------------------- READ WRITE When making use of the FOR TRANSPORT or the TO PLATFORM clauses in the BACKUP command, you cannot make use of the following clauses: CUMULATIVE forRecoveryOfSpec INCREMENTAL LEVEL n keepOption notBackedUpSpec PROXY SECTION SIZE TAG VALIDATE Table recovery In previous versions of Oracle Database, the process to recover a table to a specific point-in-time was never easy. Oracle has now solved this major issue by introducing the possibility to do a point-in-time recovery of a table, group of tables or even table partitions without affecting the remaining database objects using RMAN. This makes the process easier and faster than ever before. Remember that Oracle has previously introduced features such as database point-in-time recovery ( DBPITR ), tablespace point-in-time recovery ( TSPITR ) and Flashback database; this is an evolution of the same technology and principles. The recovery of tables and table partitions is useful in the following situations: To recover a very small set of tables to a particular point-in-time To recover a tablespace that is not self-contained to a particular point-in-time, remember that TSPITR can only be used if the tablespace is self-contained To recover tables that are corrupted or dropped with the PURGE option, so the FLASHBACK DROP functionality is not possible to be used When logging for a Flashback table is enabled but the flashback target time or SCN is beyond the available undo To recover data that was lost after a data definition language ( DDL ) operation that changed the structure of a table To recover tables and table partitions from a RMAN backup, the TARGET database should be (prerequisites): At the READ/WRITE mode In the ARCHIVELOG mode The COMPATIBLE parameter should be set to 12.0 or higher You cannot recover tables or table partitions from the SYS, SYSTEM and SYSAUX schemas, or even from a standby database. Now let's take a closer look at the steps to do a table or table partitions recovery using RMAN: First check if all the prerequisites to do a table recovery are met. Start a RMAN session with the CONNECT TARGET command. Use the RECOVER TABLE command with all the required clauses. RMAN will determine which backup contains the data that needs to be recovered based on the point-in-time specified. RMAN creates an AUXILIARY instance, you can also specify the location of the AUXILIARY instance files using the AUXILIARY DESTINATION or SET NEWNAME clause. RMAN recovers the specified objects into the AUXILIARY instance. RMAN creates a Data Pump export dump file that contains the objects. RMAN imports the recovered objects from the dump file previously created into the TARGET database. If you want to manually import the objects to the TARGET database, you can make use of the NOTABLEIMPORT clause in the RECOVER command to achieve this goal. RMAN optionally offers the possibility to rename the recovered objects in the TARGET database using the REMAP TABLE clause, or to import the recovered objects to a different tablespace using the REMAP TABLESPACE clause. An example of how to use the new RECOVER TABLE command is: RMAN> RECOVER TABLE SCOTT.test UNTIL SEQUENCE 5481 THREAD 2 AUXILARY DESTINATION '/tmp/recover' REMAP TABLE SCOTT.test:my_test;
Read more
  • 0
  • 0
  • 3755

article-image-introducing-qlikview-elements
Packt
24 Sep 2013
6 min read
Save for later

Introducing QlikView elements

Packt
24 Sep 2013
6 min read
(For more resources related to this topic, see here.) People People are the only active element of data visualization, and as such, they are the most important. We briefly describe the roles of several people that participate in our project, but we mainly focus on the person who is going to analyze and visualize the data. After the meeting, we get together with our colleague, Samantha, who is the analyst that supports the sales and executive teams. She currently manages a series of highly personalized Excels that she creates from standard reports generated within the customer invoice and project management system. Her audience ranges from the CEO down to sales managers. She is not a pushover, but she is open to try new techniques, especially given that the sponsor of this project is the CEO of QDataViz, Inc. As a data discovery user, Samantha possesses the following traits: Ownership She has a stake in the project's success or failure. She, along with the company, stands to grow as a result of this project, and most importantly, she is aware of this opportunity. Driven She is focused on grasping what we teach her and is self-motivated to continue learning after the project is fi nished. The cause of her drive is unimportant as long as she remains honest. Honest She understands that data is a passive element that is open to diverse interpretations by different people. She resists basing her arguments on deceptive visualization techniques or data omission. Flexible She does not endanger her job and company results following every technological fad or whimsical idea. However, she realizes that technology does change and that a new approach can foment breakthroughs. Analytical She loves finding anomalies in the data and being the reason that action is taken to improve QDataViz, Inc. As a means to achieve what she loves, she understands how to apply functions and methods to manipulate data. Knowledgeable She is familiar with the company's data, and she understands the indicators needed to analyze its performance. Additionally, she serves as a data source and gives context to analysis. Team player She respects the roles of her colleagues and holds them accountable. In turn, she demands respect and is also obliged to meet her responsibilities. Data Our next meeting involves Samantha and Ivan, our Information Technology (IT) Director. While Ivan explains the data available in the customer invoice and project management system's well-defined databases, Samantha adds that she has vital data in Microsoft Excel that is missing from those databases. One Excel file contains the sales budget and another contains an additional customer grouping; both files are necessary to present information to the CEO. We take advantage of this discussion to highlight the following characteristics that make data easy to analyze. Reliable Ivan is going to document the origin of the tables and fields, which increases Samantha's confidence in the data. He is also going to perform a basic data cleansing and eliminate duplicate records whose only difference is a period, two transposed letters, or an abbreviation. Once the system is operational, Ivan will consider the impact any change in the customer invoice and project management system may have on the data. He will also verify that the data is continually updated while Samantha helps con firm the data's validity. Detailed Ivan will preserve as much detail as possible. If he is unable to handle large volumes of data as a whole, he will segment the detailed data by month and reduce the detail of a year's data in a consistent fashion. Conversely, he is will consider adding detail by prorating payments between the products of paid invoices in order to maintain a consistent level of detail between invoices and payments. Formal An Excel file as a data source is a short-term solution. While Ivan respects its temporary use to allow for a quick, first release of the data visualization project, he takes responsibility to find a more stable medium to long-term solution. In the span of a few months, he will consider modifying the invoice system, investing in additional software, or creating a simple portal to upload Excel files to a database. Flexible Ivan will not prevent progress solely for bureaucratic reasons. Samantha respects that Ivan's goal is to make data more standardized, secure, and recoverable. However, Ivan knows that if he does not move as quickly as business does, he will become irrelevant as Samantha and others create their own black market of company data. Referential Ivan is going to make available manifold perspectives of QDataViz, Inc. He will maintain history, budgets, and forecasts by customers, salespersons, divisions, states, and projects. Additionally, he will support segmenting these dimensions into multiple groups, subgroups, classes, and types. Tools We continue our meeting with Ivan and Samantha, but we now change our focus to what tool we will use to foster great data visualization and analysis. We create the following list of basic features we hope from this tool: Fast and easy implementation We should be able to learn the tool quickly and be able to deliver a first version of our data visualization project within a matter of weeks. In this fashion, we start receiving a return on our investment within a short period of time. Business empowerment Samantha should be able to continue her analysis with little help from us. Also, her audience should be able to easily perform their own lightweight analysis and follow up on the decisions made. Enterprise-ready Ivan should be able to maintain hundreds or thousands of users and data volumes that exceed 100 million rows. He should also be able to restrict access to certain data to certain users. Finally, he needs to have the confidence that the tools will remain available even if a server fails. Based on these expectations, we talk about data discovery tools, which are increasingly becoming part of the architecture of many organizations. Samantha can use these tools for self-service data analysis. In other words, she can create her own data visualizations without having to depend on pre-built graphs or reports. At the same time, Ivan can be reassured that the tool does not interfere with his goal of providing an enterprise solution that offers scalability, security, and high availability. The data discovery tool we are going to use is QlikView, and the following diagram shows the overall architecture we will build and where this article focuses its attention: Summary In this article, we learned about People, data, and tools which are an essential part of creating great data visualization and analysis. Resources for Article: Further resources on this subject: Meet QlikView [Article] Linking Section Access to multiple dimensions [Article] Creating sheet objects and starting new list using Qlikview 11 [Article]
Read more
  • 0
  • 0
  • 2101
article-image-executing-pdi-jobs-filesystem-simple
Packt
19 Sep 2013
7 min read
Save for later

Executing PDI jobs from a filesystem (Simple)

Packt
19 Sep 2013
7 min read
(For more resources related to this topic, see here.) Getting ready To get ready for this article, we first need to check that our Java environment is configured properly; to do this, check that the JAVA_HOME environment variable is set. Even if all the PDI scripts, when started, call other scripts that try to find out about our Java execution environment to get the values of the JAVA_HOME variable, it is always a good rule of thumb to have that variable set properly anytime we work with a Java application. The Kitchen script is in the PDI home directory, so the best thing to do to launch the script in the easiest way is to add the path to the PDI home directory to the PATH variable. This gives you the ability to start the Kitchen script from any place without specifying the absolute path to the Kitchen file location. If you do not do this, you will always have to specify the complete path to the Kitchen script file. To play with this article, we will use the samples in the directory <book_samples>/sample1; here, <book_samples> is the directory where you unpacked all the samples of the article. How to do it… For starting a PDI job in Linux or Mac, use the following steps: Open the command-line terminal and go to the <book_samples>/sample1 directory. Let's start the sample job. To identify which job file needs to be started by Kitchen, we need to use the –file argument with the following syntax: –file: <complete_filename_to_job_file> Remember to specify either an absolute path or a relative path by properly setting the correct path to the file. The simplest way to start the job is with the following syntax: $ kitchen.sh –file:./export-job.kjb If you're not positioned locally in the directory where the job files are located, you must specify the complete path to the job file as follows: $ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb Another option to start our job is to separately specify the name of the directory where the job file is located and then give the name of the job file. To do this, we need to use the –dir argument together with the –file argument. The –dir argument lets you specify the location of the job file directory using the following syntax: –dir: <complete_path_to_ job_file_directory> So, if we're located in the same directory where the job resides, to start the job, we can use the following new syntax: $ kitchen.sh – dir:. –file:export-job.kjb If we're starting the job from a different directory than the directory where the job resides, we can use the absolute path and the –dir argument to set the job's directory as follows: $ kitchen.sh –dir:/home/sramazzina/tmp/samples –file:export-job.kjb For starting a PDI job with parameters in Linux or Mac, perform the following steps: Normally, PDI manages input parameters for the executing job. To set parameters using the command-line script, we need to use a proper argument. We use the –param argument to specify the parameters for the job we are going to launch. The syntax is as follows: -param: <parameter_name>= <parameter_value> Our sample job and transformation does accept a sample parameter called p_country that specifies the name of the country we want to export the customers to a file. Let's suppose we are positioned in the same directory where the job file resides and we want to call our job to extract all the customers for the country U.S.A. In this case, we can call the Kitchen script using the following syntax: $ kitchen.sh –param:p_country=USA -file=./export-job.kjb Of course, you can apply the –param switch to all the other three cases we detailed previously. For starting a PDI job in Windows, use the following steps: In Windows, a PDI job from the filesystem can be started by following the same rules that we saw previously, using the same arguments in the same way. The only difference is in the way we specify the command-line arguments. Any time we start the PDI jobs from Windows, we need to specify the arguments using the / character instead of the – character we used for Linux or Mac. Therefore, this means that: -file: <complete_filename_to_job_file> Will become: /file: <complete_filename_to_job_file> And: –dir: <complete_path_to_ job_file_directory> Will become: /dir: <complete_path_to_ job_file_directory> From the directory <book_samples>/sample1, if you want to start the job, you can run the Kitchen script using the following syntax: C:tempsamples>Kitchen.bat /file:./export-job.kjb Regarding the use of PDI parameters in command-line arguments, the second important difference on Windows is that we need to substitute the = character in the parameter assignment syntax with the : character. Therefore, this means that: –param: <parameter_name>= <parameter_value> Will become: /param: <parameter_name>: <parameter_value> From the directory <book_samples>/sample1, if you want to extract all the customers for the country U. S. A, you can start the job using the following syntax: C:tempsamples>Kitchen.bat /param:p_country:USA /file:./exportjob. kjb For starting the PDI transformations, perform the following steps: The Pan script starts PDI transformations. On Linux or Mac, you can find the pan.sh script in the PDI home directory. Assuming that you are in the same directory, <book_samples>/sample1, where the transformation is located, you can start a simple transformation with a command in the following way: $ pan.sh –file:./read-customers.ktr If you want to start a transformation by specifying some parameters, you can use the following command: $ pan.sh –param:p_country=USA –file:./read-customers.ktr In Windows, you can use the Pan.bat script, and the sample commands will be as follows: C:tempsamples>Pan.bat /file:./read-customers.ktr Again, if you want to start a transformation by specifying some parameters, you can use the following command: C:tempsamples>Pan.bat /param:p_country=USA /file:./readcustomers. ktr Summary IIn this article, you were guided through simply starting a PDI job using the script Kitchen. In this case, the PDI job we started were stored locally in the computer filesystem, but it could be anywhere in the network in any place that is directly accessible. You learned how to start simple jobs both with and without a set of input parameters previously defined in the job. Using command-line scripts was a fast way to start batches, but it was also the easiest way to schedule our jobs using our operating system's scheduler. The script accepted a set of inline arguments to pass the proper options required by the program to run our job in any specific situation. Resources for Article : Further resources on this subject: Integrating Kettle and the Pentaho Suite [Article] Installing Pentaho Data Integration with MySQL [Article] Pentaho – Using Formulas in Our Reports [Article]
Read more
  • 0
  • 0
  • 4289

article-image-oracle-b2b-overview
Packt
17 Sep 2013
12 min read
Save for later

Oracle B2B Overview

Packt
17 Sep 2013
12 min read
B2B environment setup Here is the list of some OFM concepts that will be used in this article: Domain: It is the basic administration unit that includes a special WebLogic Server instance called the Administration Server, and optionally one or many Java components. Java component: It is a Java EE application deployed to an Oracle WebLogic Server domain as part of a domain template. For example, SOA Suite is a Java component. Managed server: It is an additional WebLogic Server included in a domain, to host Java components such as SOA Suite. We will use the UNIX operating system for our tutorials. The following table depicts the directory environment variables used throughout the article for configuring the Oracle SOA Suite deployment: Name Variable What It Is Example Middleware home MW_HOME The top-level directory for all OFM products WebLogic Server home WL_HOME Contains installed files necessary to host a WebLogic Server $MW_HOME/wlserver_10.3 Oracle home SOA_ORACLE_HOME Oracle SOA Suite product directory $MW_HOME/Oracle_SOA1 Oracle Common Home ORACLE_COMMON_HOME Contains the binary and library files required for the Oracle Enterprise Manager Fusion Middleware Control and Java Required Files (JRF) $MW_HOME/oracle_common Domain home SOA_DOMAIN_HOME The absolute path of the source domain containing the SOA Suite Java component $MW_HOME/user_projects/domains/SOADomain Java home JAVA_HOME Specifies the location of JDK (must be 1.6.04 or higher) or JRockit $MW_HOME/jdk160_29 Ant Home ANT_HOME Specifies the location of Ant archive location $MW_HOME/org.apache.ant_1.7.1 The following figure depicts a snapshot of the SOA Suite directory's hierarchical structure: For the recommended SOA Suite directory location, please refer to the OFM Enterprise Development guide for SOA Suite that can be found at http://docs.oracle.com/cd/E16764_01/core.1111/e12036/toc.htm. JDeveloper installation tips JDeveloper is a development tool that will be used in the article. It is a full service Integrated Development Environment (IDE), which allows for the development of SOA projects along with a host of other Oracle products, including Java. If it has not been installed yet, one may consider downloading and installing the VM VirtualBox (VBox) Image of the entire package of SOA Suite, B2B, and JDeveloper, provided by Oracle on the Oracle download site found at http://www.oracle.com/technetwork/middleware/soasuite/learnmore/vmsoa-172279.html. All you need to do is to install Oracle VM VirtualBox, and import the SOA/BPM appliance. This is for evaluation and trial purposes, and is not recommended for production use; however, for the purpose of following, along with the tutorial in the article, it is perfect. The following table shows minimum and recommended requirements for the VBox Image: Minimum Recommended Memory (RAM) 4-6 GB 8 GB Disk Space 25 GB 50 GB While VM's are convenient, they do use quite a bit of disk space and memory. If you don't have a machine that meets the minimum requirements, it will be a challenge to try the exercises. The other alternative is to download the bits for the platform you are using from the Oracle download page, and install each software package, and configure them accordingly, including a JDK, a DB, WebLogic Server, SOA Suite, and JDeveloper, among other things you may need. If you decide that you have enough system resources to run the VBox Image, here are some of the major steps that you need to perform to download and install it. Please follow the detailed instructions found in the Introduction and Readme file that can be downloaded from http://www.oracle.com/technetwork/middleware/soasuite/learnmore/soabpmvirtualboxreadme-1612068.pdf, in order to have the complete set of instructions. Download the Introduction and Readme file, and review. Enable hardware virtualization in your PC BIOS if necessary. To download and install the VirtualBox software (engine that runs the virtual machine on your host machine), click on the link Download and install Oracle VM VirtualBox on the download page. To download the 7 pieces of the ZIP file, click on each file download ending with 7z.00[1-7] on the download page. To download the MD5 Checksum tool if you don't have one, click on the link Download MD5sums if you're on Windows to check your download worked okay on the download page. Run the MD5 Checksum tool to verify the 7 downloaded files: md5sums oel5u5-64bit-soabpm-11gr1-ps5-2-0-M.7z.001. Repeat for all 7 files. (This takes quite a while, but it is best to do it, so that you can verify that your download is complete and accurate.) Compare the results of the program with the results in the download that ends with .mdsum. They should match exactly. Extract the VBox Image from the .001 file using a Zip/Unzip tool. Use a ZIP tool such as 7-Zip (available as freeware for Windows), WinZip, or other to extract the .ova file from the 7 files into a single file on your platform. Using 7-Zip, if you extract from the first file; it will find the other 6 files and combine them all as it extracts. Start VirtualBox and set preferences such as the location of the VBox Image on your disk (follow instructions in the readme file). Import the new .ova file that was extracted from the ZIP file. Check settings and adjust memory/CPU. Start the appliance (VBox Image). Login as oracle with password oracle (check Readme). Choose the domain type dev_soa_osb. Set up a shared folder, you can use to share files between your machine and the virtual machine, and restart the VM. Once you are logged back in, start the admin server using the text based menu. Once the server is started, you can start the graphical desktop using the text based menu. Click on the jDeveloper Icon on the desktop of the VM to start jDeveloper. Choose Default Role when prompted for a role. At the time of writing, the latest available version is 11g PS5 (11.1.1.1.6). The VBox Image comes with SOA Suite, Oracle 11g XE Database, and JDeveloper, pre-installed on a Linux Virtual Machine. Using the VirtualBox technology, you can run this Linux machine virtually on your laptop, desktop, or on a variety of other platforms. For the purpose of this article, you should choose the dev_soa_osb type of domain. System requirements Oracle B2B is installed as a part of the SOA Suite installation. The SOA Suite installation steps are well documented, and are beyond the scope of this article. If you have never installed Oracle SOA Suite 11g, check with the Installation Guide for Oracle SOA Suite and Oracle Business Process Management Suite 11g. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://docs.oracle.com/cd/E23943_01/doc.1111/e13925/toc.htm. There are several important topics that did not have enough coverage in the SOA Suite Installation Guide. One of them is how to prepare the environment for the SOA/B2B installation. To begin, it is important to validate whether your environment meets the minimum requirements specified in the Oracle Fusion Middleware System Requirements and Specifications document. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://docs.oracle.com/html/E18558_01/fusion_requirements.htm. The spreadsheet provides very important SOA Suite installation recommendations, such as the minimum disk space information and memory requirements that could help the IT hardware team with its procurement planning process. For instance, the Oracle recommended hardware and system requirements for SOA Suite are: Minimum Physical Memory required: 2 gigabytes Minimum available Memory Required: 4 gigabytes CPU: dual-core Pentium, 1.5 GHz or greater Disk Space: 15 gigabytes or more This document also has information about supported databases and database versions. Another important document that has plenty of relevant information is Oracle Fusion Middleware 11g Release 1 (11.1.1.x) Certification Matrix. It can be downloaded from the Oracle Technology Network (OTN) Documentation downloads page at http://www.oracle.com/technetwork/middleware/downloads/fmw-11gr1certmatrix.xls. It is indeed a treasure chest of information. This spreadsheet may save you from a lot of headache. The last thing someone wants to run into is the need to re-install, just because they did not properly read and /or missed some recommendations. Here are some important points from the spreadsheet you don't want to miss: Hardware platform version's compatibility with a particular SOA Suite release Supported JDK versions Interoperability support for SOA Suite with WebLogic Server Supported database versions In conclusion, the following list includes a complete SOA Suite software stack (as used in the article): Oracle WebLogic Server (10.1.3.6) (Required) Repository Creation Utility (RCU) (11.1.1.6.0) (Required) SOA Suite 11g (11.1.1.6.0) (Required) JDeveloper (11.1.1.6.0) (Required) JDeveloper Extension for SOA (Required) Oracle B2B Document Editor release 7.05 (Optional) Oracle Database 11g (Required) Oracle B2B installation and post-installation configuration notes There are several important installation and post-installation steps that may directly or indirectly impact the B2B component's behavior. Creating a WebLogic domain is one of the most important SOA Suite 11g installation steps. The BEA engineers, who used to work with WebLogic prior to 11g, never before had to select SOA Suite components while creating a domain. This process is completely new for the Oracle engineers who are familiar only with prior releases of SOA Suite. There are several steps in this process that, if missed, might require a complete re-installation. A common mistake that people make when creating a new domain is that they don't check the Enterprise Manager checkbox. As a result, Enterprise Manager is not available, meaning that neither instance monitoring and tracking, nor access to the B2B configuration properties is available. Make sure you do not make such a mistake by selecting the Oracle Enterprise Manager checkbox. Oracle Enterprise Manager has been assigned a new name: Enterprise Manager Fusion Middleware Control. While planning SOA Suite deployment architecture, it is recommended to choose ahead of time between the following two WebLogic domain configurations: Developers domain Production domain In the Developers domain configuration, SOA Suite is installed as part of the administration server, implying that a separate managed server will not be created. This configuration could be a good choice for a development server, or a personal laptop, or any environment where available memory is limited. One should always keep in mind that SOA Suite requires up to 4 gigabytes of available memory. To set up the Developers domain, select the Oracle SOA Suite for developers checkbox on the Select Domain Source page, as shown in the following screenshot: Oracle strongly recommends against using this configuration in a production environment by warning that it will not be supported; that is, Oracle Support won't be able to provide assistance for any issues that happen to occur in this environment. Conversely, if the Oracle SOA Suite checkbox is selected, as shown in the following screenshot, a managed server will be created with a default name soa_server1. Creating a separate managed server (which is a WebLogic Java Virtual Machine) and deploying SOA Suite to this managed server, provides a more scalable configuration. If SOA Suite for developers was installed, you need to perform the following steps to activate the B2B user interface: Login to the Oracle WebLogic Server Administration Console using the following URL: http :: //<localserver>:7001/console (note that 7001 is the default port unless a different port was chosen during the installation process). Provide the default administrator account (the WebLogic user, unless it was changed during the installation process). Select Deployments from the Domain Structure list. On the bottom-right side of the page, select b2bui from the Deployments list (as shown in the following screenshot). On the next page, click on the Targets tab. Select the Component checkbox to enable the Change Target button. Click on the Change Target button. Select the AdminServer checkbox and click on the Yes button. Click on the Activate Changes button. Click on the Deployments link in the WebLogic domain structure. The B2B user interface is activated. If the SOA Suite production configuration was chosen, these steps are no longer necessary. However, you must first configure Node Manager. To do that, execute the setNMProps script and start Node Manager. $ORACLE_COMMON_HOME/common/bin/setNMProps.sh $MW_HOME/wlserver_n/server/bin/startNodeManager.sh Oracle B2B web components Oracle B2B Gateway is deployed as part of the Oracle SOA Service Infrastructure, or SOA-Infra. SOA Infrastructure is a Java EE compliant application running on Oracle WebLogic Server. Java EE compliant application: It is a wrapper around web applications and Enterprise Java Bean (EJB) applications Web Application: It usually represents the User Interface Layer, and includes Java Server pages (JSP), Servlets, HTML, and so on Servlet: It is a a module of Java code that runs as a server-side application Java Server Pages (JSP): It is a programming technology used to make dynamic web pages WAR archive: It is an artifact for the web application deployment Enterprise Java Beans: These are server-side domain objects that fit into a standard component-based architecture for building enterprise applications with Java EJB application: It is a collection of Enterprise Java Beans The following table shows a list of B2B web components installed as part of the SOA Infrastructure application. They include an enterprise application, several EJBs, a web application, and a web service. The B2B web application provides a link to the B2B Interface. The B2B MetadataWS Web Service provides Oracle SOA Service Infrastructure with access to the metadata repository. Stateless EJBs are used by the B2B Engine. This table might be helpful to understand how Oracle B2B integrates with Oracle SOA Suite. It could also be useful while developing B2B high availability architecture. Name Application Type b2b Web Application b2bui JEE Application B2BInstanceMessageBean EJB B2BStarterBeanWLS EJB B2BUtilityBean EJB B2BMetadataWS Web Service
Read more
  • 0
  • 0
  • 5600
Modal Close icon
Modal Close icon