Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-formatting-report-items-and-placeholders
Packt
15 Sep 2015
12 min read
Save for later

Formatting Report Items and Placeholders

Packt
15 Sep 2015
12 min read
 In this article by Steven Renders, author of the book Microsoft Dynamics NAV 2015 Professional Reporting, we will see how you can format report items and use placeholders, when you design the layout of a report in RDLC. As you will noticed, when you create a new report layout, by default, amounts or quantities in the report are not formatted in the way we are used to in Dynamics NAV. This is because the dataset that is generated by Dynamics NAV contains the numerical values without formatting. It sends a separate field with a format code that can be used in the format properties of a textbox in the layout. (For more resources related to this topic, see here.) Formatting report items Numerical fields have a Format property. This Format property is populated by Dynamics NAV and contains, at runtime, an RDL format code that you can use in the Format property of a textbox in Visual Studio. To get started with formatting, perform the following steps: When you right-click on a textbox, a menu appears, in which you can select the properties of the textbox, as shown in the following screenshot: In the Textbox Properties window, go to Number and then select Custom. Click on the Fx button to open Expression Designer and type an expression. The result of the expression will be the value of the property. In this case, our expression should fetch the value from the format field from the Quantity field. The expression will be: =Fields!Quantity_ItemLedgerEntryFormat.Value This means that the format of the textbox is fetched from the dataset field: Quantity_Item. Instead of using Expression Designer, you can also just type this expression directly into the Formatcode textbox or in the Format property in the properties window of the textbox, as shown in the following screenshot: Reporting Services and RDLC use .NET Framework formatting strings for the Format property of a textbox. The following is a list of possible format strings: C: CurrencyD: DecimalE: ScientificF: Fixed pointG: GeneralN: NumberP: PercentageR: Round tripX: Hexadecimal After the format string, you can provide a number representing the amount of digits that have to be shown to the right of the decimal point. For example: F2 means a fixed point with 2 digits: 1.234,00 or 1,234.00F0 means a fixed point with no digits: 1.234 or 1,234 The thousand and comma separators (.and,) that are applied, and the currency symbol, depend on the Language property of the report. More information about .NET Framework formatting strings can be found here: Custom Numeric Format Strings: http://msdn.microsoft.com/en-us/library/0c899ak8.aspx. Standard Date and Time Format Strings: http://msdn.microsoft.com/en-us/library/az4se3k1.aspx. As an alternative, you can use custom format strings to define the format value. This is actually how Dynamics NAV populates the Format fields in the dataset. The syntax is: #,##0.00 You can use this to define the precision of a numeric field. The following image provides an example: Why does the Format property sometimes have no effect? To apply formatting to a textbox, the textbox must contain an expression, for example, =Fields!LineTotal.Value or =1000. When the text in the textbox does not begin with the = sign, then the text is interpreted as a string and formatting does not apply. You can also set the format in the report dataset designer, instead of in the layout. You can do this by using the Format function. You can do this directly in the dataset in the SourceExpression of any field, or you can do it in the data item triggers, for example the OnAfterGetRecord() trigger. But, if you use an expression in the SourceExpression, you lose the option to use the IncludeCaption property. A good example of a textbox format property is available here: http://thinkaboutit.be/2015/06/how-do-i-implement-blankzero-or-replacezero-in-a-report. Using placeholders If you select a textbox and right-click on it, you open the textbox properties. But, inside the textbox, there's the placeholder. A placeholder is the text, or expression, that becomes the information displayed in the textbox at runtime. And the placeholder also has a set of properties that you can set. So you can consider a placeholder as an entity inside a textbox, with its own set of properties, which are, by default, inherited from its parent, the textbox. The following screenshot shows that, when you right-click on the text in a textbox, you can then select its placeholder properties: A textbox can contain one or more placeholders. By using multiple placeholders in one textbox, you can display multiple fields in one textbox, and give them different properties. In the following example, I will add a header to the report, and in the header, I will display the company information. To add a header (and/or footer) to a report, go to the Report menu and select: Add Page Header Add Page Footer The following screenshot shows an example of this: A report can contain a maximum of one header and one footer. As an alternative you can right-click anywhere in the body of the report, in the empty space to the left or right of the body, and add a page header or footer. The page header and page footer are always shown on every page, except if you decide not to show it for the first and/or last page by using the properties: PrintOnFirstPage PrintOnLastPage Dynamically hiding a page header/footer A page header and footer cannot be hidden dynamically. A workaround would be to put a rectangle in the page header and/or footer and use the Hidden property of the rectangle to show or hide the content of the header/footer dynamically. You need to be aware that, even when you hide the content of the page header/footer, the report viewer will preserve the space. This means that the header/footer is still displayed, but will be empty. A page header or footer cannot contain a data region. The only controls you can add to a page header or footer are: Textbox Line Rectangle Image So, in the page header, I will add a textbox with a placeholder, as in the following screenshot: To do this, add a textbox in the page header. Then, drag a field from the dataset into the textbox. Then, add one or more spaces and drag another field into the same textbox. You will notice the two fields can be selected inside the textbox and, when they are, they become gray. If you right-click on the placeholder, you can see its properties. This is how you can see that it is a placeholder. It is interesting that the mark-up type for a placeholder can be changed to HTML. This means that, if the placeholder contains HTML, it will be recognized by the report viewer and rendered, as it would be by a browser. The HTML tags that are recognized are the following: <A href> <FONT> <H{n}>, <DIV>, <SPAN>,<P>, <DIV>, <LI>, <HN> <B>, <I>, <U>, <S> <OL>, <UL>, <LI> If you use these HTML tags in a badly organized way then they will be interpreted as text and rendered as such. The possibility of using HTML in placeholders creates an opportunity for Dynamics NAV developers. What you can do, for example, is generate the HTML tags in C/AL code and send them to the dataset. By using this approach, you can format text and manage it dynamically via C/AL. You could even use a special setup table in which you let users decide how certain fields should be formatted. In our example report, I will format the company e-mail address in two ways. First, I will use the placeholder expression to underline the text: Then, I will go to the C/AL code and create a function that will format the e-mail address using a mailto hyperlink: When you run the report, the result is this: The e-mail address is underlined and there is also a hyperlink and, when you click on it, your e-mail client opens. As you can see, the formatting in the placeholder and the formatting in the C/AL code are combined. Use a code unit or buffer table In this example I used a custom function in the report (FormatAsMailto). In real life, it is better to create these types of functions in a separate code unit, or buffer table, so you can reuse them in other reports. Important properties – CanGrow and CanShrink A textbox has many properties, as you can see in the following screenshot. If you right-click a textbox and select the textbox properties, they will open in a separate popup window. In this window, some of the textbox properties are available and they are divided into categories. To see all of the textbox properties you can use the properties window, which is usually on the right in Visual Studio. Here you can sort the properties or group them using the buttons on top: The first button groups the properties. The second button sorts the properties and the third button opens the properties popup window. I am not going to discuss all of the properties, but I would like to draw your attention to CanGrow and CanShrink. These two properties can be set to True or False. If you set CanGrow to True then the height of the textbox will increase if the text, at runtime, is bigger than the width of the textbox. With CanShrink, the height of the textbox may shrink. I do not recommend these properties, except when really necessary. When a textbox grows, the height increases and it pushes the content down below. This makes it difficult to predict if the content of the report will still fit on the page. Also, the effects of CanGrow and CanShrink are different if you run the report in Preview and export it to PDF, Word, Excel, or if you print the report. Example – create an item dashboard report In this example, I am going to create an item dashboard report. Actually, I will create a first version of the dashboard and enhance it. The result of the report looks like the following screenshot: What we need to do is to show the inventory of a list of items by location. The report also includes totals and subtotals of the inventory by location, by item and a grand total. To start, you define a dataset, as follows: In this dataset, I will start with the item table and, per item, fetch the item ledger entries. The inventory is the sum of the quantities of the item in the item ledger entry table. I have also included a filter, using the PrintOnlyIfDetail property of the item data item. This means that, if an item does not have any ledger entries, it will not be shown in the report. Also, I'm using the item ledger entry table to get the location code and quantity fields. In the report layout, I will create a group and calculate the inventory via an aggregate function. In real life, there might be many items and ledger entries, so this approach is not the best one. It would be better to use a buffer table or query object, and calculate the inventory and filter in the dataset, instead of in the layout. At this point, my objective is to demonstrate how you can use a Matrix-Tablix to create a layout that has a dynamic number of rows and columns. Once you have defined the dataset, open the layout and add a matrix control to the report body. In the data cell, use the Quantity field, on the row, use the Item No and, on the column, use the Location Code. This will create the following matrix and groups: Next, modify the expression of the textbox that contains the item number, to the following expression: =Fields!Description_Item.Value & " (" & Fields!No_Item.Value & ")" This will display the item description and, between brackets, the item number. Next, change the sorting of the group by item number to sort on the description: Next, add totals for the two groups: This will add an extra column and row to the matrix. Select the Quantity and then select the Sum as an aggregate. Then, select the four textboxes and, in the properties, apply the formatting for the quantity field: Next, you can use different background colors for the textboxes in the total rows and resize the description column, to resemble the layout in the preceding screenshot. If you save and run the report, you have now created an item dashboard. Notice how easy it is to use the matrix control to create a dashboard. At runtime the number of columns depends on the number of locations. The matrix has a dynamic number of columns. There is no detail level, because the ledger entries are grouped on row and on column level. Colors and background colors When using colors in a report, pay attention to how the report is printed. Not all printers are color printers, so you need to make sure that your visualization has an effect. That's why I have used gray colors in this example. Colors are sometimes also used by developers as a trick to see at runtime, where which textbox is displayed and to test report rendering in different formats. If you do this, remember to remove the colors at the end of the development phase of your report. Summary Textboxes have a lot of properties and contain placeholders, so we can format information in many ways, including using HTML, which can be managed from C/AL, for example using a layout setup table. It’s important to understand how you can formatting report items in Dynamics NAV, so you can create a consistent look and feel in your reports as it’s done inside the Dynamics NAV application. Resources for Article: Further resources on this subject: Standard Functionality[article] Understanding and Creating Simple SSRS Reports[article] Understanding master data [article]
Read more
  • 0
  • 0
  • 14728

article-image-welcome-javascript-full-stack
Packt
15 Sep 2015
12 min read
Save for later

Welcome to JavaScript in the full stack

Packt
15 Sep 2015
12 min read
In this article by Mithun Satheesh, the author of the book Web Development with MongoDB and NodeJS, you will not only learn how to use JavaScript to develop a complete single-page web application such as Gmail, but you will also know how to achieve the following projects with JavaScript throughout the remaining part of the book: Completely power the backend using Node.js and Express.js Persist data with a powerful document oriented database such as MongoDB Write dynamic HTML pages using Handlebars.js Deploy your entire project to the cloud using services such as Heroku and AWS With the introduction of Node.js, JavaScript has officially gone in a direction that was never even possible before. Now, you can use JavaScript on the server, and you can also use it to develop full-scale enterprise-level applications. When you combine this with the power of MongoDB and its JSON-powered data, you can work with JavaScript in every layer of your application. (For more resources related to this topic, see here.) A short introduction to Node.js One of the most important things that people get confused about while getting acquainted with Node.js is understanding what exactly it is. Is it a different language altogether or is it just a framework or is it something else? Node.js is definitely not a new language, and it is not just a framework on JavaScript too. It can be considered as a runtime environment for JavaScript built on top of Google's V8 engine. So, it provides us with a context where we can write JS code on any platform and where Node.js can be installed. That is anywhere! Now a bit of history; back in 2009, Ryan Dahl gave a presentation at JSConf that changed JavaScript forever. During his presentation, he introduced Node.js to the JavaScript community, and after a, roughly, 45-minute talk, he concluded it, receiving a standing ovation from the audience in the process. He was inspired to write Node.js after he saw a simple file upload progress bar on Flickr, the image-sharing site. Realizing that the site was going about the whole process the wrong way, he decided that there had to be a better solution. Now let's go through the features of Node.js that make it unique from other server side programming languages. The advantage that the V8 engine brings in The V8 engine was developed by Google and was made open source in 2008. As we all know, JavaScript is an interpreted language and it will not be as efficient as a compiled language as each line of code gets interpreted one by one while the code is executed. The V8 engine brings in an efficient model here where the JavaScript code will be compiled into machine level code and the executions will happen on the compiled code instead of interpreting the JavaScript. But even though Node.js is using V8 engine, Joyent, which is the company that is maintaining Node.js development, does not always update the V8 engine to the latest versions that Google actively releases. Node is single threaded! You might be asking how does a single threaded model help? Typical PHP, ASP.NET, Ruby, or Java based servers follow a model where each client request results in instantiation of a new thread or even a process. When it comes to Node.js, requests are run on the same thread with even shared resources. A common question that we might be asking will be the advantage of using such a model. To understand this, we should understand the problem that Node.js tries to resolve. It tries to do an asynchronous processing on a single thread to provide more performance and scalability for applications, which are supposed to handle too much web traffic. Imagine web applications that handle millions of concurrent requests; the server makes a new thread for handling each request that comes in, it will consume many resources. We would end up trying to add more and more servers to add the scalability of the application. The single threaded asynchronous processing model has its advantage in the previous context and you can get to process much more concurrent requests with a less number of server side resources. But there is a downside to this approach, that Node.js, by default, will not utilize the number of CPU cores available on the server it is running on without using an extra module such as pm2. The point that Node.js is single threaded doesn't mean that Node doesn't use threads internally. It is that the developer and the execution context that his code has exposure to have no control over the threading model internally used by Node.js. If you are new to the concept of threads and processes, we would suggest you to go through some preliminary articles regarding this. There are plenty of YouTube videos as well on the same topic. The following reference could be used as a starting point: http://www.cs.ucsb.edu/~rich/class/cs170/notes/IntroThreads/ Nonblocking asynchronous execution One of the most powerful features of Node is that it is event-driven and asynchronous. So how does an asynchronous model work? Imagine you have a block of code and at some nth line you have an operation, which is time consuming. So what happens to the lines that follow the nth line while this code gets executed? In normal synchronous programming models, the lines, which follow the nth line, will have to wait until the operation at the nth line completes. An asynchronous model handles this case differently. To handle this scenario in an asynchronous approach, we need to segment the code that follows the nth line into two sections. The first section is dependent on the result from the operation at the nth line and the second is independent of the result. We wrap the dependent code in a function with the result of the operation as its parameter and register it as a callback to the operation on its success. So once the operation completes, the callback function will be triggered with its result. And meanwhile, we can continue executing the result-independent lines without waiting for the result. So, in this scenario, the execution is never blocked for a process to complete. It just goes on with callback functions registered on each ones completion. Simply put, you assign a callback function to an operation, and when the Node determines that the completion event has been fired, it will execute your callback function at that moment. We can look at the following example to understand the asynchronous nature in detail: console.log('One'); console.log('Two'); setTimeout(function() { console.log('Three'); }, 2000); console.log('Four'); console.log('Five'); In a typical synchronous programming language, executing the preceding code will yield the following output: One Two ... (2 second delay) ... Three Four Five However, in an asynchronous approach, the following output is seen: One Two Four Five ... (approx. 2 second delay) ... Three The function that actually logs three is known as a callback to the setTimeout function. If you are still interested in learning more about asynchronous models and the callback concept in JavaScript, Mozilla Developer Network (MDN) has many articles, which explain these concepts in detail. Node Package Manager Writing applications with Node is really enjoyable when you realize the sheer wealth of information and tools at your disposal! Using Node's built-in Package Manager (npm), you can literally find tens of thousands of modules that can be installed and used within your application with just a few keystrokes! One of the reasons for the biggest success of Node.js is npm, which is one of the best package managers out there with a very minute learning curve. If this is the first ever package manager that you are being exposed to, then you should consider yourself lucky! On a regular monthly basis, npm handles more than a billion downloads and it has around 1,50,000 packages currently available for you to download. You can view the library of available modules by visiting www.npmjs.com. Downloading and installing any module within your application is as simple as executing the npm install package command. Have you written a module that you want to share with the world? You can package it using npm, and upload it to the public www.npmjs.org registry just as easily! If you are not sure how a module you installed works, the source code is right there in your projects' node_modules folder waiting to be explored! Sharing and reusing JavaScript While you develop web applications, you will always end up doing the validations for your UI both as client and server as the client side validations are required for a better UI experience and server side validations for better security of app. Think about two different languages in action, you will have the same logic implemented in both server and client side. With Node.js, you can think of sharing the common function between server and client reducing the code duplication to a bigger extent. Ever worked on optimizing the load time for client side components of your Single Page Application (SPA) loaded from template engines like underscore? That would end up in you thinking about a way we could share the rendering of templates in both server and client at the same time—some call it hybrid templating. Node.js resolves the context of duplication of client templates better than any other server side technologies just because we can use the same JS templating framework and the templates both at server and client. If you are taking this point lightly, the problem it resolves is not just the issue of reusing validations or templates on server and client. Think about a single page application being built, you will need to implement the subsets of server-side models in the client-side MV* framework also. Now think about the templates, models, and controller subsets being shared on both client and server. We are solving a higher scenario of code redundancy. Isn't it? Not just for building web servers! Node.js is not just to write JavaScript in server side. Yes, we have discussed this point earlier. Node.js sets up the environment for the JavaScript code to work anywhere it can be installed. It can be a powerful solution to create command-line tools as well as full-featured locally run applications that have nothing to do with the Web or a browser. Grunt.js is a great example of a Node-powered command-line tool that many web developers use daily to automate everyday tasks such as build processes, compiling Coffee Script, launching Node servers, running tests, and more. In addition to command-line tools, Node is increasingly popular among the hardware crowd with the Node bots movement. Johnny-Five and Cylon.js are two popular Node libraries that exist to provide a framework to work with robotics. Search YouTube for Node robots and you will see a lot of examples. Also, there is a chance that you might be using a text editor developed on Node.js. Github's open source editor named Atom is one such kind, which is hugely popular. Real-time web with Socket.io One of the important reasons behind the origin of Node.js was to support real time web applications. Node.js has a couple of frameworks built for real-time web applications, which are hugely popular namely socket.io and sock.js. These frameworks make it quite simple to build instant collaboration based applications such as Google Drive and Mozilla's together.js. Before the introduction of WebSockets in the modern browsers, this was achieved via long polling, which was not a great solution for real-time experience. While WebSockets is a feature that is only supported in modern browsers, Socket.io acts as a framework, which also features seamless fallback implementations for legacy browsers. If you need to understand more on the use of web sockets in applictions, here is a good resource on MDN that you can explore: https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_client_applications Networking and file IO In addition to the powerful nonblocking asynchronous nature of Node, it also has very robust networking and filesystem tools available via its core modules. With Node's networking modules, you can create server and client applications that accept network connections and communicate via streams and pipes. The origin of io.js io.js is nothing but a fork of Node.js that was created to stay updated with the latest development on both V8 and other developments in JS community. Joyent was taking care of the releases in Node.js and the process, which was followed in taking care of the release management of Node.js, lacked an open governance model. It leads to scenarios where the newer developments in V8 as well as the JS community were not incorporated into its releases. For example, if you want to write JavaScript using the latest EcmaScript6 (ES6) features, you will have to run it in the harmony mode. Joyent is surely not to be blamed on this as they were more concerned about stability of Node.js releases than frequent updates in the stack. This led to the io.js fork, which is kept up to date with the latest JavaScript and V8 updates. So it's better to keep your eyes on the releases on both Node and io.js to keep updated with the Node.js world. Summary We discussed the amazing current state of JavaScript and how it can be used to power the full stack of a web application. Not that you needed any convincing in the first place, but I hope you're excited and ready to get started writing web applications using Node.js and MongoDB! Resources for Article: Further resources on this subject: Introduction and Composition [article] Deployment and Maintenance [article] Node.js Fundamentals [article]
Read more
  • 0
  • 0
  • 2278

article-image-object-oriented-programming-typescript
Packt
15 Sep 2015
12 min read
Save for later

Writing SOLID JavaScript code with TypeScript

Packt
15 Sep 2015
12 min read
In this article by Remo H. Jansen, author of the book Learning TypeScript, explains that in the early days of software development, developers used to write code with procedural programing languages. In procedural programming languages, the programs follow a top to bottom approach and the logic is wrapped with functions. New styles of computer programming like modular programming or structured programming emerged when developers realized that procedural computer programs could not provide them with the desired level of abstraction, maintainability and reusability. The development community created a series of recommended practices and design patterns to improve the level of abstraction and reusability of procedural programming languages but some of these guidelines required certain level of expertise. In order to facilitate the adherence to these guidelines, a new style of computer programming known as object-oriented programming (OOP) was created. (For more resources related to this topic, see here.) Developers quickly noticed some common OOP mistakes and came up with five rules that every OOP developer should follow to create a system that is easy to maintain and extend over time. These five rules are known as the SOLID principles. SOLID is an acronym introduced by Michael Feathers, which stands for the each following principles: Single responsibility principle (SRP): This principle states that software component (function, class or module) should focus on one unique tasks (have only one responsibility). Open/closed principle (OCP): This principle states that software entities should be designed with the application growth (new code) in mind (be open to extension), but the application growth should require the smaller amount of changes to the existing code as possible (be closed for modification). Liskov substitution principle (LSP): This principle states that we should be able to replace a class in a program with another class as long as both classes implement the same interface. After replacing the class no other changes should be required and the program should continue to work as it did originally. Interface segregation principle (ISP): This principle states that we should split interfaces which are very large (general-purpose interfaces) into smaller and more specific ones (many client-specific interfaces) so that clients will only have to know about the methods that are of interest to them. Dependency inversion principle (DIP): This principle states that entities should depend on abstractions (interfaces) as opposed to depend on concretion (classes). JavaScript does not support interfaces and most developers find its class support (prototypes) not intuitive. This may lead us to think that writing JavaScript code that adheres to the SOLID principles is not possible. However, with TypeScript we can write truly SOLID JavaScript. In this article we will learn how to write TypeScript code that adheres to the SOLID principles so our applications are easy to maintain and extend over time. Let's start by taking a look to interface and classes in TypeScript. Interfaces The feature that we will miss the most when developing large-scale web applications with JavaScript is probably interfaces. Following the SOLID principles can help us to improve the quality of our code and writing good code is a must when working on a large project. The problem is that if we attempt to follow the SOLID principles with JavaScript we will soon realize that without interfaces we will never be able to write truly OOP code that adheres to the SOLID principles. Fortunately for us, TypeScript features interfaces. The Wikipedia's definition of interfaces in OOP is: In object-oriented languages, the term interface is often used to define an abstract type that contains no data or code, but defines behaviors as method signatures. Implementing an interface can be understood as signing a contract. The interface is a contract and when we sign it (implement it) we must follow its rules. The interface rules are the signatures of the methods and properties and we must implement them. Usually in OOP languages, a class can extend another class and implement one or more interfaces. On the other hand, an interface can implement one or more interfaces and cannot extend another class or interfaces. In TypeScript, interfaces doesn't strictly follow this behavior. The main two differences are that in TypeScript: An interface can extend another interface or class. An interface can define data and behavior as opposed to only behavior. An interface in TypeScript can be declared using the interface keyword: interface IPerson { greet(): void; } Classes The support of Classes is another essential feature to write code that adheres to the SOLID principles. We can create classes in JavaScript using prototypes but its is not as trivial as it is in other OOP languages like Java or C#. The ECMAScript 6 (ES6) specification of JavaScript introduces native support for the class keyword but unfortunately ES6 is not compatible with many old browsers that still around. However, TypeScript features classes and allow us to use them today because can indicate to the compiler which version of JavaScript we would like to use (including ES3, ES5, and ES6). Let's start by declaring a simple class: class Person implements Iperson { public name : string; public surname : string; public email : string; constructor(name : string, surname : string, email : string){ this.email = email; this.name = name; this.surname = surname; } greet() { alert("Hi!"); } } var me : Person = new Person("Remo", "Jansen", "remo.jansen@wolksoftware.com"); We use classes to represent the type of an object or entity. A class is composed of a name, attributes, and methods. The class above is named Person and contains three attributes or properties (name, surname, and email) and two methods (constructor and greet). The class attributes are used to describe the objects characteristics while the class methods are used to describe its behavior. The class above uses the implements keyword to implement the IPerson interface. All the methods (greet) declared by the IPerson interface must be implemented by the Person class. A constructor is an especial method used by the new keyword to create instances (also known as objects) of our class. We have declared a variable named me, which holds an instance of the class Person. The new keyword uses the Person class's constructor to return an object which type is Person. Single Responsibility Principle This principle states that a software component (usually a class) should adhere to the Single Responsibility Principle (SRP). The Person class above represents a person including all its characteristics (attributes) and behaviors (methods). Now, let's add some email is validation logic to showcase the advantages of the SRP: class Person { public name : string; public surname : string; public email : string; constructor(name : string, surname : string, email : string) { this.surname = surname; this.name = name; if(this.validateEmail(email)) { this.email = email; } else { throw new Error("Invalid email!"); } } validateEmail() { var re = /S+@S+.S+/; return re.test(this.email); } greet() { alert("Hi! I'm " + this.name + ". You can reach me at " + this.email); } } When an object doesn't follow the SRP and it knows too much (has too many properties) or does too much (has too many methods) we say that the object is a God object. The preceding class Person is a God object because we have added a method named validateEmail that is not really related to the Person class behavior. Deciding which attributes and methods should or should not be part of a class is a relatively subjective decision. If we spend some time analyzing our options we should be able to find a way to improve the design of our classes. We can refactor the Person class by declaring an Email class, which is responsible for the e-mail validation and use it as an attribute in the Person class: class Email { public email : string; constructor(email : string){ if(this.validateEmail(email)) { this.email = email; } else { throw new Error("Invalid email!"); } } validateEmail(email : string) { var re = /S+@S+.S+/; return re.test(email); } } Now that we have an Email class we can remove the responsibility of validating the e-mails from the Person class and update its email attribute to use the type Email instead of string. class Person { public name : string; public surname : string; public email : Email; constructor(name : string, surname : string, email : Email){ this.email = email; this.name = name; this.surname = surname; } greet() { alert("Hi!"); } } Making sure that a class has a single responsibility makes it easier to see what it does and how we can extend/improve it. We can further improve our Person an Email classes by increasing the level of abstraction of our classes. For example, when we use the Email class we don't really need to be aware of the existence of validateEmail method so this method could be private or internal (invisible from the outside of the Email class). As a result, the Email class would be much simpler to understand. When we increase the level of abstraction of an object, we can say that we are encapsulating that object. Encapsulation is also known as information hiding. For example, in the Email class allow us to use e-mails without having to worry about the e-mail validation because the class will deal with it for us. We can make this more clearly by using access modifiers (public or private) to flag as private all the class attributes and methods that we want to abstract from the usage of the Email class: class Email { private email : string; constructor(email : string){ if(this.validateEmail(email)) { this.email = email; } else { throw new Error("Invalid email!"); } } private validateEmail(email : string) { var re = /S+@S+.S+/; return re.test(email); } get():string { return this.email; } } We can then simply use the Email class without explicitly perform any kind of validation: var email = new Email("remo.jansen@wolksoftware.com"); Liskov Substitution Principle Liskov Substitution Principle (LSP) states: Subtypes must be substitutable for their base types. Let's take a look at an example to understand what this means. We are going to declare a class which responsibility is to persist some objects into some kind of storage. We will start by declaring the following interface: interface IPersistanceService { save(entity : any) : number; } After declaring the IPersistanceService interface we can implement it. We will use cookies the storage for the application's data: class CookiePersitanceService implements IPersistanceService{ save(entity : any) : number { var id = Math.floor((Math.random() * 100) + 1); // Cookie persistance logic... return id; } } We will continue by declaring a class named FavouritesController, which has a dependency on the IPersistanceService interface: class FavouritesController { private _persistanceService : IPersistanceService; constructor(persistanceService : IPersistanceService) { this._persistanceService = persistanceService; } public saveAsFavourite(articleId : number) { return this._persistanceService.save(articleId); } } We can finally create and instance of FavouritesController and pass an instance of CookiePersitanceService via its constructor. var favController = new FavouritesController(new CookiePersitanceService()); The LSP allows us to replace a dependency with another implementation as long as both implementations are based in the same base type. For example, we decide to stop using cookies as storage and use the HTML5 local storage API instead without having to worry about the FavouritesController code being affected by this change: class LocalStoragePersitanceService implements IpersistanceService { save(entity : any) : number { var id = Math.floor((Math.random() * 100) + 1); // Local storage persistance logic... return id; } } We can then replace it without having to add any changes to the FavouritesController controller class: var favController = new FavouritesController(new LocalStoragePersitanceService()); Interface Segregation Principle In the previous example, our interface was IPersistanceService and it was implemented by the cases LocalStoragePersitanceService and CookiePersitanceService. The interface was consumed by the class FavouritesController so we say that this class is a client of the IPersistanceService API. Interface Segregation Principle (ISP) states that no client should be forced to depend on methods it does not use. To adhere to the ISP we need to keep in mind that when we declare the API (how two or more software components cooperate and exchange information with each other) of our application's components the declaration of many client-specific interfaces is better than the declaration of one general-purpose interface. Let's take a look at an example. If we are designing an API to control all the elements in a vehicle (engine, radio, heating, navigation, lights, and so on) we could have one general-purpose interface, which allows controlling every single element of the vehicle: interface IVehicle { getSpeed() : number; getVehicleType: string; isTaxPayed() : boolean; isLightsOn() : boolean; isLightsOff() : boolean; startEngine() : void; acelerate() : number; stopEngine() : void; startRadio() : void; playCd : void; stopRadio() : void; } If a class has a dependency (client) in the IVehicle interface but it only wants to use the radio methods we would be facing a violation of the ISP because, as we have already learned, no client should be forced to depend on methods it does not use. The solution is to split the IVehicle interface into many client-specific interfaces so our class can adhere to the ISP by depending only on Iradio: interface IVehicle { getSpeed() : number; getVehicleType: string; isTaxPayed() : boolean; isLightsOn() : boolean; } interface ILights { isLightsOn() : boolean; isLightsOff() : boolean; } interface IRadio { startRadio() : void; playCd : void; stopRadio() : void; } interface IEngine { startEngine() : void; acelerate() : number; stopEngine() : void; } Dependency Inversion Principle Dependency Inversion (DI) principle states that we should: Depend upon Abstractions. Do not depend upon concretions In the previous section, we implemented FavouritesController and we were able to replace an implementation of IPersistanceService with another without having to perform any additional change to FavouritesController. This was possible because we followed the DI principle as FavouritesController has a dependency on the IPersistanceService interface (abstractions) rather than LocalStoragePersitanceService class or CookiePersitanceService class (concretions). The DI principle also allow us to use an inversion of control (IoC) container. An IoC container is a tool used to reduce the coupling between the components of an application. Refer to Inversion of Control Containers and the Dependency Injection pattern by Martin Fowler at http://martinfowler.com/articles/injection.html. If you want to learn more about IoC. Summary In this article, we looked upon classes, interfaces, and the SOLID principles. Resources for Article: Further resources on this subject: Welcome to JavaScript in the full stack [article] Introduction to Spring Web Application in No Time [article] Introduction to TypeScript [article]
Read more
  • 0
  • 0
  • 33906

article-image-hello-pong
Packt
15 Sep 2015
19 min read
Save for later

Hello, Pong!

Packt
15 Sep 2015
19 min read
In this article written by Alejandro Rodas de Paz and Joseph Howse, authors of the book Python Game Programming By Example, we learn how game development is a highly evolving software development process, and it how has improved continuously since the appearance of the first video games in the 1950s. Nowadays, there is a wide variety of platforms and engines, and this process has been facilitated with the arrival of open source tools. Python is a free high-level programming language with a design intended to write readable and concise programs. Thanks to its philosophy, we can create our own games from scratch with just a few lines of code. There are a plenty of game frameworks for Python, but for our first game, we will see how we can develop it without any third-party dependency. We will be covering the following topics: Installation of the required software Overview of Tkinter, a GUI library included in the Python standard library Applying object-oriented programming to encapsulate the logic of our game Basic collision and input detection Drawing game objects without external assets (For more resources related to this topic, see here.) Installing Python You will need Python 3.4 with Tcl / Tk 8.6 installed on your computer. The latest branch of this version is Python 3.4.3, which can be downloaded from https://www.python.org/downloads/. Here, you can find the official binaries for the most popular platforms, such as Windows and Mac OS. During the installation process, make sure that you check the Tcl/Tk option to include the library. The code examples included in the book have been tested against Windows 8 and Mac, but can be run on Linux without any modification. Note that some distributions may require you to install the appropriate package for Python 3. For instance, on Ubuntu, you need to install the python3-tk package. Once you have Python installed, you can verify the version by opening Command Prompt or a terminal and executing these lines: $ python –-version Python 3.4.3 After this check, you should be able to start a simple GUI program: $ python >>> from tkinter import Tk >>> root = Tk() >>> root.title('Hello, world!') >>> root.mainloop() These statements create a window, change its title, and run indefinitely until the window is closed. Do not close the new window that is displayed when the second statement is executed. Otherwise, it will raise an error because the application has been destroyed. We will use this library in our first game, and the complete documentation of the module can be found at https://docs.python.org/3/library/tkinter.html. Tkinter and Python 2 The Tkinter module was renamed to tkinter in Python 3. If you have Python 2 installed, simply change the import statement with Tkinter in uppercase, and the program should run as expected. Overview of Breakout The Breakout game starts with a paddle and a ball at the bottom of the screen and some rows of bricks at the top. The player must eliminate all the bricks by hitting them with the ball, which rebounds against the borders of the screen, the bricks, and the bottom paddle. As in Pong, the player controls the horizontal movement of the paddle. The player starts the game with three lives, and if she or he misses the ball's rebound and it reaches the bottom border of the screen, one life is lost. The game is over when all the bricks are destroyed, or when the player loses all their lives. This is a screenshot of the final version of our game: Basic GUI layout We will start out game by creating a top-level window as in the simple program we ran previously. However, this time, we will use two nested widgets: a container frame and the canvas where the game objects will be drawn, as shown here: With Tkinter, this can easily be achieved using the following code: import tkinter as tk lives = 3 root = tk.Tk() frame = tk.Frame(root) canvas = tk.Canvas(frame, width=600, height=400, bg='#aaaaff') frame.pack() canvas.pack() root.title('Hello, Pong!') root.mainloop() Through the tk alias, we access the classes defined in the tkinter module, such as Tk, Frame, and Canvas. Notice the first argument of each constructor call which indicates the widget (the child container), and the required pack() calls for displaying the widgets on their parent container. This is not necessary for the Tk instance, since it is the root window. However, this approach is not exactly object-oriented, since we use global variables and do not define any new class to represent our new data structures. If the code base grows, this can lead to poorly organized projects and highly coupled code. We can start encapsulating the pieces of our game in this way: import tkinter as tk class Game(tk.Frame): def __init__(self, master): super(Game, self).__init__(master) self.lives = 3 self.width = 610 self.height = 400 self.canvas = tk.Canvas(self, bg='#aaaaff', width=self.width, height=self.height,) self.canvas.pack() self.pack() if __name__ == '__main__': root = tk.Tk() root.title('Hello, Pong!') game = Game(root) game.mainloop() Our new type, called Game, inherits from the Frame Tkinter class. The class Game(tk.Frame): definition specifies the name of the class and the superclass between parentheses. If you are new to object-oriented programming with Python, this syntax may not sound familiar. In our first look at classes, the most important concepts are the __init__ method and the self variable: The __init__ method is a special method that is invoked when a new class instance is created. Here, we set the object attributes, such as the width, the height, and the canvas widget. We also call the parent class initialization with the super(Game, self).__init__(master) statement, so the initial state of the Frame is properly initialized. The self variable refers to the object, and it should be the first argument of a method if you want to access the object instance. It is not strictly a language keyword, but the Python convention is to call it self so that other Python programmers won't be confused about the meaning of the variable. In the preceding snippet, we introduced the if __name__ == '__main__' condition, which is present in many Python scripts. This snippet checks the name of the current module that is being executed, and will prevent starting the main loop where this module was being imported from another script. This block is placed at the end of the script, since it requires that the Game class be defined. New- and old-style classes You may see the MySuperClass.__init__(self, arguments) syntax in some Python 2 examples, instead of the super call. This is the old-style syntax, the only flavor available up to Python 2.1, and is maintained in Python 2 for backward compatibility. The super(MyClass, self).__init__(arguments) is the new-class style introduced in Python 2.2. It is the preferred approach, and we will use it throughout this book. Since no external assets are needed, you can place the set of code files given along with the book(Chapter1_01.Py) in any directory and execute it from the python command line by running the file. The main loop will run indefinitely until you click on the close button of the window, or if you kill the process from the command line. This is the starting point of our game, so let's start diving into the Canvas widget and see how we can draw and animate items in it. Diving into the Canvas widget So far, we have the window set up and now we can start drawing items on the canvas. The canvas widget is two-dimensional and uses the Cartesian coordinate system. The origin—the (0, 0) ordered pair—is placed at the top-left corner, and the axis can be represented as shown in the following screenshot: Keeping this layout in mind, we can use two methods of the Canvas widget to draw the paddle, the bricks, and the ball: canvas.create_rectangle(x0, y0, x1, y1, **options) canvas.create_oval(x0, y0, x1, y1, **options) Each of these calls returns an integer, which identifies the item handle. This reference will be used later to manipulate the position of the item and its options. The **options syntax represents a key/value pair of additional arguments that can be passed to the method call. In our case, we will use the fill and the tags option. The x0 and y0 coordinates indicate the top-left corner of the previous screenshot, and x1 and y1 are indicated in the bottom-right corner. For instance, we can call canvas.create_rectangle(250, 300, 330, 320, fill='blue', tags='paddle') to create a player's paddle, where: The top-left corner is at the coordinates (250, 300). The bottom-right corner is at the coordinates (300, 320). The fill='blue' means that the background color of the item is blue. The tags='paddle' means that the item is tagged as a paddle. This string will be useful later to find items in the canvas with specific tags. We will invoke other Canvas methods to manipulate the items and retrieve widget information. This table gives the references to the Canvas widget that will be used here: Method Description canvas.coords(item) Returns the coordinates of the bounding box of an item. canvas.move(item, x, y) Moves an item by a horizontal and a vertical offset. canvas.delete(item) Deletes an item from the canvas. canvas.winfo_width() Retrieves the canvas width. canvas.itemconfig(item, **options) Changes the options of an item, such as the fill color or its tags. canvas.bind(event, callback) Binds an input event with the execution of a function. The callback handler receives one parameter of the type Tkinter event. canvas.unbind(event) Unbinds the input event so that there is no callback function executed when the event occurs. canvas.create_text(*position, **opts) Draws text on the canvas. The position and the options arguments are similar to the ones passed in canvas.create_rectangle and canvas.create_oval. canvas.find_withtag(tag) Returns the items with a specific tag. canvas.find_overlapping(*position) Returns the items that overlap or are completely enclosed by a given rectangle. You can check out a complete reference of the event syntax as well as some practical examples at http://effbot.org/tkinterbook/tkinter-events-and-bindings.htm#events. Basic game objects Before we start drawing all our game items, let's define a base class with the functionality that they will have in common—storing a reference to the canvas and its underlying canvas item, getting information about its position, and deleting the item from the canvas: class GameObject(object): def __init__(self, canvas, item): self.canvas = canvas self.item = item def get_position(self): return self.canvas.coords(self.item) def move(self, x, y): self.canvas.move(self.item, x, y) def delete(self): self.canvas.delete(self.item) Assuming that we have created a canvas widget as shown in our previous code samples, a basic usage of this class and its attributes would be like this: item = canvas.create_rectangle(10,10,100,80, fill='green') game_object = GameObject(canvas,item) #create new instance print(game_object.get_position()) # [10, 10, 100, 80] game_object.move(20, -10) print(game_object.get_position()) # [30, 0, 120, 70] game_object.delete() In this example, we created a green rectangle and a GameObject instance with the resulting item. Then we retrieved the position of the item within the canvas, moved it, and calculated the position again. Finally, we deleted the underlying item. The methods that the GameObject class offers will be reused in the subclasses that we will see later, so this abstraction avoids unnecessary code duplication. Now that you have learned how to work with this basic class, we can define separate child classes for the ball, the paddle, and the bricks. The Ball class The Ball class will store information about the speed, direction, and radius of the ball. We will simplify the ball's movement, since the direction vector will always be one of the following: [1, 1] if the ball is moving towards the bottom-right corner [-1, -1] if the ball is moving towards the top-left corner [1, -1] if the ball is moving towards the top-right corner [-1, 1] if the ball is moving towards the bottom-left corner Representation of the possible direction vectors Therefore, by changing the sign of one of the vector components, we will change the ball's direction by 90 degrees. This will happen when the ball bounces with the canvas border, or when it hits a brick or the player's paddle: class Ball(GameObject): def __init__(self, canvas, x, y): self.radius = 10 self.direction = [1, -1] self.speed = 10 item = canvas.create_oval(x-self.radius, y-self.radius, x+self.radius, y+self.radius, fill='white') super(Ball, self).__init__(canvas, item)   For now, the object initialization is enough to understand the attributes that the class has. We will cover the ball rebound logic later, when the other game objects are defined and placed in the game canvas. The Paddle class The Paddle class represents the player's paddle and has two attributes to store the width and height of the paddle. A set_ball method will be used store a reference to the ball, which can be moved with the ball before the game starts: class Paddle(GameObject): def __init__(self, canvas, x, y): self.width = 80 self.height = 10 self.ball = None item = canvas.create_rectangle(x - self.width / 2, y - self.height / 2, x + self.width / 2, y + self.height / 2, fill='blue') super(Paddle, self).__init__(canvas, item) def set_ball(self, ball): self.ball = ball def move(self, offset): coords = self.get_position() width = self.canvas.winfo_width() if coords[0] + offset >= 0 and coords[2] + offset <= width: super(Paddle, self).move(offset, 0) if self.ball is not None: self.ball.move(offset, 0) The move method is responsible for the horizontal movement of the paddle. Step by step, the following is the logic behind this method: The self.get_position() calculates the current coordinates of the paddle The self.canvas.winfo_width() retrieves the canvas width If both the minimum and maximum x-axis coordinates plus the offset produced by the movement are inside the boundaries of the canvas, this is what happens: The super(Paddle, self).move(offset, 0) calls the method with same name in the Paddle class's parent class, which moves the underlying canvas item If the paddle still has a reference to the ball (this happens when the game has not been started), the ball is moved as well This method will be bound to the input keys so that the player can use them to control the paddle's movement. We will see later how we can use Tkinter to process the input key events. For now, let's move on to the implementation of the last one of our game's components. The Brick class Each brick in our game will be an instance of the Brick class. This class contains the logic that is executed when the bricks are hit and destroyed: class Brick(GameObject): COLORS = {1: '#999999', 2: '#555555', 3: '#222222'} def __init__(self, canvas, x, y, hits): self.width = 75 self.height = 20 self.hits = hits color = Brick.COLORS[hits] item = canvas.create_rectangle(x - self.width / 2, y - self.height / 2, x + self.width / 2, y + self.height / 2, fill=color, tags='brick') super(Brick, self).__init__(canvas, item) def hit(self): self.hits -= 1 if self.hits == 0: self.delete() else: self.canvas.itemconfig(self.item, fill=Brick.COLORS[self.hits]) As you may have noticed, the __init__ method is very similar to the one in the Paddle class, since it draws a rectangle and stores the width and the height of the shape. In this case, the value of the tags option passed as a keyword argument is 'brick'. With this tag, we can check whether the game is over when the number of remaining items with this tag is zero. Another difference from the Paddle class is the hit method and the attributes it uses. The class variable called COLORS is a dictionary—a data structure that contains key/value pairs with the number of hits that the brick has left, and the corresponding color. When a brick is hit, the method execution occurs as follows: The number of hits of the brick instance is decreased by 1 If the number of hits remaining is 0, self.delete() deletes the brick from the canvas Otherwise, self.canvas.itemconfig() changes the color of the brick. For instance, if we call this method for a brick with two hits left, we will decrease the counter by 1 and the new color will be #999999, which is the value of Brick.COLORS[1]. If the same brick is hit again, the number of remaining hits will become zero and the item will be deleted. Adding the Breakout items Now that the organization of our items is separated into these top-level classes, we can extend the __init__ method of our Game class: class Game(tk.Frame): def __init__(self, master): super(Game, self).__init__(master) self.lives = 3 self.width = 610 self.height = 400 self.canvas = tk.Canvas(self, bg='#aaaaff', width=self.width, height=self.height) self.canvas.pack() self.pack() self.items = {} self.ball = None self.paddle = Paddle(self.canvas, self.width/2, 326) self.items[self.paddle.item] = self.paddle for x in range(5, self.width - 5, 75): self.add_brick(x + 37.5, 50, 2) self.add_brick(x + 37.5, 70, 1) self.add_brick(x + 37.5, 90, 1) self.hud = None self.setup_game() self.canvas.focus_set() self.canvas.bind('<Left>', lambda _: self.paddle.move(-10)) self.canvas.bind('<Right>', lambda _: self.paddle.move(10)) def setup_game(self): self.add_ball() self.update_lives_text() self.text = self.draw_text(300, 200, 'Press Space to start') self.canvas.bind('<space>', lambda _: self.start_game()) This initialization is more complex that what we had at the beginning of the article. We can divide it into two sections: Game object instantiation, and their insertion into the self.items dictionary. This attribute contains all the canvas items that can collide with the ball, so we add only the bricks and the player's paddle to it. The keys are the references to the canvas items, and the values are the corresponding game objects. We will use this attribute later in the collision check, when we will have the colliding items and will need to fetch the game object. Key input binding, via the Canvas widget. The canvas.focus_set() call sets the focus on the canvas, so the input events are directly bound to this widget. Then we bind the left and right keys to the paddle's move() method and the spacebar to trigger the game start. Thanks to the lambda construct, we can define anonymous functions as event handlers. Since the callback argument of the bind method is a function that receives a Tkinter event as an argument, we define a lambda that ignores the first parameter—lambda _: <expression>. Our new add_ball and add_brick methods are used to create game objects and perform a basic initialization. While the first one creates a new ball on top of the player's paddle, the second one is a shorthand way of adding a Brick instance:   def add_ball(self): if self.ball is not None: self.ball.delete() paddle_coords = self.paddle.get_position() x = (paddle_coords[0] + paddle_coords[2]) * 0.5 self.ball = Ball(self.canvas, x, 310) self.paddle.set_ball(self.ball) def add_brick(self, x, y, hits): brick = Brick(self.canvas, x, y, hits) self.items[brick.item] = brick The draw_text method will be used to display text messages in the canvas. The underlying item created with canvas.create_text() is returned, and it can be used to modify the information:   def draw_text(self, x, y, text, size='40'): font = ('Helvetica', size) return self.canvas.create_text(x, y, text=text, font=font) The update_lives_text method displays the number of lives left and changes its text if the message is already displayed. It is called when the game is initialized—this is when the text is drawn for the first time—and it is also invoked when the player misses a ball rebound:    def update_lives_text(self): text = 'Lives: %s' % self.lives if self.hud is None: self.hud = self.draw_text(50, 20, text, 15) else: self.canvas.itemconfig(self.hud, text=text) We leave start_game unimplemented for now, since it triggers the game loop, and this logic will be added in the next section. Since Python requires a code block for each method, we use the pass statement. This does not execute any operation, and it can be used as a placeholder when a statement is required syntactically: def start_game(self): pass If you execute this script, it will display a Tkinter window like the one shown in the following figure. At this point, we can move the paddle horizontally, so we are ready to start the game and hit some bricks! Summary We covered the basics of the control flow and the class syntax. We used Tkinter widgets, especially the Canvas widget and its methods, to achieve the functionality needed to develop a game based on collisions and simple input detection. Our Breakout game can be customized as we want. Feel free to change the color defaults, the speed of the ball, or the number of rows of bricks. However, GUI libraries are very limited, and more complex frameworks are required to achieve a wider range of capabilities. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [article] Understanding the Python regex engine [article] Ten IPython essentials [article]
Read more
  • 0
  • 1
  • 7371

article-image-analyzing-financial-data-qlikview
Packt
15 Sep 2015
8 min read
Save for later

Analyzing Financial Data in QlikView

Packt
15 Sep 2015
8 min read
In this article by Diane Blackwood, author of the book QlikView for Finance, the author talks about how QlikView is an easy-to-use business intelligence product designed to facilitate ad hoc relationship analysis. However, it can also be used in formal corporate performance applications by a financial user. It is designed to use a methodology of direct discovery to analyze data from multiple sources. QlikView is designed to allow you to do your own business discovery, take you out of the data management stage and into the data relationship investigation stage. Investigating relationships and outliers in financial data more can lead to effective management. (For more resources related to this topic, see here.) You could use QlikView when you wish to analyze and quickly see trends and exceptions that — with normal financial application-oriented BI products—would not be readily apparent without days of consultant and technology department setup. With QlikView, you can also analyze data relationships that are not measured in monetary units. Certainly, QlikView can be used to analyze sales trends and stock performance, but other relationships soon become apparent when you start using QlikView. Also, with the free downloadable personal edition of QlikView, you can start analyzing your own data right away. QlikView consists of two parts: The sheet: This can contain sheet objects, such as charts or list boxes, which show clickable information. The load script: This stores information about the data and the data sources that the data is coming from. Financial professionals are always using Excel to examine their data, and we can load data from an Excel sheet into QlikView. This can also help you to create a basic document sheet containing a chart. The newest version of QlikView comes with a sample Sales Order data that can be used to investigate and create sheet objects. In order to use data from other file types, you can use the File Wizard (Type) that you start from the Edit Script dialog by clicking on the Table Files button. Using the Edit Script dialog, you can view your data script and edit it in the script and add other data sources. You can also reload your data by clicking on the Reload button. If you just want to analyze data from an existing QlikView file and analyze the information in it, you do not need to work with the script at all. We will use some sample financial data that was downloaded from an ERP system to Excel in order to demonstrate how an analysis might work. Our QlikView Financial Analysis of Cheyenne Company will appear as follows: Figure 1: Our Financial Analysis QlikView Application When we create objects for analysis purposes in QlikView, the drop-down menu shows that there are multiple sheet object types to choose from, such as List Box, Statistics Box, Chart, Input Box, Current Selections Box, MultiBox, Table Box, Button, Text Object, Line/Arrow Object, Slider/Calendar Object, and Bookmark Object. In our example, we chose the Statistic Box Sheet object to add the grand total to our analysis. From this, we can see that the total company is out of balance by $1.59. From an auditor’s point of view, this amount is probably small enough to be immaterial, but, from our point of view as financial professionals, we want to know where our books are falling out of balance. To make our investigation easier, we should add one additional sheet object: a List Box for Company. This is done by right-clicking on the context menu and selecting New Sheet object and then List Box. Figure 2: Added Company List Box We can now see that we are actually out of balance in three companies. Cheyenne Co. L.P. is a company out by $1.59, but Cheyenne Holding and Cheyenne National Inc. seem to have balancing entries that balance at the total companies’ level, but these companies don’t balance at the individual company level. We can analyze our data using the list boxes just by selecting a Company and viewing the Account Groups and Cost Centers that are included (white) and excluded (gray). This is the standard color scheme usage of QlikView. Our selected company is shown in green and in the Current Selection Box. By selecting Cheyenne Holding, we would be able to verify that it is indeed a holding company, does not have any manufacturing or sales accounting groups, or cost centers. Alternatively, if we choose Provo, we can see that it is in balance. To load more than one spreadsheet or load from a different data source, we must edit load script. From the Edit Script interface, we can modify and execute a script that connects the QlikView document to an ODBC data source or to data files of different type and grab the data source information as well. Our first script was generated automatically, but scripts can be typed manually, or automatically generated scripts can be modified. Complex script statements must, at least partially, be entered manually. The Edit Script dialog uses autocomplete, so when typing, the program tries to predict what is wanted in the script without having to type it completely. The predictions include words that are part of the script syntax. The script is also color coded by syntax components. The Edit Script interface and behavior may be customized to your preferences by selecting Tools and Editor Preferences. A menu bar is found at the top of the Edit Script dialog with various script-related commands. The most frequently used commands also appear in the toolbar. In the toolbar, there is also a drop-down list for the tabs of the Edit Script wizard. The first script in the Edit Script interface is the automatically generated one that was created by the wizard when we started the QlikView file. The automatically generated script picks up the column names from the Excel file and puts in some default formatting scripting. The language selection that we made during the initial installation of QlikView determines the defaults assigned to this portion of the script. We can add data from multiple sources, such as ODBC links, additional Excel files, sources from the Web, FTP, and even other QlikView files. Our first Excel file, which we used to create the initial QlikView document, is already in our script. It happened to be October 2013 data, but suppose we wanted to add another month such as November data to our analysis? We would just navigate to the Edit Script interface from the File menu and then click on the script itself. Make sure that our cursor is at the bottom of the script after the first Excel file path and description. If you do not position your cursor where you want your additional script information to populate, you may generate your new script code in the middle of your existing script code. If you make a mistake, click on CANCEL and start over. After navigating to the script location where you want to add your new code, click on the Table Files button after the script and towards the center right first button in the column. Click on NEXT through the next four screens unless you need to add column labels. Comments can be added to scripts using // for a single line or by surrounding the comment by a beginning /* and an ending */ and comments show up as green. After clicking on the OK button to get out of Script Editor, there is another File menu item that can be used to verify that QlikView has correctly interpreted the joins. This is the Table Viewer menu item. You cannot edit in the Table view, but it is convenient to visualize how the table fields are interacting. Save the changes to the script by clicking on the OK button in the lower-right corner. Now, with the File menu, navigate to Edit Script and then to the Reload menu item and click on it to reload your data; otherwise, your new month of data will not be loaded. If you receive any error messages, the solutions can be researched in QlikView Help. In this case, the column headers were the same, so QlikView knew to add the data from the two spreadsheets together into one table. However, because of this, if we look at our Company List Box and Amount Statistics Box, we see everything added together. Figure 3: Data Doubled after Reload with Additional File The reason this data is doubled is that we do not have any way to split the months or only select October or November. Now that we have more than one month of data, we can add another List Box with the months. This will automatically link up to our Chart and Straight Table Sheet objects to separate our monthly data. Once added, from our new List Box, we can select OCTOBER or NOVEMBER, and our sheet object automatically shows the correct sum of the individual months. We can then use the List Box and linked objects to further analyze our financial data. Summary You can find further find books on QlikView published by Packt on the Packt website http://www.packtpub.com. Some of them are listed as follows: Learning QlikView Data Visualization by Karl Pover Predictive Analytics using Rattle and QlikView by Ferran Garcia Pagans Resources for Article: Further resources on this subject: Common QlikView script errors [article] Securing QlikView Documents [article] Conozca QlikView [article]
Read more
  • 0
  • 0
  • 6439

article-image-smart-features-improve-your-efficiency
Packt
15 Sep 2015
11 min read
Save for later

Smart Features to Improve Your Efficiency

Packt
15 Sep 2015
11 min read
In this article by Denis Patin and Stefan Rosca authors of the book WebStorm Essentials, we are going to deal with a number of really smart features that will enable you to fundamentally change your approach to web development and learn how to gain maximum benefit from WebStorm. We are going to study the following in this article: On-the-fly code analysis Smart code features Multiselect feature Refactoring facility (For more resources related to this topic, see here.) On-the-fly code analysis WebStorm will preform static code analysis on your code on the fly. The editor will check the code based on the language used and the rules you specify and highlight warnings and errors as you type. This is a very powerful feature that means you don't need to have an external linter and will catch most errors quickly thus making a dynamic and complex language like JavaScript more predictable and easy to use. Runtime error and any other error, such as syntax or performance, are two things. To investigate the first one, you need tests or a debugger, and it is obvious that they have almost nothing in common with the IDE itself (although, when these facilities are integrated into the IDE, such a synergy is better, but that is not it). You can also examine the second type of errors the same way but is it convenient? Just imagine that you need to run tests after writing the next line of code. It is no go! Won't it be more efficient and helpful to use something that keeps an eye on and analyzes each word being typed in order to notify about probable performance issues and bugs, code style and workflow issues, various validation issues, warn of dead code and other likely execution issues before executing the code, to say nothing of reporting inadvertent misprints. WebStorm is the best fit for it. It performs a deep-level analysis of each line, each word in the code. Moreover, you needn't break off your developing process when WebStorm scans your code; it is performed on the fly and thus so called: WebStorm also enables you to get a full inspection report on demand. For getting it, go to the menu: Code | Inspect Code. It pops up the Specify Inspection Scope dialog where you can define what exactly you would like to inspect, and click OK. Depending on what is selected and of what size, you need to wait a little for the process to finish, and you will see the detailed results where the Terminal window is located: You can expand all the items, if needed. To the right of this inspection result list you can see an explanation window. To jump to the erroneous code lines, you can simply click on the necessary item, and you will flip into the corresponding line. Besides simple indicating where some issue is located, WebStorm also unequivocally suggests the ways to eliminate this issue. And you even needn't make any changes yourself—WebStorm already has quick solutions, which you need just to click on, and they will be instantly inserted into the code: Smart code features Being an Integrated Development Environment (IDE) and tending to be intelligent, WebStorm provides a really powerful pack of features by using which you can strongly improve your efficiency and save a lot of time. One of the most useful and hot features is code completion. WebStorm continually analyzes and processes the code of the whole project, and smartly suggests the pieces of code appropriate in the current context, and even more—alongside the method names you can find the usage of these methods. Of course, code completion itself is not a fresh innovation; however WebStorm performs it in a much smarter way than other IDEs do. WebStorm can auto-complete a lot things: Class and function names, keywords and parameters, types and properties, punctuation, and even file paths. By default, the code completion facility is on. To invoke it, simply start typing some code. For example, in the following image you can see how WebStorm suggests object methods: You can navigate through the list of suggestions using your mouse or the Up and Down arrow keys. However, the list can be very long, which makes it not very convenient to browse. To reduce it and retain only the things appropriate in the current context, keep on typing the next letters. Besides typing only initial consecutive letter of the method, you can either type something from the middle of the method name, or even use the CamelCase style, which is usually the quickest way of typing really long method names: It may turn out for some reason that the code completion isn't working automatically. To manually invoke it, press Control + Space on Mac or Ctrl + Space on Windows. To insert the suggested method, press Enter; to replace the string next to the current cursor position with the suggested method, press Tab. If you want the facility to also arrange correct syntactic surroundings for the method, press Shift + ⌘ + Enter on Mac or Ctrl + Shift + Enter on Windows, and missing brackets or/and new lines will be inserted, up to the styling standards of the current language of the code. Multiselect feature With the multiple selection (or simply multiselect) feature, you can place the cursor in several locations simultaneously, and when you will type the code it will be applied at all these positions. For example, you need to add different background colors for each table cell, and then make them of twenty-pixel width. In this case, what you need to not perform these identical tasks repeatedly and save a lot of time, is to place the cursor after the <td> tag, press Alt, and put the cursor in each <td> tag, which you are going to apply styling to: Now you can start typing the necessary attribute—it is bgcolor. Note that WebStorm performs smart code completion here too, independently of you typing something on a single line or not. You get empty values for bgcolor attributes, and you fill them out individually a bit later. You need also to change the width so you can continue typing. As cell widths are arranged to be fixed-sized, simply add the value for width attributes as well. What you get in the following image: Moreover, the multiselect feature can select identical values or just words independently, that is, you needn't place the cursor in multiple locations. Let us watch this feature by another example. Say, you changed your mind and decided to colorize not backgrounds but borders of several consecutive cells. You may instantly think of using a simple replace feature but you needn't replace all attribute occurrences, only several consecutive ones. For doing this, you can place the cursor on the first attribute, which you are going to perform changes from, and click Ctrl + G on Mac or Alt + J on Windows as many times as you need. One by one the same attributes will be selected, and you can replace the bgcolor attribute for the bordercolor one: You can also select all occurrences of any word by clicking Ctrl + command + G on Mac or Ctrl + Alt + Shift + J. To get out of the multiselect mode you have to click in a different position or use the Esc key. Refactoring facility Throughout the development process, it is almost unavoidable that you have to use refactoring. Also, the bigger code base you have, the more difficult it becomes to control the code, and when you need to refactor some code, you can most likely be up against some issues relating to, examples. naming omission or not taking into consideration function usage. You learned that WebStorm performs a thorough code analysis so it understands what is connected with what and if some changes occur it collates them and decide what is acceptable and what is not to perform in the rest of the code. Let us try a simple example. In a big HTML file you have the following line: <input id="search" type="search" placeholder="search" /> And in a big JavaScript file you have another one: var search = document.getElementById('search'); You decided to rename the id attribute's value of the input element to search_field because it is less confusing. You could simply rename it here but after that you would have to manually find all the occurrences of the word search in the code. It is evident that the word is rather frequent so you would spend a lot of time recognizing usage cases appropriate in the current context or not. And there is a high probability that you forget something important, and even more time will be spent on investigating an issue. Instead, you can entrust WebStorm with this task. Select the code unit to refactor (in our case, it is the search value of the id attribute), and click Shift + T on Mac or Ctrl + Alt + Shift + T on Windows (or simply click the Refactor menu item) to call the Refactor This dialog. There, choose the Rename… item and enter the new name for the selected code unit (search_field in our case). To get only a preview of what will happen during the refactoring process, click the Preview button, and all the changes to apply will be displayed in the bottom. You can walk through the hierarchical tree and either apply the change by clicking the Do Refactor button, or not. If you need a preview, you can simply click the Refactor button. What you will see is that the id attribute got the search_field value, not the type or placeholder values, even if they have the same value, and in the JavaScript file you got getElementById('search_field'). Note that even though WebStorm can perform various smart tasks, it still remains a program, and there can occur some issues caused by so-called artificial intelligence imperfection, so you should always be careful when performing the refactoring. In particular, manually check the var declarations because WebStorm sometimes can apply the changes to them as well but it is not always necessary because of the scope. Of course, it is just a little of what you are enabled to perform with refactoring. The basic things that the refactoring facility allows you to do are as follows: The elements in the preceding screenshot are explained as follows: Rename…: You have already got familiar with this refactoring. Once again, with it you can rename code units, and WebStorm automatically will fix all references of them in the code. The shortcut is Shift + F6. Change Signature…: This feature is used basically for changing function names, and adding/removing, reordering, or renaming function parameters, that is, changing the function signature. The shortcut is ⌘ + F6 for Mac and Ctrl + F6 for Windows. Move…: This feature enables you to move files or directories within a project, and it simultaneously repairs all references to these project elements in the code so you needn't manually repair them. The shortcut is F6. Copy…: With this feature, you can copy a file or directory or even a class, with its structure, from one place to another. The shortcut is F5. Safe Delete…: This feature is really helpful. It allows you to safely delete any code or entire files from the project. When performing this refactoring, you will be asked about whether it is needed to inspect comments and strings or all text files for the occurrence of the required piece of code or not. The shortcut is ⌘ + delete for Mac and Alt + Delete for Windows. Variable…: This refactoring feature declares a new variable whereto the result of the selected statement or expression is put. It can be useful when you realize there are too many occurrences of a certain expression so it can be turned into a variable, and the expression can just initialize it. The shortcut is Alt +⌘ + V for Mac and Ctrl + Alt + V for Windows. Parameter…: When you need to add a new parameter to some method and appropriately update its calls, use this feature. The shortcut is Alt + ⌘ + P for Mac and Ctrl + Alt + P for Windows. Method…: During this refactoring, the code block you selected undergoes analysis, through which the input and output variables get detected, and the extracted function receives the output variable as a return value. The shortcut is Alt + ⌘ + M for Mac and Ctrl + Alt + M for Windows. Inline…: The inline refactoring is working contrariwise to the extract method refactoring—it replaces surplus variables with their initializers making the code more compact and concise. The shortcut is Alt + ⌘ + N for Mac and Ctrl + Alt + N for Windows. Summary In this article, you have learned about the most distinctive features of WebStorm, which are the core constituents of improving your efficiency in building web applications. Resources for Article: Further resources on this subject: Introduction to Spring Web Application in No Time [article] Applications of WebRTC [article] Creating Java EE Applications [article]
Read more
  • 0
  • 0
  • 2421
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-slideshow-presentations
Packt
15 Sep 2015
24 min read
Save for later

Slideshow Presentations

Packt
15 Sep 2015
24 min read
 In this article by David Mitchell, author of the book Dart By Example you will be introduced to the basics of how to build a presentation application using Dart. It usually takes me more than three weeks to prepare a good impromptu speech. Mark Twain Presentations make some people shudder with fear, yet they are an undeniably useful tool for information sharing when used properly. The content has to be great and some visual flourish can make it stand out from the crowd. Too many slides can make the most receptive audience yawn, so focusing the presenter on the content and automatically taking care of the visuals (saving the creator from fiddling with different animations and fonts sizes!) can help improve presentations. Compelling content still requires the human touch. (For more resources related to this topic, see here.) Building a presentation application Web browsers are already a type of multimedia presentation application so it is feasible to write a quality presentation program as we explore more of the Dart language. Hopefully it will help us pitch another Dart application to our next customer. Building on our first application, we will use a text based editor for creating the presentation content. I was very surprised how much faster a text based editor is for producing a presentation, and more enjoyable. I hope you experience such a productivity boost! Laying out the application The application will have two modes, editing and presentation. In the editing mode, the screen will be split into two panes. The top pane will display the slides and the lower will contain the editor, and other interface elements. This article will focus on the core creation side of the presentation. The application will be a single Dart project. Defining the presentation format The presentations will be written in a tiny subset of the Markdown format which is a powerful yet simple to read text file based format (much easier to read, type and understand than HTML). In 2004, John Gruber and the late Aaron Swartz created the Markdown language in 2004 with the goal of enabling people to write using an easy-to-read, easy-to-write plain text format. It is used on major websites, such as GitHub.com and StackOverflow.com. Being plain text, Markdown files can be kept and compared in version control. For more detail and background on Markdown see https://en.wikipedia.org/wiki/Markdown A simple titled slide with bullet points would be defined as: #Dart Language +Created By Google +Modern language with a familiar syntax +Structured Web Applications +It is Awesomely productive! I am positive you only had to read that once! This will translate into the following HTML. <h1>Dart Language</h1> <li>Created By Google</li>s <li>Modern language with a familiar syntax</li> <li>Structured Web Applications</li> <li>It is Awesomely productive!</li> Markdown is very easy and fast to parse, which probably explains its growing popularity on the web. It can be transformed into many other formats. Parsing the presentation The content of the TextAreaHtml element is split into a list of individual lines, and processed in a similar manner to some of the features in the Text Editor application using forEach to iterate over the list. Any lines that are blank once any whitespace has been removed via the trim method are ignored. #A New Slide Title +The first bullet point +The second bullet point #The Second Slide Title +More bullet points !http://localhost/img/logo.png #Final Slide +Any questions? For each line starting with a # symbol, a new Slide object is created. For each line starting with a + symbol, they are added to this slides bullet point list. For each line is discovered using a ! symbol the slide's image is set (a limit of one per slide). This continues until the end of the presentation source is reached. A sample presentation To get a new user going quickly, there will be an example presentation which can be used as a demonstration and testing the various areas of the application. I chose the last topic that came up round the family dinner table—the coconut! #Coconut +Member of Arecaceae family. +A drupe - not a nut. +Part of daily diets. #Tree +Fibrous root system. +Mostly surface level. +A few deep roots for stability. #Yield +75 fruits on fertile land +30 typically +Fibre has traditional uses #Finally !coconut.png #Any Questions? Presenter project structures The project is a standard Dart web application with index.html as the entry point. The application is kicked off by main.dart which is linked to in index.html, and the application functionality is stored in the lib folder. Source File Description sampleshows.dart    The text for the slideshow application.  lifecyclemixin.dart  The class for the mixin.  slideshow.dart  Data structures for storing the presentation.  slideshowapp.dart  The application object. Launching the application The main function has a very short implementation. void main() { new SlideShowApp(); } Note that the new class instance does not need to be stored in a variable and that the object does not disappear after that line is executed. As we will see later, the object will attach itself to events and streams, keeping the object alive for the lifetime that the page is loaded. Building bullet point slides The presentation is build up using two classes—Slide and SlideShow. The Slide object creates the DivElement used to display the content and the SlideShow contains a list of Slide objects. The SlideShow object is updated as the text source is updated. It also keeps track of which slide is currently being displayed in the preview pane. Once the number of Dart files grows in a project, the DartAnalyzer will recommend naming the library. It is good habit to name every .dart file in a regular project with its own library name. The slideshow.dart file has the keyword library and a name next to it. In Dart, every file is a library, whether it is explicitly declared or not. If you are looking at Dart code online you may stumble across projects with imports that look a bit strange. #import("dart:html"); This is the old syntax for Dart's import mechanism. If you see this it is a sign that other aspects of the code may be out of date too. If you are writing an application in a single project, source files can be arranged in a folder structure appropriate for the project, though keeping the relatives paths manageable is advisable. Creating too many folders is probably means it is time to create a package! Accessing private fields In Dart, as discussed when we covered packages, the privacy is at the library level but it is still possible to have private fields in a class even though Dart does not have the keywords public, protected, and private. A simple return of a private field's value can be performed with a one line function. String getFirstName() => _name; To retrieve this value, a function call is required, for example, Person.getFirstName() however it may be preferred to have a property syntax such as Person.firstName. Having private fields and retaining the property syntax in this manner, is possible using the get and set keywords. Using true getters and setters The syntax of Dart also supports get and set via keywords: int get score =>score + bonus; set score(int increase) =>score += increase * level; Using either get/set or simple fields is down to preference. It is perfectly possible to start with simple fields and scale up to getters and setters if more validation or processing is required. The advantage of the get and set keywords in a library, is the intended interface for consumers of the package is very clear. Further it clarifies which methods may change the state of the object and which merely report current values. Mixin it up In object oriented languages, it is useful to build on one class to create a more specialized related class. For example, in the text editor the base dialog class was extended to create alert and confirm pop ups. What if we want to share some functionality but do not want inheritance occurring between the classes? Aggregation can solve this problem to some extent: class A{ classb usefulObject; } The downside is that this requires a longer reference to use: new A().usefulObject.handyMethod(); This problem has been solved in Dart (and other languages) by a mixin class to do this job, allowing the sharing of functionality without forced inheritance or clunky aggregation. In Dart, a mixin must meet the requirements: No constructors in the class declaration. The base class of the mixin must be Object. No calls to a super class are made. mixins are really just classes that are malleable enough to fit into the class hierarchy at any point. A use case for a mixin may be serialization fields and methods that may be required on several classes in an application that are not part of any inheritance chain. abstract class Serialisation { void save() { //Implementation here. } void load(String filename) { //Implementation here. } } The with keyword is used to declare that a class is using a mixin. class ImageRecord extends Record with Serialisation If the class does not have an explicit base class, it is required to specify Object. class StorageReports extends Object with Serialization In Dart, everything is an object, even basic types such as num are objects and not primitive types. The classes int and double are subtypes of num. This is important to know, as other languages have different behaviors. Let's consider a real example of this. main() { int i; print("$i"); } In a language such as Java the expected output would be 0 however the output in Dart is null. If a value is expected from a variable, it is always good practice to initialize it! For the classes Slide and SlideShow, we will use a mixin from the source file lifecyclemixin.dart to record a creation and an editing timestamp. abstract class LifecycleTracker { DateTime _created; DateTime _edited; recordCreateTimestamp() => _created = new DateTime.now(); updateEditTimestamp() => _edited = new DateTime.now(); DateTime get created => _created; DateTime get lastEdited => _edited; } To use the mixin, the recordCreateTimestamp method can be called from the constructor and the updateEditTimestamp from the main edit method. For slides, it makes sense just to record the creation. For the SlideShow class, both the creation and update will be tracked. Defining the core classes The SlideShow class is largely a container objects for a list of Slide objects and uses the mixin LifecycleTracker. class SlideShow extends Object with LifecycleTracker { List<Slide> _slides; List<Slide> get slides => _slides; ... The Slide class stores the string for the title and a list of strings for the bullet points. The URL for any image is also stored as a string: class Slide extends Object with LifecycleTracker { String titleText = ""; List<String> bulletPoints; String imageUrl = ""; ... A simple constructor takes the titleText as a parameter and initializes the bulletPoints list. If you want to focus on just-the-code when in WebStorm , double-click on filename title of the tab to expand the source code to the entire window. Double-click again to return to the original layout. For even more focus on the code, go to the View menu and click on Enter Distraction Free Mode. Transforming data into HTML To add the Slide object instance into a HTML document, the strings need to be converted into instances of HTML elements to be added to the DOM (Document Object Model). The getSlideContents() method constructs and returns the entire slide as a single object. DivElement getSlideContents() { DivElement slide = new DivElement(); DivElement title = new DivElement(); DivElement bullets = new DivElement(); title.appendHtml("<h1>$titleText</h1>"); slide.append(title); if (imageUrl.length > 0) { slide.appendHtml("<img src="$imageUrl" /><br/>"); } bulletPoints.forEach((bp) { if (bp.trim().length > 0) { bullets.appendHtml("<li>$bp</li>"); } }); slide.append(bullets); return slide; } The Div elements are constructed as objects (instances of DivElement), while the content is added as literal HTML statements. The method appendHtml is used for this particular task as it renders HTML tags in the text. The regular method appendText puts the entire literal text string (including plain unformatted text of the HTML tags) into the element. So what exactly is the difference? The method appendHtml evaluates the supplied ,HTML, and adds the resultant object node to the nodes of the parent element which is rendered in the browser as usual. The method appendText is useful, for example, to prevent user supplied content affecting the format of the page and preventing malicious code being injected into a web page. Editing the presentation When the source is updated the presentation is updated via the onKeyUp event. This was used in the text editor project to trigger a save to local storage. This is carried out in the build method of the SlideShow class, and follows the pattern we discussed parsing the presentation. build(String src) { updateEditTimestamp(); _slides = new List<Slide>(); Slide nextSlide; src.split("n").forEach((String line) { if (line.trim().length > 0) { // Title - also marks start of the next slide. if (line.startsWith("#")) { nextSlide = new Slide(line.substring(1)); _slides.add(nextSlide); } if (nextSlide != null) { if (line.startsWith("+")) { nextSlide.bulletPoints.add(line.substring(1)); } else if (line.startsWith("!")) { nextSlide.imageUrl = line.substring(1); } } } }); } As an alternative to the startsWith method, the square bracket [] operator could be used for line [0] to retrieve the first character. The startsWith can also take a regular expression or a string to match and a starting index, refer to the dart:core documentation for more information. For the purposes of parsing the presentation, the startsWith method is more readable. Displaying the current slide The slide is displayed via the showSlide method in slideShowApp.dart. To preview the current slide, the current index, stored in the field currentSlideIndex, is used to retrieve the desired slide object and the Div rendering method called. showSlide(int slideNumber) { if (currentSlideShow.slides.length == 0) return; slideScreen.style.visibility = "hidden"; slideScreen ..nodes.clear() ..nodes.add(currentSlideShow.slides[slideNumber].getSlideContents ()); rangeSlidePos.value = slideNumber.toString(); slideScreen.style.visibility = "visible"; } The slideScreen is a DivElement which is then updated off screen by setting the visibility style property to hidden The existing content of the DivElement is emptied out by calling nodes.clear() and the slide content is added with nodes.add. The range slider position is set and finally the DivElement is set to visible again. Navigating the presentation A button set with familiar first, previous, next and last slide allow the user to jump around the preview of the presentation. This is carried out by having an index into the list of slides stored in the field slide in the SlideShowApp class. Handling the button key presses The navigation buttons require being set up in an identical pattern in the constructor of the SlideShowApp object. First get an object reference using id, which is the id attribute of the element, and then attaching a handler to the click event. Rather than repeat this code, a simple function can handle the process. setButton(String id, Function clickHandler) { ButtonInputElement btn = querySelector(id); btn.onClick.listen(clickHandler); } As function is a type in Dart, functions can be passed around easily as a parameter. Let us take a look at the button that takes us to the first slide. setButton("#btnFirst", startSlideShow); void startSlideShow(MouseEvent event) { showFirstSlide(); } void showFirstSlide() { showSlide(0); } The event handlers do not directly change the slide, these are carried out by other methods, which may be triggered by other inputs such as the keyboard. Using the function type The SlideShowApp constructor makes use of this feature. Function qs = querySelector; var controls = qs("#controls"); I find the querySelector method a little long to type (though it is a good descriptive of what it does). With Function being types, we can easily create a shorthand version. The constructor spends much of its time selecting and assigning the HTML elements to member fields of the class. One of the advantages of this approach is that the DOM of the page is queried only once, and the reference stored and reused. This is good for performance of the application as, once the application is running, querying the DOM may take much longer. Staying within the bounds Using min and max function from the dart:math package, the index can be kept in range of the current list. void showLastSlide() { currentSlideIndex = max(0, currentSlideShow.slides.length - 1); showSlide(currentSlideIndex); } void showNextSlide() { currentSlideIndex = min(currentSlideShow.slides.length - 1, ++currentSlideIndex); showSlide(currentSlideIndex); } These convenience functions can save a great deal if and else if comparisons and help make code a good degree more readable. Using the slider control The slider control is another new control in the HTML5 standard. This will allow the user to scroll though the slides in the presentation. This control is a personal favorite of mine, as it is so visual and can be used to give very interactive feedback to the user. It seemed to be a huge omission from the original form controls in the early generation of web browsers. Even with clear widely accepted features, HTML specifications can take a long time to clear committees and make it into everyday browsers! <input type="range" id="rngSlides" value="0"/> The control has an onChange event which is given a listener in the SlideShowApp constructor. rangeSlidepos.onChange.listen(moveToSlide);rangeSlidepos.onChange .listen(moveToSlide); The control provides its data via a simple string value, which can be converted to an integer via the int.parse method to be used as an index to the presentation's slide list. void moveToSlide(Event event) { currentSlideIndex = int.parse(rangeSlidePos.value); showSlide(currentSlideIndex); } The slider control must be kept in synchronization with any other change in slide display, use of navigation or change in number of slides. For example, the user may use the slider to reach the general area of the presentation, and then adjust with the previous and next buttons. void updateRangeControl() { rangeSlidepos ..min = "0" ..max = (currentSlideShow.slides.length - 1).toString(); } This method is called when the number of slides is changed, and as with working with most HTML elements, the values to be set need converted to strings. Responding to keyboard events Using the keyboard, particularly the arrow (cursor) keys, is a natural way to look through the slides in a presentation even in the preview mode. This is carried out in the SlideShowApp constructor. In Dart web applications, the dart:html package allows direct access to the globalwindow object from any class or function. The Textarea used to input the presentation source will also respond to the arrow keys so there will need to be a check to see if it is currently being used. The property activeElement on the document will give a reference to the control with focus. This reference can be compared to the Textarea, which is stored in the presEditor field, so a decision can be taken on whether to act on the keypress or not. Key Event Code Action Left Arrow  37  Go back a slide. Up Arrow  38  Go to first slide.   Right Arrow  39  Go to next slide.  Down Arrow  40  Go to last slide. Keyboard events, like other events, can be listened to by using a stream event listener. The listener function is an anonymous function (the definition omits a name) that takes the KeyboardEvent as its only parameter. window.onKeyUp.listen((KeyboardEvent e) { if (presEditor != document.activeElement){ if (e.keyCode == 39) showNextSlide(); else if (e.keyCode == 37) showPrevSlide(); else if (e.keyCode == 38) showFirstSlide(); else if (e.keyCode == 40) showLastSlide(); } }); It is a reasonable question to ask how to get the keyboard key codes required to write the switching code. One good tool is the W3C's Key and Character Codes page at http://www.w3.org/2002/09/tests/keys.html, to help with this but it can often be faster to write the handler and print out the event that is passed in! Showing the key help Rather than testing the user's memory, there will be a handy reference to the keyboard shortcuts. This is a simple Div element which is shown and then hidden when the key (remember to press Shift too!) is pressed again by toggling the visibility style from visible to hidden. Listening twice to event streams The event system in Dart is implemented as a stream. One of the advantages of this is that an event can easily have more than one entity listening to the class. This is useful, for example in a web application where some keyboard presses are valid in one context but not in another. The listen method is an add operation (accumulative) so the key press for help can be implemented separately. This allows a modular approach which helps reuse as the handlers can be specialized and added as required. window.onKeyUp.listen((KeyboardEvent e) { print(e); //Check the editor does not have focus. if (presEditor != document.activeElement) { DivElement helpBox = qs("#helpKeyboardShortcuts"); if (e.keyCode == 191) { if (helpBox.style.visibility == "visible") { helpBox.style.visibility = "hidden"; } else { helpBox.style.visibility = "visible"; } } } }); In, for example, a game, a common set of event handling may apply to title and introduction screen and the actual in game screen contains additional event handling as a superset. This could be implemented by adding and removing handlers to the relevant event stream. Changing the colors HTML5 provides browsers with full featured color picker (typically browsers use the native OS's color chooser). This will be used to allow the user to set the background color of the editor application itself. The color picker is added to the index.html page with the following HTML: <input id="pckBackColor" type="color"> The implementation is straightforward as the color picker control provides: InputElement cp = qs("#pckBackColor"); cp.onChange.listen( (e) => document.body.style.backgroundColor = cp.value); As the event and property (onChange and value) are common to the input controls the basic InputElement class can be used. Adding a date Most presentations are usually dated, or at least some of the jokes are! We will add a convenient button for the user to add a date to the presentation using the HTML5 input type date which provides a graphical date picker. <input type="date" id="selDate" value="2000-01-01"/> The default value is set in the index.html page as follows: The valueAsDate property of the DateInputElement class provides the Date object which can be added to the text area: void insertDate(Event event) { DateInputElement datePicker = querySelector("#selDate"); if (datePicker.valueAsDate != null) presEditor.value = presEditor.value + datePicker.valueAsDate.toLocal().toString(); } In this case, the toLocal method is used to obtain a string formatted to the month, day, year format. Timing the presentation The presenter will want to keep to their allotted time slot. We will include a timer in the editor to aid in rehearsal. Introducing the stopwatch class The Stopwatch class (from dart:core) provides much of the functionality needed for this feature, as shown in this small command line application: main() { Stopwatch sw = new Stopwatch(); sw.start(); print(sw.elapsed); sw.stop(); print(sw.elapsed); } The elapsed property can be checked at any time to give the current duration. This is very useful class, for example, it can be used to compare different functions to see which is the fastest. Implementing the presentation timer The clock will be stopped and started with a single button handled by the toggleTimer method. A recurring timer will update the duration text on the screen as follows: If the timer is running, the update Timer and the Stopwatch in field slidesTime is stopped. No update to the display is required as the user will need to see the final time: void toggleTimer(Event event) { if (slidesTime.isRunning) { slidesTime.stop(); updateTimer.cancel(); } else { updateTimer = new Timer.periodic(new Duration(seconds: 1), (timer) { String seconds = (slidesTime.elapsed.inSeconds % 60).toString(); seconds = seconds.padLeft(2, "0"); timerDisplay.text = "${slidesTime.elapsed.inMinutes}:$seconds"; }); slidesTime ..reset() ..start(); } } The Stopwatch class provides properties for retrieving the elapsed time in minutes and seconds. To format this to minutes and seconds, the seconds portion is determined with the modular division operator % and padded with the string function padLeft. Dart's string interpolation feature is used to build the final string, and as the elapsed and inMinutes properties are being accessed, the {} brackets are required so that the single value is returned. Overview of slides This provides the user with a visual overview of the slides as shown in the following screenshot: The presentation slides will be recreated in a new full screen Div element. This is styled using the fullScreen class in the CSS stylesheet in the SlideShowApp constructor: overviewScreen = new DivElement(); overviewScreen.classes.toggle("fullScreen"); overviewScreen.onClick.listen((e) => overviewScreen.remove()); The HTML for the slides will be identical. To shrink the slides, the list of slides is iterated over, the HTML element object obtained and the CSS class for the slide is set: currentSlideShow.slides.forEach((s) { aSlide = s.getSlideContents(); aSlide.classes.toggle("slideOverview"); aSlide.classes.toggle("shrink"); ... The CSS hover class is set to scale the slide when the mouse enters so a slide can be focused on for review. The classes are set with the toggle method which either adds if not present or removes if they are. The method has an optional parameter: aSlide.classes.toggle('className', condition); The second parameter is named shouldAdd is true if the class is always to be added and false if the class is always to be removed. Handout notes There is nothing like a tangible handout to give attendees to your presentation. This can be achieved with a variation of the overview display: Instead of duplicating the overview code, the function can be parameterized with an optional parameter in the method declaration. This is declared with square brackets [] around the declaration and a default value that is used if no parameter is specified. void buildOverview([bool addNotes = false]) This is called by the presentation overview display without requiring any parameters. buildOverview(); This is called by the handouts display without requiring any parameters. buildOverview(true); If this parameter is set, an additional Div element is added for the Notes area and the CSS is adjust for the benefit of the print layout. Comparing optional positional and named parameters The addNotes parameter is declared as an optional positional parameter, so an optional value can be specified without naming the parameter. The first parameter is matched to the supplied value. To give more flexibility, Dart allows optional parameters to be named. Consider two functions, the first will take named optional parameters and the second positional optional parameters. getRecords1(String query,{int limit: 25, int timeOut: 30}) { } getRecords2(String query,[int limit = 80, int timeOut = 99]) { } The first function can be called in more ways: getRecords1(""); getRecords1("", limit:50, timeOut:40); getRecords1("", timeOut:40, limit:65); getRecords1("", limit:50); getRecords1("", timeOut:40); getRecords2(""); getRecords2("", 90); getRecords2("", 90, 50); With named optional parameters, the order they are supplied is not important and has the advantage that the calling code is clearer as to the use that will be made of the parameters being passed. With positional optional parameters, we can omit the later parameters but it works in a strict left to right order so to set the timeOut parameter to a non-default value, limit must also be supplied. It is also easier to confuse which parameter is for which particular purpose. Summary The presentation editor is looking rather powerful with a range of advanced HTML controls moving far beyond text boxes to date pickers and color selectors. The preview and overview help the presenter visualize the entire presentation as they work, thanks to the strong class structure built using Dart mixins and data structures using generics. We have spent time looking at the object basis of Dart, how to pass parameters in different ways and, closer to the end user, how to handle keyboard input. This will assist in the creation of many different types of application and we have seen how optional parameters and true properties can help document code for ourselves and other developers. Hopefully you learned a little about coconuts too. The next step for this application is to improve the output with full screen display, animation and a little sound to capture the audiences' attention. The presentation editor could be improved as well—currently it is only in the English language. Dart's internationalization features can help with this. Resources for Article: Further resources on this subject: Practical Dart[article] Handling the DOM in Dart[article] Dart with JavaScript [article]
Read more
  • 0
  • 0
  • 1762

article-image-performance-design
Packt
15 Sep 2015
9 min read
Save for later

Performance by Design

Packt
15 Sep 2015
9 min read
In this article by Shantanu Kumar, author of the book, Clojure High Performance Programming - Second Edition, we learn how Clojure is a safe, functional programming language that brings great power and simplicity to the user. Clojure is also dynamically and strongly typed, and has very good performance characteristics. Naturally, every activity performed on a computer has an associated cost. What constitutes acceptable performance varies from one use-case and workload to another. In today's world, performance is even the determining factor for several kinds of applications. We will discuss Clojure (which runs on the JVM (Java Virtual Machine)), and its runtime environment in the light of performance, which is the goal of the book. In this article, we will study the basics of performance analysis, including the following: A whirlwind tour of how the application stack impacts performance Classifying the performance anticipations by the use cases types (For more resources related to this topic, see here.) Use case classification The performance requirements and priority vary across the different kinds of use cases. We need to determine what constitutes acceptable performance for the various kinds of use cases. Hence, we classify them to identify their performance model. When it comes to details, there is no sure shot performance recipe of any kind of use case, but it certainly helps to study their general nature. Note that in real life, the use cases listed in this section may overlap with each other. The user-facing software The performance of user facing applications is strongly linked to the user's anticipation. Having a difference of a good number of milliseconds may not be perceptible for the user but at the same time, a wait for more than a few seconds may not be taken kindly. One important element to normalize the anticipation is to engage the user by providing a duration-based feedback. A good idea to deal with such a scenario would be to start the task asynchronously in the background, and poll it from the UI layer to generate duration-based feedback for the user. Another way could be to incrementally render the results to the user to even out the anticipation. Anticipation is not the only factor in user facing performance. Common techniques like staging or precomputation of data, and other general optimization techniques can go a long way to improve the user experience with respect to performance. Bear in mind that all kinds of user facing interfaces fall into this use case category—the Web, mobile web, GUI, command line, touch, voice-operated, gesture...you name it. Computational and data-processing tasks Non-trivial compute intensive tasks demand a proportional amount of computational resources. All of the CPU, cache, memory, efficiency and the parallelizability of the computation algorithms would be involved in determining the performance. When the computation is combined with distribution over a network or reading from/staging to disk, I/O bound factors come into play. This class of workloads can be further subclassified into more specific use cases. A CPU bound computation A CPU bound computation is limited by the CPU cycles spent on executing it. Arithmetic processing in a loop, small matrix multiplication, determining whether a number is a Mersenne prime, and so on, would be considered CPU bound jobs. If the algorithm complexity is linked to the number of iterations/operations N, such as O(N), O(N2) and more, then the performance depends on how big N is, and how many CPU cycles each step takes. For parallelizable algorithms, performance of such tasks may be enhanced by assigning multiple CPU cores to the task. On virtual hardware, the performance may be impacted if the CPU cycles are available in bursts. A memory bound task A memory bound task is limited by the availability and bandwidth of the memory. Examples include large text processing, list processing, and more. For example, specifically in Clojure, the (reduce f (pmap g coll)) operation would be memory bound if coll is a large sequence of big maps, even though we parallelize the operation using pmap here. Note that higher CPU resources cannot help when memory is the bottleneck, and vice versa. Lack of availability of memory may force you to process smaller chunks of data at a time, even if you have enough CPU resources at your disposal. If the maximum speed of your memory is X and your algorithm on single the core accesses the memory at speed X/3, the multicore performance of your algorithm cannot exceed three times the current performance, no matter how many CPU cores you assign to it. The memory architecture (for example, SMP and NUMA) contributes to the memory bandwidth in multicore computers. Performance with respect to memory is also subject to page faults. A cache bound task A task is cache bound when its speed is constrained by the amount of cache available. When a task retrieves values from a small number of repeated memory locations, for example, a small matrix multiplication, the values may be cached and fetched from there. Note that CPUs (typically) have multiple layers of cache, and the performance will be at its best when the processed data fits in the cache, but the processing will still happen, more slowly, when the data does not fit into the cache. It is possible to make the most of the cache using cache-oblivious algorithms. A higher number of concurrent cache/memory bound threads than CPU cores is likely to flush the instruction pipeline, as well as the cache at the time of context switch, likely leading to a severely degraded performance. An input/output bound task An input/output (I/O) bound task would go faster if the I/O subsystem, that it depends on, goes faster. Disk/storage and network are the most commonly used I/O subsystems in data processing, but it can be serial port, a USB-connected card reader, or any I/O device. An I/O bound task may consume very few CPU cycles. Depending on the speed of the device, connection pooling, data compression, asynchronous handling, application caching, and more, may help in performance. One notable aspect of I/O bound tasks is that performance is usually dependent on the time spent waiting for connection/seek, and the amount of serialization that we do, and hardly on the other resources. In practice, many data processing workloads are usually a combination of CPU bound, memory bound, cache bound, and I/O bound tasks. The performance of such mixed workloads effectively depends on the even distribution of CPU, cache, memory, and I/O resources over the duration of the operation. A bottleneck situation arises only when one resource gets too busy to make way for another. Online transaction processing The online transaction processing (OLTP) systems process the business transactions on demand. It can sit behind systems such as a user-facing ATM machine, point-of-sale terminal, a network-connected ticket counter, ERP systems, and more. The OLTP systems are characterized by low latency, availability, and data integrity. They run day-to-day business transactions. Any interruption or outage is likely to have a direct and immediate impact on the sales or service. Such systems are expected to be designed for resiliency rather than the delayed recovery from failures. When the performance objective is unspecified, you may like to consider graceful degradation as a strategy. It is a common mistake to ask the OLTP systems to answer analytical queries; something that they are not optimized for. It is desirable of an informed programmer to know the capability of the system, and suggest design changes as per the requirements. Online analytical processing The online analytical processing (OLAP) systems are designed to answer analytical queries in short time. They typically get data from the OLTP operations, and their data model is optimized for querying. They basically provide for consolidation (roll-up), drill-down and slicing, and dicing of data for analytical purposes. They often use specialized data stores that can optimize the ad-hoc analytical queries on the fly. It is important for such databases to provide pivot-table like capability. Often, the OLAP cube is used to get fast access to the analytical data. Feeding the OLTP data into the OLAP systems may entail workflows and multistage batch processing. The performance concern of such systems is to efficiently deal with large quantities of data, while also dealing with inevitable failures and recovery. Batch processing Batch processing is automated execution of predefined jobs. These are typically bulk jobs that are executed during off-peak hours. Batch processing may involve one or more stages of job processing. Often batch processing is clubbed with work-flow automation, where some workflow steps are executed offline. Many of the batch processing jobs work on staging of data, and on preparing data for the next stage of processing to pick up. Batch jobs are generally optimized for the best utilization of the computing resources. Since there is little to moderate the demand to lower the latencies of some particular subtasks, these systems tend to optimize for throughput. A lot of batch jobs involve largely I/O processing and are often distributed over a cluster. Due to distribution, the data locality is preferred when processing the jobs; that is, the data and processing should be local in order to avoid network latency in reading/writing data. Summary We learned about the basics of what it is like to think more deeply about performance. The performance of Clojure applications depend on various factors. For a given application, understanding its use cases, design and implementation, algorithms, resource requirements and alignment with the hardware, and the underlying software capabilities, is essential. Resources for Article: Further resources on this subject: Big Data [article] The Observer Pattern [article] Working with Incanter Datasets [article]
Read more
  • 0
  • 0
  • 12069

article-image-deploy-nodejs-apps-aws-code-deploy
Ankit Patial
14 Sep 2015
7 min read
Save for later

Deploy Node.js Apps with AWS Code Deploy

Ankit Patial
14 Sep 2015
7 min read
As an application developer, you must be familiar with the complexity of deploying apps to a fleet of servers with minimum down time. AWS introduced a new service called AWS Code Deploy to ease out the deployment of applications to an EC2 instance on the AWS cloud. Before explaining the full process, I will assume that you are using AWS VPC and are having all of your EC2 instances inside VPC, and that each instance is having an IAM role. Let's see how we can deploy a Node.js application to AWS. Install AWS Code Deploy Agent The first thing you need to do is to install aws-codedeploy-agent on each machine that you want your code deployed on. Before installing a client, please make sure that you have trust relationship for codedeploy.us-west-2.amazonaws.com and codedeploy.us-east-1.amazonaws.com added in IAM role that EC2 instance is using. Not sure what it is? Then, click on the top left dropdown with your account name in AWS console, select Security Credentials option you will be redirected to a new page, select Roles from left menu and look for IAM role that EC2 instance is using, click it and scroll to them bottom, and you will see Edit Trust Relationship button. Click this button to edit trust relationship, and make sure it looks like the following. ... "Principal": { "Service": [ "ec2.amazonaws.com", "codedeploy.us-west-2.amazonaws.com", "codedeploy.us-east-1.amazonaws.com" ] } ... Ok we are good to install the AWS Code Deploy Agent, so make sure ruby2.0 is installed. Use the following script to install code deploy agent. aws s3 cp s3://aws-codedeploy-us-east-1/latest/install ./install-aws-codedeploy-agent --region us-east-1 chmod +x ./install-aws-codedeploy-agent sudo ./install-aws-codedeploy-agent auto rm install-aws-codedeploy-agent Ok, hopefully nothing will go wrong and agent will be installed up and running. To check if its running or not, try the following command: sudo service codedeploy-agent status Let's move to the next step. Create Code Deploy Application Login to your AWS account. Under Deployment & Management click on Code Deploy link, on next screen click on the Get Started Now button and complete the following things: Choose Custom Deployment and click the Skip Walkthrough button. Create New Application; the following are steps to create an application. –            Application Name: display name for application you want to deploy. –            Deployment Group Name: this is something similar to environments like LIVE, STAGING and QA. –            Add Instances: you can choose Amazon EC2 instances by name group name etc. In case you are using autoscaling feature, you can add that auto scaling group too. –            Deployment Config: its a way to specify how we want to deploy application, whether we want to deploy one server at-a-time or half of servers at-a-time or deploy all at-once. –            Service Role: Choose the IAM role that has access to S3 bucket that we will use to hold code revisions. –            Hit the Create Application button. Ok, we just created a Code Deploy application. Let's hold it here and move to our NodeJs app to get it ready for deployment. Code Revision Ok, you have written your app and you are ready to deploy it. The most important thing your app need is appspec.yml. This file will be used by code deploy agent to perform various steps during the deployment life cycle. In simple words the deployment process includes the following steps: Stop the previous application if already deployed; if its first time then this step will not exist. Update the latest code, such as copy files to the application directory. Install new packages or run DB migrations. Start the application. Check if the application is working. Rollback if something went wrong. All above steps seem easy, but they are time consuming and painful to perform each time. Let's see how we can perform these steps easily with AWS code deploy. Lets say we have a following appspec.yml file in our code and also we have bin folder in an app that contain executable sh scripts to perform certain things that I will explain next. First of all take an example of appspec.yml: version: 0.0 os: linux files: - source: / destination: /home/ec2-user/my-app permissions: - object: / pattern: "**" owner: ec2-user group: ec2-user hooks: ApplicationStop: - location: bin/app-stop timeout: 10 runas: ec2-user AfterInstall: - location: bin/install-pkgs timeout: 1200 runas: ec2-user ApplicationStart: - location: bin/app-start timeout: 60 runas: ec2-user ValidateService: - location: bin/app-validate timeout: 10 runas: ec2-user It's a way to tell Code Deploy to copy and provide a destination of those files. files: - source: / destination: /home/ec2-user/my-app We can specify the permissions to be set for the source file on copy. permissions: - object: / pattern: "**" owner: ec2-user group: ec2-user Hooks are executed in an order during the Code Deploy life cycle. We have ApplicationStop, DownloadBundle, BeforeInstall, Install, AfterInstall, ApplicationStart and ValidateService hooks that all have the same syntax. hooks: deployment-lifecycle-event-name - location: script-location timeout: timeout-in-seconds runas: user-name location is the relative path from code root to script file that you want to execute. timeout is the maximum time a script can run. runas is an os user to run the script, and some time you may want to run a script with diff user privileges. Lets bundle your app, exclude the unwanted files such as node_modules folder, and zip it. I use AWS CLI to deploy my code revisions, but you can install awscli using PPI (Python Package Index). sudo pip install awscli I am using awscli profile that has access to s3 code revision bucket in my account. Here is code sample that can help: aws --profile simsaw-baas deploy push --no-ignore-hidden-files --application-name MY_CODE_DEPLOY_APP_NAME --s3-location s3://MY_CODE_REVISONS/MY_CODE_DEPLOY_APP_NAME/APP_BUILD --source MY_APP_BUILD.zip Now Code Revision is published to s3 and also the same revision is registered with the Code Deploy application with the name MY_CODE_DEPLOY_APP_NAME (it will be name of the application you created earlier in the second step.) Now go back to AWS console, Code Deploy. Deploy Code Revision Select your Code Deploy application from the application list show on the Code Deploy Dashboard. It will take you to the next window where you can see the published revision(s), expand the revision and click on Deploy This Revision. You will be redirected to a new window with options like application and deployment group. Choose them carefully and hit deploy. Wait for magic to happen. Code Deploy has a another option to deploy your app from github. The process for it will be almost the same, except you need not push code revisions to S3. About the author Ankit Patial has a Masters in Computer Applications, and nine years of experience with custom APIs, web and desktop applications using .NET technologies, ROR and NodeJs. As a CTO with SimSaw Inc and Pink Hand Technologies, his job is to learn and help his team to implement the best practices of using Cloud Computing and JavaScript technologies.
Read more
  • 0
  • 0
  • 29097

article-image-mastering-jenkins
Packt
14 Sep 2015
13 min read
Save for later

Continuous Delivery and Continuous Deployment

Packt
14 Sep 2015
13 min read
 In this article by Jonathan McAllister, author of the book Mastering Jenkins, we will discover all things continuous namely Continuous Delivery, and Continuous Deployment practices. Continuous Delivery represents a logical extension to the continuous integration practices. It expands the automation defined in continuous integration beyond simply building a software project and executing unit tests. Continuous Delivery adds automated deployments and acceptance test verification automation to the solution. To better describe this process, let's take a look at some basic characteristics of Continuous Delivery: The development resources commit changes to the mainline of the source control solution multiple times per day, and the automation system initiates a complete build, deploy, and test validation of the software project Automated tests execute against every change deployed, and help ensure that the software remains in an always-releasable state Every committed change is treated as potentially releasable, and extra care is taken to ensure that incomplete development work is hidden and does not impact readiness of the software Feedback loops are developed to facilitate notifications of failures. This includes build results, test execution reports, delivery status, and user acceptance verification Iterations are short, and feedback is rapid, allowing the business interests to weigh in on software development efforts and propose alterations along the way Business interests, instead of engineering, will decide when to physically release the software project, and as such, the software automation should facilitate this goal (For more resources related to this topic, see here.) As described previously, Continuous Delivery (CD) represents the expansion of Continuous Integration Practices. At the time of writing of this book, Continuous Delivery approaches have been successfully implemented at scale in organizations like Amazon, Wells Fargo, and others. The value of CD derives from the ability to tie software releases to business interests, collect feedback rapidly, and course correct efficiently. The following diagram illustrates the basic automation flow for Continuous Delivery: Figure 8-10: Continuous Delivery workflow As we can see in the preceding diagram, this practice allows businesses to rapidly develop, strategically market, and release software based on pivoting market demands instead of engineering time frames. When implementing a continuous delivery solution, there are a few key points that we should keep in mind: Keep the build fast Illuminate the failures, and recover immediately Make deployments push-button, for any version to any environment Automate the testing and validation operations with defined buckets for each logical test group (unit, smoke, functional and regression) Use feature toggles to avoid branching Get feedback early and often (automation feedback, test feedback, build feedback, UAT feedback) Principles of Continuous Delivery Continuous Delivery was founded on the premise of standardized and defined release processes, automation-based build pipelines, and logical quality gates with rapid feedback loops. In a continuous delivery paradigm, builds flow from development to QA and beyond like water in a pipe. Builds can be promoted from one logical group to another, and the risk of the proposed change is exposed incrementally to a wider audience. The practical application of the Continuous Delivery principles lies in frequent commits to the mainline, which, in turn, execute the build pipeline automation suite, pass through automated quality gates for verification, and are individually signed off by business interests (in a best case scenario). The idea of incrementally exposing the risk can be better illustrated through a circle of trust diagram, as follows: Figure 8-11: Circle of Trust for code changes As illustrated in the preceding trust diagram, the number of people exposed to a build expands incrementally as the build passes from one logical development and business group to another. This model places emphasis on verification and attempts to remove waste (time) by exposing the build output only to groups that have a vested interest in the build at that phase. Continuous Delivery in Jenkins Applying the Continuous Delivery principles in Jenkins can be accomplished in a number of ways. That said, there are some definite tips and tricks which can be leveraged to make the implementation a bit less painful. In this section, we will discuss and illustrate some of the more advanced Continuous Delivery tactics and learn how to apply them in Jenkins. Your specific implementation of Continuous Delivery will most definitely be unique to your organization; so, take what is useful, research anything that is missing, and disregard what is useless. Let's get started. Rapid feedback loops Rapid feedback loops are the baseline implementation requirement for Continuous Delivery. Applying this with Jenkins can be accomplished in a pretty slick manner using a combination of the Email-Ext plugin and some HTML template magic. In large-scale Jenkins implementations, it is not wise to manage many e-mail templates, and creating a single transformable one will help save time and effort. Let's take a look how to do this in Jenkins. The Email-Ext plugin provides Jenkins with the capabilities of completely customizable e-mail notifications. It allows the Jenkins system to customize just about every aspect of notifications and can be leveraged as an easy-to-implement, template-based e-mail solution. To begin with, we will need to install the plugin into our Jenkins system. The details for this plugin can be found at the following web address: https://wiki.jenkins-ci.org/display/JENKINS/Email-ext+plugin Once the plug-in has been installed into our Jenkins system, we will need to configure the basic connection details and optional settings. To begin, navigate to the Jenkins administration area and locate the Extended Email Notification section. Jenkins->Manage Jenkins->Configure System On this page, we will need to specify, at a minimum, the following details: SMTP Server SMTP Authentication details (User Name + Password) Reply-to List (nobody@domain.com) System Admin Email Address (located further up on the page) The completed form may look something like the following screenshot: Figure 8-12: Completed form Once the basic SMTP configuration details have been specified, we can then add the Editable Email Notification post build step to our jobs, and configure the e-mail contents appropriately. The following screenshot illustrates the basic configuration options required for the build step to operate: Figure 8-13: Basic configuration options As we can see from the preceding screenshot, environment variables are piped into the plugin via the job's automation to define the e-mail contents, recipient list, and other related details. This solution makes for a highly effective feedback loop implementation. Quality gates and approvals Two of the key aspects of Continuous Delivery include the adoption of quality gates and stakeholder approvals. This requires individuals to signoff on a given change or release as it flows through the pipeline. Back in the day, this used to be managed through a Release Signoff sheet, which would often times be maintained manually on paper. In the modern digital age, this is managed through the Promoted builds plugin in Jenkins, whereby we can add LDAP or Active Directory integration to ensure that only authentic users have the access rights required to promote builds. However, there is room to expand this concept and learn some additional tips and tricks, which will ensure that we have a solid and secure implementation. Integrating Jenkins with Lightweight Directory Access Protocol (LDAP) is generally a straightforward exercise. This solution allows a corporate authentication system to be tied directly into Jenkins. This means that once the security integration is configured in Jenkins, we will be able to login to the Jenkins system (UI) by using our corporate account credentials. To connect Jenkins to a corporate authentication engine, we will first need to configure Jenkins to talk to the corporate security servers. This is configured in the Global Security administration area of the Jenkins user interface as shown in the following screenshot: Figure 8-14: Global Security configuration options The global security area of Jenkins allows us to specify the type of authentication that Jenkins will use for users who wish to access the Jenkins system. By default, Jenkins provides a built-in internal database for managing users; we will have to alter this to support LDAP. To configure this system to utilize LDAP, click the LDAP radio button, and enter your LDAP server details as illustrated in the following screenshot: Figure 8-15: LDAP server details Fill out the form with your company's LDAP specifics, and click save. If you happen to get stuck on this configuration, the Jenkins community has graciously provided an additional in-depth documentation. This documentation can be found at the following URL: https://wiki.jenkins-ci.org/display/JENKINS/LDAP+Plugin For users who wish to leverage Active Directory, there is a Jenkins plugin which can facilitate this type of integrated security solution. For more details on this plugin, please consult the plugin page at the following URL: https://wiki.jenkins-ci.org/display/JENKINS/Active+Directory+plugin Once the authentication solution has successfully been configured, we can utilize it to set approvers in the promoted builds plugin. To configure a promotion approver, we will need to edit the desired Jenkins project, and specify the users who should have the promote permissions. The following screenshot shows an example of this configuration: Figure 8-16: Configuration example As we can see, the promoted builds plugin provides an excellent signoff sheet solution. It is complete with access security controls, promotion criteria, and a robust build step implementation solution. Build pipeline(s) workflow and visualization When build pipelines are created initially, the most common practice is to simply daisy chain the jobs together. This is a perfectly reasonable initial-implementation approach, but in the long term, this may get confusing and it may become difficult to track the workflow of daisy-chained jobs. To assist with this issue, Jenkins offers a plugin to help visualize the build pipelines, and is appropriately named the Build Pipelines plugin. The details surrounding this plugin can be found at the following web URL: https://wiki.jenkins-ci.org/display/JENKINS/Build+Pipeline+Plugin This plugin provides an additional view option, which is populated by specifying an entry point to the pipeline, detecting upstream and downstream jobs, and creating a visual representation of the pipeline. Upon the initial installation of the plugin, we can see an additional option available to us when we create a new dashboard view. This is illustrated in the following screenshot: Figure 8-17: Dashboard view Upon creating a pipeline view using the build pipeline plugin, Jenkins will present us with a number of configuration options. The most important configuration options are the name of the view and the initial job dropdown selection option, as seen in the following screenshot: Figure 8-18: Pipeline view configuration options Once the basic configuration has been defined, click the OK button to save the view. This will trigger the plugin to perform an initial scan of the linked jobs and generate the pipeline view. An example of a completely developed pipeline is illustrated in the following image: Figure 8-19: Completely developed pipeline This completes the basic configuration of a build pipeline view, which gives us a good visual representation of our build pipelines. There are a number of features and customizations that we could apply to the pipeline view, but we will let you explore those and tweak the solution to your own specific needs. Continuous Deployment Just as Continuous Delivery represents a logical extension of Continuous Integration, Continuous Deployment represents a logical expansion upon the Continuous Delivery practices. Continuous Deployment is very similar to Continuous Delivery in a lot of ways, but it has one key fundamental variance: there are no approval gates. Without approval gates, code commits to the mainline have the potential to end up in the production environment in short order. This type of an automation solution requires a high-level of discipline, strict standards, and reliable automation. It is a practice that has proven valuable for the likes of Etsy, Flickr, and many others. This is because Continuous Deployment dramatically increases the deployment velocity. The following diagram describes both, Continuous Delivery and Continuous Deployment, to better showcase the fundamental difference between, them: Figure 8-20: Differentiation between Continuous Delivery and Continuous Deployment It is important to understand that Continuous Deployment is not for everyone, and is a solution that may not be feasible for some organizations, or product types. For example, in embedded software or Desktop application software, Continuous Deployment may not be a wise solution without properly architected background upgrade mechanisms, as it will most likely alienate the users due to the frequency of the upgrades. On the other hand, it's something that could be applied, fairly easily, to a simple API web service or a SaaS-based web application. If the business unit indeed desires to migrate towards a continuous deployment solution, tight controls on quality will be required to facilitate stability and avoid outages. These controls may include any of the following: The required unit testing with code coverage metrics The required a/b testing or experiment-driven development Paired programming Automated rollbacks Code reviews and static code analysis implementations Behavior-driven development (BDD) Test-driven development (TDD) Automated smoke tests in production Additionally, it is important to note that since a Continuous Deployment solution is a significant leap forward, in general, the implementation of the Continuous Delivery practices would most likely be a pre-requisite. This solution would need to be proven stable and trusted prior to the removal of the approval gates. Once removed though, the deployment velocity should significantly increase as a result. The quantifiable value of continuous deployment is well advertised by companies such as Amazon who realized a 78 percent reduction in production outages, and a 60% reduction in downtime minutes due to catastrophic defects. That said, implementing continuous deployment will require a buy-in from the stakeholders and business interests alike. Continuous Deployment in Jenkins Applying the Continuous Deployment practices in Jenkins is actually a simple exercise once Continuous Integration and Continuous Delivery have been completed. It's simply a matter of removing the approve criteria and allowing builds to flow freely through the environments, and eventually to production. The following screenshot shows how to implement this using the Promoted builds plugin: Figure 8-21: Promoted builds plugin implementation Once removed, the build automation solutions will continuously deploy for every commit to the mainline (given that all the automated tests have been passed). Summary With this article of Mastering Jenkins, you should now have a solid understanding of how to advocate for and develop Continuous Delivery, and Continuous Deployment practices at an organization. Resources for Article: Further resources on this subject: Exploring Jenkins [article] Jenkins Continuous Integration [article] What is continuous delivery and DevOps? [article]
Read more
  • 0
  • 0
  • 2662
article-image-understanding-datastore
Packt
14 Sep 2015
41 min read
Save for later

Understanding the Datastore

Packt
14 Sep 2015
41 min read
 In this article by Mohsin Hijazee, the author of the book Mastering Google App Engine, we will go through learning, but unlearning something is even harder. The main reason why learning something is hard is not because it is hard in and of itself, but for the fact that most of the times, you have to unlearn a lot in order to learn a little. This is quite true for a datastore. Basically, it is built to scale the so-called Google scale. That's why, in order to be proficient with it, you will have to unlearn some of the things that you know. Your learning as a computer science student or a programmer has been deeply enriched by the relational model so much so that it is natural to you. Anything else may seem quite hard to grasp, and this is the reason why learning Google datastore is quite hard. However, if this were the only glitch in all that, things would have been way simpler because you could ask yourself to forget the relational world and consider the new paradigm afresh. Things have been complicated due to Google's own official documentation, where it presents a datastore in a manner where it seems closer to something such as Django's ORM, Rails ActiveRecord, or SQLAlchemy. However, all of a sudden, it starts to enlist its limitations with a very brief mention or, at times, no mention of why the limitations exist. Since you only know the limitations but not why the limitations are there in the first place, a lack of reason may result to you being unable to work around those limitations or mold your problem space into the new solution space, which is Google datastore. We will try to fix this. Hence, the following will be our goals in this article: To understand BigTable and its data model To have a look at the physical data storage in BigTable and the operations that are available in it To understand how BigTable scales To understand datastore and the way it models data on top of BigTable So, there's a lot more to learn. Let's get started on our journey of exploring datastore. The BigTable If you decided to fetch every web page hosted on the planet, download and store a copy of it, and later process every page to extract data from it, you'll find out that your own laptop or desktop is not good enough to accomplish this task. It has barely enough storage to store every page. Usually, laptops come with 1 TB hard disk drives, and this seems to be quite enough for a person who is not much into video content such as movies. Assuming that there are 2 billion websites, each with an average of 50 pages and each page weighing around 250 KB, it sums up to around 23,000+ TB (or roughly 22 petabytes), which would need 23,000 such laptops to store all the web pages with a 1 TB hard drive in each. Assuming the same statistics, if you are able to download at a whopping speed of 100 MBps, it would take you about seven years to download the whole content to one such gigantic hard drive if you had one in your laptop. Let's suppose that you downloaded the content in whatever time it took and stored it. Now, you need to analyze and process it too. If processing takes about 50 milliseconds per page, it would take about two months to process the entire data that you downloaded. The world would have changed a lot by then already, leaving your data and processed results obsolete. This is the Kind of scale for which BigTable is built. Every Google product that you see—Search Analytics, Finance, Gmail, Docs, Drive, and Google Maps—is built on top of BigTable. If you want to read more about BigTable, you can go through the academic paper from Google Research, which is available at http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf. The data model Let's examine the data model of BigTable at a logical level. BigTable is basically a key-value store. So, everything that you store falls under a unique key, just like PHP' arrays, Ruby's hash, or Python's dict: # PHP $person['name'] = 'Mohsin'; # Ruby or Python person['name'] = 'Mohsin' However, this is a partial picture. We will learn the details gradually in a while. So, let's understand this step by step. A BigTable installation can have multiple tables, just like a MySQL database can have multiple tables. The difference here is that a MySQL installation might have multiple databases, which in turn might have multiple tables. However, in the case of BigTable, the first major storage unit is a table. Each table can have hundreds of columns, which can be divided into groups called column families. You can define column families at the time of creating a table. They cannot be altered later, but each column family might have hundreds of columns that you can define even after the creation of the table. The notation that is used to address a column and its column families is like job:title, where job is a column family and title is the column. So here, you have a job column family that stores all the information about the job of the user, and title is supposed to store the job title. However, one of the important facts about these columns is that there's no concept of datatypes in BigTable as you'd encounter in other relational database systems. Everything is just an uninterpreted sequence of bytes, which means nothing to BigTable. What they really mean is just up to you. It might be a very long integer, a string, or a JSON-encoded data. Now, let's turn our attention to the rows. There are two major characteristics of the rows that we are concerned about. First, each row has a key, which must be unique. The contents of the key again consist of an uninterpreted string of bytes that is up to 64 KB in length. A key can be anything that you want it to be. All that's required is that it must be unique within the table, and in case it is not, you will have to overwrite the contents of the row with the same content. Which key should you use for a row in your table? That's the question that requires some consideration. To answer this, you need to understand how the data is actually stored. Till then, you can assume that each key has to be a unique string of bytes within the scope of a table and should be up to 64 KB in length. Now that we know about tables, column families, columns, rows, and row keys, let's look at an example of BigTable that stores 'employees' information. Let's pretend that we are creating something similar to LinkedIn here. So, here's the table: Personal Professional Key(name) personal:lastname personal:age professinal:company professional:designation Mohsin Hijazee 29 Sony Senior Designer Peter Smith 34 Panasonic General Manager Kim Yong 32 Sony Director Ricky Martin 45 Panasonic CTO Paul Jefferson 39 LG Sales Head So, 'this is a sample BigTable. The first column is the name, and we have chosen it as a key. It is of course not a good key, because the first name cannot necessarily be unique, even in small groups, let alone in millions of records. However, for the sake of this example, we will assume that the name is unique. Another reason behind assuming the name's uniqueness is that we want to increase our understanding gradually. So, the key point here is that we picked the first name as the row's key for now, but we will improve on this as we learn more. Next, we have two column groups. The personal column group holds all the personal attributes of the employees, and the other column family named professional has all the other attributes pertaining to the professional aspects. When referring to a column within a family, the notation is family:column. So, personal:age contains the age of the employees. If you look at professinal:designation and personal:age, it seems that the first one's contents are strings, while the second one stores integers. That's false. No column stores anything but just plain bytes without any distinction of what they mean. The meaning and interpretation of these bytes is up to the user of the data. From the point of view of BigTable', each column just contains plain old bytes. Another thing that is drastically different from RDBMS is such as MySQL is that each row need not have the same number of columns. Each row can adopt the layout that they want. So, the second row's personal column family can have two more columns that store gender and nationality. For this particular example, the data is in no particular order, and I wrote it down as it came to my mind. Hence, there's no order of any sort in the data at all. To summarize, BigTable is a key-value storage where keys should be unique and have a length that is less than or equal to 64 KB. The columns are divided into column families, which can be created at the time of defining the table, but each column family might have hundreds of columns created as and when needed. Also, contents have no data type and comprise just plain old bytes. There's one minor detail left, which is not important as regards our purpose. However, for the sake of the completeness of the BigTable's data model, I will mention it now. Each value of the column is stored with a timestamp that is accurate to the microseconds, and in this way, multiple versions of a column value are available. The number of last versions that should be kept is something that is configurable at the table level, but since we are not going to deal with BigTable directly, this detail is not important to us. How data is stored? Now that we know about row keys, column families, and columns, we will gradually move towards examining this data model in detail and understand how the data is actually stored. We will examine the logical storage and then dive into the actual structure, as it ends up on the disk. The data that we presented in the earlier table had no order and were listed as they came to my mind. However, while storing, the data is always sorted by the row key. So now, the data will actually be stored like this: personal professional Key(name) personal:lastname personal:age professinal:company professional:designation Kim Yong 32 Sony Director Mohsin Hijazee 29 Sony Senior Designer Paul Jefferson 39 LG Sales Head Peter Smith 34 Panasonic General Manager Ricky Martin 45 Panasonic CTO OK, so what happened here? The name column indicates the key of the table and now, the whole table is sorted by the key. That's exactly how it is stored on the disk as well. 'An important thing about sorting is lexicographic sorting and not semantic sorting. By lexicographic, we mean that they are sorted by the byte value and not by the textness or the semantic sort. This matters because even within the Latin character set, different languages have different sort orders for letters, such as letters in English versus German and French. However, all of this and the Unicode collation order isn't valid here. It is just sorted by byte values. In our instance, since K has a smaller byte value (because K has a lower ASCII/Unicode value) than letter M, it comes first. Now, suppose that some European language considers and sorts M before K. That's not how the data would be laid out here, because it is a plain, blind, and simple sort. The data is sorted by the byte value, with no regard for the semantic value. In fact, for BigTable, this is not even text. It's just a plain string of bytes. Just a hint. This order of keys is something that we will exploit when modeling data. How? We'll see later. The Physical storage Now that we understand the logical data model and how it is organized, it's time to take a closer look at how this data is actually stored on the disk. On a physical disk, the stored data is sorted by the key. So, key 1 is followed by its respective value, key 2 is followed by its respective value, and so on. At the end of the file, there's a sorted list of just the keys and their offset in the file from the start, which is something like the block to the right: Ignore the block on your left that is labeled Index. We will come back to it in a while. This particular format actually has a name SSTable (String Storage Table) because it has strings (the keys), and they are sorted. It is of course tabular data, and hence the name. Whenever your data is sorted, you have certain advantages, with the first and foremost advantage being that when you look up for an item or a range of items, 'your dataset is sorted. We will discuss this in detail later in this article. Now, if we start from the beginning of the file and read sequentially, noting down every key and then its offset in a format such as key:offset, we effectively create an index of the whole file in a single scan. That's where the first block to your left in the preceding diagram comes from. Since the keys are sorted in the file, we simply read it sequentially till the end of the file, hence effectively creating an index of the data. Furthermore, since this index only contains keys and their offsets in the file, it is much smaller in terms of the space it occupies. Now, assuming that SSTable has a table that is, say, 500 MB in size, we only need to load the index from the end of the file into the memory, and whenever we are asked for a key or a range of keys, we just search within a memory index (thus not touching the disk at all). If we find the data, only then do we seek the disk at the given offset because we know the offset of that particular key from the index that we loaded in the memory. Some limitations Pretty smart, neat, and elegant, you would say! Yes it is. However, there's a catch. If you want to create a new row, key must come in a sorted order, and even if you are sure about where exactly this key should be placed in the file to avoid the need to sort the data, you still need to rewrite the whole file in a new, sorted order along with the index. Hence, large amounts of I/O are required for just a single row insertion. The same goes for deleting a row because now, the file should be sorted and rewritten again. Updates are OK as long as the key itself is not altered because, in that case, it is sort of having a new key altogether. This is so because a modified key would have a different place in the sorted order, depending on what the key actually is. Hence, the whole file would be rewritten. Just for an example, say you have a row with the key as all-boys, and then you change the key of that row to x-rays-of-zebra. Now, you will see that after the new modification, the row will end up at nearly the end of the file, whereas previously, it was probably at the beginning of the file because all-boys comes before x-rays-of-zebra when sorted. This seems pretty limiting, and it looks like inserting or removing a key is quite expensive. However, this is not the case, as we will see later. Random writes and deletion There's one last thing that's worth a mention before we examine the operations that are available on a BigTable. We'd like to examine how random writes and the deletion of rows are handled because that seems quite expensive, as we just examined in the preceding section. The idea is very simple. All the read, writes, and removals don't go straight to the disk. Instead, an in-memory SSTable is created along with its index, both of which are empty when created. We'll call it MemTable from this point onwards for the sake of simplicity. Every read checks the index of this table, and if a record is found from here, it's well and good. If it is not, then the index of the SSTable on the disk is checked and the desired row is returned. When a new row has to be read, we don't look at anything and simply enter the row in the MemTable along with its record in the index of this MemTable. To delete a key, we simply mark it deleted in the memory, regardless of whether it is in MemTable or in the on disk table. As shown here the allocation of block into Mem Table: Now, when the size of the MemTable grows up to a certain size, it is written to the disk as a new SSTable. Since this only depends on the size of the MemTable and of course happens much infrequently, it is much faster. Each time the MemTable grows beyond a configured size, it is flushed to the disk as a new SSTable. However, the index of each flushed SSTable is still kept in the memory so that we can quickly check the incoming read requests and locate it in any table without touching the disk. Finally, when the number of SSTables reaches a certain count, the SSTables are merged and collapsed into a single SSTable. Since each SSTable is just a sorted set of keys, a merge sort is applied. This merging process is quite fast. Congratulations! You've just learned the most atomic storage unit in BigData solutions such as BigTable, Hbase, Hypertable, Cassandara, and LevelDB. That's how they actually store and process the data. Now that we know how a big table is actually stored on the disk and how the read and writes are handled, it's time to take a closer look at the available operations. Operations on BigTable Until this point, we know that a BigTable table is a collection of rows that have unique keys up to 64 KB in length and the data is stored according to the lexicographic sort order of the keys. We also examined how it is laid out on the disk and how read, writes, and removals are handled. Now, the question is, which operations are available on this data? The following are the operations that are available to us: Fetching a row by using its key Inserting a new key Deleting a row Updating a row Reading a range of rows from the starting row key to the ending row key Reading Now, the first operation is pretty simple. You have a key, and you want the associated row. Since the whole data set is sorted by the key, all we need to do is perform a binary search on it, and you'll be able to locate your desired row within a few lookups, even within a set of a million rows. In practice, the index at the end of the SSTable is loaded in the memory, and the binary search is actually performed on it. If we take a closer look at this operation in light of what we know from the previous section, the index is already in the memory of the MemTable that we saw in the previous section. In case there are multiple SSTables because MemTable was flushed many times to the disk as it grew too large, all the indexes of all the SSTables are present in the memory, and a quick binary search is performed on them. Writing The second operation that is available to us is the ability to insert a new row. So, we have a key and the values that we want to insert in the table. According to our new knowledge about physical storage and SSTables, we can understand this very well. The write directly happens on the in-memory MemTable and its index is updated, which is also in the memory. Since no disk access is required to write the row as we are writing in memory, the whole file doesn't have to be rewritten on disk, because yet again, all of it is in the memory. This operation is very fast and almost instantaneous. However, if the MemTable grows in size, it will be flushed to the disk as a new SSTable along with the index while retaining a copy of its index in the memory. Finally, we also saw that when the number of SSTables reaches a certain number, they are merged and collapsed to form a new, bigger table. Deleting It seems that since all the keys are in a sorted order on the disk and deleting a key would mean disrupting the sort order, a rewrite of the whole file would be a big I/O overhead. However, it is not, as it can be handled smartly. Since all the indexes, including the MemTable and the tables that were the result of flushing a larger MemTable to the disk, are already in the memory, deleting a row only requires us to find the required key in the in-memory indexes and mark it as deleted. Now, whenever someone tries to read the row, the in-memory indexes will be checked, and although an entry will be there, it will be marked as deleted and won't be returned. When MemTable is being flushed to the disk or multiple tables are being collapsed, this key and the associated row will be excluded in the write process. Hence, they are totally gone from the storage. Updating Updating a row is no different, but it has two cases. The first case is in which not only the values, but also the key is modified. In this case, it is like removing the row with an old key and inserting a row with a new key. We already have seen both of these cases in detail. So, the operation should be obvious. However, the case where only the values are modified is even simpler. We only have to locate the row from the indexes, load it in the memory if it is not already there, and modify. That's all. Scanning a range This last operation is quite interesting. You can scan a range of keys from a starting key to an ending key. For instance, you can return all the rows that have a key greater than or equal to key1 and less than or equal to key2, effectively forming a range. Since the looking up of a single key is a fast operation, we only have to locate the first key of the range. Then, we start reading the consecutive keys one after the other till we encounter a key that is greater than key2, at which point, we will stop the scanning, and the keys that we scanned so far are our query's result. This is how it looks like: Name Department Company Chris Harris Research & Development Google Christopher Graham Research & Development LG Debra Lee Accounting Sony Ernest Morrison Accounting Apple Fred Black Research & Development Sony Janice Young Research & Development Google Jennifer Sims Research & Development Panasonic Joyce Garrett Human Resources Apple Joyce Robinson Research & Development Apple Judy Bishop Human Resources Google Kathryn Crawford Human Resources Google Kelly Bailey Research & Development LG Lori Tucker Human Resources Sony Nancy Campbell Accounting Sony Nicole Martinez Research & Development LG Norma Miller Human Resources Sony Patrick Ward Research & Development Sony Paula Harvey Research & Development LG Stephanie Chavez Accounting Sony Stephanie Mccoy Human Resources Panasonic In the preceding table, we said that the starting key will be greater than or equal to Ernest and ending key will be less than or equal to Kathryn. So, we locate the first key that is greater than or equal to Ernest, which happens to be Ernest Morrison. Then, we start scanning further, picking and returning each key as long as it is less than or equal to Kathryn. When we reach Judy, it is less than or equal to Kathryn, but Kathryn isn't. So, this row is not returned. However, the rows before this are returned. This is the last operation that is available to us on BigTable. Selecting a key Now that we have examined the data model and the storage layout, we are in a better position to talk about the key selection for a table. As we know that the stored data is sorted by the key, it does not impact the writing, deleting, and updating to fetch a single row. However, the operation that is impacted by the key is that of scanning a range. Let's think about the previous table again and assume that this table is a part of some system that processes payrolls for companies, and the companies pay us for the task of processing their payroll. Now, let's suppose that Sony asks us to process their data and generate a payroll for them. Right now, we cannot do anything of this kind. We can just make our program scan the whole table, and hence all the records (which might be in millions), and only pick the records where job:company has the value of Sony. This would be inefficient. Instead, what we can do is put this sorted nature of row keys to our service. Select the company name as the key and concatenate the designation and name along with it. So, the new table will look like this: Key Name Department Company Apple-Accounting-Ernest Morrison Ernest Morrison Accounting Apple Apple-Human Resources-Joyce Garrett Joyce Garrett Human Resources Apple Apple-Research & Development-Joyce Robinson Joyce Robinson Research & Development Apple Google-Human Resources-Judy Bishop Chris Harris Research & Development Google Google-Human Resources-Kathryn Crawford Janice Young Research & Development Google Google-Research & Development-Chris Harris Judy Bishop Human Resources Google Google-Research & Development-Janice Young Kathryn Crawford Human Resources Google LG-Research & Development-Christopher Graham Christopher Graham Research & Development LG LG-Research & Development-Kelly Bailey Kelly Bailey Research & Development LG LG-Research & Development-Nicole Martinez Nicole Martinez Research & Development LG LG-Research & Development-Paula Harvey Paula Harvey Research & Development LG Panasonic-Human Resources-Stephanie Mccoy Jennifer Sims Research & Development Panasonic Panasonic-Research & Development-Jennifer Sims Stephanie Mccoy Human Resources Panasonic Sony-Accounting-Debra Lee Debra Lee Accounting Sony Sony-Accounting-Nancy Campbell Fred Black Research & Development Sony Sony-Accounting-Stephanie Chavez Lori Tucker Human Resources Sony Sony-Human Resources-Lori Tucker Nancy Campbell Accounting Sony Sony-Human Resources-Norma Miller Norma Miller Human Resources Sony Sony-Research & Development-Fred Black Patrick Ward Research & Development Sony Sony-Research & Development-Patrick Ward Stephanie Chavez Accounting Sony So, this is a new format. We just welded the company, department, and name as the key and as the table will always be sorted by the key, that's what it looks like, as shown in the preceding table. Now, suppose that we receive a request from Google to process their data. All we have to do is perform a scan, starting from the key greater than or equal to Google and less then L because that's the next letter. This scan is highlighted in the previous table. Now, the next request is more specific. Sony asks us to process their data, but only for their accounting department. How do we do that? Quite simple! In this case, our starting key will be greater than or equal to Sony-Accounting, and the ending key can be Sony-Accountinga, where a is appended to indicate the end key in the range. The scanned range and the returned rows are highlighted in the previous table. BigTable – a hands-on approach Okay, enough of the theory. It is now time to take a break and perform some hands-on experimentation. By now, we know that about 80 percent of the BigTable and the other 20 percent of the complexity is scaling it to more than one machine. Our current discussion only assumed and focused on a single machine environment, and we assumed that the BigTable table is on our laptop and that's about it. You might really want to experiment with what you learned. Fortunately, given that you have the latest version of Google Chrome or Mozilla Firefox, that's easy. You have BigTable right there! How? Let me explain. Basically, from the ideas that we looked at pertaining to the stored key value, the sorted layout, the indexes of the sorted files, and all the operations that were performed on them, including scanning, we extracted a separate component called LevelDB. Meanwhile, as HTML was evolving towards HTML5, a need was felt to store data locally. Initially, SQLite3 was embedded in browsers, and there was a querying interface for you to play with. So all in all, you had an SQL database in the browser, which yielded a lot of possibilities. However, in recent years, W3C deprecated this specification and urged browser vendors to not implement it. Instead of web databases that were based on SQLite3, they now have databases based on LevelDB that are actually key-value stores, where storage is always sorted by key. Hence, besides looking up for a key, you can scan across a range of keys. Covering the IndexedDB API here would be beyond the scope of this book, but if you want to understand it and find out what the theory that we talked about looks like in practice, you can try using IndexedDB in your browser by visiting http://code.tutsplus.com/tutorials/working-with-indexeddb--net-34673. The concepts of keys and the scanning of key ranges are exactly like those that we examined here as regards BigTable, and those about indexes are mainly from the concepts that we will examine in a later section about datastores. Scaling BigTable to BigData By now, you have probably understood the data model of BigTable, how it is laid out on the disk, and the advantages it offers. To recap once again, the BigTable installation may have many tables, each table may have many column families that are defined at the time of creating the table, and each column family may have many columns, as required. Rows are identified by keys, which have a maximum length of 64 KB, and the stored data is sorted by the key. We can receive, update, and delete a single row. We can also scan a range of rows from a starting key to an ending key. So now, the question comes, how does this scale? We will provide a very high-level overview, neglecting the micro details to keep things simple and build a mental model that is useful to us as the consumers of BigTable, as we're not supposed to clone BigTable's implementation after all. As we saw earlier, the basic storage unit in BigTable is a file format called SSTable that stores key-value pairs, which are sorted by the key, and has an index at its end. We also examined how the read, write, and delete work on an in-memory copy of the table and merged periodically with the table that is present on the disk. Lastly, we also mentioned that when the in memory is flushed as SSTables on the disk when reach a certain configurable count, they are merged into a bigger table. The view so far presents the data model, its physical layout, and how operations work on it in cases where the data resides on a single machine, such as a situation where your laptop has a telephone directory of the entire Europe. However, how does that work at larger scales? Neglecting the minor implementation details and complexities that arise in distributed systems, the overall architecture and working principles are simple. In case of a single machine, there's only one SSTable (or a few in case they are not merged into one) file that has to be taken care of, and all the operations have to be performed on it. However, in case this file does not fit on a single machine, we will of course have to add another machine, and half of the SSTable will reside on one machine, while the other half will be on the another machine. This split would of course mean that each machine would have a range of keys. For instance, if we have 1 million keys (that look like key1, key2, key3, and so on), then the keys from key1 to key500000 might be on one machine, while the keys from key500001 to key1000000 will be on the second machine. So, we can say that each machine has a different key range for the same table. Now, although the data resides on two different machines, it is of course a single table that sprawls over two machines. These partitions or separate parts are called tablets. Let's see the Key allocation on two machines: We will keep this system to only two machines and 1 million rows for the sake of discussion, but there may be cases where there are about 20 billion keys sprawling over some 12,000 machines, with each machine having a different range of keys. However, let's continue with this small cluster consisting of only two nodes. Now, the problem is that as an external user who has no knowledge of which machine has which portion of the SSTable (and eventually, the key ranges on each machine), how can a key, say, key489087 be located? For this, we will have to add something like a telephone directory, where I look up the table name and my desired key and I get to know the machine that I should contact to get the data associated with the key. So, we are going to add another node, which will be called the master. This master will again contain simple, plain SSTable, which is familiar to us. However, the key-value pair would be a very interesting one. Since this table would contain data about the other BigTable tables, let's call it the METADATA table. In the METADATA table, we will adopt the following format for the keys: tablename_ending-row-key Since we have only two machines and each machine has two tablets, the METADATA table will look like this: Key Value employees_key500000 192.168.0.2 employees_key1000000 192.168.0.3 The master stores the location of each tablet server with the row key that is the encoding of the table name and the ending row of the tablet. So, the tablet has to be scanned. The master assigns tablets to different machines when required. Each tablet is about 100 MB to 200 MB in size. So, if we want to fetch a key, all we need to know is the following: Location of the master server Table in which we are looking for the key The key itself Now, we will concatenate the table name with the key and perform a scan on the METADATA table on the master node. Let's suppose that we are looking for key600000 in employees table. So, we would first be actually looking for the employees_key600000 key in the table on master machine. As you are familiar with the scan operation on SSTable (and METADATA is just an SSTable), we are looking for a key that is greater than or equal to employees_key600000, which happens to be employees_key1000000. From this lookup, the key that we get is employees_key1000000 against which, IP address 192.168.0.3 is listed. This means that this is the machine that we should connect to fetch our data. We used the word keys and not the key because it is a range scan operation. This will be clearer with another example. Let's suppose that we want to process rows with keys starting from key400000 to key800000. Now, if you look at the distribution of data across the machine, you'll know that half of the required range is on one machine, while the other half is on the other. Now, in this case, when we consult the METADATA table, two rows will be returned to us because key400000 is less then key500000 (which is the ending row key for data on the first machine) and key800000 is less then key1000000, which is the ending row for the data on the second machine. So, with these two rows returned, we have two locations to fetch our data from. This leads to an interesting side-effect. As the data resides on two different machines, this can be read or processed in parallel, which leads to an improved system performance. This is one reason why even with larger datasets, the performance of BigTable won't deteriorate as badly as it would have if it were a single, large machine with all the data on it. The datastore thyself So until now, everything that we talked about was about BigTable, and we did not mention datastore at all. Now is the time to look at datastore in detail because we understand BigTable quite well now. Datastore is an effectively solution that was built on top of BigTable as a persistent NoSQL layer for Google App Engine. As we know that BigTable might have different tables, data for all the applications is stored in six separate tables, where each table stores a different aspect or information about the data. Don't worry about memorizing things about data modeling and how to use it for now, as this is something that we are going to look into in greater detail later. The fundamental unit of storage in datastore is called a property. You can think of a property as a column. So, a property has a name and type. You can group multiple properties into a Kind, which effectively is a Python class and analogous to a table in the RDBMS world. Here's a pseudo code sample: # 1. Define our Kind and how it looks like. class Person(object): name = StringProperty() age = IntegerProperty() # 2. Create an entity of kind person ali = Person(name='Ali', age='24) bob = Person(name='Bob', age='34) david = Person(name='David', age='44) zain = Person(name='Zain', age='54) # 3. Save it ali.put() bob.put() david.put() zain.put() This looks a lot like an ORM such as Django's ORM, SQLAlchemy, or Rails ActiveRecord. So, Person class is called a Kind in App Engine's terminology. The StringProperty and IntegerProperty property classes are used to indicate the type of the data that is supposed to be stored. We created an instance of the Person class as mohsin. This instance is called an entity in App Engine's terminology. Each entity, when stored, has a key that is not only unique throughout your application, but also combined with your application ID. It becomes unique throughout all the applications that are hosted over Google App Engine. All entities of all kinds for all apps are stored in a single BigTable, and they are stored in a way where all the property values are serialized and stored in a single BigTable column. Hence, no separate columns are defined for each property. This is interesting and required as well because if we are Google App Engine's architects, we do not know the Kind of data that people are going to store or the number and types of properties that they would define so that it makes sense to serialize the whole thing as one and store them in one column. So, this is how it looks like: Key Kind Data agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM Person {name: 'Ali', age: 24} agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM Person {name: 'Bob', age: 34} agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM Person {name: 'David', age: 44} agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM Person {name: 'Zain', age: 54} The key appears to be random, but it is not. A key is formed by concatenating your application ID, your Kind name (Person here), and either a unique identifier that is auto generated by Google App Engine, or a string that is supplied by you. The key seems cryptic, but it is not safe to pass it around in public, as someone might decode it and take advantage of it. Basically, it is just base 64 encoded and can easily be decoded to know the entity's Kind name and ID. A better way would be to encrypt it using a secret key and then pass it around in public. On the other hand, to receive it, you will have to decrypt it using the same key. A gist of this is available on GitHub that can serve the purpose. To view this, visit https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68. However, for it to work, you need to edit your app.yaml file so that it includes the following: libraries: - name: pycrypto version: latest Then, you can call the encrypt() method on the key while passing around and decrypt it back using the decrypt() method, as follows: person = Person(name='peter', age=10) key = person.put() url_safe_key = key.urlsafe() safe_to_pass_around = encrypt(SECRET_KEY, url_safe_key) Now, when you have a key from the outside, you should first decrypt it and then use it, as follows: key_from_outside = request.params.get('key') url_safe_key = decrypt(SECRET_KEY, key_from_outside) key = ndb.Key(urlsafe=url_safe_key) person = key.get() The key object is now good for use. To summarize, just get the URL safe key by calling the ndb.Key.urlsafe() method and encrypt it so that it can be passed around. On return, just do the reverse. If you really want to see how the encrypt and decrypt operations are implemented, they are reproduced as follows without any documentation/comments, as cryptography is not our main subject: import os import base64 from Crypto.Cipher import AES BLOCK_SIZE = 32 PADDING='#' def _pad(data, pad_with=PADDING): return data + (BLOCK_SIZE - len(data) % BLOCK_SIZE) * PADDING def encrypt(secret_key, data): cipher = AES.new(_pad(secret_key, '@')[:32]) return base64.b64encode(cipher.encrypt(_pad(data))) def decrypt(secret_key, encrypted_data): cipher = AES.new(_pad(secret_key, '@')[:32]) return cipher.decrypt(base64.b64decode (encrypted_data)).rstrip(PADDING) KEY='your-key-super-duper-secret-key-here-only-first-32-characters-are-used' decrypted = encrypt(KEY, 'Hello, world!') print decrypted print decrypt(KEY, decrypted) More explanation on how this works is given at https://gist.github.com/mohsinhijazee/07cdfc2826a565b50a68. Now, let's come back to our main subject, datastore. As you can see, all the data is stored in a single column, and if we want to query something, for instance, people who are older than 25, we have no way to do this. So, how will this work? Let's examine this next. Supporting queries Now, what if we want to get information pertaining to all the people who are older than, say, 30? In the current scheme of things, this does not seem to be something that is doable, because the data is serialized and dumped, as shown in the previous table. Datastore solves this problem by putting the sorted values to be queried upon as keys. So here, we want to query by age. Datastore will create a record in another table called the Index table. This index table is nothing but just a plain BigTable, where the row keys are actually the property value that you want to query. Hence, a scan and a quick lookup is possible. Here's how it would look like: Key Entity key Myapp-person-age-24 agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM Myapp-person-age-34 agtkZXZ-bWdhZS0wMXIPCxNTVVyc29uIgNBbGkM Myapp-person-age-44 agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIgNBbBQM Myapp-person-age-54 agtkZXZ-bWdhZS0wMXIPCxIGUGVyc29uIRJ3bGkM Implementation details So, all in all, Datastore actually builds a NoSQL solution on top of BigTable by using the following six tables: A table to store entities A table to store entities by kind A table to store indexes for the property values in the ascending order A table to store indexes for the property values in the descending order A table to store indexes for multiple properties together A table to keep a track of the next unique ID for Kind Let us look at each table in turn. The first table is used to store entities for all the applications. We have examined this in an example. The second table just stores the Kind names. Nothing fancy here. It's just some metadata that datastore maintains for itself. Think of this—you want to get all the entities that are of the Person Kind. How will you do this? If you look at the entities table alone and the operations that are available to us on a BigTable table, you will know that there's no such way for us to fetch all the entities of a certain Kind. This table does exactly this. It looks like this: Key Entity key Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM AgtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBbGkM Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854 agtkZXZ-bWdhZS0wMXIQTXIGUGVyc29uIgNBb854 Myapp-Person-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM agtkZXZ-agtkZXZ-bWdhZS0wMXIQTXIGUGVy748IgNBbGkM So, as you can see, this is just a simple BigTable table where the keys are of the [app ID]-[Kind name]-[entity key] pattern. The tables 3, 4, and 5 from the six tables that were mentioned in the preceding list are similar to the table that we examined in the Supporting queries section labeled Data as stored in BigTable. This leaves us with the last table. As you know that while storing entities, it is important to have a unique key for each row. Since all the entities from all the apps are stored in a single table, they should be unique across the whole table. When datastore generates a key for an entity that has to be stored, it combines your application ID and the Kind name of the entity. Now, this much part of the key only makes it unique across all the other entities in the table, but not within the set of your own entities. To do this, you need a number that should be appended to this. This is exactly similar to how AUTO INCREMENT works in the RDBMS world, where the value of a column is automatically incremented to ensure that it is unique. So, that's exactly what the last table is for. It keeps a track of the last ID that was used by each Kind of each application, and it looks like this: Key Next ID Myapp-Person 65 So, in this table, the key is of the [application ID]-[Kind name] format, and the value is the next value, which is 65 in this particular case. When a new entity of kind Person is created, it will be assigned 65 as the ID, and the row will have a new value of 66. Our application has only one Kind defined, which is Person. Therefore, there's only one row in this table because we are only keeping track for the next ID for this Kind. If we had another Kind, say, Group, it will have its own row in this table. Summary We started this article with the problem of storing huge amounts of data, processing it in bulk, and randomly accessing it. This arose from the fact that we were ambitious to store every single web page on earth and process it to extract some results from it. We introduced a solution called BigTable and examined its data model. We saw that in BigTable, we can define multiple tables, with each table having multiple column families, which are defined at the time of creating the table. We learned that column families are logical groupings of columns, and new columns can be defined in a column family, as needed. We also learned that the data store in BigTable has no meaning on its own, and it stores them just as plain bytes; its interpretation and meanings depend on the user of data. We also learned that each row in BigTable has a unique row key, which has a length of 64 KB. Lastly, we turned our attention to datastore, a NoSQL storage solution built on top of BigTable for Google App Engine. We briefly mentioned some datastore terminology such as properties (columns), entities (rows), and kinds (tables). We learned that all data is stored across six different BigTable tables. This captured a different aspect of data. Most importantly, we learned that all the entities of all the apps hosted on Google App Engine are stored in a single BigTable and all properties go to a single BigTable column. We also learned how querying is supported by additional tables that are keyed by the property values that list the corresponding keys. This concludes our discussion on Google App Engine's datastore and its underlying technology, workings, and related concepts. Next, we will learn how to model our data on top of datastore. What we learned in this chapter will help us enormously in understanding how to better model our data to take full advantage of the underlying mechanisms. Resources for Article: Further resources on this subject: Google Guice[article] The EventBus Class[article] Integrating Google Play Services [article]
Read more
  • 0
  • 0
  • 2744

article-image-overview-common-machine-learning-tasks
Packt
14 Sep 2015
29 min read
Save for later

Introducing Bayesian Inference

Packt
14 Sep 2015
29 min read
In this article by Dr. Hari M. Kudovely, the author of Learning Bayesian Models with R, we will look at Bayesian inference in depth. The Bayes theorem is the basis for updating beliefs or model parameter values in Bayesian inference, given the observations. In this article, a more formal treatment of Bayesian inference will be given. To begin with, let us try to understand how uncertainties in a real-world problem are treated in Bayesian approach. (For more resources related to this topic, see here.) Bayesian view of uncertainty The classical or frequentist statistics typically take the view that any physical process-generating data containing noise can be modeled by a stochastic model with fixed values of parameters. The parameter values are learned from the observed data through procedures such as maximum likelihood estimate. The essential idea is to search in the parameter space to find the parameter values that maximize the probability of observing the data seen so far. Neither the uncertainty in the estimation of model parameters from data, nor the uncertainty in the model itself that explains the phenomena under study, is dealt with in a formal way. The Bayesian approach, on the other hand, treats all sources of uncertainty using probabilities. Therefore, neither the model to explain an observed dataset nor its parameters are fixed, but they are treated as uncertain variables. Bayesian inference provides a framework to learn the entire distribution of model parameters, not just the values, which maximize the probability of observing the given data. The learning can come from both the evidence provided by observed data and domain knowledge from experts. There is also a framework to select the best model among the family of models suited to explain a given dataset. Once we have the distribution of model parameters, we can eliminate the effect of uncertainty of parameter estimation in the future values of a random variable predicted using the learned model. This is done by averaging over the model parameter values through marginalization of joint probability distribution. Consider the joint probability distribution of N random variables again: This time, we have added one more term, m, to the argument of the probability distribution, in order to indicate explicitly that the parameters are generated by the model m. Then, according to Bayes theorem, the probability distribution of model parameters conditioned on the observed data  and model m is given by:   Formally, the term on the LHS of the equation  is called posterior probability distribution. The second term appearing in the numerator of RHS, , is called the prior probability distribution. It represents the prior belief about the model parameters, before observing any data, say, from the domain knowledge. Prior distributions can also have parameters and they are called hyperparameters. The term  is the likelihood of model m explaining the observed data. Since , it can be considered as a normalization constant . The preceding equation can be rewritten in an iterative form as follows:   Here,  represents values of observations that are obtained at time step n,  is the marginal parameter distribution updated until time step n - 1, and  is the model parameter distribution updated after seeing the observations  at time step n. Casting Bayes theorem in this iterative form is useful for online learning and it suggests the following: Model parameters can be learned in an iterative way as more and more data or evidence is obtained The posterior distribution estimated using the data seen so far can be treated as a prior model when the next set of observations is obtained Even if no data is available, one could make predictions based on prior distribution created using the domain knowledge alone To make these points clear, let's take a simple illustrative example. Consider the case where one is trying to estimate the distribution of the height of males in a given region. The data used for this example is the height measurement in centimeters obtained from M volunteers sampled randomly from the population. We assume that the heights are distributed according to a normal distribution with the mean  and variance :   As mentioned earlier, in classical statistics, one tries to estimate the values of  and  from observed data. Apart from the best estimate value for each parameter, one could also determine an error term of the estimate. In the Bayesian approach, on the other hand,  and  are also treated as random variables. Let's, for simplicity, assume  is a known constant. Also, let's assume that the prior distribution for  is a normal distribution with (hyper) parameters  and . In this case, the expression for posterior distribution of  is given by:   Here, for convenience, we have used the notation  for . It is a simple exercise to expand the terms in the product and complete the squares in the exponential. The resulting expression for the posterior distribution  is given by:   Here,  represents the sample mean. Though the preceding expression looks complex, it has a very simple interpretation. The posterior distribution is also a normal distribution with the following mean:   The variance is as follows:   The posterior mean is a weighted sum of prior mean  and sample mean . As the sample size M increases, the weight of the sample mean increases and that of the prior decreases. Similarly, posterior precision (inverse of the variance) is the sum of the prior precision  and precision of the sample mean :   As M increases, the contribution of precision from observations (evidence) outweighs that from the prior knowledge. Let's take a concrete example where we consider age distribution with the population mean 5.5 and population standard deviation 0.5. We sample 100 people from this population by using the following R script: >set.seed(100) >age_samples <- rnorm(10000,mean = 5.5,sd=0.5) We can calculate the posterior distribution using the following R function: >age_mean <- function(n){ mu0 <- 5 sd0 <- 1 mus <- mean(age_samples[1:n]) sds <- sd(age_samples[1:n]) mu_n <- (sd0^2/(sd0^2 + sds^2/n)) * mus + (sds^2/n/(sd0^2 + sds^2/n)) * mu0 mu_n } >samp <- c(25,50,100,200,400,500,1000,2000,5000,10000) >mu <- sapply(samp,age_mean,simplify = "array") >plot(samp,mu,type="b",col="blue",ylim=c(5.3,5.7),xlab="no of samples",ylab="estimate of mean") >abline(5.5,0) One can see that as the number of samples increases, the estimated mean asymptotically approaches the population mean. The initial low value is due to the influence of the prior, which is, in this case, 5.0. This simple and intuitive picture of how the prior knowledge and evidence from observations contribute to the overall model parameter estimate holds in any Bayesian inference. The precise mathematical expression for how they combine would be different. Therefore, one could start using a model for prediction with just prior information, either from the domain knowledge or the data collected in the past. Also, as new observations arrive, the model can be updated using the Bayesian scheme. Choosing the right prior distribution In the preceding simple example, we saw that if the likelihood function has the form of a normal distribution, and when the prior distribution is chosen as normal, the posterior also turns out to be a normal distribution. Also, we could get a closed-form analytical expression for the posterior mean. Since the posterior is obtained by multiplying the prior and likelihood functions and normalizing by integration over the parameter variables, the form of the prior distribution has a significant influence on the posterior. This section gives some more details about the different types of prior distributions and guidelines as to which ones to use in a given context. There are different ways of classifying prior distributions in a formal way. One of the approaches is based on how much information a prior provides. In this scheme, the prior distributions are classified as Informative, Weakly Informative, Least Informative, and Non-informative. Here, we take more of a practitioner's approach and illustrate some of the important classes of the prior distributions commonly used in practice. Non-informative priors Let's start with the case where we do not have any prior knowledge about the model parameters. In this case, we want to express complete ignorance about model parameters through a mathematical expression. This is achieved through what are called non-informative priors. For example, in the case of a single random variable x that can take any value between  and , the non-informative prior for its mean   would be the following: Here, the complete ignorance of the parameter value is captured through a uniform distribution function in the parameter space. Note that a uniform distribution is not a proper distribution function since its integral over the domain is not equal to 1; therefore, it is not normalizable. However, one can use an improper distribution function for the prior as long as it is multiplied by the likelihood function; the resulting posterior can be normalized. If the parameter of interest is variance , then by definition it can only take non-negative values. In this case, we transform the variable so that the transformed variable has a uniform probability in the range from  to : It is easy to show, using simple differential calculus, that the corresponding non-informative distribution function in the original variable  would be as follows: Another well-known non-informative prior used in practical applications is the Jeffreys prior, which is named after the British statistician Harold Jeffreys. This prior is invariant under reparametrization of  and is defined as proportional to the square root of the determinant of the Fisher information matrix: Here, it is worth discussing the Fisher information matrix a little bit. If X is a random variable distributed according to , we may like to know how much information observations of X carry about the unknown parameter . This is what the Fisher Information Matrix provides. It is defined as the second moment of the score (first derivative of the logarithm of the likelihood function): Let's take a simple two-dimensional problem to understand the Fisher information matrix and Jeffreys prior. This example is given by Prof. D. Wittman of the University of California. Let's consider two types of food item: buns and hot dogs. Let's assume that generally they are produced in pairs (a hot dog and bun pair), but occasionally hot dogs are also produced independently in a separate process. There are two observables such as the number of hot dogs () and the number of buns (), and two model parameters such as the production rate of pairs () and the production rate of hot dogs alone (). We assume that the uncertainty in the measurements of the counts of these two food products is distributed according to the normal distribution, with variance  and , respectively. In this case, the Fisher Information matrix for this problem would be as follows: In this case, the inverse of the Fisher information matrix would correspond to the covariance matrix: Subjective priors One of the key strengths of Bayesian statistics compared to classical (frequentist) statistics is that the framework allows one to capture subjective beliefs about any random variables. Usually, people will have intuitive feelings about minimum, maximum, mean, and most probable or peak values of a random variable. For example, if one is interested in the distribution of hourly temperatures in winter in a tropical country, then the people who are familiar with tropical climates or climatology experts will have a belief that, in winter, the temperature can go as low as 15°C and as high as 27°C with the most probable temperature value being 23°C. This can be captured as a prior distribution through the Triangle distribution as shown here. The Triangle distribution has three parameters corresponding to a minimum value (a), the most probable value (b), and a maximum value (c). The mean and variance of this distribution are given by:   One can also use a PERT distribution to represent a subjective belief about the minimum, maximum, and most probable value of a random variable. The PERT distribution is a reparametrized Beta distribution, as follows:   Here:     The PERT distribution is commonly used for project completion time analysis, and the name originates from project evaluation and review techniques. Another area where Triangle and PERT distributions are commonly used is in risk modeling. Often, people also have a belief about the relative probabilities of values of a random variable. For example, when studying the distribution of ages in a population such as Japan or some European countries, where there are more old people than young, an expert could give relative weights for the probability of different ages in the populations. This can be captured through a relative distribution containing the following details: Here, min and max represent the minimum and maximum values, {values} represents the set of possible observed values, and {weights} represents their relative weights. For example, in the population age distribution problem, these could be the following: The weights need not have a sum of 1. Conjugate priors If both the prior and posterior distributions are in the same family of distributions, then they are called conjugate distributions and the corresponding prior is called a conjugate prior for the likelihood function. Conjugate priors are very helpful for getting get analytical closed-form expressions for the posterior distribution. In the simple example we considered, we saw that when the noise is distributed according to the normal distribution, choosing a normal prior for the mean resulted in a normal posterior. The following table gives examples of some well-known conjugate pairs: Likelihood function Model parameters Conjugate prior Hyperparameters Binomial   (probability) Beta   Poisson   (rate) Gamma   Categorical   (probability, number of categories) Dirichlet   Univariate normal (known variance )   (mean) Normal   Univariate normal (known mean )   (variance) Inverse Gamma     Hierarchical priors Sometimes, it is useful to define prior distributions for the hyperparameters itself. This is consistent with the Bayesian view that all parameters should be treated as uncertain by using probabilities. These distributions are called hyper-prior distributions. In theory, one can continue this into many levels as a hierarchical model. This is one way of eliciting the optimal prior distributions. For example: is the prior distribution with a hyperparameter . We could define a prior distribution for  through a second set of equations, as follows: Here,  is the hyper-prior distribution for the hyperparameter , parametrized by the hyper-hyper-parameter . One can define a prior distribution for in the same way and continue the process forever. The practical reason for formalizing such models is that, at some level of hierarchy, one can define a uniform prior for the hyper parameters, reflecting complete ignorance about the parameter distribution, and effectively truncate the hierarchy. In practical situations, typically, this is done at the second level. This corresponds to, in the preceding example, using a uniform distribution for . I want to conclude this section by stressing one important point. Though prior distribution has a significant role in Bayesian inference, one need not worry about it too much, as long as the prior chosen is reasonable and consistent with the domain knowledge and evidence seen so far. The reasons are is that, first of all, as we have more evidence, the significance of the prior gets washed out. Secondly, when we use Bayesian models for prediction, we will average over the uncertainty in the estimation of the parameters using the posterior distribution. This averaging is the key ingredient of Bayesian inference and it removes many of the ambiguities in the selection of the right prior. Estimation of posterior distribution So far, we discussed the essential concept behind Bayesian inference and also how to choose a prior distribution. Since one needs to compute the posterior distribution of model parameters before one can use the models for prediction, we discuss this task in this section. Though the Bayesian rule has a very simple-looking form, the computation of posterior distribution in a practically usable way is often very challenging. This is primarily because computation of the normalization constant  involves N-dimensional integrals, when there are N parameters. Even when one uses a conjugate prior, this computation can be very difficult to track analytically or numerically. This was one of the main reasons for not using Bayesian inference for multivariate modeling until recent decades. In this section, we will look at various approximate ways of computing posterior distributions that are used in practice. Maximum a posteriori estimation Maximum a posteriori (MAP) estimation is a point estimation that corresponds to taking the maximum value or mode of the posterior distribution. Though taking a point estimation does not capture the variability in the parameter estimation, it does take into account the effect of prior distribution to some extent when compared to maximum likelihood estimation. MAP estimation is also called poor man's Bayesian inference. From the Bayes rule, we have: Here, for convenience, we have used the notation X for the N-dimensional vector . The last relation follows because the denominator of RHS of Bayes rule is independent of . Compare this with the following maximum likelihood estimate: The difference between the MAP and ML estimate is that, whereas ML finds the mode of the likelihood function, MAP finds the mode of the product of the likelihood function and prior. Laplace approximation We saw that the MAP estimate just finds the maximum value of the posterior distribution. Laplace approximation goes one step further and also computes the local curvature around the maximum up to quadratic terms. This is equivalent to assuming that the posterior distribution is approximately Gaussian (normal) around the maximum. This would be the case if the amount of data were large compared to the number of parameters: M >> N. Here, A is an N x N Hessian matrix obtained by taking the derivative of the log of the posterior distribution: It is straightforward to evaluate the previous expressions at , using the following definition of conditional probability: We can get an expression for P(X|m) from Laplace approximation that looks like the following: In the limit of a large number of samples, one can show that this expression simplifies to the following: The term  is called Bayesian information criterion (BIC) and can be used for model selections or model comparison. This is one of the goodness of fit terms for a statistical model. Another similar criterion that is commonly used is Akaike information criterion (AIC), which is defined by . Now we will discuss how BIC can be used to compare different models for model selection. In the Bayesian framework, two models such as  and  are compared using the Bayes factor. The definition of the Bayes factor  is the ratio of posterior odds to prior odds that is given by: Here, posterior odds is the ratio of posterior probabilities of the two models of the given data and prior odds is the ratio of prior probabilities of the two models, as given in the preceding equation. If , model  is preferred by the data and if , model  is preferred by the data. In reality, it is difficult to compute the Bayes factor because it is difficult to get the precise prior probabilities. It can be shown that, in the large N limit,  can be viewed as a rough approximation to . Monte Carlo simulations The two approximations that we have discussed so far, the MAP and Laplace approximations, are useful when the posterior is a very sharply peaked function about the maximum value. Often, in real-life situations, the posterior will have long tails. This is, for example, the case in e-commerce where the probability of the purchasing of a product by a user has a long tail in the space of all products. So, in many practical situations, both MAP and Laplace approximations fail to give good results. Another approach is to directly sample from the posterior distribution. Monte Carlo simulation is a technique used for sampling from the posterior distribution and is one of the workhorses of Bayesian inference in practical applications. In this section, we will introduce the reader to Markov Chain Monte Carlo (MCMC) simulations and also discuss two common MCMC methods used in practice. As discussed earlier, let  be the set of parameters that we are interested in estimating from the data through posterior distribution. Consider the case of the parameters being discrete, where each parameter has K possible values, that is, . Set up a Markov process with states  and transition probability matrix . The essential idea behind MCMC simulations is that one can choose the transition probabilities in such a way that the steady state distribution of the Markov chain would correspond to the posterior distribution we are interested in. Once this is done, sampling from the Markov chain output, after it has reached a steady state, will give samples of distributed according to the posterior distribution. Now, the question is how to set up the Markov process in such a way that its steady state distribution corresponds to the posterior of interest. There are two well-known methods for this. One is the Metropolis-Hastings algorithm and the second is Gibbs sampling. We will discuss both in some detail here. The Metropolis-Hasting algorithm The Metropolis-Hasting algorithm was one of the first major algorithms proposed for MCMC. It has a very simple concept—something similar to a hill-climbing algorithm in optimization: Let  be the state of the system at time step t. To move the system to another state at time step t + 1, generate a candidate state  by sampling from a proposal distribution . The proposal distribution is chosen in such a way that it is easy to sample from it. Accept the proposal move with the following probability: If it is accepted, = ; if not, . Continue the process until the distribution converges to the steady state. Here,  is the posterior distribution that we want to simulate. Under certain conditions, the preceding update rule will guarantee that, in the large time limit, the Markov process will approach a steady state distributed according to . The intuition behind the Metropolis-Hasting algorithm is simple. The proposal distribution  gives the conditional probability of proposing state  to make a transition in the next time step from the current state . Therefore,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. Similarly,  is the probability that the system is currently in state  and would make a transition to state  in the next time step. If the ratio of these two probabilities is more than 1, accept the move. Alternatively, accept the move only with the probability given by the ratio. Therefore, the Metropolis-Hasting algorithm is like a hill-climbing algorithm where one accepts all the moves that are in the upward direction and accepts moves in the downward direction once in a while with a smaller probability. The downward moves help the system not to get stuck in local minima. Let's revisit the example of estimating the posterior distribution of the mean and variance of the height of people in a population discussed in the introductory section. This time we will estimate the posterior distribution by using the Metropolis-Hasting algorithm. The following lines of R code do this job: >set.seed(100) >mu_t <- 5.5 >sd_t <- 0.5 >age_samples <- rnorm(10000,mean = mu_t,sd = sd_t) >#function to compute log likelihood >loglikelihood <- function(x,mu,sigma){ singlell <- dnorm(x,mean = mu,sd = sigma,log = T) sumll <- sum(singlell) sumll } >#function to compute prior distribution for mean on log scale >d_prior_mu <- function(mu){ dnorm(mu,0,10,log=T) } >#function to compute prior distribution for std dev on log scale >d_prior_sigma <- function(sigma){ dunif(sigma,0,5,log=T) } >#function to compute posterior distribution on log scale >d_posterior <- function(x,mu,sigma){ loglikelihood(x,mu,sigma) + d_prior_mu(mu) + d_prior_sigma(sigma) } >#function to make transition moves tran_move <- function(x,dist = .1){ x + rnorm(1,0,dist) } >num_iter <- 10000 >posterior <- array(dim = c(2,num_iter)) >accepted <- array(dim=num_iter - 1) >theta_posterior <-array(dim=c(2,num_iter)) >values_initial <- list(mu = runif(1,4,8),sigma = runif(1,1,5)) >theta_posterior[1,1] <- values_initial$mu >theta_posterior[2,1] <- values_initial$sigma >for (t in 2:num_iter){ #proposed next values for parameters theta_proposed <- c(tran_move(theta_posterior[1,t-1]) ,tran_move(theta_posterior[2,t-1])) p_proposed <- d_posterior(age_samples,mu = theta_proposed[1] ,sigma = theta_proposed[2]) p_prev <-d_posterior(age_samples,mu = theta_posterior[1,t-1] ,sigma = theta_posterior[2,t-1]) eps <- exp(p_proposed - p_prev) # proposal is accepted if posterior density is higher w/ theta_proposed # if posterior density is not higher, it is accepted with probability eps accept <- rbinom(1,1,prob = min(eps,1)) accepted[t - 1] <- accept if (accept == 1){ theta_posterior[,t] <- theta_proposed } else { theta_posterior[,t] <- theta_posterior[,t-1] } } To plot the resulting posterior distribution, we use the sm package in R: >library(sm) x <- cbind(c(theta_posterior[1,1:num_iter]),c(theta_posterior[2,1:num_iter])) xlim <- c(min(x[,1]),max(x[,1])) ylim <- c(min(x[,2]),max(x[,2])) zlim <- c(0,max(1)) sm.density(x, xlab = "mu",ylab="sigma", zlab = " ",zlim = zlim, xlim = xlim ,ylim = ylim,col="white") title("Posterior density")  The resulting posterior distribution will look like the following figure:   Though the Metropolis-Hasting algorithm is simple to implement for any Bayesian inference problem, in practice it may not be very efficient in many cases. The main reason for this is that, unless one carefully chooses a proposal distribution , there would be too many rejections and it would take a large number of updates to reach the steady state. This is particularly the case when the number of parameters are high. There are various modifications of the basic Metropolis-Hasting algorithms that try to overcome these difficulties. We will briefly describe these when we discuss various R packages for the Metropolis-Hasting algorithm in the following section. R packages for the Metropolis-Hasting algorithm There are several contributed packages in R for MCMC simulation using the Metropolis-Hasting algorithm, and here we describe some popular ones. The mcmc package contributed by Charles J. Geyer and Leif T. Johnson is one of the popular packages in R for MCMC simulations. It has the metrop function for running the basic Metropolis-Hasting algorithm. The metrop function uses a multivariate normal distribution as the proposal distribution. Sometimes, it is useful to make a variable transformation to improve the speed of convergence in MCMC. The mcmc package has a function named morph for doing this. Combining these two, the function morph.metrop first transforms the variable, does a Metropolis on the transformed density, and converts the results back to the original variable. Apart from the mcmc package, two other useful packages in R are MHadaptive contributed by Corey Chivers and the Evolutionary Monte Carlo (EMC) algorithm package by Gopi Goswami. Gibbs sampling As mentioned before, the Metropolis-Hasting algorithm suffers from the drawback of poor convergence, due to too many rejections, if one does not choose a good proposal distribution. To avoid this problem, two physicists Stuart Geman and Donald Geman proposed a new algorithm. This algorithm is called Gibbs sampling and it is named after the famous physicist J W Gibbs. Currently, Gibbs sampling is the workhorse of MCMC for Bayesian inference. Let  be the set of parameters of the model that we wish to estimate: Start with an initial state . At each time step, update the components one by one, by drawing from a distribution conditional on the most recent value of rest of the components:         After N steps, all components of the parameter will be updated. Continue with step 2 until the Markov process converges to a steady state.  Gibbs sampling is a very efficient algorithm since there are no rejections. However, to be able to use Gibbs sampling, the form of the conditional distributions of the posterior distribution should be known. R packages for Gibbs sampling Unfortunately, there are not many contributed general purpose Gibbs sampling packages in R. The gibbs.met package provides two generic functions for performing MCMC in a Naïve way for user-defined target distribution. The first function is gibbs_met. This performs Gibbs sampling with each 1-dimensional distribution sampled by using the Metropolis algorithm, with normal distribution as the proposal distribution. The second function, met_gaussian, updates the whole state with independent normal distribution centered around the previous state. The gibbs.met package is useful for general purpose MCMC on moderate dimensional problems. Apart from the general purpose MCMC packages, there are several packages in R designed to solve a particular type of machine-learning problems. The GibbsACOV package can be used for one-way mixed-effects ANOVA and ANCOVA models. The lda package performs collapsed Gibbs sampling methods for topic (LDA) models. The stocc package fits a spatial occupancy model via Gibbs sampling. The binomlogit package implements an efficient MCMC for Binomial Logit models. Bmk is a package for doing diagnostics of MCMC output. Bayesian Output Analysis Program (BOA) is another similar package. RBugs is an interface of the well-known OpenBUGS MCMC package. The ggmcmc package is a graphical tool for analyzing MCMC simulation. MCMCglm is a package for generalized linear mixed models and BoomSpikeSlab is a package for doing MCMC for Spike and Slab regression. Finally, SamplerCompare is a package (more of a framework) for comparing the performance of various MCMC packages. Variational approximation In the variational approximation scheme, one assumes that the posterior distribution  can be approximated to a factorized form: Note that the factorized form is also a conditional distribution, so each  can have dependence on other s through the conditioned variable X. In other words, this is not a trivial factorization making each parameter independent. The advantage of this factorization is that one can choose more analytically tractable forms of distribution functions . In fact, one can vary the functions  in such a way that it is as close to the true posterior  as possible. This is mathematically formulated as a variational calculus problem, as explained here. Let's use some measures to compute the distance between the two probability distributions, such as  and , where . One of the standard measures of distance between probability distributions is the Kullback-Leibler divergence, or KL-divergence for short. It is defined as follows: The reason why it is called a divergence and not distance is that  is not symmetric with respect to Q and P. One can use the relation  and rewrite the preceding expression as an equation for log P(X): Here: Note that, in the equation for ln P(X), there is no dependence on Q on the LHS. Therefore, maximizing  with respect to Q will minimize , since their sum is a term independent of Q. By choosing analytically tractable functions for Q, one can do this maximization in practice. It will result in both an approximation for the posterior and a lower bound for ln P(X) that is the logarithm of evidence or marginal likelihood, since . Therefore, variational approximation gives us two quantities in one shot. A posterior distribution can be used to make predictions about future observations (as explained in the next section) and a lower bound for evidence can be used for model selection. How does one implement this minimization of KL-divergence in practice? Without going into mathematical details, here we write a final expression for the solution: Here,  implies that the expectation of the logarithm of the joint distribution  is taken over all the parameters  except for . Therefore, the minimization of KL-divergence leads to a set of coupled equations; one for each  needs to be solved self-consistently to obtain the final solution. Though the variational approximation looks very complex mathematically, it has a very simple, intuitive explanation. The posterior distribution of each parameter  is obtained by averaging the log of the joint distribution over all the other variables. This is analogous to the Mean Field theory in physics where, if there are N interacting charged particles, the system can be approximated by saying that each particle is in a constant external field, which is the average of fields produced by all the other particles. We will end this section by mentioning a few R packages for variational approximation. The VBmix package can be used for variational approximation in Bayesian mixture models. A similar package is vbdm used for Bayesian discrete mixture models. The package vbsr is used for variational inference in Spike Regression Regularized Linear Models. Prediction of future observations Once we have the posterior distribution inferred from data using some of the methods described already, it can be used to predict future observations. The probability of observing a value Y, given observed data X, and posterior distribution of parameters  is given by: Note that, in this expression, the likelihood function  is averaged by using the distribution of the parameter given by the posterior . This is, in fact, the core strength of the Bayesian inference. This Bayesian averaging eliminates the uncertainty in estimating the parameter values and makes the prediction more robust. Summary In this article, we covered the basic principles of Bayesian inference. Starting with how uncertainty is treated differently in Bayesian statistics compared to classical statistics, we discussed deeply various components of Bayes' rule. Firstly, we learned the different types of prior distributions and how to choose the right one for your problem. Then we learned the estimation of posterior distribution using techniques such as MAP estimation, Laplace approximation, and MCMC simulations. Resources for Article: Further resources on this subject: Bayesian Network Fundamentals [article] Learning Data Analytics with R and Hadoop [article] First steps with R [article]
Read more
  • 0
  • 0
  • 3455

article-image-introducing-boost-c-libraries
Packt
14 Sep 2015
22 min read
Save for later

Introducing the Boost C++ Libraries

Packt
14 Sep 2015
22 min read
 In this article written by John Torjo and Wisnu Anggoro, authors of the book Boost.Asio C++ Network Programming - Second Edition, the authors state that "Many programmers have used libraries since this simplifies the programming process. Because they do not need to write the function from scratch anymore, using a library can save much code development time". In this article, we are going to get acquainted with Boost C++ libraries. Let us prepare our own compiler and text editor to prove the power of Boost libraries. As we do so, we will discuss the following topics: Introducing the C++ standard template library Introducing Boost C++ libraries Setting up Boost C++ libraries in MinGW compiler Building Boost C++ libraries Compiling code that contains Boost C++ libraries (For more resources related to this topic, see here.) Introducing the C++ standard template library The C++ Standard Template Library (STL) is a generic template-based library that offers generic containers among other things. Instead of dealing with dynamic arrays, linked lists, binary trees, or hash tables, programmers can easily use an algorithm that is provided by STL. The STL is structured by containers, iterators, and algorithms, and their roles are as follows: Containers: Their main role is to manage the collection of objects of certain kinds, such as arrays of integers or linked lists of strings. Iterators: Their main role is to step through the element of the collections. The working of an iterator is similar to that of a pointer. We can increment the iterator by using the ++ operator and access the value by using the * operator. Algorithms: Their main role is to process the element of collections. An algorithm uses an iterator to step through all elements. After it iterates the elements, it processes each element, for example, modifying the element. It can also search and sort the element after it finishes iterating all the elements. Let us examine the three elements that structure STL by creating the following code: /* stl.cpp */ #include <vector> #include <iostream> #include <algorithm> int main(void) { int temp; std::vector<int> collection; std::cout << "Please input the collection of integer numbers, input 0 to STOP!n"; while(std::cin >> temp != 0) { if(temp == 0) break; collection.push_back(temp); } std::sort(collection.begin(), collection.end()); std::cout << "nThe sort collection of your integer numbers:n"; for(int i: collection) { std::cout << i << std::endl; } } Name the preceding code stl.cpp, and run the following command to compile it: g++ -Wall -ansi -std=c++11 stl.cpp -o stl Before we dissect this code, let us run it to see what happens. This program will ask users to enter as many as integer, and then it will sort the numbers. To stop the input and ask the program to start sorting, the user has to input 0. This means that 0 will not be included in the sorting process. Since we do not prevent users from entering non-integer numbers such as 3.14, or even the string, the program will soon stop waiting for the next number after the user enters a non-integer number. The code yields the following output: We have entered six integer: 43, 7, 568, 91, 2240, and 56. The last entry is 0 to stop the input process. Then the program starts to sort the numbers and we get the numbers sorted in sequential order: 7, 43, 56, 91, 568, and 2240. Now, let us examine our code to identify the containers, iterators, and algorithms that are contained in the STL. std::vector<int> collection; The preceding code snippet has containers from STL. There are several containers, and we use a vector in the code. A vector manages its elements in a dynamic array, and they can be accessed randomly and directly with the corresponding index. In our code, the container is prepared to hold integer numbers so we have to define the type of the value inside the angle brackets <int>. These angle brackets are also called generics in STL. collection.push_back(temp); std::sort(collection.begin(), collection.end()); The begin() and end() functions in the preceding code are algorithms in STL. They play the role of processing the data in the containers that are used to get the first and the last elements in the container. Before that, we can see the push_back() function, which is used to append an element to the container. for(int i: collection) { std::cout << i << std::endl; } The preceding for block will iterate each element of the integer which is called as collection. Each time the element is iterated, we can process the element separately. In the preceding example, we showed the number to the user. That is how the iterators in STL play their role. #include <vector> #include <algorithm> We include vector definition to define all vector functions and algorithm definition to invoke the sort() function. Introducing the Boost C++ libraries The Boost C++ libraries is a set of libraries to complement the C++ standard libraries. The set contains more than a hundred libraries that we can use to increase our productivity in C++ programming. It is also used when our requirements go beyond what is available in the STL. It provides source code under Boost Licence, which means that it allows us to use, modify, and distribute the libraries for free, even for commercial use. The development of Boost is handled by the Boost community, which consists of C++ developers from around the world. The mission of the community is to develop high-quality libraries as a complement to STL. Only proven libraries will be added to the Boost libraries. For detailed information about Boost libraries go to www.boost.org. And if you want to contribute developing libraries to Boost, you can join the developer mailing list at lists.boost.org/mailman/listinfo.cgi/boost. The entire source code of the libraries is available on the official GitHub page at github.com/boostorg. Advantages of Boost libraries As we know, using Boost libraries will increase programmer productivity. Moreover, by using Boost libraries, we will get advantages such as these: It is open source, so we can inspect the source code and modify it if needed. Its license allows us to develop both open source and close source projects. It also allows us to commercialize our software freely. It is well documented and we can find it libraries all explained along with sample code from the official site. It supports almost any modern operating system, such as Windows and Linux. It also supports many popular compilers. It is a complement to STL instead of a replacement. It means using Boost libraries will ease those programming processes which are not handled by STL yet. In fact, many parts of Boost are included in the standard C++ library. Preparing Boost libraries for MinGW compiler Before we go through to program our C++ application by using Boost libraries, the libraries need to be configured in order to be recognized by MinGW compiler. Here we are going to prepare our programming environment so that our compiler is able use Boost libraries. Downloading Boost libraries The best source from which to download Boost is the official download page. We can go there by pointing our internet browser to www.boost.org/users/download. Find the Download link in Current Release section. At the time of writing, the current version of Boost libraries is 1.58.0, but when you read this article, the version may have changed. If so, you can still choose the current release because the higher version must be compatible with the lower. However, you have to adjust as we're goning to talk about the setting later. Otherwise, choosing the same version will make it easy for you to follow all the instructions in this article. There are four file formats to be choose from for download; they are .zip, .tar.gz, .tar.bz2, and .7z. There is no difference among the four files but their file size. The largest file size is of the ZIP format and the lowest is that of the 7Z format. Because of the file size, Boost recommends that we download the 7Z format. See the following image for comparison: We can see, from the preceding image, the size of ZIP version is 123.1 MB while the size of the 7Z version is 65.2 MB. It means that the size of the ZIP version is almost twice that of the 7Z version. Therefore they suggest that you choose the 7Z format to reduce download and decompression time. Let us choose boost_1_58_0.7z to be downloaded and save it to our local storage. Deploying Boost libraries After we have got boost_1_58_0.7z in our local storage, decompress it using the 7ZIP application and save the decompression files to C:boost_1_58_0. The 7ZIP application can be grabbed from www.7-zip.org/download.html. The directory then should contain file structures as follows: Instead of browsing to the Boost download page and searching for the Boost version manually, we can go directly to sourceforge.net/projects/boost/files/boost/1.58.0. It will be useful when the 1.58.0 version is not the current release anymore. Using Boost libraries Most libraries in Boost are header-only; this means that all declarations and definitions of functions, including namespaces and macros, are visible to the compiler and there is no need to compile them separately. We can now try to use Boost with the program to convert the string into int value as follows: /* lexical.cpp */ #include <boost/lexical_cast.hpp> #include <string> #include <iostream> int main(void) { try { std::string str; std::cout << "Please input first number: "; std::cin >> str; int n1 = boost::lexical_cast<int>(str); std::cout << "Please input second number: "; std::cin >> str; int n2 = boost::lexical_cast<int>(str); std::cout << "The sum of the two numbers is "; std::cout << n1 + n2 << "n"; return 0; } catch (const boost::bad_lexical_cast &e) { std::cerr << e.what() << "n"; return 1; } } Open the Notepad++ application, type the preceding code, and save it as lexical.cpp in C:CPP. Now open the command prompt, point the active directory to C:CPP, and then type the following command: g++ -Wall -ansi lexical.cpp –Ic:boost_1_58_0 -o lexical We have a new option here, which is –I (the "include" option). This option is used along with the full path of the directory to inform the compiler that we have another header directory that we want to include to our code. Since we store our Boost libraries in c: boost_1_58_0, we can use –Ic:boost_1_58_0 as an additional parameter. In lexical.cpp, we apply boost::lexical_cast to convert string type data into int type data. The program will ask the user to input two numbers and will then automatically find the sum of both numbers. If a user inputs an inappropriate number, it will inform that an error has occurred. The Boost.LexicalCast library is provided by Boost for casting data type purpose (converting numeric types such as int, double, or floats into string types, and vice versa). Now let us dissect lexical.cpp to for a more detailed understanding of what it does: #include <boost/lexical_cast.hpp> #include <string> #include <iostream> We include boost/lexical_cast.hpp because the boost::lexical_cast function is declared lexical_cast.hpp header file whilst string header is included to apply std::string function and iostream header is included to apply std::cin, std::cout and std::cerr function. Other functions, such as std::cin and std::cout, and we saw what their functions are so we can skip those lines. #include <boost/lexical_cast.hpp> #include <string> #include <iostream> We used the preceding two separate lines to convert the user-provided input string into the int data type. Then, after converting the data type, we summed up both of the int values. We can also see the try-catch block in the preceding code. It is used to catch the error if user inputs an inappropriate number, except 0 to 9. catch (const boost::bad_lexical_cast &e) { std::cerr << e.what() << "n"; return 1; } The preceding code snippet will catch errors and inform the user what the error message exactly is by using boost::bad_lexical_cast. We call the e.what() function to obtain the string of the error message. Now let us run the application by typing lexical at the command prompt. We will get output like the following: I put 10 for first input and 20 for the second input. The result is 30 because it just sums up both input. But what will happen if I put in a non-numerical value, for instance Packt. Here is the output to try that condition: Once the application found the error, it will ignore the next statement and go directly to the catch block. By using the e.what() function, the application can get the error message and show it to the user. In our example, we obtain bad lexical cast: source type value could not be interpreted as target as the error message because we try to assign the string data to int type variable. Building Boost libraries As we discussed previously, most libraries in Boost are header-only, but not all of them. There are some libraries that have to be built separately. They are: Boost.Chrono: This is used to show the variety of clocks, such as current time, the range between two times, or calculating the time passed in the process. Boost.Context: This is used to create higher-level abstractions, such as coroutines and cooperative threads. Boost.Filesystem: This is used to deal with files and directories, such as obtaining the file path or checking whether a file or directory exists. Boost.GraphParallel: This is an extension to the Boost Graph Library (BGL) for parallel and distributed computing. Boost.IOStreams: This is used to write and read data using stream. For instance, it loads the content of a file to memory or writes compressed data in GZIP format. Boost.Locale: This is used to localize the application, in other words, translate the application interface to user's language. Boost.MPI: This is used to develop a program that executes tasks concurrently. MPI itself stands for Message Passing Interface. Boost.ProgramOptions: This is used to parse command-line options. Instead of using the argv variable in the main parameter, it uses double minus (--) to separate each command-line option. Boost.Python: This is used to parse Python language in C++ code. Boost.Regex: This is used to apply regular expression in our code. But if our development supports C++11, we do not depend on the Boost.Regex library anymore since it is available in the regex header file. Boost.Serialization: This is used to convert objects into a series of bytes that can be saved and then restored again into the same object. Boost.Signals: This is used to create signals. The signal will trigger an event to run a function on it. Boost.System: This is used to define errors. It contains four classes: system::error_code, system::error_category, system::error_condition, and system::system_error. All of these classes are inside the boost namespace. It is also supported in the C++11 environment, but because many Boost libraries use Boost.System, it is necessary to keep including Boost.System. Boost.Thread: This is used to apply threading programming. It provides classes to synchronize access on multiple-thread data. It is also supported in C++11 environments, but it offers extensions, such as we can interrupt thread in Boost.Thread. Boost.Timer: This is used to measure the code performance by using clocks. It measures time passed based on usual clock and CPU time, which states how much time has been spent to execute the code. Boost.Wave: This provides a reusable C preprocessor that we can use in our C++ code. There are also a few libraries that have optional, separately compiled binaries. They are as follows: Boost.DateTime: It is used to process time data; for instance, calendar dates and time. It has a binary component that is only needed if we use to_string, from_string, or serialization features. It is also needed if we target our application in Visual C++ 6.x or Borland. Boost.Graph: It is used to create two-dimensional graphics. It has a binary component that is only needed if we intend to parse GraphViz files. Boost.Math: It is used to deal with mathematical formulas. It has binary components for cmath functions. Boost.Random: It is used to generate random numbers. It has a binary component which is only needed if we want to use random_device. Boost.Test: It is used to write and organize test programs and their runtime execution. It can be used in header-only or separately compiled mode, but separate compilation is recommended for serious use. Boost.Exception: It is used to add data to an exception after it has been thrown. It provides non-intrusive implementation of exception_ptr for 32-bit _MSC_VER==1310 and _MSC_VER==1400, which requires a separately compiled binary. This is enabled by #define BOOST_ENABLE_NON_INTRUSIVE_EXCEPTION_PTR. Let us try to recreate the random number generator. But now we will use the Boost.Random library instead of std::rand() from the C++ standard function. Let us take a look at the following code: /* rangen_boost.cpp */ #include <boost/random/mersenne_twister.hpp> #include <boost/random/uniform_int_distribution.hpp> #include <iostream> int main(void) { int guessNumber; std::cout << "Select number among 0 to 10: "; std::cin >> guessNumber; if(guessNumber < 0 || guessNumber > 10) { return 1; } boost::random::mt19937 rng; boost::random::uniform_int_distribution<> ten(0,10); int randomNumber = ten(rng); if(guessNumber == randomNumber) { std::cout << "Congratulation, " << guessNumber << " is your lucky number.n"; } else { std::cout << "Sorry, I'm thinking about number " << randomNumber << "n"; } return 0; } We can compile the preceding source code by using the following command: g++ -Wall -ansi -Ic:/boost_1_58_0 rangen_boost.cpp -o rangen_boost Now, let us run the program. Unfortunately, for the three times that I ran the program, I always obtained the same random number as follows: As we can see from this example, we always get number 8. This is because we apply Mersenne Twister, a Pseudorandom Number Generator (PRNG), which uses the default seed as a source of randomness so it will generate the same number every time the program is run. And of course it is not the program that we expect. Now, we will rework the program once again, just in two lines. First, find the following line: #include <boost/random/mersenne_twister.hpp> Change it as follows: #include <boost/random/random_device.hpp> Next, find the following line: boost::random::mt19937 rng; Change it as follows: boost::random::random_device rng; Then, save the file as rangen2_boost.cpp and compile the rangen2_boost.cpp file by using the command like we compiled rangen_boost.cpp. The command will look like this: g++ -Wall -ansi -Ic:/boost_1_58_0 rangen2_boost.cpp -o rangen2_boost Sadly, there will be something wrong and the compiler will show the following error message: cc8KWVvX.o:rangen2_boost.cpp:(.text$_ZN5boost6random6detail20generate _uniform_intINS0_13random_deviceEjEET0_RT_S4_S4_N4mpl_5bool_ILb1EEE[_ ZN5boost6random6detail20generate_uniform_intINS0_13random_deviceEjEET 0_RT_S4_S4_N4mpl_5bool_ILb1EEE]+0x24f): more undefined references to boost::random::random_device::operator()()' follow collect2.exe: error: ld returned 1 exit status This is because, as we have discussed earlier, the Boost.Random library needs to be compiled separately if we want to use the random_device attribute. Boost libraries have a system to compile or build Boost itself, called Boost.Build library. There are two steps we have to achieve to install the Boost.Build library. First, run Bootstrap by pointing the active directory at the command prompt to C:boost_1_58_0 and typing the following command: bootstrap.bat mingw We use our MinGW compiler, as our toolset in compiling the Boost library. Wait a second and then we will get the following output if the process is a success: Building Boost.Build engine Bootstrapping is done. To build, run: .b2 To adjust configuration, edit 'project-config.jam'. Further information: - Command line help: .b2 --help - Getting started guide: http://boost.org/more/getting_started/windows.html - Boost.Build documentation: http://www.boost.org/build/doc/html/index.html In this step, we will find four new files in the Boost library's root directory. They are: b2.exe: This is an executable file to build Boost libraries. bjam.exe: This is exactly the same as b2.exe but it is a legacy version. bootstrap.log: This contains logs from the bootstrap process project-config.jam: This contains a setting that will be used in the building process when we run b2.exe. We also find that this step creates a new directory in C:boost_1_58_0toolsbuildsrcenginebin.ntx86 , which contains a bunch of .obj files associated with Boost libraries that needed to be compiled. After that, run the second step by typing the following command at the command prompt: b2 install toolset=gcc Grab yourself a cup of coffee after running that command because it will take about twenty to fifty minutes to finish the process, depending on your system specifications. The last output we will get will be like this: ...updated 12562 targets... This means that the process is complete and we have now built the Boost libraries. If we check in our explorer, the Boost.Build library adds C:boost_1_58_0stagelib, which contains a collection of static and dynamic libraries that we can use directly in our program. bootstrap.bat and b2.exe use msvc (Microsoft Visual C++ compiler) as the default toolset, and many Windows developers already have msvc installed on their machines. Since we have installed GCC compiler, we set the mingw and gcc toolset options in Boost's build. If you also have mvsc installed and want to use it in Boost's build, the toolset options can be omitted. Now, let us try to compile the rangen2_boost.cpp file again, but now with the following command: c:CPP>g++ -Wall -ansi -Ic:/boost_1_58_0 rangen2_boost.cpp - Lc:boost_1_58_0stagelib -lboost_random-mgw49-mt-1_58 - lboost_system-mgw49-mt-1_58 -o rangen2_boost We have two new options here, they are –L and –l. The -L option is used to define the path that contains the library file if it is not in the active directory. The –l option is used to define the name of library file but omitting the first lib word in front of the file name. In this case, the original library file name is libboost_random-mgw49-mt-1_58.a, and we omit the lib phrase and the file extension for option -l. The new file called rangen2_boost.exe will be created in C:CPP. But before we can run the program, we have to ensure that the directory which the program installed has contained the two dependencies library file. These are libboost_random-mgw49-mt-1_58.dll and libboost_system-mgw49-mt-1_58.dll, and we can get them from the library directory c:boost_1_58_0_1stagelib. Just to make it easy for us to run that program, run the following copy command to copy the two library files to C:CPP: copy c:boost_1_58_0_1stageliblibboost_random-mgw49-mt-1_58.dll c:cpp copy c:boost_1_58_0_1stageliblibboost_system-mgw49-mt-1_58.dll c:cpp And now the program should run smoothly. In order to create a network application, we are going to use the Boost.Asio library. We do not find Boost.Asio—the library we are going to use to create a network application—in the non-header-only library. It seems that we do not need to build the boost library since Boost.Asio is header-only library. This is true, but since Boost.Asio depends on Boost.System and Boost.System needs to be built before being used, it is important to build Boost first before we can use it to create our network application. For option –I and –L, the compiler does not care if we use backslash () or slash (/) to separate each directory name in the path because the compiler can handle both Windows and Unix path styles. Summary We saw that Boost C++ libraries were developed to complement the standard C++ library We have also been able to set up our MinGW compiler in order to compile the code which contains Boost libraries and build the binaries of libraries which have to be compiled separately. Please remember that though we can use the Boost.Asio library as a header-only library, it is better to build all Boost libraries by using the Boost.Build library. It will be easy for us to use all libraries without worrying about compiling failure. Resources for Article:   Further resources on this subject: Actors and Pawns[article] What is Quantitative Finance?[article] Program structure, execution flow, and runtime objects [article]
Read more
  • 0
  • 0
  • 13548
Packt
14 Sep 2015
6 min read
Save for later

Getting Started – Understanding Citrix XenDesktop and its Architecture

Packt
14 Sep 2015
6 min read
In this article written by Gurpinder Singh, author of the book Troubleshooting Citrix Xendesktop, the author wants us to learn about the following topics: Hosted shared vs hosted virtual desktops Citrix FlexCast delivery technology Modular framework architecture What's new in XenDesktop 7.x (For more resources related to this topic, see here.) Hosted shared desktops (HSD) vs hosted virtual desktops (HVD) Instead of going through the XenDesktop architecture; firstly, we would like to explain the difference between the two desktop delivery platforms HSD and HVD. It is a common question that is asked by every System Administrator whenever there is a discussion on the most suited desktop delivery platform for the enterprises. Desktop Delivery platform depends on the requirements for the enterprise. Some choose Hosted Shared Desktops (HSD)or Server Based Computing (XenApp) over Hosted Virtual Desktop (XenDesktop); where single server desktop is shared among multiple users, and the environment is locked down using Active Directory GPOs. XenApp is cost effective platform when compared between XenApp and XenDesktop and many small to mid-sized enterprises prefer to choose this platform due to its cost benefits and less complexity. However, the preceding model does pose some risks to the environment as the same server is being shared by multiple users and a proper design plan is required to configure proper HSD or XenApp Published Desktop environment. Many enterprises have security and other user level dependencies where they prefer to go with hosted virtual desktops solution. Hosted virtual desktop or XenDesktop runs a Windows 7 or Windows 8 desktop running as virtual machine hosted on a data centre. In this model, single user connects to single desktop and therefore, there is a very less risk of having desktop configuration impacted for all users. XenDesktop 7.x and above versions now also enable you to deliver server based desktops (HSD) along with HVD within one product suite. XenDesktop also provides HVD pooled desktops which work on a shared OS image concept which is similar to HSD desktops with a difference of running Desktop Operating System instead of Server Operating System. Please have a look at the following table which would provide you a fair idea on the requirement and recommendation on both delivery platforms for your enterprise. Customer Requirement Delivery Platform User needs to work on one or two applications and often need not to do any updates or installation on their own. Hosted Shared Desktop User work on their own core set of applications for which they need to change system level settings, installations and so on. Hosted virtual Desktops (Dedicated) User works on MS Office and other content creation tools Hosted Shared Desktop User needs to work on CPU and graphic intensive applications that requires video rendering Hosted Virtual Desktop (Blade PCs) User needs to have admin privileges to work on specific set of applications. Hosted Virtual Desktop (Pooled) You can always have mixed set of desktop delivery platforms in your environment focussed on the customer need and requirements. Citrix FlexCast delivery technology Citrix FlexCast is a delivery technology that allows Citrix administrator to personalize virtual desktops to meet the performance, security and flexibility requirements of end users. There are different types of user requirements; some need standard desktops with standard set of apps and others require high performance personalized desktops. Citrix has come up with a solution to meet these demands with Citrix FlexCast Technology. You can deliver any kind of virtualized desktop with FlexCast technology, there are five different categories in which FlexCast models are available. Hosted Shared or HSD Hosted Virtual Desktop or HVD Streamed VHD Local VMs On-Demand Apps The detailed discussion on these models is out of scope for this article. To read more about the FlexCast models, please visit http://support.citrix.com/article/CTX139331. Modular framework architecture To understand the XenDesktop architecture, it is better to break down the architecture into discrete independent modules rather than visualizing it as an integrated one single big piece. Citrix provided this modularized approach to design and architect XenDesktop to solve end customers set of requirements and objectives. This modularized approach solves customer requirements by providing a platform that is highly resilient, flexible and scalable. This reference architecture is based on information gathered by multiple Citrix consultants working on a wide range of XenDesktop implementations. Have a look at the basic components of the XenDesktop architecture that everyone should be aware of before getting involved with troubleshooting: We won't be spending much time on understanding each component of the reference architecture, http://www.citrix.com/content/dam/citrix/en_us/documents/products-solutions/xendesktop-deployment-blueprint.pdf in detail as this is out of scope for this book. We would be going through each component quickly. What's new in XenDesktop 7.x With the release of Citrix XenDesktop 7, Citrix has introduced a lot of improvements over previous releases. With every new product release, there is lot of information published and sometimes it becomes very difficult to get the key information that all system administrators would be looking for to understand what has been changed and what the key benefits of the new release are. The purpose of this section would be to highlight the new key features that XenDesktop 7.x brings to the kitty for all Citrix administrators. This section would not provide you all the details regarding the new features and changes that XenDesktop 7.x has introduced but highlights the key points that every Citrix administrator should be aware of while administrating XenDesktop 7. Key Highlights: XenApp and XenDesktop are part of now single setup Cloud integration to support desktop deployments on the cloud IMA database doesn't exist anymore IMA is replaced by FMA (Flexcast Management Architecture) Zones Concept are no more zones or ZDC (Data Collectors) Microsoft SQL is the only supported Database Sites are used instead of Farms XenApp and XenDesktop can now share consoles, Citrix Studio and Desktop Director are used for both products Shadowing feature is deprecated; Citrix recommends Microsoft Remote Assistance to be used Locally installed applications integrated to be used with Server based desktops HDX & mobility features Profile Management is included MCS can now be leveraged for both Server & Desktop OS MCS now works with KMS Storefront replaces Web Interface Remote-PC Access No more Citrix Streaming Profile Manager; Citrix recommends MS App-V Core component is being replaced by a VDA agent Summary We should now have a basic understanding on desktop virtualization concepts, Architecture, new features in XenDesktop 7.x, XenDesktop delivery models based on FlexCast Technology. Resources for Article: Further resources on this subject: High Availability, Protection, and Recovery using Microsoft Azure [article] Designing a XenDesktop® Site [article] XenMobile™ Solutions Bundle [article]
Read more
  • 0
  • 0
  • 20630

article-image-understanding-model-based-clustering
Packt
14 Sep 2015
10 min read
Save for later

Understanding Model-based Clustering

Packt
14 Sep 2015
10 min read
 In this article by Ashish Gupta, author of the book, Rapid – Apache Mahout Clustering Designs, we will discuss a model-based clustering algorithm. Model-based clustering is used to overcome some of the deficiencies that can occur in K-means or Fuzzy K-means algorithms. We will discuss the following topics in this article: Learning model-based clustering Understanding Dirichlet clustering Understanding topic modeling (For more resources related to this topic, see here.) Learning model-based clustering In model-based clustering, we assume that data is generated by a model and try to get the model from the data. The right model will fit the data better than other models. In the K-means algorithm, we provide the initial set of cluster, and K-means provides us with the data points in the clusters. Think about a case where clusters are not distributed normally, then the improvement of a cluster will not be good using K-means. In this scenario, the model-based clustering algorithm will do the job. Another idea you can think of when dividing the clusters is—hierarchical clustering—and we need to find out the overlapping information. This situation will also be covered by model-based clustering algorithms. If all components are not well separated, a cluster can consist of multiple mixture components. In simple terms, in model-based clustering, data is a mixture of two or more components. Each component has an associated probability and is described by a density function. Model-based clustering can capture the hierarchy and the overlap of the clusters at the same time. Partitions are determined by an EM (expectation-maximization) algorithm for maximum likelihood. The generated models are compared by a Bayesian Information criterion (BIC). The model with the lowest BIC is preferred. In the equation BIC = -2 log(L) + mlog(n), L is the likelihood function and m is the number of free parameters to be estimated. n is the number of data points. Understanding Dirichlet clustering Dirichlet clustering is a model-based clustering method. This algorithm is used to understand the data and cluster the data. Dirichlet clustering is a process of nonparametric and Bayesian modeling. It is nonparametric because it can have infinite number of parameters. Dirichlet clustering is based on Dirichlet distribution. For this algorithm, we have a probabilistic mixture of a number of models that are used to explain data. Each data point will be coming from one of the available models. The models are taken from the sample of a prior distribution of models, and points are assigned to these models iteratively. In each iteration probability, a point generated by a particular model is calculated. After the points are assigned to a model, new parameters for each of the model are sampled. This sample is from the posterior distribution of the model parameters, and it considers all the observed data points assigned to the model. This sampling provides more information than normal clustering listed as follows: As we are assigning points to different models, we can find out how many models are supported by the data. The other information that we can get is how well the data is described by a model and how two points are explained by the same model. Topic modeling In machine learning, topic modeling is nothing but finding out a topic from the text document using a statistical model. A document on particular topics has some particular words. For example, if you are reading an article on sports, there are high chances that you will get words such as football, baseball, Formula One and Olympics. So a topic model actually uncovers the hidden sense of the article or a document. Topic models are nothing but the algorithms that can discover the main themes from a large set of unstructured document. It uncovers the semantic structure of the text. Topic modeling enables us to organize large scale electronic archives. Mahout has the implementation of one of the topic modeling algorithms—Latent Dirichlet Allocation (LDA). LDA is a statistical model of document collection that tries to capture the intuition of the documents. In normal clustering algorithms, if words having the same meaning don't occur together, then the algorithm will not associate them, but LDA can find out which two words are used in similar context, and LDA is better than other algorithms in finding out the association in this way. LDA is a generative, probabilistic model. It is generative because the model is tweaked to fit the data, and using the parameters of the model, we can generate the data on which it fits. It is probabilistic because each topic is modeled as an infinite mixture over an underlying set of topic probabilities. The topic probabilities provide an explicit representation of a document. Graphically, a LDA model can be represented as follows: The notation used in this image represents the following: M, N, and K represent the number of documents, the number of words in the document, and the number of topics in the document respectively. is the prior weight of the K topic in a document. is the prior weight of the w word in a topic. φ is the probability of a word occurring in a topic. Θ is the topic distribution. z is the identity of a topic of all the words in all the documents. w is the identity of all the words in all the documents. How LDA works in a map-reduce mode? So these are the steps that LDA follows in mapper and reducer steps: Mapper phase: The program starts with an empty topic model. All the documents are read by different mappers. The probabilities of each topic for each word in the document are calculated. Reducer Phase: The reducer receives the count of probabilities. These counts are summed and the model is normalized. This process is iterative, and in each iteration the sum of the probabilities is calculated and the process stops when it stops changing. A parameter set, which is similar to the convergence threshold in K-means, is set to check the changes. In the end, LDA estimates how well the model fits the data. In Mahout, the Collapsed Variation Bayes (CVB) algorithm is implemented for LDA. LDA uses a term frequency vector as an input and not tf-idf vectors. We need to take care of the two parameters while running the LDA algorithm—the number of topics and the number of words in the documents. A higher number of topics will provide very low level topics while a lower number will provide a generalized topic at high level, such as sports. In Mahout, mean field variational inference is used to estimate the model. It is similar to expectation-maximization of hierarchical Bayesian models. An expectation step reads each document and calculates the probability of each topic for each word in every document. The maximization step takes the counts and sums all the probabilities and normalizes them. Running LDA using Mahout To run LDA using Mahout, we will use the 20 Newsgroups dataset. We will convert the corpus to vectors, run LDA on these vectors, and get the resultant topics. Let's run this example to view how topic modeling works in Mahout. Dataset selection We will use the 20 Newsgroup dataset for this exercise. Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. Steps to execute CVB (LDA) Perform the following steps to execute the CVB algorithm: Create a 20newsdata directory and unzip the data here: mkdir /tmp/20newsdata cdtmp/20newsdatatar-xzvf /tmp/20news-bydate.tar.gz There are two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now, create another 20newsdataall directory and merge both the training and test data of the group. Now move to the home directory and execute the following command: mkdir /tmp/20newsdataall cp –R /20newsdata/*/* /tmp/20newsdataall Create a directory in Hadoop and save this data in HDFS: hadoopfs –mkdir /usr/hue/20newsdata hadoopfs –put /tmp/20newsdataall /usr/hue/20newsdata Mahout CVB will accept the data in the vector format. For this, first we will generate a sequence file from the directory as follows: bin/mahoutseqdirectory -i /user/hue/20newsdata/20newsdataall -o /user/hue/20newsdataseq-out Convert the sequence file to a sparse vector but, as discussed earlier, using the term frequency weight. bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wtt Convert the sparse vector to the input form required by the CVB algorithm. bin/mahoutrowid -i /user/hue/20newsdatavec/tf-vectors –o /user/hue/20newsmatrix Convert the sparse vector to the input form required by CVB algorithm. bin/mahout cvb -i /user/hue/20newsmatrix/matrix –o /user/hue/ldaoutput–k 10 –x 20 –dict/user/hue/20newsdatavec/dictionary.file-0 –dt /user/hue/ldatopics –mt /user/hue/ldamodel The parameters used in the preceding command can be explained as follows:      -i: This is the input path of the document vector      -o: This is the output path of the topic term distribution      -k: This is the number of latent topics      -x: This is the maximum number of iterations      -dict: This is the term dictionary files      -dt: This is the output path of document—topic distribution      -mt: This is the model state path after each iteration The output of the preceding command can be seen as follows: Once the command finishes, you will get the information on the screen as follows: To view the output, run the following command : bin/mahout vectordump -i /user/hue/ldaoutput/ -d /user/hue/20newsdatavec/dictionary.file-0 -dtsequencefile -vs 10 -sort true -o /tmp/lda-output.txt The parameters used in the preceding command can be explained as follows:     -i: This is the input location of the CVB output     -d: This is the dictionary file location created during vector creation     -dt: This is the dictionary file type (sequence or text)     -vs: This is the vector size     -sort: This is the flag to put true or false     -o: This is the output location of local filesystem Now your output will be saved in the local filesystem. Open the file and you will see an output similar to the following: From the preceding screenshot you can see that after running the algorithm, you will get the term and probability of that. Summary In this article, we learned about model-based clustering, the Dirichlet process, and topic modeling. In model-based clustering, we tried to obtain the model from the data ,while the Dirichlet process is used to understand the data. Topic modeling helps us to identify the topics in an article or in a set of documents. We discussed how Mahout has implemented topic modeling using the latent Dirichlet process and how it is implemented in map reduce. We discussed how to use Mahout to find out the topic distribution on a set of documents. Resources for Article: Further resources on this subject: Learning Random Forest Using Mahout[article] Implementing the Naïve Bayes classifier in Mahout[article] Clustering [article]
Read more
  • 0
  • 0
  • 6232
Modal Close icon
Modal Close icon