Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-swift-power-and-performance
Packt
12 Oct 2015
14 min read
Save for later

Swift Power and Performance

Packt
12 Oct 2015
14 min read
In this article by Kostiantyn Koval, author of the book, Swift High Performance, we will learn about Swift, its performance and optimization, and how to achieve high performance. (For more resources related to this topic, see here.) Swift speed I could guess you are interested in Swift speed and are probably wondering "How fast can Swift be?" Before we even start learning Swift and discovering all the good things about it, let's answer this right here and right now. Let's take an array of 100,000 random numbers, sort in Swift, Objective-C, and C by using a standard sort function from stdlib (sort for Swift, qsort for C, and compare for Objective-C), and measure how much time would it take. In order to sort an array with 100,000 integer elements, the following are the timings: Swift 0.00600 sec C 0.01396 sec Objective-C 0.08705 sec The winner is Swift! Swift is 14.5 times faster that Objective-C and 2.3 times faster than C. In other examples and experiments, C is usually faster than Objective-C and Swift is way faster. Comparing the speed of functions You know how functions and methods are implemented and how they work. Let's compare the performance and speed of global functions and different method types. For our test, we will use a simple add function. Take a look at the following code snippet: func add(x: Int, y: Int) -> Int { return x + y } class NumOperation { func addI(x: Int, y: Int) -> Int class func addC(x: Int, y: Int) -> Int static func addS(x: Int, y: Int) -> Int } class BigNumOperation: NumOperation { override func addI(x: Int, y: Int) -> Int override class func addC(x: Int, y: Int) -> Int } For the measurement and code analysis, we use a simple loop in which we call those different methods: measure("addC") { var result = 0 for i in 0...2000000000 { result += NumOperation.addC(i, y: i + 1) // result += test different method } print(result) } Here are the results. All the methods perform exactly the same. Even more so, their assembly code looks exactly the same, except the name of the function call: Global function: add(10, y: 11) Static: NumOperation.addS(10, y: 11) Class: NumOperation.addC(10, y: 11) Static subclass: BigNumOperation.addS(10, y: 11) Overridden subclass: BigNumOperation.addC(10, y: 11) Even though the BigNumOperation addC class function overrides the NumOperation addC function when you call it directly, there is no need for a vtable lookup. The instance method call looks a bit different: Instance: let num = NumOperation() num.addI(10, y: 11) Subclass overridden instance: let bigNum = BigNumOperation() bigNum.addI() One difference is that we need to initialize a class and create an instance of the object. In our example, this is not so expensive an operation because we do it outside the loop and it takes place only once. The loop with the calling instance method looks exactly the same. As you can see, there is almost no difference in the global function and the static and class methods. The instance method looks a bit different but it doesn't have any major impact on performance. Also, even though it's true for simple use cases, there is a difference between them in more complex examples. Let's take a look at the following code snippet: let baseNumType = arc4random_uniform(2) == 1 ? BigNumOperation.self : NumOperation.self for i in 0...loopCount { result += baseNumType.addC(i, y: i + 1) } print(result) The only difference we incorporated here is that instead of specifying the NumOperation class type in compile time, we randomly returned it at runtime. And because of this, the Swift compiler doesn't know what method should be called at compile time—BigNumOperation.addC or NumOperation.addC. This small change has an impact on the generated assembly code and performance. A summary of the usage of functions and methods Global functions are the simplest and give the best performance. Too many global functions, however, make the code hard to read and reason. Static type methods, which can't be overridden have the same performance as global functions, but they also provide a namespace (its type name), so our code looks clearer and there is no adverse effect on performance. Class methods, which can be overridden could lead to a decrease in performance, and they should be used when you need class inheritance. In other cases, static methods are preferred. The instance method operates on the instance of the object. Use instance methods when you need to operate on the data of that instance. Make methods final when you don't need to override them. This gives an extra tip for the compiler for optimization, and performance could be increased because of it. Intelligent code Because Swift is a static and strongly typed language, it can read, understand, and optimize code very well. It tries to avoid the execution of all unnecessary code. For a better explanation, let's take a look at this simple example: class Object { func nothing() { } } let object = Object() object.nothing() object.nothing() We create an instance of the Object class and call a nothing method. The nothing method is empty, and calling it does nothing. The Swift compiler understands this and removes those method calls. After this, we have only one line of code: let object = Object() The Swift compiler can also remove the objects created that are never used. It reduces memory usage and unnecessary function calls, which also reduces CPU usage. In our example, the object instance is not used after removing the nothing method call and the creation of object can be removed as well. In this way, Swift removes all three lines of code and we end up with no code to execute at all. Objective-C, in comparison, can't do this optimization. Because it has a dynamic runtime, the nothing method's implementation can be changed to do some work at runtime. That's why Objective-C can't remove empty method calls. This optimization might not seem like a big win but let's take a look at another—a bit more complex—example that uses more memory: class Object { let x: Int let y: Int let z: Int init(x: Int) { self.x = x self.y = x * 2 self.z = y * 2 } func nothing() { } } We have added some Int data to our Object class to increase memory usage. Now, the Object instance would use at least 24 bytes (3 * int size; Int uses 4 bytes in the 64 bit architecture). Let's also try to increase the CPU usage by adding more instructions, using a loop: for i in 0...1_000_000 { let object = Object(x: i) object.nothing() object.nothing() } print("Done") Integer literals can use the underscore sign (_) to improve readability. So, 1_000_000_000 is the same as 1000000000. Now, we have 3 million instructions and we would use 24 million bytes (about 24 MB). This is quite a lot for a type of operation that actually doesn't do anything. As you can see, we don't use the result of the loop body. For the loop body, Swift does the same optimization as in previous example and we end up with an empty loop: for i in 0...1_000_000 { } The empty loop can be skipped as well. As a result, we have saved 24 MB of memory usage and 3 million method calls. Dangerous functions There are some functions and instructions that sometimes don't provide any value for the application but the Swift compiler can't skip them because that could have a very negative impact on performance. Console print Printing a statement to the console is usually used for debugging purposes. The print and debugPrint instructions aren't removed from the application in release mode. Let's explore this code: for i in 0...1_000_000 { print(i) } The Swift compiler treats print and debugPrint as valid and important instructions that can't be skipped. Even though this code does nothing, it can't be optimized, because Swift doesn't remove the print statement. As a result, we have 1 million unnecessary instructions. As you can see, even very simple code that uses the print statement could decrease an application's performance very drastically. The loop with the 1_000_000 print statement takes 5 seconds, and that's a lot. It's even worse if you run it in Xcode; it would take up to 50 seconds. It gets all the more worse if you add a print instruction to the nothing method of an Object class from the previous example: func nothing() { print(x + y + z) } In that case, a loop in which we create an instance of Object and call nothing can't be eliminated because of the print instruction. Even though Swift can't eliminate the execution of that code completely, it does optimization by removing the creation instance of Object and calling the nothing method, and turns it into simple loop operation. The compiled code after optimization looks like this: // Initial Source Code for i in 0...1_000 { let object = Object(x: i) object.nothing() object.nothing() } // Optimized Code var x = 0, y = 0, z = 0 for i in 0...1_000_000 { x = i y = x * 2 z = y * 2 print(x + y + z) print(x + y + z) } As you can see, this code is far from perfect and has a lot of instructions that actually don't give us any value. There is a way to improve this code, so the Swift compiler does the same optimization as without print. Removing print logs To solve this performance problem, we have to remove the print statements from the code before compiling it. There are different ways of doing this. Comment out The first idea is to comment out all print statements of the code in release mode: //print("A") This will work but the next time when you want to enable logs, you will need to uncomment that code. This is a very bad and painful practice. But there is a better solution to it. Commented code is bad practice in general. You should be using a source code version control system, such as Git, instead. In this way, you can safely remove the unnecessary code and find it in the history if you need it later. Using a build configuration We can enable print only in debug mode. To do this, we will use a build configuration to conditionally exclude some code. First, we need to add a Swift compiler custom flag. To do this, select a project target and then go to Build Settings | Other Swift Flags. In the Swift Compiler - Custom Flags section and add the –D DEBUG flag for debug mode, like this: After this, you can use the DEBUG configuration flag to enable code only in debug mode. We will define our own print function. It will generate a print statement only in debug mode. In release mode, this function will be empty, and the Swift compiler will successfully eliminate it: func D_print(items: Any..., separator: String = " ", terminator: String = "n") { #if DEBUG print(items, separator: separator, terminator: terminator) #endif } Now, everywhere instead of print, we will use D_print: func nothing() { D_print(x + y + z) } You can also create a similar D_debugPrint function. Swift is very smart and does a lot of optimization, but we also have to make our code clear for people to read and for the compiler to optimize. Using a preprocessor adds complexity to your code. Use it wisely and only in situations when normal if conditions won't work, for instance, in our D_print example. Improving speed There are a few techniques that can simply improve code performance. Let's proceed directly to the first one. final You can create a function and property declaration with the final attribute. Adding the final attribute makes it non-overridable. The subclasses can't override that method or property. When you make a method non-overridable, there is no need to store it in vtable and the call to that function can be performed directly without any function address lookup in vtable: class Animal { final var name: String = "" final func feed() { } } As you have seen, final methods perform faster than non-final methods. Even such small optimization could improve an application's performance. It not only improves performance but also makes the code more secure. This way, you prevent a method from being overridden and prevent unexpected and incorrect behavior. Enabling the Whole Module Optimization setting would achieve very similar optimization results, but it's better to mark a function and property declaration explicitly as final, which would reduce the compiler's work and speed up the compilation. The compilation time for big projects with Whole Module Optimization could be up to 5 minutes in Xcode 7. Inline functions As you have seen, Swift can do optimization and inline some function calls. This way, there is no performance penalty for calling a function. You can manually enable or disable inline functions with the @inline attribute: @inline(__always) func someFunc () { } @inline(never) func someFunc () { } Even though you can manually control inline functions, it's usually better to leave it to the Swift complier to do this. Depending on the optimization settings, the Swift compiler applies different inlining techniques. The use-case for @inline(__always) would be very simple one-line functions that you always want to be inline. Value objects and reference objects There are many benefits of using immutable value types. Value objects make code not only safer and clearer but also faster. They have better speed and performance than reference objects; here is why. Memory allocation A value object can be allocated in the stack memory instead of the heap memory. Reference objects need to be allocated in the heap memory because they can be shared between many owners. Because value objects have only one owner, they can be allocated safely in the stack. Stack memory is way faster than heap memory. The second advantage is that value objects don't need reference counting memory management. As they can have only one owner, there is no such thing as reference counting for value objects. With Automatic Reference Counting (ARC) we don't think much about memory management, and it mostly looks transparent for us. Even though code looks the same when using reference objects and value objects, ARC adds extra retain and release method calls for reference objects. Avoiding Objective-C In most cases, Objective-C, with its dynamic runtime, performs slower than Swift. The interoperability between Swift and Objective-C is done so seamlessly that sometimes we may use Objective-C types and its runtime in the Swift code without knowing it. When you use Objective-C types in Swift code, Swift actually uses the Objective-C runtime for method dispatch. Because of that, Swift can't do the same optimization as for pure Swift types. Lets take a look at a simple example: for _ in 0...100 { _ = NSObject() } Let's read this code and make some assumptions about how the Swift compiler would optimize it. The NSObject instance is never used in the loop body, so we could eliminate the creation of an object. After that, we will have an empty loop; this can be eliminated as well. So, we remove all of the code from execution, but actually no code gets eliminated. This happens because Objective-C types use dynamic runtime method dispatch, called message sending. All standard frameworks, such as Foundation and UIKit, are written in Objective-C, and all types such as NSDate, NSURL, UIView, and UITableView use the Objective-C runtime. They do not perform as fast as Swift types, but we get all of these frameworks available for usage in Swift, and this is great. There is no way to remove the Objective-C dynamic runtime dispatch from Objective-C types in Swift, so the only thing we can do is learn how to use them wisely. Summary In this article, we covered many powerful features of Swift related to Swift's performance and gave some tips on how to solve performance-related issues. Resources for Article: Further resources on this subject: Flappy Swift[article] Profiling an app[article] Network Development with Swift [article]
Read more
  • 0
  • 0
  • 14276

article-image-running-firefox-os-simulators-webide
Packt
12 Oct 2015
9 min read
Save for later

Running Firefox OS Simulators with WebIDE

Packt
12 Oct 2015
9 min read
In this article by Tanay Pant, the author of the book, Learning Firefox OS Application Development, you will learn how to use WebIDE and its features. We will start by installing Firefox OS simulators in the WebIDE so that we can run and test Firefox OS applications in it. Then, we will study how to install and create new applications with WebIDE. Finally, we will cover topics such as using developer tools for applications that run in WebIDE, and uninstalling applications in Firefox OS. In brief, we will go through the following topics: Getting to know about WebIDE Installing Firefox OS simulator Installing and creating new apps with WebIDE Using developer tools inside WebIDE Uninstalling applications in Firefox OS (For more resources related to this topic, see here.) Introducing WebIDE It is now time to have a peek at Firefox OS. You can test your applications in two ways, either by running it on a real device or by running it in Firefox OS Simulator. Let's go ahead with the latter option since you might not have a Firefox OS device yet. We will use WebIDE, which comes preinstalled with Firefox, to accomplish this task. If you haven't installed Firefox yet, you can do so from https://www.mozilla.org/en-US/firefox/new/. WebIDE allows you to install one or several runtimes (different versions) together. You can use WebIDE to install different types of applications, debug them using Firefox's Developer Tools Suite, and edit the applications/manifest using the built-in source editor. After you install Firefox, open WebIDE. You can open it by navigating to Tools | Web Developer | WebIDE. Let's now take a look at the following screenshot of WebIDE: You will notice that on the top-right side of your window, there is a Select Runtime option. When you click on it, you will see the Install Simulator option. Select that option, and you will see a page titled Extra Components. It presents a list of Firefox OS simulators. We will install the latest stable and unstable versions of Firefox OS. We installed two versions of Firefox OS because we would need both the latest and stable versions to test our applications in the future. After you successfully install both the simulators, click on Select Runtime. This will now show both the OS versions listed, as shown in the following screenshot:. Let's open Firefox OS 3.0. This will open up a new window titled B2G. You should now explore Firefox OS, take a look at its applications, and interact with them. It's all HTML, CSS and JavaScript. Wonderful, isn't it? Very soon, you will develop applications like these:` Installing and creating new apps using WebIDE To install or create a new application, click on Open App in the top-left corner of the WebIDE window. You will notice that there are three options: New App, Open Packaged App, and Open Hosted App. For now, think of Hosted apps like websites that are served from a web server and are stored online in the server itself but that can still use appcache and indexeddb to store all their assets and data offline, if desired. Packaged apps are distributed in a .zip format and they can be thought of as the source code of the website bundled and distributed in a ZIP file. Let's now head to the first option in the Open App menu, which is New App. Select the HelloWorld template, enter Project Name, and click on OK. After completing this, the WebIDE will ask you about the directory where you want to store the application. I have made a new folder named Hello World for this purpose on the desktop. Now, click on Open button and finally, click again on the OK button. This will prepare your app and show details, such as Title, Icon, Description, Location and App ID of your application. Note that beneath the app title, it says Packaged Web. Can you figure out why? As we discussed, it is because of the fact that we are not serving the application online, but from a packaged directory that holds its source code. This covers the right-hand side panel. In the left-hand side panel, we have the directory listing of the application. It contains an icon folder that holds different-sized icons for different screen resolutions It also contains the app.js file, which is the engine of the application and will contain the functionality of the application; index.html, which will contain the markup data for the application; and finally, the manifest.webapp file, which contains crucial information and various permissions about the application. If you click on any filename, you will notice that the file opens in an in-browser editor where you can edit the files to make changes to your application and save them from here itself. Let's make some edits in the application— in app.js and index.html. I have replaced World with Firefox everywhere to make it Hello Firefox. Let's make the same changes in the manifest file. The manifest file contains details of your application, such as its name, description, launch path, icons, developer information, and permissions. These details are used to display information about your application in the WebIDE and Firefox Marketplace. The manifest file is in JSON format. I went ahead and edited developer information in the application as well, to include my name and my website. After saving all the files, you will notice that the information of the app in the WebIDE has changed! It's now time to run the application in Firefox OS. Click on Select Runtime and fire up Firefox OS 3.0. After it is launched, click on the Play button in the WebIDE hovering on which is the prompt that says Install and Run. Doing this will install and launch the application on your simulator! Congratulations, you installed your first Firefox OS application! Using developer tools inside WebIDE WebIDE allows you to use Firefox's awesome developer tools for applications that run in the Simulator via WebIDE as well. To use them, simply click on the Settings icon (which looks like a wrench) beside the Install and Run icon that you had used to get the app installed and running. The icon says Debug App on hovering the cursor over it. Click on this to reveal developer tools for the app that is running via WebIDE. Click on Console, and you will see the message Hello Firefox, which we gave as the input in console.log() in the app.js file. Note that it also specifies the App ID of our application while displaying Hello Firefox. You may have noticed in the preceding illustration that I sent a command via the console alert('Hello Firefox'); and it simultaneously executed the instruction in the app running in the simulator. As you may have noticed, Firefox OS customizes the look and feel of components, such as the alert box (this is browser based). Our application is running in an iframe in Gaia. Every app, including the keyboard application, runs in an iframe for security reasons. You should go through these tools to get a hang of the debugging capabilities if you haven't done so already! One more important thing that you should keep in mind is that inline scripts (for example, <a href="#" onclick="alert(this)">Click Me</a>) are forbidden in Firefox OS apps, due to Content Security Policy (CSP) restrictions. CSP restrictions include the remote scripts, inline scripts, javascript URIs, function constructor, dynamic code execution, and plugins, such as Flash or Shockwave. Remote styles are also banned. Remote Web Workers and eval() operators are not allowed for security reasons and they show 400 error and security errors respectively upon usage. You are warned about CSP violations when submitting your application to the Firefox OS Marketplace. CSP warnings in the validator will not impact whether your app is accepted into the Marketplace. However, if your app is privileged and violates the CSP, you will be asked to fix this issue in order to get your application accepted. Browsing other runtime applications You can also take a look at the source code of the preinstalled/runtime apps that are present in Firefox OS or Gaia, to be precise. For example, the following is an illustration that shows how to open them: You can click on the Hello World button (in the same place where Open App used to exist), and this will show you the whole list of Runtime Apps as shown in the preceding illustration. I clicked on the Camera application and it showed me the source code of its main.js file. It's completely okay if you are daunted by the huge file. If you find these runtime applications interesting and want to contribute to them, then you can refer to Mozilla Developer Network's articles on developing Gaia, which you can find at https://developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia. Our application looks as follows in the App Launcher of the operating system: Uninstalling applications in Firefox OS You can remove the project from WebIDE by clicking on the Remove Project button in the home page of the application. However, this will not uninstall the application from Firefox OS Simulator. The uninstallation system of the operating system is quite similar to iOS. You just have to double tap in OS X to get the Edit screen, from where you can click on the cross button on the top-left of the app icon to uninstall the app. You will then get a confirmation screen that warns you that all the data of the application will also be deleted along with the app. This will take you back to the Edit screen where you can click on Done to get back to the home screen. Summary In this article, you learned about WebIDE, how to install Firefox OS simulator in WebIDE, using Firefox OS and installing applications in it, and creating a skeleton application using WebIDE. You then learned how to use developer tools for applications that run in the simulator, browsing other preinstalled runtime applications present in Firefox OS. Finally, you learned about removing a project from WebIDE and uninstalling an application from the operating system. Resources for Article: Further resources on this subject: Learning Node.js for Mobile Application Development [Article] Introducing Web Application Development in Rails [Article] One-page Application Development [Article]
Read more
  • 0
  • 0
  • 14147

article-image-exploring-windows-powershell-50
Packt
12 Oct 2015
16 min read
Save for later

Exploring Windows PowerShell 5.0

Packt
12 Oct 2015
16 min read
In this article by Chendrayan Venkatesan, the author of the book Windows PowerShell for .NET Developers, we will cover the following topics: Basics of Desired State Configuration (DSC) Parsing structured objects using PowerShell Exploring package management Exploring PowerShell Get-Module Exploring other enhanced features (For more resources related to this topic, see here.) Windows PowerShell 5.0 has many significant benefits, to know more features about its features refer to the following link: http://go.microsoft.com/fwlink/?LinkID=512808 A few highlights of Windows PowerShell 5.0 are as follows: Improved usability Backward compatibility Class and Enum keywords are introduced Parsing structured objects are made easy using ConvertFrom string command We have some new modules introduced in Windows PowerShell 5.0, such as Archive, Package Management (this was formerly known as OneGet) and so on ISE supported transcriptions Using PowerShell Get-Module cmdlet, we can find, install, and publish modules Debug at runspace can be done using Microsoft.PowerShell.Utility module Basics of Desired State Configuration Desired State Configuration also known as DSC is a new management platform in Windows PowerShell. Using DSC, we can deploy and manage configuration data for software servicing and manage the environment. DSC can be used to streamline datacenters and this was introduced along with Windows Management Framework 4.0 and it heavily extended into Windows Management Framework 5.0. Few highlights of DSC in April 2015 Preview are as follows: New cmdlets are introduced in WMF 5.0 Few DSC commands are updated and remarkable changes are made to the configuration management platform in PowerShell 5.0 DSC resources can be built using class, so no need of MOF file It's not mandatory to know PowerShell to learn DSC but it's a great added advantage. Similar to function we can also use configuration keyword but it has a huge difference because in DSC everything is declarative, which is a cool thing in Desired State Configuration. So before beginning this exercise, I created a DSCDemo lab machine in Azure cloud with Windows Server 2012 and it's available out of the box. So, the default PowerShell version is 4.0. For now let's create and define a simple configuration, which creates a file in the local host. Yeah! A simple New-Item command can do that but it's an imperative cmdlet and we need to write a program to tell the computer to create it, if it does not exist. Structure of the DSC configuration is as follows: Configuration Name { Node ComputerName { ResourceName <String> { } } } To create a simple text file with contents, we use the following code: Configuration FileDemo { Node $env:COMPUTERNAME { File FileDemo { Ensure = 'Present' DestinationPath = 'C:TempDemo.txt' Contents = 'PowerShell DSC Rocks!' Force = $true } } } Look at the following screenshot: Following are the steps represented in the preceding figure: Using the Configuration keyword, we are defining a configuration with the name FileDemo—it's a friendly name. Inside the Configuration block we created a Node block and also a file on the local host. File is the resource name. FileDemo is a friendly name of a resource and it's also a string. Properties of the file resource. This creates MOF file—we call this similar to function. But wait, here a code file is not yet created. We just created a MOF file. Look at the MOF file structure in the following image: We can manually edit the MOF and use it on another machine that has PS 4.0 installed on it. It's not mandatory to use PowerShell for generating MOF, if you are comfortable with PowerShell, you can directly write the MOF file. To explore the available DSC resources you can execute the following command: Get-DscResource The output is illustrated in the following image: Following are the steps represented in the preceding figure: Shows you how the resources are implemented. Binary, Composite, PowerShell, and so on. In the preceding example, we created a DSC Configuration that's FileDemo and that is listed as Composite. Name of the resource. Module name the resource belongs to. Properties of the resource. To know the Syntax of a particular DSC resource we can try the following code: Get-DscResource -Name Service -Syntax The output is illustrated in the following figure ,which shows the resource syntax in detail: Now, let's see how DSC works and its three different phases: The authoring phase. The staging phase. The "Make it so" phase. The authoring phase In this phase we will create a DSC Configuration using PowerShell and this outputs a MOF file. We saw a FileDemo example to create a configuration is considered to be an authoring phase. The staging phase In this phase the declarative MOF will be staged and it's as per node. DSC has a push and pull model, where push is simply pushing the configuration to target nodes. The custom providers need to be manually placed in target machines whereas in pull mode, we need to build an IIS Server that will have MOF for target nodes and this is well defined by the OData interface. In pull mode, the custom providers are downloaded to target system. The "Make it so" phase This is the phase for enacting the configuration, that is applying the configuration on the target nodes. Before we summarize the basics of DSC, let's see a few more DSC Commands. We can do this by executing the following command: Get-Command -Noun DSC* The output is as follows: We are using a PowerShell 4.0 stable release and not 5.0, so the version will not be available. Local Configuration Manager (LCM) is the engine for DSC and it runs on all nodes. LCM is responsible to call the configuration resources that are included in a DSC configuration script. Try executing Get-DscLocalConfigurationManager cmdlet to explore its properties. To Apply the LCM settings on target nodes we can use Set-DscLocalConfigurationManager cmdlet. Use case of classes in WMF 5.0 Using classes in PowerShell makes IT professionals, system administrators, and system engineers to start learning development in WMF. It's time for us to switch back to Windows PowerShell 5.0 because the Class keyword is supported from version 5.0 onwards. Why do we need to write class in PowerShell? Is there any special need? May be we will answer this in this section but this is one reason why I prefer to say that, PowerShell is far more than a scripting language. When the Class keyword was introduced, it mainly focused on creating DSC resources. But using class we can create objects like in any other object oriented programming language. The documentation that reads New-Object is not supported. But it's revised now. Indeed it supports the New-Object. The class we create in Windows PowerShell is a .NET framework type. How to create a PowerShell Class? It's easy, just use the Class keyword! The following steps will help you to create a PowerShell class. Create a class named ClassName {}—this is an empty class. Define properties in the class as Class ClassName {$Prop1 , $prop2} Instantiate the class as $var = [ClassName]::New() Now check the output of $var: Class ClassName { $Prop1 $Prop2 } $var = [ClassName]::new() $var Let's now have a look at how to create a class and its advantages. Let us define the properties in class: Class Catalog { #Properties $Model = 'Fujitsu' $Manufacturer = 'Life Book S Series' } $var = New-Object Catalog $var The following image shows the output of class, its members, and setting the property value: Now, by changing the property value, we get the following output: Now let's create a method with overloads. In the following example we have created a method name SetInformation that accepts two arguments $mdl and $mfgr and these are of string type. Using $var.SetInformation command with no parenthesis will show the overload definitions of the method. The code is as follows: Class Catalog { #Properties $Model = 'Fujitsu' $Manufacturer = 'Life Book S Series' SetInformation([String]$mdl,[String]$mfgr) { $this.Manufacturer = $mfgr $this.Model = $mdl } } $var = New-Object -TypeName Catalog $var.SetInformation #Output OverloadDefinitions ------------------- void SetInformation(string mdl, string mfgr) Let's set the model and manufacturer using set information, as follows: Class Catalog { #Properties $Model = 'Fujitsu' $Manufacturer = 'Life Book S Series' SetInformation([String]$mdl,[String]$mfgr) { $this.Manufacturer = $mfgr $this.Model = $mdl } } $var = New-Object -TypeName Catalog $var.SetInformation('Surface' , 'Microsoft') $var The output is illustrated in following image: Inside the PowerShell class we can use PowerShell cmdlets as well. The following code is just to give a demo of using PowerShell cmdlet. Class allows us to validate the parameters as well. Let's have a look at the following example: Class Order { [ValidateSet("Red" , "Blue" , "Green")] $color [ValidateSet("Audi")] $Manufacturer Book($Manufacturer , $color) { $this.color = $color $this.Manufacturer = $Manufacturer } } The parameter $Color and $Manufacturer has ValidateSet property and has a set of values. Now let's use New-Object and set the property with an argument which doesn't belong to this set, shown as follows: $var = New-Object Order $var.color = 'Orange' Now, we get the following error: Exception setting "color": "The argument "Orange" does not belong to the set "Red,Blue,Green" specified by the ValidateSet attribute. Supply an argument that is in the set and then try the command again." Let's set the argument values correctly to get the result using Book method, as follows: $var = New-Object Order $var.Book('Audi' , 'Red') $var The output is illustrated in the following figure: Constructors A constructor is a special type of method that creates new objects. It has the same name as the class and the return type is void. Multiple constructors are supported, but each one takes different numbers and types of parameters. In the following code, let's see the steps to create a simple constructor in PowerShell that simply creates a user in the active directory. Class ADUser { $identity $Name ADUser($Idenity , $Name) { New-ADUser -SamAccountName $Idenity -Name $Name $this.identity = $Idenity $this.Name = $Name } } $var = [ADUser]::new('Dummy' , 'Test Case User') $var We can also hide the properties in PowerShell class, for example let's create two properties and hide one. In theory, it just hides the property but we can use the property as follows: Class Hide { [String]$Name Hidden $ID } $var = [Hide]::new() $var The preceding code is illustrated in the following figure: Additionally, we can carry out operations, such as Get and Set, as shown in the following code: Class Hide { [String]$Name Hidden $ID } $var = [Hide]::new() $var.Id = '23' $var.Id This returns output as 23. To explore more about class use help about_Classes -Detailed. Parsing structured objects using PowerShell In Windows PowerShell 5.0 a new cmdlet ConvertFrom-String has been introduced and it's available in Microsoft.PowerShell.Utility. Using this command, we can parse the structured objects from any given string content. To see information, use help command with ConvertFrom-String -Detailed command. The help has an incorrect parameter as PropertyName. Copy paste will not work, so use help ConvertFrom-String –Parameter * and read the parameter—it's actually PropertyNames. Now, let's see an example of using ConvertFrom-String. Let us examine a scenario where a team has a custom code which generates log files for daily health check-up reports of their environment. Unfortunately, the tool delivered by the vendor is an EXE file and no source code is available. The log file format is as follows: "Error 4356 Lync" , "Warning 6781 SharePoint" , "Information 5436 Exchange", "Error 3432 Lync" , "Warning 4356 SharePoint" , "Information 5432 Exchange" There are many ways to manipulate this record but let's see how PowerShell cmdlet ConvertFrom-String helps us. Using the following code, we will simply extract the Type, EventID, and Server: "Error 4356 Lync" , "Warning 6781 SharePoint" , "Information 5436 Exchange", "Error 3432 Lync" , "Warning 4356 SharePoint" , "Information 5432 Exchange" | ConvertFrom-String -PropertyNames Type , EventID, Server Following figure shows the output of the code we just saw: Okay, what's interesting in this? It's cool because now your output is a PSCustom object and you can manipulate it as required. "Error 4356 Lync" , "Warning 6781 SharePoint" , "Information 5436 Exchange", "Error 3432 SharePoint" , "Warning 4356 SharePoint" , "Information 5432 Exchange" | ConvertFrom-String -PropertyNames Type , EventID, Server | ? {$_.Type -eq 'Error'} An output in Lync and SharePoint has some error logs that needs to be taken care of on priority. Since, requirement varies you can use this cmdlet as required. ConvertFrom-String has a delimiter parameter, which helps us to manipulate the strings as well. In the following example let's use the –Delimiter parameter that removes white space and returns properties, as follows: "Chen V" | ConvertFrom-String -Delimiter "s" -PropertyNames "FirstName" , "SurName" This results FirstName and SurName – FirstName as Chen and SurName as V In the preceding example, we walked you through using template file to manipulate the string as we need. To do this we need to use the parameter –Template Content. Use help ConvertFrom-String –Parameter Template Content Before we begin we need to create a template file. To do this let's ping a web site. Ping www.microsoft.com and the output returned is, as shown: Pinging e10088.dspb.akamaiedge.net [2.21.47.138] with 32 bytes of data: Reply from 2.21.47.138: bytes=32 time=37ms TTL=51 Reply from 2.21.47.138: bytes=32 time=35ms TTL=51 Reply from 2.21.47.138: bytes=32 time=35ms TTL=51 Reply from 2.21.47.138: bytes=32 time=36ms TTL=51 Ping statistics for 2.21.47.138: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 35ms, Maximum = 37ms, Average = 35ms Now, we have the information in some structure. Let's extract IP and bytes; to do this I replaced the IP and Bytes as {IP*:2.21.47.138} Pinging e10088.dspb.akamaiedge.net [2.21.47.138] with 32 bytes of data: Reply from {IP*:2.21.47.138}: bytes={[int32]Bytes:32} time=37ms TTL=51 Reply from {IP*:2.21.47.138}: bytes={[int32]Bytes:32} time=35ms TTL=51 Reply from {IP*:2.21.47.138}: bytes={[int32]Bytes:32} time=36ms TTL=51 Reply from {IP*:2.21.47.138}: bytes={[int32]Bytes:32} time=35ms TTL=51 Ping statistics for 2.21.47.138: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 35ms, Maximum = 37ms, Average = 35ms ConvertFrom-String has a debug parameter using which we can debug our template file. In the following example let's see the debugging output: ping www.microsoft.com | ConvertFrom-String -TemplateFile C:TempTemplate.txt -Debug As we mentioned earlier PowerShell 5.0 is a Preview release and has few bugs. Let's ignore those for now and focus on the features, which works fine and can be utilized in environment. Exploring package management In this topic, we will walk you through the features of package management, which is another great feature of Windows Management Framework 5.0. This was introduced in Windows 10 and was formerly known as OneGet. Using package management we can automate software discovery, installation of software, and inventorying. Do not think about Software Inventory Logging (SIL) for now. As we know, in Windows Software Installation, technology has its own way of doing installations, for example MSI type, MSU type, and so on. This is a real challenge for IT professionals and developers, to think about the unique automation of software installation or deployment. Now, we can do it using package management module. To begin with, let's see the package management Module using the following code: Get-Module -Name PackageManagement The output is illustrated as follows: Yeah, well we got an output that is a binary module. Okay, how to know the available cmdlets and their usage? PowerShell has the simplest way to do things, as shown in the following code: Get-Module -Name PackageManagement The available cmdlets are shown in the following image: Package Providers are the providers connected to package management (OneGet) and package sources are registered for providers. To view the list of providers and sources we use the following cmdlets: Now, let's have a look at the available packages—in the following example I am selecting the first 20 packages, for easy viewing: Okay, we have 20 packages so using Install-Package cmdlet, let us now install WindowsAzurePowerShell on our Windows 2012 Server. We need to ensure that the source are available prior to any installation. To do this just execute the cmdlet Get-PackageSource. If the chocolatey source didn't come up in the output, simply execute the following code—do not change any values. This code will install chocolatey package manager on your machine. Once the installation is done we need to restart the PowerShell: Invoke-Expression ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')) Find-Package -Name WindowsAzurePowerShell | Install-Package -Verbose The command we just saw shows the confirmation dialog for chocolatey, which is the package source, as shown in the following figure: Click on Yes and install the package. Following are the steps represented in the figure that we just saw: Installs the prerequisites. Creates a temporary folder. Installation successful. Windows Server 2012 has .NET 4.5 in the box by default, so the verbose turned up as False for .NET 4.5, which says PowerShell not installed but WindowsAzurePowerShell is installed successfully. If you are trying to install the same package and the same version that is available on your system – the cmdlet will skip the installation. Find-Package -Name PowerShell Here | Install-Package -Verbose VERBOSE: Skipping installed package PowerShellHere 0.0.3 Explore all the package management cmdlets and automate your software deployments. Exploring PowerShell Get-Module PowerShell Get-Module is a module available in Windows PowerShell 5.0 preview. Following are few more modules: Search through modules in the gallery with Find-Module Save modules to your system from the gallery with Save-Module Install modules from the gallery with Install-Module Update your modules to the latest version with Update-Module Add your own custom repository with Register-PSRepository The following screenshot shows the additional cmdlets that are available: This will allow us to find a module from PowerShell gallery and install it in our environment. PS gallery is a repository of modules. Using Find-Module cmdlet we get a list of module available in the PS gallery. Pipe and install the required module, alternatively we can save the module and examine it before installation, to do this use Save-Module cmdlet. The following screenshot illustrates the installation and deletion of the xJEA module: We can also publish module in the PS gallery, which will be available over the internet to others. This is not a great module. All it does is get user-information from an active directory for the same account name—creates a function and saves it as PSM1 in module folder. In order to publish the module in PS gallery, we need to ensure that the module has manifest. Following are the steps to publish your module: Create a PSM1 file. Create a PSD1 file that is a manifest module (also known as data file). Get your NuGet API key from the PS gallery link shared above. Publish your module using the Publish-PSModule cmdlet. Following figure shows modules that are currently published: Following figure shows the commands to publish modules: Summary In this article, we saw that Windows PowerShell 5.0 preview has got a lot more significant features, such as enhancement in PowerShell DSC, cmdlets improvements and new cmdlets, ISE support transcriptions, support class, and using class. We can create Custom DSC resources with easy string manipulations, A new Network Switch module is introduced using which we can automate and manage Microsoft signed network switches. Resources for Article: Further resources on this subject: Windows Phone 8 Applications[article] The .NET Framework Primer[article] Unleashing Your Development Skills with PowerShell [article]
Read more
  • 0
  • 0
  • 51494

article-image-creating-graph-application-python-neo4j-gephi-linkuriousjs
Greg Roberts
12 Oct 2015
13 min read
Save for later

Creating a graph application with Python, Neo4j, Gephi & Linkurious.js

Greg Roberts
12 Oct 2015
13 min read
I love Python, and to celebrate Packt Python week, I’ve spent some time developing an app using some of my favorite tools. The app is a graph visualization of Python and related topics, as well as showing where all our content fits in. The topics are all StackOverflow tags, related by their co-occurrence in questions on the site. The app is available to view at http://gregroberts.github.io/ and in this blog, I’m going to discuss some of the techniques I used to construct the underlying dataset, and how I turned it into an online application. Graphs, not charts Graphs are an incredibly powerful tool for analyzing and visualizing complex data. In recent years, many different graph database engines have been developed to make use of this novel manner of representing data. These databases offer many benefits over traditional, relational databases because of how the data is stored and accessed. Here at Packt, I use a Neo4j graph to store and analyze data about our business. Using the Cypher query language, it’s easy to express complicated relations between different nodes succinctly. It’s not just the technical aspect of graphs which make them appealing to work with. Seeing the connections between bits of data visualized explicitly as in a graph helps you to see the data in a different light, and make connections that you might not have spotted otherwise. This graph has many uses at Packt, from customer segmentation to product recommendations. In the next section, I describe the process I use to generate recommendations from the database. Make the connection For product recommendations, I use what’s known as a hybrid filter. This considers both content based filtering (product x and y are about the same topic) and collaborative filtering (people who bought x also bought y). Each of these methods has strengths and weaknesses, so combining them into one algorithm provides a more accurate signal. The collaborative aspect is straightforward to implement in Cypher. For a particular product, we want to find out which other product is most frequently bought alongside it. We have all our products and customers stored as nodes, and purchases are stored as edges. Thus, the Cypher query we want looks like this: MATCH (n:Product {title:’Learning Cypher’})-[r:purchased*2]-(m:Product) WITH m.title AS suggestion, count(distinct r)/(n.purchased+m.purchased) AS alsoBought WHERE m<>n RETURN* ORDER BY alsoBought DESC and will very efficiently return the most commonly also purchased product. When calculating the weight, we divide by the total units sold of both titles, so we get a proportion returned. We do this so we don’t just get the titles with the most units; we’re effectively calculating the size of the intersection of the two titles’ audiences relative to their overall audience size. The content side of the algorithm looks very similar: MATCH (n:Product {title:’Learning Cypher’})-[r:is_about*2]-(m:Product) WITH m.title AS suggestion, count(distinct r)/(length(n.topics)+length(m.topics)) AS alsoAbout WHERE m<>n RETURN * ORDER BY alsoAbout DESC Implicit in this algorithm is knowledge that a title is_about a topic of some kind. This could be done manually, but where’s the fun in that? In Packt’s domain there already exists a huge, well moderated corpus of technology concepts and their usage: StackOverflow. The tagging system on StackOverflow not only tells us about all the topics developers across the world are using, it also tells us how those topics are related, by looking at the co-occurrence of tags in questions. So in our graph, StackOverflow tags are nodes in their own right, which represent topics. These nodes are connected via edges, which are weighted to reflect their co-occurrence on StackOverflow: edge_weight(n,m) = (Number of questions tagged with both n & m)/(Number questions tagged with n or m) So, to find topics related to a given topic, we could execute a query like this: MATCH (n:StackOverflowTag {name:'Matplotlib'})-[r:related_to]-(m:StackOverflowTag) RETURN n.name, r.weight, m.name ORDER BY r.weight DESC LIMIT 10 Which would return the following: | n.name | r.weight | m.name ----+------------+----------+-------------------- 1 | Matplotlib | 0.065699 | Plot 2 | Matplotlib | 0.045678 | Numpy 3 | Matplotlib | 0.029667 | Pandas 4 | Matplotlib | 0.023623 | Python 5 | Matplotlib | 0.023051 | Scipy 6 | Matplotlib | 0.017413 | Histogram 7 | Matplotlib | 0.015618 | Ipython 8 | Matplotlib | 0.013761 | Matplotlib Basemap 9 | Matplotlib | 0.013207 | Python 2.7 10 | Matplotlib | 0.012982 | Legend There are many, more complex relationships you can define between topics like this, too. You can infer directionality in the relationship by looking at the local network, or you could start constructing Hyper graphs using the extensive StackExchange API. So we have our topics, but we still need to connect our content to topics. To do this, I’ve used a two stage process. Step 1 – Parsing out the topics We take all the copy (words) pertaining to a particular product as a document representing that product. This includes the title, chapter headings, and all the copy on the website. We use this because it’s already been optimized for search, and should thus carry a fair representation of what the title is about. We then parse this document and keep all the words which match the topics we’ve previously imported. #...code for fetching all the copy for all the products key_re = 'W(%s)W' % '|'.join(re.escape(i) for i in topic_keywords) for i in documents tags = re.findall(key_re, i[‘copy’]) i['tags'] = map(lambda x: tag_lookup[x],tags) Having done this for each product, we have a bag of words representing each product, where each word is a recognized topic. Step 2 – Finding the information From each of these documents, we want to know the topics which are most important for that document. To do this, we use the tf-idf algorithm. tf-idf stands for term frequency, inverse document frequency. The algorithm takes the number of times a term appears in a particular document, and divides it by the proportion of the documents that word appears in. The term frequency factor boosts terms which appear often in a document, whilst the inverse document frequency factor gets rid of terms which are overly common across the entire corpus (for example, the term ‘programming’ is common in our product copy, and whilst most of the documents ARE about programming, this doesn’t provide much discriminating information about each document). To do all of this, I use python (obviously) and the excellent scikit-learn library. Tf-idf is implemented in the class sklearn.feature_extraction.text.TfidfVectorizer. This class has lots of options you can fiddle with to get more informative results. import sklearn.feature_extraction.text as skt tagger = skt.TfidfVectorizer(input = 'content', encoding = 'utf-8', decode_error = 'replace', strip_accents = None, analyzer = lambda x: x, ngram_range = (1,1), max_df = 0.8, min_df = 0.0, norm = 'l2', sublinear_tf = False) It’s a good idea to use the min_df & max_df arguments of the constructor so as to cut out the most common/obtuse words, to get a more informative weighting. The ‘analyzer’ argument tells it how to get the words from each document, in our case, the documents are already lists of normalized words, so we don’t need anything additional done. #create vectors of all the documents vectors = tagger.fit_transform(map(lambda x: x['tags'],rows)).toarray() #get back the topic names to map to the graph t_map = tagger.get_feature_names() jobs = [] for ind, vec in enumerate(vectors): features = filter(lambda x: x[1]>0, zip(t_map,vec)) doc = documents[ind] for topic, weight in features: job = ‘’’MERGE (n:StackOverflowTag {name:’%s’}) MERGE (m:Product {id:’%s’}) CREATE UNIQUE (m)-[:is_about {source:’tf_idf’,weight:%d}]-(n) ’’’ % (topic, doc[‘id’], weight) jobs.append(job) We then execute all of the jobs using Py2neo’s Batch functionality. Having done all of this, we can now relate products to each other in terms of what topics they have in common: MATCH (n:Product {isbn10:'1783988363'})-[r:is_about]-(a)-[q:is_about]-(m:Product {isbn10:'1783289007'}) WITH a.name as topic, r.weight+q.weight AS weight RETURN topic ORDER BY weight DESC limit 6 Which returns: | topic ---+------------------ 1 | Machine Learning 2 | Image 3 | Models 4 | Algorithm 5 | Data 6 | Python Huzzah! I now have a graph into which I can throw any piece of content about programming or software, and it will fit nicely into the network of topics we’ve developed. Take a breath So, that’s how the graph came to be. To communicate with Neo4j from Python, I use the excellent py2neo module, developed by Nigel Small. This module has all sorts of handy abstractions to allow you to work with nodes and edges as native Python objects, and then update your Neo instance with any changes you’ve made. The graph I’ve spoken about is used for many purposes across the business, and has grown in size and scope significantly over the last year. For this project, I’ve taken from this graph everything relevant to Python. I started by getting all of our content which is_about Python, or about a topic related to python: titles = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag {name:'Python'}) return distinct n''')] t2 = [i.n for i in graph.cypher.execute('''MATCH (n)-[r:is_about]-(m:StackOverflowTag)-[:related_to]-(m:StackOverflowTag {name:'Python'}) where has(n.name) return distinct n''')] titles.extend(t2) then hydrated this further by going one or two hops down each path in various directions, to get a large set of topics and content related to Python. Visualising the graph Since I started working with graphs, two visualisation tools I’ve always used are Gephi and Sigma.js. Gephi is a great solution for analysing and exploring graphical data, allowing you to apply a plethora of different layout options, find out more about the statistics of the network, and to filter and change how the graph is displayed. Sigma.js is a lightweight JavaScript library which allows you to publish beautiful graph visualizations in a browser, and it copes very well with even very large graphs. Gephi has a great plugin which allows you to export your graph straight into a web page which you can host, share and adapt. More recently, Linkurious have made it their mission to bring graph visualization to the masses. I highly advise trying the demo of their product. It really shows how much value it’s possible to get out of graph based data. Imagine if your Customer Relations team were able to do a single query to view the entire history of a case or customer, laid out as a beautiful graph, full of glyphs and annotations. Linkurious have built their product on top of Sigma.js, and they’ve made available much of the work they’ve done as the open source Linkurious.js. This is essentially Sigma.js, with a few changes to the API, and an even greater variety of plugins. On Github, each plugin has an API page in the wiki and a downloadable demo. It’s worth cloning the repository just to see the things it’s capable of! Publish It! So here’s the workflow I used to get the Python topic graph out of Neo4j and onto the web. Use Py2neo to graph the subgraph of content and topics pertinent to Python, as described above Add to this some other topics linked to the same books to give a fuller picture of the Python “world” Add in topic-topic edges and product-product edges to show the full breadth of connections observed in the data Export all the nodes and edges to csv files Import node and edge tables into Gephi. The reason I’m using Gephi as a middle step is so that I can fiddle with the visualisation in Gephi until it looks perfect. The layout plugin in Sigma is good, but this way the graph is presentable as soon as the page loads, the communities are much clearer, and I’m not putting undue strain on browsers across the world! The layout of the graph has been achieved using a number of plugins. Instead of using the pre-installed ForceAtlas layouts, I’ve used the OpenOrd layout, which I feel really shows off the communities of a large graph. There’s a really interesting and technical presentation about how this layout works here. Export the graph into gexf format, having applied some partition and ranking functions to make it more clear and appealing. Now it’s all down to Linkurious and its various plugins! You can explore the source code of the final page to see all the details, but here I’ll give an overview of the different plugins I’ve used for the different parts of the visualisation: First instantiate the graph object, pointing to a container (note the CSS of the container, without this, the graph won’t display properly: <style type="text/css"> #container { max-width: 1500px; height: 850px; margin: auto; background-color: #E5E5E5; } </style> … <div id="container"></div> … <script> s= new sigma({ container: 'container', renderer: { container: document.getElementById('container'), type: 'canvas' }, settings: { … } }); sigma.parsers.gexf - used for (trivially!) importing a gexf file into a sigma instance sigma.parsers.gexf( 'static/data/Graph1.gexf', s, function(s) { //callback executed once the data is loaded, use this to set up any aspects of the app which depend on the data }); sigma.plugins.filter - Adds the ability to very simply hide nodes/edges based on a callback function which returns a boolean. This powers the filtering widgets on the page. <input class="form-control" id="min-degree" type="range" min="0" max="0" value="0"> … function applyMinDegreeFilter(e) { var v = e.target.value; $('#min-degree-val').textContent = v; filter .undo('min-degree') .nodesBy( function(n, options) { return this.graph.degree(n.id) >= options.minDegreeVal; },{ minDegreeVal: +v }, 'min-degree' ) .apply(); }; $('#min-degree').change(applyMinDegreeFilter); sigma.plugins.locate - Adds the ability to zoom in on a single node or collection of nodes. Very useful if you’re filtering a very large initial graph function locateNode (nid) { if (nid == '') { locate.center(1); } else { locate.nodes(nid); } }; sigma.renderers.glyphs - Allows you to add custom glyphs to each node. Useful if you have many types of node. Outro This application has been a very fun little project to build. The improvements to Sigma wrought by Linkurious have resulted in an incredibly powerful toolkit to rapidly generate graph based applications with a great degree of flexibility and interaction potential. None of this would have been possible were it not for Python. Python is my right (left, I’m left handed) hand which I use for almost everything. Its versatility and expressiveness make it an incredibly robust Swiss army knife in any data-analysts toolkit.
Read more
  • 0
  • 0
  • 29669

article-image-basics-jupyter-notebook-python
Packt Editorial Staff
11 Oct 2015
28 min read
Save for later

Basics of Jupyter Notebook and Python

Packt Editorial Staff
11 Oct 2015
28 min read
In this article by Cyrille Rossant, coming from his book, Learning IPython for Interactive Computing and Data Visualization - Second Edition, we will see how to use IPython console, Jupyter Notebook, and we will go through the basics of Python. Originally, IPython provided an enhanced command-line console to run Python code interactively. The Jupyter Notebook is a more recent and more sophisticated alternative to the console. Today, both tools are available, and we recommend that you learn to use both. [box type="note" align="alignleft" class="" width=""]The first chapter of the book, Chapter 1, Getting Started with IPython, contains all installation instructions. The main step is to download and install the free Anaconda distribution at https://www.continuum.io/downloads (the version of Python 3 64-bit for your operating system).[/box] Launching the IPython console To run the IPython console, type ipython in an OS terminal. There, you can write Python commands and see the results instantly. Here is a screenshot: IPython console The IPython console is most convenient when you have a command-line-based workflow and you want to execute some quick Python commands. You can exit the IPython console by typing exit. [box type="note" align="alignleft" class="" width=""]Let's mention the Qt console, which is similar to the IPython console but offers additional features such as multiline editing, enhanced tab completion, image support, and so on. The Qt console can also be integrated within a graphical application written with Python and Qt. See http://jupyter.org/qtconsole/stable/ for more information.[/box] Launching the Jupyter Notebook To run the Jupyter Notebook, open an OS terminal, go to ~/minibook/ (or into the directory where you've downloaded the book's notebooks), and type jupyter notebook. This will start the Jupyter server and open a new window in your browser (if that's not the case, go to the following URL: http://localhost:8888). Here is a screenshot of Jupyter's entry point, the Notebook dashboard: The Notebook dashboard [box type="note" align="alignleft" class="" width=""]At the time of writing, the following browsers are officially supported: Chrome 13 and greater; Safari 5 and greater; and Firefox 6 or greater. Other browsers may work also. Your mileage may vary.[/box] The Notebook is most convenient when you start a complex analysis project that will involve a substantial amount of interactive experimentation with your code. Other common use-cases include keeping track of your interactive session (like a lab notebook), or writing technical documents that involve code, equations, and figures. In the rest of this section, we will focus on the Notebook interface. [box type="note" align="alignleft" class="" width=""]Closing the Notebook server To close the Notebook server, go to the OS terminal where you launched the server from, and press Ctrl + C. You may need to confirm with y.[/box] The Notebook dashboard The dashboard contains several tabs which are as follows: Files: shows all files and notebooks in the current directory Running: shows all kernels currently running on your computer Clusters: lets you launch kernels for parallel computing A notebook is an interactive document containing code, text, and other elements. A notebook is saved in a file with the .ipynb extension. This file is a plain text file storing a JSON data structure. A kernel is a process running an interactive session. When using IPython, this kernel is a Python process. There are kernels in many languages other than Python. [box type="note" align="alignleft" class="" width=""]We follow the convention to use the term notebook for a file, and Notebook for the application and the web interface.[/box] In Jupyter, notebooks and kernels are strongly separated. A notebook is a file, whereas a kernel is a process. The kernel receives snippets of code from the Notebook interface, executes them, and sends the outputs and possible errors back to the Notebook interface. Thus, in general, the kernel has no notion of the Notebook. A notebook is persistent (it's a file), whereas a kernel may be closed at the end of an interactive session and it is therefore not persistent. When a notebook is re-opened, it needs to be re-executed. In general, no more than one Notebook interface can be connected to a given kernel. However, several IPython consoles can be connected to a given kernel. The Notebook user interface To create a new notebook, click on the New button, and select Notebook (Python 3). A new browser tab opens and shows the Notebook interface as follows: A new notebook Here are the main components of the interface, from top to bottom: The notebook name, which you can change by clicking on it. This is also the name of the .ipynb file. The Menu bar gives you access to several actions pertaining to either the notebook or the kernel. To the right of the menu bar is the Kernel name. You can change the kernel language of your notebook from the Kernel menu. The Toolbar contains icons for common actions. In particular, the dropdown menu showing Code lets you change the type of a cell. Following is the main component of the UI: the actual Notebook. It consists of a linear list of cells. We will detail the structure of a cell in the following sections. Structure of a notebook cell There are two main types of cells: Markdown cells and code cells, and they are described as follows: A Markdown cell contains rich text. In addition to classic formatting options like bold or italics, we can add links, images, HTML elements, LaTeX mathematical equations, and more. A code cell contains code to be executed by the kernel. The programming language corresponds to the kernel's language. We will only use Python in this book, but you can use many other languages. You can change the type of a cell by first clicking on a cell to select it, and then choosing the cell's type in the toolbar's dropdown menu showing Markdown or Code. Markdown cells Here is a screenshot of a Markdown cell: A Markdown cell The top panel shows the cell in edit mode, while the bottom one shows it in render mode. The edit mode lets you edit the text, while the render mode lets you display the rendered cell. We will explain the differences between these modes in greater detail in the following section. Code cells Here is a screenshot of a complex code cell: Structure of a code cell This code cell contains several parts, as follows: The Prompt number shows the cell's number. This number increases every time you run the cell. Since you can run cells of a notebook out of order, nothing guarantees that code numbers are linearly increasing in a given notebook. The Input area contains a multiline text editor that lets you write one or several lines of code with syntax highlighting. The Widget area may contain graphical controls; here, it displays a slider. The Output area can contain multiple outputs, here: Standard output (text in black) Error output (text with a red background) Rich output (an HTML table and an image here) The Notebook modal interface The Notebook implements a modal interface similar to some text editors such as vim. Mastering this interface may represent a small learning curve for some users. Use the edit mode to write code (the selected cell has a green border, and a pen icon appears at the top right of the interface). Click inside a cell to enable the edit mode for this cell (you need to double-click with Markdown cells). Use the command mode to operate on cells (the selected cell has a gray border, and there is no pen icon). Click outside the text area of a cell to enable the command mode (you can also press the Esc key). Keyboard shortcuts are available in the Notebook interface. Type h to show them. We review here the most common ones (for Windows and Linux; shortcuts for Mac OS X may be slightly different). Keyboard shortcuts available in both modes Here are a few keyboard shortcuts that are always available when a cell is selected: Ctrl + Enter: run the cell Shift + Enter: run the cell and select the cell below Alt + Enter: run the cell and insert a new cell below Ctrl + S: save the notebook Keyboard shortcuts available in the edit mode In the edit mode, you can type code as usual, and you have access to the following keyboard shortcuts: Esc: switch to command mode Ctrl + Shift + -: split the cell Keyboard shortcuts available in the command mode In the command mode, keystrokes are bound to cell operations. Don't write code in command mode or unexpected things will happen! For example, typing dd in command mode will delete the selected cell! Here are some keyboard shortcuts available in command mode: Enter: switch to edit mode Up or k: select the previous cell Down or j: select the next cell y / m: change the cell type to code cell/Markdown cell a / b: insert a new cell above/below the current cell x / c / v: cut/copy/paste the current cell dd: delete the current cell z: undo the last delete operation Shift + =: merge the cell below h: display the help menu with the list of keyboard shortcuts Spending some time learning these shortcuts is highly recommended. References Here are a few references: Main documentation of Jupyter at http://jupyter.readthedocs.org/en/latest/ Jupyter Notebook interface explained at http://jupyter-notebook.readthedocs.org/en/latest/notebook.html A crash course on Python If you don't know Python, read this section to learn the fundamentals. Python is a very accessible language and is even taught to school children. If you have ever programmed, it will only take you a few minutes to learn the basics. Hello world Open a new notebook and type the following in the first cell: In [1]: print("Hello world!") Out[1]: Hello world! Here is a screenshot: "Hello world" in the Notebook [box type="note" align="alignleft" class="" width=""]Prompt string Note that the convention chosen in this article is to show Python code (also called the input) prefixed with In [x]: (which shouldn't be typed). This is the standard IPython prompt. Here, you should just type print("Hello world!") and then press Shift + Enter.[/box] Congratulations! You are now a Python programmer. Variables Let's use Python as a calculator. In [2]: 2 * 2 Out[2]: 4 Here, 2 * 2 is an expression statement. This operation is performed, the result is returned, and IPython displays it in the notebook cell's output. [box type="note" align="alignleft" class="" width=""]Division In Python 3, 3 / 2 returns 1.5 (floating-point division), whereas it returns 1 in Python 2 (integer division). This can be source of errors when porting Python 2 code to Python 3. It is recommended to always use the explicit 3.0 / 2.0 for floating-point division (by using floating-point numbers) and 3 // 2 for integer division. Both syntaxes work in Python 2 and Python 3. See http://python3porting.com/differences.html#integer-division for more details.[/box] Other built-in mathematical operators include +, -, ** for the exponentiation, and others. You will find more details at https://docs.python.org/3/reference/expressions.html#the-power-operator. Variables form a fundamental concept of any programming language. A variable has a name and a value. Here is how to create a new variable in Python: In [3]: a = 2 And here is how to use an existing variable: In [4]: a * 3 Out[4]: 6 Several variables can be defined at once (this is called unpacking): In [5]: a, b = 2, 6 There are different types of variables. Here, we have used a number (more precisely, an integer). Other important types include floating-point numbers to represent real numbers, strings to represent text, and booleans to represent True/False values. Here are a few examples: In [6]: somefloat = 3.1415 sometext = 'pi is about' # You can also use double quotes. print(sometext, somefloat) # Display several variables. Out[6]: pi is about 3.1415 Note how we used the # character to write comments. Whereas Python discards the comments completely, adding comments in the code is important when the code is to be read by other humans (including yourself in the future). String escaping String escaping refers to the ability to insert special characters in a string. For example, how can you insert ' and ", given that these characters are used to delimit a string in Python code? The backslash is the go-to escape character in Python (and in many other languages too). Here are a few examples: In [7]: print("Hello "world"") print("A list:n* item 1n* item 2") print("C:pathonwindows") print(r"C:pathonwindows") Out[7]: Hello "world" A list: * item 1 * item 2 C:pathonwindows C:pathonwindows The special character n is the new line (or line feed) character. To insert a backslash, you need to escape it, which explains why it needs to be doubled as . You can also disable escaping by using raw literals with a r prefix before the string, like in the last example above. In this case, backslashes are considered as normal characters. This is convenient when writing Windows paths, since Windows uses backslash separators instead of forward slashes like on Unix systems. A very common error on Windows is forgetting to escape backslashes in paths: writing "C:path" may lead to subtle errors. You will find the list of special characters in Python at https://docs.python.org/3.4/reference/lexical_analysis.html#string-and-bytes-literals. Lists A list contains a sequence of items. You can concisely instruct Python to perform repeated actions on the elements of a list. Let's first create a list of numbers as follows: In [8]: items = [1, 3, 0, 4, 1] Note the syntax we used to create the list: square brackets [], and commas , to separate the items. The built-in function len() returns the number of elements in a list: In [9]: len(items) Out[9]: 5 [box type="note" align="alignleft" class="" width=""]Python comes with a set of built-in functions, including print(), len(), max(), functional routines like filter() and map(), and container-related routines like all(), any(), range(), and sorted(). You will find the full list of built-in functions at https://docs.python.org/3.4/library/functions.html.[/box] Now, let's compute the sum of all elements in the list. Python provides a built-in function for this: In [10]: sum(items) Out[10]: 9 We can also access individual elements in the list, using the following syntax: In [11]: items[0] Out[11]: 1 In [12]: items[-1] Out[12]: 1 Note that indexing starts at 0 in Python: the first element of the list is indexed by 0, the second by 1, and so on. Also, -1 refers to the last element, -2, to the penultimate element, and so on. The same syntax can be used to alter elements in the list: In [13]: items[1] = 9 items Out[13]: [1, 9, 0, 4, 1] We can access sublists with the following syntax: In [14]: items[1:3] Out[14]: [9, 0] Here, 1:3 represents a slice going from element 1 included (this is the second element of the list) to element 3 excluded. Thus, we get a sublist with the second and third element of the original list. The first-included/last-excluded asymmetry leads to an intuitive treatment of overlaps between consecutive slices. Also, note that a sublist refers to a dynamic view of the original list, not a copy; changing elements in the sublist automatically changes them in the original list. Python provides several other types of containers: Tuples are immutable and contain a fixed number of elements: In [15]: my_tuple = (1, 2, 3) my_tuple[1] Out[15]: 2 Dictionaries contain key-value pairs. They are extremely useful and common: In [16]: my_dict = {'a': 1, 'b': 2, 'c': 3} print('a:', my_dict['a']) Out[16]: a: 1 In [17]: print(my_dict.keys()) Out[17]: dict_keys(['c', 'a', 'b']) There is no notion of order in a dictionary. However, the native collections module provides an OrderedDict structure that keeps the insertion order (see https://docs.python.org/3.4/library/collections.html). Sets, like mathematical sets, contain distinct elements: In [18]: my_set = set([1, 2, 3, 2, 1]) my_set Out[18]: {1, 2, 3} A Python object is mutable if its value can change after it has been created. Otherwise, it is immutable. For example, a string is immutable; to change it, a new string needs to be created. A list, a dictionary, or a set is mutable; elements can be added or removed. By contrast, a tuple is immutable, and it is not possible to change the elements it contains without recreating the tuple. See https://docs.python.org/3.4/reference/datamodel.html for more details. Loops We can run through all elements of a list using a for loop: In [19]: for item in items: print(item) Out[19]: 1 9 0 4 1 There are several things to note here: The for item in items syntax means that a temporary variable named item is created at every iteration. This variable contains the value of every item in the list, one at a time. Note the colon : at the end of the for statement. Forgetting it will lead to a syntax error! The statement print(item) will be executed for all items in the list. Note the four spaces before print: this is called the indentation. You will find more details about indentation in the next subsection. Python supports a concise syntax to perform a given operation on all elements of a list, as follows: In [20]: squares = [item * item for item in items] squares Out[20]: [1, 81, 0, 16, 1] This is called a list comprehension. A new list is created here; it contains the squares of all numbers in the list. This concise syntax leads to highly readable and Pythonic code. Indentation Indentation refers to the spaces that may appear at the beginning of some lines of code. This is a particular aspect of Python's syntax. In most programming languages, indentation is optional and is generally used to make the code visually clearer. But in Python, indentation also has a syntactic meaning. Particular indentation rules need to be followed for Python code to be correct. In general, there are two ways to indent some text: by inserting a tab character (also referred to as t), or by inserting a number of spaces (typically, four). It is recommended to use spaces instead of tab characters. Your text editor should be configured such that the Tab key on the keyboard inserts four spaces instead of a tab character. In the Notebook, indentation is automatically configured properly; so you shouldn't worry about this issue. The question only arises if you use another text editor for your Python code. Finally, what is the meaning of indentation? In Python, indentation delimits coherent blocks of code, for example, the contents of a loop, a conditional branch, a function, and other objects. Where other languages such as C or JavaScript use curly braces to delimit such blocks, Python uses indentation. Conditional branches Sometimes, you need to perform different operations on your data depending on some condition. For example, let's display all even numbers in our list: In [21]: for item in items: if item % 2 == 0: print(item) Out[21]: 0 4 Again, here are several things to note: An if statement is followed by a boolean expression. If a and b are two integers, the modulo operand a % b returns the remainder from the division of a by b. Here, item % 2 is 0 for even numbers, and 1 for odd numbers. The equality is represented by a double equal sign == to avoid confusion with the assignment operator = that we use when we create variables. Like with the for loop, the if statement ends with a colon :. The part of the code that is executed when the condition is satisfied follows the if statement. It is indented. Indentation is cumulative: since this if is inside a for loop, there are eight spaces before the print(item) statement. Python supports a concise syntax to select all elements in a list that satisfy certain properties. Here is how to create a sublist with only even numbers: In [22]: even = [item for item in items if item % 2 == 0] even Out[22]: [0, 4] This is also a form of list comprehension. Functions Code is typically organized into functions. A function encapsulates part of your code. Functions allow you to reuse bits of functionality without copy-pasting the code. Here is a function that tells whether an integer number is even or not: In [23]: def is_even(number): """Return whether an integer is even or not.""" return number % 2 == 0 There are several things to note here: A function is defined with the def keyword. After def comes the function name. A general convention in Python is to only use lowercase characters, and separate words with an underscore _. A function name generally starts with a verb. The function name is followed by parentheses, with one or several variable names called the arguments. These are the inputs of the function. There is a single argument here, named number. No type is specified for the argument. This is because Python is dynamically typed; you could pass a variable of any type. This function would work fine with floating point numbers, for example (the modulo operation works with floating point numbers in addition to integers). The body of the function is indented (and note the colon : at the end of the def statement). There is a docstring wrapped by triple quotes """. This is a particular form of comment that explains what the function does. It is not mandatory, but it is strongly recommended to write docstrings for the functions exposed to the user. The return keyword in the body of the function specifies the output of the function. Here, the output is a Boolean, obtained from the expression number % 2 == 0. It is possible to return several values; just use a comma to separate them (in this case, a tuple of Booleans would be returned). Once a function is defined, it can be called like this: In [24]: is_even(3) Out[24]: False In [25]: is_even(4) Out[25]: True Here, 3 and 4 are successively passed as arguments to the function. Positional and keyword arguments A Python function can accept an arbitrary number of arguments, called positional arguments. It can also accept optional named arguments, called keyword arguments. Here is an example: In [26]: def remainder(number, divisor=2): return number % divisor The second argument of this function, divisor, is optional. If it is not provided by the caller, it will default to the number 2, as shown here: In [27]: remainder(5) Out[27]: 1 There are two equivalent ways of specifying a keyword argument when calling a function. They are as follows: In [28]: remainder(5, 3) Out[28]: 2 In [29]: remainder(5, divisor=3) Out[29]: 2 In the first case, 3 is understood as the second argument, divisor. In the second case, the name of the argument is given explicitly by the caller. This second syntax is clearer and less error-prone than the first one. Functions can also accept arbitrary sets of positional and keyword arguments, using the following syntax: In [30]: def f(*args, **kwargs): print("Positional arguments:", args) print("Keyword arguments:", kwargs) In [31]: f(1, 2, c=3, d=4) Out[31]: Positional arguments: (1, 2) Keyword arguments: {'c': 3, 'd': 4} Inside the function, args is a tuple containing positional arguments, and kwargs is a dictionary containing keyword arguments. Passage by assignment When passing a parameter to a Python function, a reference to the object is actually passed (passage by assignment): If the passed object is mutable, it can be modified by the function If the passed object is immutable, it cannot be modified by the function Here is an example: In [32]: my_list = [1, 2] def add(some_list, value): some_list.append(value) add(my_list, 3) my_list Out[32]: [1, 2, 3] The add() function modifies an object defined outside it (in this case, the object my_list); we say this function has side-effects. A function with no side-effects is called a pure function: it doesn't modify anything in the outer context, and it deterministically returns the same result for any given set of inputs. Pure functions are to be preferred over functions with side-effects. Knowing this can help you spot out subtle bugs. There are further related concepts that are useful to know, including function scopes, naming, binding, and more. Here are a couple of links: Passage by reference at https://docs.python.org/3/faq/programming.html#how-do-i-write-a-function-with-output-parameters-call-by-reference Naming, binding, and scope at https://docs.python.org/3.4/reference/executionmodel.html Errors Let's discuss errors in Python. As you learn, you will inevitably come across errors and exceptions. The Python interpreter will most of the time tell you what the problem is, and where it occurred. It is important to understand the vocabulary used by Python so that you can more quickly find and correct your errors. Let's see the following example: In [33]: def divide(a, b): return a / b In [34]: divide(1, 0) Out[34]: --------------------------------------------------------- ZeroDivisionError Traceback (most recent call last) <ipython-input-2-b77ebb6ac6f6> in <module>() ----> 1 divide(1, 0) <ipython-input-1-5c74f9fd7706> in divide(a, b) 1 def divide(a, b): ----> 2 return a / b ZeroDivisionError: division by zero Here, we defined a divide() function, and called it to divide 1 by 0. Dividing a number by 0 is an error in Python. Here, a ZeroDivisionError exception was raised. An exception is a particular type of error that can be raised at any point in a program. It is propagated from the innards of the code up to the command that launched the code. It can be caught and processed at any point. You will find more details about exceptions at https://docs.python.org/3/tutorial/errors.html, and common exception types at https://docs.python.org/3/library/exceptions.html#bltin-exceptions. The error message you see contains the stack trace, the exception type, and the exception message. The stack trace shows all function calls between the raised exception and the script calling point. The top frame, indicated by the first arrow ---->, shows the entry point of the code execution. Here, it is divide(1, 0), which was called directly in the Notebook. The error occurred while this function was called. The next and last frame is indicated by the second arrow. It corresponds to line 2 in our function divide(a, b). It is the last frame in the stack trace: this means that the error occurred there. Object-oriented programming Object-oriented programming (OOP) is a relatively advanced topic. Although we won't use it much in this book, it is useful to know the basics. Also, mastering OOP is often essential when you start to have a large code base. In Python, everything is an object. A number, a string, or a function is an object. An object is an instance of a type (also known as class). An object has attributes and methods, as specified by its type. An attribute is a variable bound to an object, giving some information about it. A method is a function that applies to the object. For example, the object 'hello' is an instance of the built-in str type (string). The type() function returns the type of an object, as shown here: In [35]: type('hello') Out[35]: str There are native types, like str or int (integer), and custom types, also called classes, that can be created by the user. In IPython, you can discover the attributes and methods of any object with the dot syntax and tab completion. For example, typing 'hello'.u and pressing Tab automatically shows us the existence of the upper() method: In [36]: 'hello'.upper() Out[36]: 'HELLO' Here, upper() is a method available to all str objects; it returns an uppercase copy of a string. A useful string method is format(). This simple and convenient templating system lets you generate strings dynamically, as shown in the following example: In [37]: 'Hello {0:s}!'.format('Python') Out[37]: Hello Python The {0:s} syntax means "replace this with the first argument of format(), which should be a string". The variable type after the colon is especially useful for numbers, where you can specify how to display the number (for example, .3f to display three decimals). The 0 makes it possible to replace a given value several times in a given string. You can also use a name instead of a position—for example 'Hello {name}!'.format(name='Python'). Some methods are prefixed with an underscore _; they are private and are generally not meant to be used directly. IPython's tab completion won't show you these private attributes and methods unless you explicitly type _ before pressing Tab. In practice, the most important thing to remember is that appending a dot . to any Python object and pressing Tab in IPython will show you a lot of functionality pertaining to that object. Functional programming Python is a multi-paradigm language; it notably supports imperative, object-oriented, and functional programming models. Python functions are objects and can be handled like other objects. In particular, they can be passed as arguments to other functions (also called higher-order functions). This is the essence of functional programming. Decorators provide a convenient syntax construct to define higher-order functions. Here is an example using the is_even() function from the previous Functions section: In [38]: def show_output(func): def wrapped(*args, **kwargs): output = func(*args, **kwargs) print("The result is:", output) return wrapped The show_output() function transforms an arbitrary function func() to a new function, named wrapped(), that displays the result of the function, as follows: In [39]: f = show_output(is_even) f(3) Out[39]: The result is: False Equivalently, this higher-order function can also be used with a decorator, as follows: In [40]: @show_output def square(x): return x * x In [41]: square(3) Out[41]: The result is: 9 You can find more information about Python decorators at https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators and at http://www.thecodeship.com/patterns/guide-to-python-function-decorators/. Python 2 and 3 Let's finish this section with a few notes about Python 2 and Python 3 compatibility issues. There are still some Python 2 code and libraries that are not compatible with Python 3. Therefore, it is sometimes useful to be aware of the differences between the two versions. One of the most obvious differences is that print is a statement in Python 2, whereas it is a function in Python 3. Therefore, print "Hello" (without parentheses) works in Python 2 but not in Python 3, while print("Hello") works in both Python 2 and Python 3. There are several non-mutually exclusive options to write portable code that works with both versions: futures: A built-in module supporting backward-incompatible Python syntax 2to3: A built-in Python module to port Python 2 code to Python 3 six: An external lightweight library for writing compatible code Here are a few references: Official Python 2/3 wiki page at https://wiki.python.org/moin/Python2orPython3 The Porting to Python 3 book, by CreateSpace Independent Publishing Platform at http://www.python3porting.com/bookindex.html 2to3 at https://docs.python.org/3.4/library/2to3.html six at https://pythonhosted.org/six/ futures at https://docs.python.org/3.4/library/__future__.html The IPython Cookbook contains an in-depth recipe about choosing between Python 2 and 3, and how to support both. Going beyond the basics You now know the fundamentals of Python, the bare minimum that you will need in this book. As you can imagine, there is much more to say about Python. Following are a few further basic concepts that are often useful and that we cannot cover here, unfortunately. You are highly encouraged to have a look at them in the references given at the end of this section: range and enumerate pass, break, and, continue, to be used in loops Working with files Creating and importing modules The Python standard library provides a wide range of functionality (OS, network, file systems, compression, mathematics, and more) Here are some slightly more advanced concepts that you might find useful if you want to strengthen your Python skills: Regular expressions for advanced string processing Lambda functions for defining small anonymous functions Generators for controlling custom loops Exceptions for handling errors with statements for safely handling contexts Advanced object-oriented programming Metaprogramming for modifying Python code dynamically The pickle module for persisting Python objects on disk and exchanging them across a network Finally, here are a few references: Getting started with Python: https://www.python.org/about/gettingstarted/ A Python tutorial: https://docs.python.org/3/tutorial/index.html The Python Standard Library: https://docs.python.org/3/library/index.html Interactive tutorial: http://www.learnpython.org/ Codecademy Python course: http://www.codecademy.com/tracks/python Language reference (expert level): https://docs.python.org/3/reference/index.html Python Cookbook, by David Beazley and Brian K. Jones, O'Reilly Media (advanced level, highly recommended if you want to become a Python expert) Summary In this article, we have seen how to launch the IPython console and Jupyter Notebook, the different aspects of the Notebook and its user interface, the structure of the notebook cell, keyboard shortcuts that are available in the Notebook interface, and the basics of Python. Introduction to Data Analysis and Libraries Hand Gesture Recognition Using a Kinect Depth Sensor The strange relationship between objects, functions, generators and coroutines
Read more
  • 0
  • 0
  • 126551

article-image-asynchronous-communication-between-components
Packt
09 Oct 2015
12 min read
Save for later

Asynchronous Communication between Components

Packt
09 Oct 2015
12 min read
In this article by Andreas Niedermair, the author of the book Mastering ServiceStack, we will see the communication between asynchronous components. The recent release of .NET has added several new ways to further embrace asynchronous and parallel processing by introducing the Task Parallel Library (TPL) and async and await. (For more resources related to this topic, see here.) The need for asynchronous processing has been there since the early days of programming. Its main concept is to offload the processing to another thread or process to release the calling thread from waiting and it has become a standard model since the rise of GUIs. In such interfaces only one thread is responsible for drawing the GUI, which must not be blocked in order to remain available and also to avoid putting the application in a non-responding state. This paradigm is a core point in distributed systems, at some point, long running operations are offloaded to a separate component, either to overcome blocking or to avoid resource bottlenecks using dedicated machines, which also makes the processing more robust against unexpected application pool recycling and other such issues. A synonym for "fire-and-forget" is "one-way", which is also reflected by the design of static routes of ServiceStack endpoints, where the default is /{format}/oneway/{service}. Asynchronism adds a whole new level of complexity to our processing chain, as some callers might depend on a return value. This problem can be overcome by adding callback or another event to your design. Messaging or in general a producer consumer chain is a fundamental design pattern, which can be applied within the same process or inter-process, on the same or a cross machine to decouple components. Consider the following architecture: The client issues a request to the service, which processes the message and returns a response. The server is known and is directly bound to the client, which makes an on-the-fly addition of servers practically impossible. You'd need to reconfigure the clients to reflect the collection of servers on every change and implement a distribution logic for requests. Therefore, a new component is introduced, which acts as a broker (without any processing of the message, except delivery) between the client and service to decouple the service from the client. This gives us the opportunity to introduce more services for heavy load scenarios by simply registering a new instance to the broker, as shown in the following figure:. I left out the clustering (scaling) of brokers and also the routing of messages on purpose at this stage of introduction. In many cross process scenarios a database is introduced as a broker, which is constantly polled by services (and clients, if there's a response involved) to check whether there's a message to be processed or not. Adding a database as a broker and implementing your own logic can be absolutely fine for basic systems, but for more advanced scenarios it lacks some essential features, which Messages Queues come shipped with. Scalability: Decoupling is the biggest step towards a robust design, as it introduces the possibility to add more processing nodes to your data flow. Resilience: Messages are guaranteed to be delivered and processed as automatic retrying is available for non-acknowledged (processed) messages. If the retry count is exceeded, failed messages are stored in a Dead Letter Queue (DLQ) to be inspected later and are requeued after fixing the issue that caused the failure. In case of a partial failure of your infrastructure, clients can still produce messages that get delivered and processed as soon as there is even a single consumer back online. Pushing instead of polling: This is where asynchronism comes into play, as clients do not constantly poll for messages but instead it gets pushed by the broker when there's a new message in their subscribed queue. This minimizes the spinning and wait time, when the timer ticks only for 10 seconds. Guaranteed order: Most Message Queues offer a guaranteed order of the processing under defined conditions (mostly FIFO). Load balancing: With multiple services registered for messages, there is an inherent load balancing so that the heavy load scenarios can be handled better. In addition to this round-robin routing there are other routing logics, such as smallest-mailbox, tail-chopping, or random routing. Message persistence: Message Queues can be configured to persist their data to disk and even survive restarts of the host on which they are running. To overcome the downtime of the Message Queue you can even setup a cluster to offload the demand to other brokers while restarting a single node. Built-in priority: Message Queues usually have separate queues for different messages and even provide a separate in queue for prioritized messages. There are many more features, such as Time to live, security and batching modes, which we will not cover as they are outside the scope of ServiceStack. In the following example we will refer to two basic DTOs: public class Hello : ServiceStack.IReturn<HelloResponse> { public string Name { get; set; } } public class HelloResponse { public string Result { get; set; } } The Hello class is used to send a Name to a consumer that generates a message, which will be enqueued in the Message Queue as well. RabbitMQ RabbitMQ is a mature broker built on top of the Advanced Message Queuing Protocol (AMQP), which makes it possible to solve even more complex scenarios, as shown here: The messages will survive restarts of the RabbitMQ service and the additional guaranty of delivery is accomplished by depending upon an acknowledgement of the receipt (and processing) of the message, by default it is done by ServiceStack for typical scenarios. The client of this Message Queue is located in the ServiceStack.RabbitMq object's NuGet package (it uses the official client in the RabbitMQ.Client package under the hood). You can add additional protocols to RabbitMQ, such as Message Queue Telemetry Transport (MQTT) and Streaming Text Oriented Messaging Protocol (STOMP), with plugins to ease Interop scenarios. Due to its complexity, we will focus on an abstracted interaction with the broker. There are many books and articles available for a deeper understanding of RabbitMQ. A quick overview of the covered scenarios is available at https://www.rabbitmq.com/getstarted.html. The method of publishing a message with RabbitMQ does not differ much from RedisMQ: using ServiceStack; using ServiceStack.RabbitMq; using (var rabbitMqServer = new RabbitMqServer()) { using (var messageProducer = rabbitMqServer.CreateMessageProducer()) { var hello = new Hello { Name = "Demo" }; messageProducer.Publish(hello); } } This will create a Helloobject and publish it to the corresponding queue in RabbitMQ. To retrieve this message, we need to register a handler, as shown here: using System; using ServiceStack; using ServiceStack.RabbitMq; using ServiceStack.Text; var rabbitMqServer = new RabbitMqServer(); rabbitMqServer.RegisterHandler<Hello>(message => { var hello = message.GetBody(); var name = hello.Name; var result = "Hello {0}".Fmt(name); result.Print(); return null; }); rabbitMqServer.Start(); "Listening for hello messages".Print(); Console.ReadLine(); rabbitMqServer.Dispose(); This registers a handler for Hello objects and prints a message to the console. In favor of a straightforward example we are omitting all the parameters with default values of the constructor of RabbitMqServer, which will connect us to the local instance at port 5672. To change this, you can either provide a connectionString parameter (and optional credentials) or use a RabbitMqMessageFactory object to customize the connection. Setup Setting up RabbitMQ involves a bit of effort. At first you need to install Erlang from http://www.erlang.org/download.html, which is the runtime for RabbitMQ due to its functional and concurrent nature. Then you can grab the installer from https://www.rabbitmq.com/download.html, which will set RabbitMQ up and running as a service with a default configuration. Processing chain Due to its complexity, the processing chain with any mature Message Queue is different from what you know from RedisMQ. Exchanges are introduced in front of queues to route the messages to their respective queues according to their routing keys: The default exchange name is mx.servicestack (defined in ServiceStack.Messaging.QueueNames.Exchange) and is used in any Publish to call an IMessageProducer or IMessageQueueClient object. With IMessageQueueClient.Publish you can inject a routing key (queueName parameter), to customize the routing of a queue. Failed messages are published to the ServiceStack.Messaging.QueueNames.ExchangeDlq (mx.servicestack.dlq) and routed to queues with the name mq:{type}.dlq. Successful messages are published to ServiceStack.Messaging.QueueNames.ExchangeTopic (mx.servicestack.topic) and routed to the queue mq:{type}.outq. Additionally, there's also a priority queue to the in-queue with the name mq:{type}.priority. If you interact with RabbitMQ on a lower level, you can directly publish to queues and leave the routing via an exchange out of the picture. Each queue has features to define whether the queue is durable, deletes itself after the last consumer disconnected, or which exchange is to be used to publish dead messages with which routing key. More information on the concepts, different exchange types, queues, and acknowledging messages can be found at https://www.rabbitmq.com/tutorials/amqp-concepts.html. Replying directly back to the producer Messages published to a queue are dequeued in FIFO mode, hence there is no guarantee if the responses are delivered to the issuer of the initial message or not. To force a response to the originator you can make use of the ReplyTo property of a message: using System; using ServiceStack; using ServiceStack.Messaging; using ServiceStack.RabbitMq; using ServiceStack.Text; var rabbitMqServer = new RabbitMqServer(); var messageQueueClient = rabbitMqServer.CreateMessageQueueClient(); var queueName = messageQueueClient.GetTempQueueName(); var hello = new Hello { Name = "reply to originator" }; messageQueueClient.Publish(new Message<Hello>(hello) { ReplyTo = queueName }); var message = messageQueueClient.Get<HelloResponse>(queueName); var helloResponse = message.GetBody(); This code is more or less identical to the RedisMQ approach, but it does something different under the hood. The messageQueueClient.GetTempQueueName object creates a temporary queue, whose name is generated by ServiceStack.Messaging.QueueNames.GetTempQueueName. This temporary queue does not survive a restart of RabbitMQ, and gets deleted as soon as the consumer disconnects. As each queue is a separate Erlang process, you may encounter the process limits of Erlang and the maximum amount of file descriptors of your OS. Broadcasting a message In many scenarios a broadcast to multiple consumers is required, for example if you need to attach multiple loggers to a system it needs a lower level of implementation. The solution to this requirement is to create a fan-out exchange that will forward the message to all the queues instead of one connected queue, where one queue is consumed exclusively by one consumer, as shown: using ServiceStack; using ServiceStack.Messaging; using ServiceStack.RabbitMq; var fanoutExchangeName = string.Concat(QueueNames.Exchange, ".", ExchangeType.Fanout); var rabbitMqServer = new RabbitMqServer(); var messageProducer= (RabbitMqProducer) rabbitMqServer.CreateMessageProducer(); var channel = messageProducer.Channel; channel.ExchangeDeclare(exchange: fanoutExchangeName, type: ExchangeType.Fanout, durable: true, autoDelete: false, arguments: null); With the cast to RabbitMqProducer we have access to lower level actions, we need to declare and exchange this with the name mx.servicestack.fanout, which is durable and does not get deleted. Now, we need to bind a temporary and an exclusive queue to the exchange: var messageQueueClient = (RabbitMqQueueClient) rabbitMqServer.CreateMessageQueueClient(); var queueName = messageQueueClient.GetTempQueueName(); channel.QueueBind(queue: queueName, exchange: fanoutExchangeName, routingKey: QueueNames<Hello>.In); The call to messageQueueClient.GetTempQueueName() creates a temporary queue, which lives as long as there is just one consumer connected. This queue is bound to the fan-out exchange with the routing key mq:Hello.inq, as shown here: To publish the messages we need to use the RabbitMqProducer object (messageProducer): var hello = new Hello { Name = "Broadcast" }; var message = new Message<Hello>(hello); messageProducer.Publish(queueName: QueueNames<Hello>.In, message: message, exchange: fanoutExchangeName); Even though the first parameter of Publish is named queueName, it is propagated as the routingKey to the underlying PublishMessagemethod call. This will publish the message on the newly generated exchange with mq:Hello.inq as the routing key: Now, we need to encapsulate the handling of the message as: var messageHandler = new MessageHandler<Hello>(rabbitMqServer, message => { var hello = message.GetBody(); var name = hello.Name; name.Print(); return null; }); The MessageHandler<T> class is used internally in all the messaging solutions and looks for retries and replies. Now, we need to connect the message handler to the queue. using System; using System.IO; using System.Threading.Tasks; using RabbitMQ.Client; using RabbitMQ.Client.Exceptions; using ServiceStack.Messaging; using ServiceStack.RabbitMq; var consumer = new RabbitMqBasicConsumer(channel); channel.BasicConsume(queue: queueName, noAck: false, consumer: consumer); Task.Run(() => { while (true) { BasicGetResult basicGetResult; try { basicGetResult = consumer.Queue.Dequeue(); } catch (EndOfStreamException) { // this is ok return; } catch (OperationInterruptedException) { // this is ok return; } var message = basicGetResult.ToMessage<Hello>(); messageHandler.ProcessMessage(messageQueueClient, message); } }); This creates a RabbitMqBasicConsumer object, which is used to consume the temporary queue. To process messages we try to dequeuer from the Queue property in a separate task. This example does not handle the disconnects and reconnects from the server and does not integrate with the services (however, both can be achieved). Integrate RabbitMQ in your service The integration of RabbitMQ in a ServiceStack service does not differ overly from RedisMQ. All you have to do is adapt to the Configure method of your host. using Funq; using ServiceStack; using ServiceStack.Messaging; using ServiceStack.RabbitMq; public override void Configure(Container container) { container.Register<IMessageService>(arg => new RabbitMqServer()); container.Register<IMessageFactory>(arg => new RabbitMqMessageFactory()); var messageService = container.Resolve<IMessageService>(); messageService.RegisterHandler<Hello> (this.ServiceController.ExecuteMessage); messageService.Start(); } The registration of an IMessageService is needed for the rerouting of the handlers to your service; and also, the registration of an IMessageFactory is relevant if you want to publish a message in your service with PublishMessage. Summary In this article the messaging pattern was introduced along with all the available clients of existing Message Queues. Resources for Article: Further resources on this subject: ServiceStack applications[article] Web API and Client Integration[article] Building a Web Application with PHP and MariaDB – Introduction to caching [article]
Read more
  • 0
  • 0
  • 6125
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-how-to-develop-for-wearable-tech-with-pebblejs
Eugene Safronov
09 Oct 2015
7 min read
Save for later

How to Develop for Today's Wearable Tech with Pebble.js

Eugene Safronov
09 Oct 2015
7 min read
Pebble is a smartwatch that pairs with both Android and iPhone devices via Bluetooth. It has an e-paper display with LED backlight, accelerometer and compass sensors. On top of that battery lasts up to a week between charges. From the beginning Pebble team embraced the Developer community which resulted in powerful SDK. Although a primary language for apps development is C, there is a room for JavaScript developers as well. PebbleKit JS The PebbleKit JavaScript framework expands the ability of Pebble app to run JavaScript logic on the phone. It allows fast access to data location from the phone and has API for getting data from the web. Unfortunately app development still requires programming in C. I could recommend some great articles on how to get started. Pebble.js Pebble.js, in contrast to PebbleKit JS, allows developers to create watchapp using only JavaScript code. It is simple yet powerful enough for creating watch apps that fetch and display data from various web services or remotely control other smartdevices. The downside of that approach is connected to the way how Pebble.js works. It is built on top of the standard Pebble SDK and PebbleKit JS. It consists of a C app that runs on the watch and interacts with the phone in order to process user actions. The Pebble.js library provides an API to build user interface and then remotely controls the C app to display it. As a consequence of the described approach watchapp couldn't function without a connection to the phone. On a side note, I would mention that library is open source and still in beta, so breaking API changes are possible. First steps There are 2 options getting started with Pebble.js: Install Pebble SDK on your local machine. This option allows you to customize Pebble.js. Create a CloudPebble account and work with your appwatch projects online. It is the easiest way to begin Pebble development. CloudPebble The CloudPebble environment allows you to write, build and deploy your appwatch applications both on a simulator and a physical device. Everything is stored in the cloud so no headache with compilers, virtual machines or python dependencies (in my case installation of boost-python end up with errors on MacOS). Hello world As an introduction let's build the Hello World application with Pebble.js. Create a new project in CloudPebble: Then write the following code in the app.js file: // Import the UI elements var UI = require('ui'); // Create a simple Card var card = new UI.Card({ title: 'Hello World', body: 'This is your first Pebble app!' }); // Display to the user card.show(); Start the code on Pebble watch or simulator and you will get the same screen as below: StackExchange profile Getting some data from web services like weather or news is easy with an Ajax library call. For example let's construct an app view of your StackExchange profile: var UI = require('ui'); var ajax = require('ajax'); // Create a Card with title and subtitle var card = new UI.Card({ title:'Profile', subtitle:'Fetching...' }); // Display the Card card.show(); // Construct URL // https://api.stackexchange.com/docs/me#order=desc&sort=reputation&filter=default&site=stackoverflow&run=true var API_TOKEN = 'put api token here'; var API_KEY = 'secret key'; var URL = 'https://api.stackexchange.com/2.2/me?key=' + API_KEY + '&order=desc&sort=reputation&access_token=' + API_TOKEN + '&filter=default'; // Make the request ajax( { url: URL, type: 'json' }, function(data) { // Success! console.log('Successfully fetched StackOverflow profile'); var profile = data.items[0]; var badges = 'Badges: ' + profile.badge_counts.gold + ' ' + profile.badge_counts.silver + ' ' + profile.badge_counts.bronze; // Show to user card.subtitle('Rep: ' + profile.reputation); card.body(badges + 'nDaily change:' + profile.reputation_change_day + 'nWeekly change:' + profile.reputation_change_week + 'nMonthly change:' + profile.reputation_change_month); }, function(error) { // Failure! console.log('Failed fetching Stackoverflow data: ' + JSON.stringify(error)); } ); Egg timer Lastly I would like to create a small real life watchapp. I will demonstrate how to compose a Timer app for boiling eggs. Let's start with the building blocks that we need: Window is the basic building block in Pebble application. It allows you to add different elements and specify a position and size for them. A menu is a type of Window that displays a standard Pebble menu. Vibe allows you to trigger vibration on the wrist. It will signal that eggs are boiled. Egg size screen Users are able to select eggs size from options: Medium Large Extra-large var UI = require('ui'); var menu = new UI.Menu({ sections: [{ title: 'Egg size', items: [{ title: 'Medium', }, { title: 'Large' }, { title: 'Extra-large' }] }] }); menu.show(); Timer selection screen On the next step user selects timer duration. It depends whether he wants soft-boiled or hard-boiled eggs. The second level menu for medium size looks like: var mdMenu = new UI.Menu({ sections: [{ title: 'Egg timer', items: [{ title: 'Runny', subtitle: '2m' }, { title: 'Soft', subtitle: '3m' }, { title: 'Firm', subtitle: '4m' }, { title: 'Hard', subtitle: '5m' }] }] }); // open second level menu from the main menu.on('select', function(e) { if (e.itemIndex === 0){ mdMenu.show(); } else if (e.itemIndex === 1){ lgMenu.show(); } else { xlMenu.show(); } }); Timer screen When timer duration is selected we start a countdown. mdMenu.on('select', onTimerSelect); lgMenu.on('select', onTimerSelect); xlMenu.on('select', onTimerSelect); // timeouts mapping from header to seconds var timeouts = { '2m': 120, '3m': 180, '4m': 240, '5m': 300, '6m': 360, '7m': 420 }; function onTimerSelect(e){ var timeout = timeouts[e.item.subtitle]; timer(timeout); } The final bit of the watchapp is to display a timer, show a message and notify the user with vibration on the wrist when time is elapsed. var readyMessage = new UI.Card({ title: 'Done', body: 'Your eggs are ready!' }); function timer(timerInSec){ var intervalId = setInterval(function(){ timerInSec--; // notify with double vibration if (timerInSec == 1){ Vibe.vibrate('double'); } if (timerInSec > 0){ timerText.text(getTimeString(timerInSec)); } else { readyMessage.show(); timerWindow.hide(); clearInterval(intervalId); // notify with long vibration Vibe.vibrate('long'); } }, 1000); var timerWindow = new UI.Window(); var timerText = new UI.Text({ position: new Vector2(0, 50), size: new Vector2(144, 30), font: 'bitham-42-light', text: getTimeString(timerInSec), textAlign: 'center' }); timerWindow.add(timerText); timerWindow.show(); timerWindow.on('hide', function(){ clearInterval(intervalId); }); } // format remaining time into 00:00 string function getTimeString(timeInSec){ var minutes = parseInt(timeInSec / 60); var seconds = timeInSec % 60; return minutes + ':' + (seconds < 10 ? ('0' + seconds) : seconds); } Conclusion You can do much more with Pebble.js: Get accelerometer values Display complex UI mixing geometric elements, text and images Animate elements on the screen Use the GPS and LocalStorage on the phone Timeline API is coming Pebble.js is best suited for quick prototyping and applications that require access to the Internet. The unfortunate part of the JavaScript written applications is the requirement of a constant connection to the phone. Usually Pebble.js apps need more power and respond slower than a similar native app. Useful links Pebble.js tutorial Pebble blog Pebble.js docs Egg timer code About the author Eugene Safronov is a software engineer with a proven record of delivering high quality software. He has an extensive experience building successful teams and adjusting development processes to the project’s needs. His primary focuses are Web (.NET, node.js stacks) and cross-platform mobile development (native and hybrid). He can be found on Twitter @sejoker.
Read more
  • 0
  • 0
  • 5080

article-image-first-principle-and-useful-way-think
Packt
08 Oct 2015
8 min read
Save for later

First Principle and a Useful Way to Think

Packt
08 Oct 2015
8 min read
In this article, by Timothy Washington, author of the book Clojure for Finance, we will cover the following topics: Modeling the stock price activity Function evaluation First-Class functions Lazy evaluation Basic Clojure functions and immutability Namespace modifications and creating our first function (For more resources related to this topic, see here.) Modeling the stock price activity There are many types of banks. Commercial entities (large stores, parking areas, hotels, and so on) that collect and retain credit card information, are either quasi banks, or farm out credit operations to bank-like processing companies. There are more well-known consumer banks, which accept demand deposits from the public. There are also a range of other banks such as commercial banks, insurance companies and trusts, credit unions, and in our case, investment banks. As promised, this article will slowly build up a set of lagging price indicators that follow a moving stock price time series. In order to do that, I think it's useful to touch on stock markets, and to crudely model stock price activity. A stock (or equity) market, is a collection of buyers and sellers trading economic assets (usually companies). The stock (or shares) of those companies can be equities listed on an exchange (New York Stock Exchange, London Stock Exchange, and others), or may be those traded privately. In this exercise, we will do the following: Crudely model the stock price movement, which will give us a test bed for writing our lagging price indicators Introduce some basic features of the Clojure language Function evaluation The Clojure website has a cheatsheet (http://clojure.org/cheatsheet) with all of the language's core functions. The first function we'll look at is rand, a function that randomly gives you a number within a given range. So in your edgar/ project, launch a repl with the lein repl shell command. After a few seconds, you will enter repl (Read-Eval-Print-Loop). Again, Clojure functions are executed by being placed in the first position of a list. The function's arguments are placed directly afterwards. In your repl, evaluate (rand 35) or (rand 99) or (rand 43.21) or any number you fancy Run it many times to see that you can get any different floating point number, within 0 and the upper bound of the number you provided First-Class functions The next functions we'll look at are repeatedly and fn. repeatedly is a function that takes another function and returns an infinite (or length n if supplied) lazy sequence of calls to the latter function. This is our first encounter of a function that can take another function. We'll also encounter functions that return other functions. Described as First-Class functions, this falls out of lambda calculus and is one of the central features of functional programming. As such, we need to wrap our previous (rand 35) call in another function. fn is one of Clojure's core functions, and produces an anonymous, unnamed function. We can now supply this function to repeatedly. In your repl, if you evaluate (take 25 (repeatedly (fn [] (rand 35)))), you should see a long list of floating point numbers with the list's tail elided. Lazy evaluation We only took the first 25 of the (repeatedly (fn [] (rand 35))) result list, because the list (actually a lazy sequence) is infinite. Lazy evaluation (or laziness) is a common feature in functional programming languages. Being infinite, Clojure chooses to delay evaluating most of the list until it's needed by some other function that pulls out some values. Laziness benefits us by increasing performance and letting us more easily construct control flow. We can avoid needless calculation, repeated evaluations, and potential error conditions in compound expressions. Let's try to pull out some values with the take function. take itself, returns another lazy sequence, of the first n items of a collection. Evaluating (take 25 (repeatedly (fn [] (rand 35)))) will pull out the first 25 repeatedly calls to rand which generates a float between 0 and 35. Basic Clojure functions and immutability There's many operations we can perform over our result list (or lazy sequence). One of the main approaches of functional programming is to take a data structure, and perform operations over top of it to produce a new data structure, or some atomic result (a string, number, and so on). This may sound inefficient at first. But most FP languages employ something called immutability to make these operations efficient. Immutable data structures are the ones that cannot change once they've been created. This is feasible as most immutable, FP languages use some kind of structural data sharing between an original and a modified version. The idea is that if we run evaluate (conj [1 2 3] 4), the resulting [1 2 3 4] vector shares the original vector of [1 2 3]. The only additional resource that's assigned is for any novelty that's been introduced to the data structure (the 4). There's a more detailed explanation of (for example) Clojure's persistent vectors here: conj: This conjoins an element to a collection—the collection decides where. So conjoining an element to a vector (conj [1 2 3] 4) versus conjoining an element to a list (conj '(1 2 3) 4) yield different results. Try it in your repl. map: This passes a function over one or many lists, yielding another list. (map inc [1 2 3]) increments each element by 1. reduce (or left fold): This passes a function over each element, accumulating one result. (reduce + (take 100 (repeatedly (fn [] (rand 35))))) sums the list. filter: This constrains the input by some condition. >=: This is a conditional function, which tests whether the first argument is greater than or equal to the second function. Try (>= 4 9) and (>= 9 1). fn: This is a function that creates a function. This unnamed or anonymous function can have any instructions you choose to put in there. So if we only want numbers above 12, we can put that assertion in a predicate function. Try entering the below expression into your repl: (take 25 (filter (fn [x] (>= x 12)) (repeatedly (fn [] (rand 35))))) Modifying the namespaces and creating our first function We now have the basis for creating a function. It will return a lazy infinite sequence of floating point numbers, within an upper and lower bound. defn is a Clojure function, which takes an anonymous function, and binds a name to it in a given namespace. A Clojure namespace is an organizational tool for mapping human-readable names to things like functions, named data structures and such. Here, we're going to bind our function to the name generate-prices in our current namespace. You'll notice that our function is starting to span multiple lines. This will be a good time to author the code in your text editor of choice. I'll be using Emacs: Open your text editor, and add this code to the file called src/edgar/core.clj. Make sure that (ns edgar.core) is at the top of that file. After adding the following code, you can then restart repl. (load "edgaru/core") uses the load function to load the Clojure code in your in src/edgaru/core.clj: (defn generate-prices [lower-bound upper-bound] (filter (fn [x] (>= x lower-bound)) (repeatedly (fn [] (rand upper-bound))))) The Read-Eval-Print-Loop In our repl, we can pull in code in various namespaces, with the require function. This applies to the src/edgar/core.clj file we've just edited. That code will be in the edgar.core namespace: In your repl, evaluate (require '[edgar.core :as c]). c is just a handy alias we can use instead of the long name. You can then generate random prices within an upper and lower bound. Take the first 10 of them like this (take 10 (c/generate-prices 12 35)). You should see results akin to the following output. All elements should be within the range of 12 to 35: (29.60706184716407 12.507593971664075 19.79939384292759 31.322074615579716 19.737852534147326 25.134649707849572 19.952195022152488 12.94569843904663 23.618693004455086 14.695872710062428) There's a subtle abstraction in the preceding code that deserves attention. (require '[edgar.core :as c]) introduces the quote symbol. ' is the reader shorthand for the quote function. So the equivalent invocation would be (require (quote [edgar.core :as c])). Quoting a form tells the Clojure reader not to evaluate the subsequent expression (or form). So evaluating '(a b c) returns a list of three symbols, without trying to evaluate a. Even though those symbols haven't yet been assigned, that's okay, because that expression (or form) has not yet been evaluated. But that begs a larger question. What is reader? Clojure (and all Lisps) are what's known as homoiconic. This means that Clojure code is also data. And data can be directly output and evaluated as code. The reader is the thing that parses our src/edgar/core.clj file (or (+ 1 1) input from the repl prompt), and produces the data structures that are evaluated. read and eval are the 2 essential processes by which Clojure code runs. The evaluation result is printed (or output), usually to the standard output device. Then we loop the process back to the read function. So, when the repl reads, your src/edgar/two.clj file, it's directly transforming that text representation into data and evaluating it. A few things fall out of that. For example, it becomes simpler for Clojure programs to directly read, transform and write out other Clojure programs. The implications of that will become clearer when we look at macros. But for now, know that there are ways to modify or delay the evaluation process, in this case by quoting a form. Summary In this article, we learned about basic features of the Clojure language and how to model the stock price activity. Besides these, we also learned function evaluation, First-Class functions, the lazy evaluation method, namespace modifications and creating our first function. Resources for Article: Further resources on this subject: Performance by Design[article] Big Data[article] The Observer Pattern [article]
Read more
  • 0
  • 0
  • 1628

article-image-linux-shell-scripting
Packt
08 Oct 2015
22 min read
Save for later

Linux Shell Scripting

Packt
08 Oct 2015
22 min read
This article is by Ganesh Naik, the author of the book Learning Linux Shell Scripting, published by Packt Publication (https://www.packtpub.com/networking-and-servers/learning-linux-shell-scripting). Whoever works with Linux will come across the Shell as first program to work with. The Graphical User Interface (GUI) usage has become very popular due to ease of use. Those who want to take advantage of the power of Linux will use the Shell program by default. Shell is a program, which gives the user direct interaction with the operating system. Let's understand the stages in the evolution of the Linux operating system. Linux was developed as a free and open source substitute for UNIX OS. The chronology can be as follows: The UNIX operating system was developed by Ken Thomson and Dennis Ritchie in 1969. It was released in 1970. They rewrote the UNIX using C language in 1972. In 1991, Linus Torvalds developed the Linux Kernel for the free operating system. (For more resources related to this topic, see here.) Comparison of shells Initially, the UNIX OS used a shell program called Bourne Shell. Then eventually, many more shell programs were developed for different flavors of UNIX. The following is brief information about different shells: Sh: Bourne Shell Csh: C Shell Ksh: Korn Shell Tcsh: Enhanced C Shell Bash: GNU Bourne Again Shell Zsh: extension to Bash, Ksh, and Tcsh Pdksh: extension to KSH A brief comparison of various shells is presented in the following table: Feature Bourne C TC Korn Bash Aliases no yes yes yes yes Command-line editing no no yes yes yes Advanced pattern matching no no no yes yes Filename completion no yes yes yes yes Directory stacks (pushd and popd) no yes yes no yes History no yes yes yes yes Functions yes no no yes yes Key binding no no yes no yes Job control no yes yes yes yes Spelling correction no no yes no yes Prompt formatting no no yes no yes What we see here is that, generally, the syntax of all these shells are 95% similar. Tasks done by shell Whenever we type any text in the shell terminal, it is the responsibility of shell to execute the command properly. The activities done by shell are as follows: Reading text and parsing the entered command Evaluating meta-characters such as wildcards, special characters, or history characters Process io-redirection, pipes, and background processing Signal handling Initialize programs for execution Working in shell Let's get started by opening the terminal, and we will familiarize ourselves with the Bash Shell environment. Open the Linux terminal and type in: $ echo $SHELL /bin/bash The preceding output in terminal says that current shell is /bin/bash such as BASH shell. $ bash --version GNU bash, version 2.05.0(1)-release (i386-redhat-linux-gnu) Our first script – Hello World We will now write our first shell script called hello.sh. You can use any editor of your choice such as vi, gedit, nano, and similar. I prefer to use the vi editor. Create a new file hello.sh as follows: #!/bin/bash # This is comment line echo "Hello World" ls date Save the newly created file. The #!/bin/bash line is called the shebang line. The combination of the characters # and ! is called the magic number. The shell uses this to call the intended shell such as /bin/bash in this case. This should always be the first line in a shell script. The next few lines in the shell script are self-explanatory: Any line starting with # will be treated as comment line. Exception to this would be the 1st line with #!/bin/bash echo command will print Hello World on screen ls will display directory content on console date command will show current date and time We can execute the newly created file giving following command: Technique one:$ bash hello.sh Technique two:$ chmod +x hello.sh By running the preceding command, we are adding executable permission to our newly created file. You will learn more about file permission in following sections of this same chapter. $ ./hello.sh By running the preceding command, we are executing hello.sh as executable file. By technique one, we passed filename as an argument to bash shell. The output of executing hello.sh will be as follows: Hello World hello.sh Sun Jan 18 22:53:06 IST 2015 Compiler and interpreter – difference in process In any program development, the following are the two options: Compilation: Using a compiler-based language, such as C, C++, Java and other similar languages Interpreter: Using interpreter-based languages, such as Bash shell scripting When we use compiler-based language, we compile the complete source code, and as a result of compilation, we get a binary executable file. We then execute the binary to check the performance of our program. On the contrary, when we develop shell script, such as an interpreter-based program, every line of the program is input to bash shell. The lines of Shell Script are executed one by one sequentially. Even if the second line of a script has an error, the first line will be executed by the shell interpreter. Command substitution In keyboard, there is one interesting key backward quote such as `. This key is normally situated below the Esc key. If we place text between two successive back quotes, then echo will execute those as commands instead of processing them as plane text. Alternate syntax for $(command) that uses the backtick character `: $(command) or `command` Let's see an example: $ echo "Today is date" Output: Today is date Now modify the command as follows: $ echo "Today is `date`" Or: $ echo "Today is $(date)" Output: Today is Fri Mar 20 15:55:58 IST 2015 Pipes Pipes are used for inter-process communication: $ command_1 | command_2 In this case, the output of command_1 will be send as input to command_2: A simple example is as follows: $ who | wc The preceding simple command will be doing three different activities. First, it will copy output of who command to temporary file. Then, wc will read temporary file and will display the result. Finally, the temporary file will be deleted. Understanding variables Let's learn about creating variables in shell. Declaring variables in Linux is very easy. We just need to use variable names and initialize it with required content: $ person="Ganesh Naik" To get the content of variable, we need to prefix $ before variable: $ echo person person $ echo $person Ganesh Naik The unset command can be used to delete a variable: $ a=20 $ echo $a $ unset a Command unset will clear or remove the variable from shell environment as well: $ person="Ganesh Naik" $ echo $person $ set The set command will show all variables declared in shell. Exporting variable By using export command, we are making variables available in child process or subshell. But if we export variables in child process, the variables will not be available in parent process. We can view environment variables by either of the following command: $ env $ printenv Whenever we create a shell script and execute it, a new shell process is created and shell script runs in that process. Any exported variable values are available to the new shell or to any subprocess. We can export any variable by either of the following: $ export NAME $ declare -x NAME Interactive shell scripts – reading user input The read command is a shell built-in command for reading data from a file or a keyboard. The read command receives the input from a keyboard or a file, till it receives a newline character. Then it converts the newline character in null character. Read a value and store it in the variable: read VARIABLE echo $VARIABLE This will receive text from keyboard. The received text will be stored in shell variable VARIABLE. Working with command-line arguments Command line arguments are required for the following reasons: It informs the utility or command which file or group of files to process (reading/writing of files) Command line arguments tell the command/utility which option to use Check the following command line: $ my_program arg1 arg2 arg3 If my_command is a bash shell script, then we can access every command line positional parameters inside the script as following: $0 would contain "my_program" # Command $1 would contain "arg1" # First parameter $2 would contain "arg2" # Second parameter $3 would contain "arg3" # Third parameter Let's create a param.sh script as follows: #!/bin/bash echo "Total number of parameters are = $#" echo "Script name = $0" echo "First Parameter is $1" echo "Second Parameter is $2" echo "Third Parameter is $3" echo "Fourth Parameter is $4" echo "Fifth Parameter is $5" echo "All parameters are = $*" Run the script as follows: $ param.sh London Washington Delhi Dhaka Paris Output: Total number of parameters are = 5 Command is = ./hello.sh First Parameter is London Second Parameter is Washington Third Parameter is Delhi Fourth Parameter is Dhaka Fifth Parameter is Paris All parameters are = London Washington Delhi Dhaka Paris Understanding set Many a times we may not pass arguments on the command line; but, we may need to set parameters internally inside the script. We can declare parameters by the set command as follows: $ set USA Canada UK France $ echo $1 USA $ echo $2 Canada $ echo $3 UK $ echo $4 France Working with arrays Array is a list of variables. For example, we can create array FRUIT, which will contain many fruit names. The array does not have a limit on how many variables it may contain. It can contain any type of data. The first element in an array will have the index value as 0. $ FRUITS=(Mango Banana Apple) $ echo ${FRUITS[*]} Mango Banana Apple $ echo $FRUITS[*] Mango[*] $ echo ${FRUITS[2]} Apple $ FRUITS[3]=Orange $ echo ${FRUITS[*]} Mango Banana Apple Orange Debugging – Tracing execution (option -x) The -x option, short for xtrace, or execution trace, tells the shell to echo each command after performing the substitution steps. This will enable us to see the value of variables and commands. We can trace execution of shell script as follows: $ bash –x hello.sh Instead of the preceding way, we can enable debugging by the following way also: #!/bin/bash -x Let's test the earlier script as follows $ bash –x hello.sh Output: $ bash –x hello.sh + echo Hello student Hello student ++ date + echo The date is Fri May 1 00:18:52 IST 2015 The date is Fri May 1 00:18:52 IST 2015 + echo Your home shell is /bin/bash Your home shell is /bin/bash + echo Good-bye student Good-bye student Summary of various debugging options Option Description -n Syntax error checking. No commands execution. -v Runs in verbose mode. The shell will echo each command prior to executing the command. -x Tracing execution. The shell will echo each command after performing the substitution steps. Thus, we will see the value of variables and commands. Checking exit status of commands Automation using shell scripts involves checking, if the earlier command executed successfully or failed, if a file is present or not, and similar. You will learn various constructs such as if, case, and similar, where you will need to check certain conditions if they are true or false. Accordingly, our script should conditionally execute various commands. Let's enter the following command: $ ls Using bash shell, we can check if the preceding command executed successfully or failed as follows: $ echo $? The preceding command will return 0, if the command ls executed successfully. The result will be non-zero such as 1 or 2 or any other non-zero number, if the command has failed. Bash shell stores the status of the last command execution in the variable. If we need to check the status of the last command execution, then we should check the content of the variable. Understanding test command Let's learn the following example to check the content or value of the expressions. $ test $name = Ganesh $ echo $? 0 if success and 1 if failure. In the preceding example, we want to check if the content of the variable name is the same as Ganesh. To check this, we used the test command. The test will store the results of comparison in the variable. We can use the following syntax for the preceding test command. In this case, we used [ ] instead of the test command. We enclosed the expression to be evaluated in square brackets. $ [ $name = Ganesh ] # Brackets replace the test command $ echo $? 0 String comparison options for the test command The following is the summary of various options for string comparison using test. Test operator Tests true if -n string True if the length of the string is non-zero. -z string True if the length of the string is zero string1 != string2 True if the strings are not equal. string1 == string2 string1 = string2 True if the strings are equal. string1 > string2 True if string1 sorts after string2 lexicographically. string1 < string2 True if string1 sorts before string2 lexicographically. Suppose we want to check, whether the length of the string is non-zero, then we can check it as follows: test –n $string Or [ –n $string ] echo $? If the result is 0, then we can conclude that the string length is non-zero. If the content of ? is non-zero, then the string has length 0. Numerical comparison operators for the test command The following is the summary of various options for numerical comparison using test. Test operator Tests true if [integer_1 –eq integer_2] integer_1 is equal to integer_2 [integer_1 –ne integer_2] integer_1 is not equal to integer_2 [integer_1 –gt integer_2] integer_1 is greater than integer_2 [integer_1 –ge integer_2] integer_1 is greater than or equal to integer_2 [integer_1 –ge integer_2] integer_1 is less than integer_2 [integer_1 –le integer_2] integer_1 is less than or equal to integer_2 Let's write shell script for learning various numerical test operator usage: #!/bin/bash num1=10 num2=30 echo $(($num1 < $num2)) # compare for less than [ $num1 -lt $num2 ] # compare for less than echo $? [ $num1 -ne $num2 ] # compare for not equal echo $? [ $num1 -eq $num2 ] # compare for equal to echo $? File test options for the test command The following are the various options for file handling operations using the test command. Test operator Tests true if –b file_name This checks if the file is a Block special file –c file_name This checks if the file is a Character special file –d file_name This checks if the Directory is existing –e file_name This checks if the File exist –f file_name This checks if the file is a Regular file and not a directory –G file_name This checks if the file is existing and is owned by the effective group ID –g file_name This checks if file has a Set-group-ID set –k file_name This checks if the file has a Sticky bit set –L file_name This checks if the file is a symbolic link –p file_name This checks if the file is a named pipe –O file_name This checks if the file exists and is owned by the effective user ID –r file_name This checks if the file is readable –S file_name This checks if the file is a socket –s file_name This checks if the file has nonzero size –t fd This checks if the file has fd (file descriptor) and is opened on a terminal –u file_name This checks if the file has a Set-user-ID bit set –w file_name This checks if the file is writable –x file_name This checks if the file is executable File testing binary operators The following are various options for binary file operations using test. Test Operator Tests True If [ file_1 –nt file_2 ] This checks if the file is newer than file2 [ file_1 –ot file_2 ] This check if file1 is older than file2. [ file_1 –ef file_2 ] This check if file1 and file2 have the same device or inode numbers. Let's write the script for testing basic file attributes such as whether it is a file or folder and whether it has file size bigger than 0 #!/bin/bash # Check if file is Directory [ -d work ] echo $? # Check that is it a File [ -f test.txt ] echo $? # Check if File has size greater than 0 [ -s test.txt ] echo $? Logical test operators The following are the various options for logical operations using test. Test Operator Tests True If [ string_1 –a string_1 ] Both string_1 and string_2 are true [ string_1 –o string_2 ] Either string_1 or string_2 is true [ ! string_1 ] Not a string_1 match [[ pattern_1 && pattern_2 ]] Both pattern_1 and pattern_2 are true [[ pattern_1 || pattern_2 ]] Either pattern_1 or pattern_2 is true [[ ! pattern ]] Not a pattern match Reference: Bash reference manual—http://www.gnu.org/software/bash/ Conditional constructs – if else We use if command for checking the pattern or command status, and accordingly, we can make certain decisions to execute script or commands. The syntax of if conditional is as follows: if command then command command fi From the preceding syntax, we can clearly understand the working of the if conditional construct. Initially, the if statement will execute the block of command. If the result of command execution is true or 0, then all the commands which are enclosed between then and fi will be executed. If the status of command execution after if is false or non-zero, then all the commands after then will be ignored and the control of execution will directly go to fi. The simple example for checking the status of last command executed using if construct is as follows: #!/bin/bash if [ $? -eq 0 ] then echo "Command was successful." else echo "Command was successful." fi Whenever we run any command, the exit status of the command will be stored in the ? variable. The if construct will be very useful in checking the status of the last command. Switching case Apart from simple decision making with if, it is also possible to process multiple decision-making operations using command case. In a case statement, the expression contained in a variable is compared with a number of expressions, and for each expression matched, a command is executed. A case statement has the following structure: case variable in value1) command(s) ;; value2) command(s) ;; *) command(s) ;; esac For illustrating switch case scripting example, we will write script as follows. We will ask the user to enter any one of the number from range 1 to 9. We will check the entered number by case command. If the user enters any other number, then we will display the error by giving the Invalid key message. #!/bin/bash echo "Please enter number from 1 to 4" read number case $number in 1) echo "ONE" ;; 2) echo "TWO" ;; 3) echo "Three" ;; 4) echo "FOUR" ;; *) echo "SOME ANOTHER NUMBER" ;; esac Output: Please enter any number from 1 to 4 2 TWO Looping with for command For iterative operations, bash shell uses three types of loops: for, while, and until. By using the for looping command, we can execute set of commands, for a finite number of times, for every item in a list. In the for loop command, the user-defined variable is specified. In the for command, after the in keyword, a list of values can be specified. The user-defined variable will get the value from that list, and all the statements between do and done get executed, until it reaches the end of the list. The purpose of the for loop is to process a list of elements. It has the following syntax. The simple script with for loop could be as follows: for command in clear date cal do sleep 1 $command Done In the preceding script, the commands clear, date, and cal will be called one after the other. The command sleep for one second will be called before every command. If we need to loop continuously or infinitely, then following is the syntax: for ((;;)) do command done Let's write simple script as follows. In this script, we will print variable var 10 times. #!/bin/bash for var in {1..10} do echo $var done Exiting from the current loop iteration with continue With the help of the continue command, it is possible to exit from a current iteration from loop and to resume the next iteration of the loop. We use commands for, while, or until for the loop iterations. The following is the script with the for loop with the continue command, to skip a certain part of loop commands: #!/bin/bash for x in 1 2 3 do echo before $x continue 1 echo after $x done exit 0 Exiting from a loop with break In previous section, we discussed about how continue can be used to exit from the current iteration of a loop. The break command is another way to introduce a new condition within a loop. Unlike continue, however, it causes the loop to be terminated altogether if the condition is met. In the following script, we are checking the directory content. If directory is found, then we exit the loop and display the message that the first directory is found. #!/bin/bash rm -rf sample* echo > sample_1 echo > sample_2 mkdir sample_3 echo > sample_4 for file in sample* do if [ -d "$file" ]; then break; fi done echo The first directory is $file rm -rf sample* exit 0 Working with loop using do while Similar to the for command, while is also the command for loop operations. The condition or expression next to while is evaluated. If it is a success or 0, then the commands inside do and done are executed. The purpose of a loop is to test a certain condition and execute a given command while the condition is true (while loop) or until the condition becomes true (until loop). In the following script, we are printing numbers 1 to 10 on the screen by using the while loop. #!/bin/bash declare -i x x=0 while [ $x -le 10 ] do echo $x x=$((x+1)) done Using until The until command is similar to the while command. The given statements in loop are executed, as long as it evaluates the condition as true. As soon as the condition becomes false, then the loop is exited. Syntax: until command do command(s) done In the following script, we are printing numbers 0 to 9 on the screen. When the value of variable x becomes 10, then the until loop stops executing. #!/bin/bash x=0 until [ $x -eq 10 ] do echo $x x=`expr $x + 1` done Understanding functions In the real-word scripts, we break down big tasks or scripts in smaller logical tasks. This modularization of scripts helps in better development and understanding of code. The smaller logical block of script can be written as a function. Functions can be defined on command line or inside scripts. Syntax for defining functions on command line is as follows: functionName { command_1; command_2; . . . } Let's write a very simple function for illustrating the preceding syntax. $ hello() { echo 'Hello world!' ; } We can use the preceding defined function as follows: $ hello Output: Hello world! Functions should be defined at the beginning of script. Command source and '.' Normally, whenever we enter any command, a new process is created. If we want to make functions from scripts to be made available in the current shell, then we need a technique that will run script in the current shell instead of creating a new shell environment. The solution to this problem is using the source or . commands. The commands source and . can be used to run shell script in the current shell instead of creating new process. This helps in declaring function and variables in current shell. The syntax is as follows: $ source filename [arguments] Or: $ . filename [arguments] $ source functions.sh Or: $ . functions.sh If we pass command line arguments, those will be handled inside script as $1, $2 …and similar: $ source functions.sh arg1 arg2 or $ . /path/to/functions.sh arg1 arg2 The source command does not create new shell; but runs the shell scripts in current shell, so that all the variables and functions will be available in current shell for usage. System startup, inittab, and run levels When we power on the Linux system, the shell scripts are run one after another and the Linux system is initialized. These scripts start various services, daemons, start databases, mount discs, and many more applications are started. Even during the shutting down of system, certain shell scripts are executed so that important system data and information can be saved to the disk and applications are properly shut down. These are called as boot, startup, and shutdown scripts. These scripts are copied during the installation of the Linux operating system in our computer. As a developer or administrator, understanding these scripts may help you in understating and debugging the Linux system, and if required, you can customize these scripts. During the system startup, as per run level, various scripts are called to initialize the basic operating system. Once the basic operating system in initialized, the user login process starts. This process is explained in the following topics. System-wide settings scripts In the /etc/ folder, the following files are related to user level initialization: /etc/profile: Few distributions will have additional folder such as /etc/profile.d/. All the scripts from the profile.d folder will be executed. /etc/bash.bashrc: Scripts in the /etc/ folder will be called for all the users. Particular user specific initialization scripts are located the in HOME folder of each user. These are as follows: $HOME/.bash_profile: This contains user specific bash environment default settings. This script is called during login process. $HOME/.bash_login: Second user environment initialization script, called during the login process. $HOME/.profile: If present, this script internally calls the .bashrc script file. $HOME/.bashrc: This is an interactive shell or terminal initialization script. If we customize .bashrc such as added new alias commands or declare new function or environment variables, then we should execute .bashrc to take its effect. Summary In this article, you have learned about basic the Linux Shell Scripting along with the shell environment, creating and using variables, command line arguments, various debugging techniques, decision-making techniques and various looping techniques while testing for numeric, strings, logical and file handling-related operations, and the about writing functions. You also learned about system initialization and various initializing script and about how to customizing them. This article has been a very short introduction to Linux shell scripting. For complete information with detailed explanation and numerous sample scripts, you can refer to the book at https://www.packtpub.com/networking-and-servers/learning-linux-shell-scripting. Resources for Article: Further resources on this subject: CoreOS – Overview and Installation[article] Getting Started[article] An Introduction to the Terminal [article]
Read more
  • 0
  • 0
  • 7298

article-image-creating-city-information-app-customized-table-views
Packt
08 Oct 2015
19 min read
Save for later

Creating a City Information App with Customized Table Views

Packt
08 Oct 2015
19 min read
In this article by Cecil Costa, the author of Swift 2 Blueprints, we will cover the following: Project overview Setting it up The first scene Displaying cities information (For more resources related to this topic, see here.) Project overview The idea of this app is to give users information about cities such as the current weather, pictures, history, and cities that are around. How can we do it? Firstly, we have to decide on how the app is going to suggest a city to the user. Of course, the most logical city would be the city where the user is located, which means that we have to use the Core Location framework to retrieve the device's coordinates with the help of GPS. Once we have retrieved the user's location, we can search for cities next to it. To do this, we are going to use a service from http://www.geonames.org/. Other information that will be necessary is the weather. Of course, there are a lot of websites that can give us information on the weather forecast, but not all of them offer an API to use it for your app. In this case, we are going to use the Open Weather Map service. What about pictures? For pictures, we can use the famous Flickr. Easy, isn't it? Now that we have the necessary information, let's start with our app. Setting it up Before we start coding, we are going to register the needed services and create an empty app. First, let's create a user at geonames. Just go to http://www.geonames.org/login with your favorite browser, sign up as a new user, and confirm it when you receive a confirmation e-mail. It may look like everything has been done, however, you still need to upgrade your account to use the API services. Don't worry, it's free! So, open http://www.geonames.org/manageaccount and upgrade your account. Don't use the user demo provided by geonames, even for development. This user exceeds its daily quota very frequently. With geonames, we can receive information on cities by their coordinates, but we don't have the weather forecast and pictures. For weather forecasts, open http://openweathermap.org/register and register a new user and API. Lastly, we need a service for the cities' pictures. In this case, we are going to use Flickr. Just create a Yahoo! account and create an API key at https://www.flickr.com/services/apps/create/. While creating a new app, try to investigate the services available for it and their current status. Unfortunately, the APIs change a lot like their prices, their terms, and even their features. Now, we can start creating the app. Open Xcode, create a new single view application for iOS, and call it Chapter 2 City Info. Make sure that Swift is the main language like the following picture: The first task here is to add a library to help us work with JSON messages. In this case, a library called SwiftyJSON will solve our problem. Otherwise, it would be hard work to navigate through the NSJSONSerialization results. Download the SwiftyJSON library from https://github.com/SwiftyJSON/SwiftyJSON/archive/master.zip, then uncompress it, and copy the SwiftyJSON.swift file in your project. Another very common way of installing third party libraries or frameworks would be to use CocoaPods, which is commonly known as just PODs. This is a dependency manager, which downloads the desired frameworks with their dependencies and updates them. Check https://cocoapods.org/ for more information. Ok, so now it is time to start coding. We will create some functions and classes that should be common for the whole program. As you know, many functions return NSError if something goes wrong. However, sometimes, there are errors that are detected by the code, like when you receive a JSON message with an unexpected struct. For this reason, we are going to create a class that creates custom NSError. Once we have it, we will add a new file to the project (command + N) called ErrorFactory.swift and add the following code: import Foundation class ErrorFactory {{ static let Domain = "CityInfo" enum Code:Int { case WrongHttpCode = 100, MissingParams = 101, AuthDenied = 102, WrongInput = 103 } class func error(code:Code) -> NSError{ let description:String let reason:String let recovery:String switch code { case .WrongHttpCode: description = NSLocalizedString("Server replied wrong code (not 200, 201 or 304)", comment: "") reason = NSLocalizedString("Wrong server or wrong api", comment: "") recovery = NSLocalizedString("Check if the server is is right one", comment: "") case .MissingParams: description = NSLocalizedString("There are some missing params", comment: "") reason = NSLocalizedString("Wrong endpoint or API version", comment: "") recovery = NSLocalizedString("Check the url and the server version", comment: "") case .AuthDenied: description = NSLocalizedString("Authorization denied", comment: "") reason = NSLocalizedString("User must accept the authorization for using its feature", comment: "") recovery = NSLocalizedString("Open user auth panel.", comment: "") case .WrongInput: description = NSLocalizedString("A parameter was wrong", comment: "") reason = NSLocalizedString("Probably a cast wasn't correct", comment: "") recovery = NSLocalizedString("Check the input parameters.", comment: "") } return NSError(domain: ErrorFactory.Domain, code: code.rawValue, userInfo: [ NSLocalizedDescriptionKey: description, NSLocalizedFailureReasonErrorKey: reason, NSLocalizedRecoverySuggestionErrorKey: recovery ]) } } The previous code shows the usage of NSError that requires a domain, which is a string that differentiates the error type/origin and avoids collisions in the error code. The error code is just an integer that represents the error that occurred. We used an enumeration based on integer values, which makes it easier for the developer to remember and allows us to convert its enumeration to an integer easily with the rawValue property. The third argument of an NSError initializer is a dictionary that contains messages, which can be useful to the user (actually to the developer). Here, we have three keys: NSLocalizedDescriptionKey: This contains a basic description of the error NSLocalizedFailureReasonErrorKey: This explains what caused the error NSLocalizedRecoverySuggestionErrorKey: This shows what is possible to avoid this error As you might have noticed, for these strings, we used a function called NSLocalizedString, which will retrieve the message in the corresponding language if it is set to the Localizable.strings file. So, let's add a new file to our app and call it Helpers.swift; click on it for editing. URLs have special character combinations that represent special characters, for example, a whitespace in a URL is sent as a combination of %20 and a open parenthesis is sent with the combination of %28. The stringByAddingPercentEncodingWithAllowedCharacters string method allows us to do this character conversion. If you need more information on the percent encoding, you can check the Wikipedia entry at https://en.wikipedia.org/wiki/Percent-encoding. As we are going to work with web APIs, we will need to encode some texts before we send them to the corresponding server. Type the following function to convert a dictionary into a string with the URL encoding: func toUriEncoded(params: [String:String]) -> String { var records = [String]() for (key, value) in params { let valueEncoded = value.stringByAddingPercentEncodingWithAllowedCharacters(.URLHostAllowedCharacterSet()) records.append("(key)=(valueEncoded!)") } return "&".join(records) } Another common task is to call the main queue. You might have already used a code like dispatch_async(dispatch_get_main_queue(), {() -> () in … }), however, it is too long. We can reduce it by calling it something like M{…}. So, here is the function for it: func M(((completion: () -> () ) { dispatch_async(dispatch_get_main_queue(), completion) } A common task is to request for JSON messages. To do so, we just need to know the endpoint, the required parameters, and the callback. So, we can start with this function as follows: func requestJSON(urlString:String, params:[String:String] = [:], completion:(JSON, NSError?) -> Void){ let fullUrlString = "(urlString)?(toUriEncoded(params))" if let url = NSURL(string: fullUrlString) { NSURLSession.sharedSession().dataTaskWithURL(url) { (data, response, error) -> Void in if error != nil { completion(JSON(NSNull()), error) return } var jsonData = data! var jsonString = NSString(data: jsonData, encoding: NSUTF8StringEncoding)! Here, we have to add a tricky code, because the Flickr API is always returned with a callback function called jsonFlickrApi while wrapping the corresponding JSON. This callback must be removed before the JSON text is parsed. So, we can fix this issue by adding the following code: // if it is the Flickr response we have to remove the callback function jsonFlickrApi() // from the JSON string if (jsonString as String).characters.startsWith("jsonFlickrApi(".characters) { jsonString = jsonString.substringFromIndex("jsonFlickrApi(".characters.count) let end = (jsonString as String).characters.count - 1 jsonString = jsonString.substringToIndex(end) jsonData = jsonString.dataUsingEncoding(NSUTF8StringEncoding)! } Now, we can complete this function by creating a JSON object and calling the callback: let json = JSON(data:jsonData) completion(json, nil) }.resume() }else { completion(JSON(NSNull()), ErrorFactory.error(.WrongInput)) } } At this point, the app has a good skeleton. It means that, from now on, we can code the app itself. The first scene Create a project group (command + option + N) for the view controllers and move the ViewController.swift file (created by Xcode) to this group. As we are going to have more than one view controller, it is also a good idea to rename it to InitialViewController.swift: Now, open this file and rename its class from ViewController to InitialViewController: class InitialViewController: UIViewController { Once the class is renamed, we need to update the corresponding view controller in the storyboard by: Clicking on the storyboard. Selecting the view controller (the only one we have till now). Going to the Identity inspector by using the command+ option + 3 combination. Here, you can update the class name to the new one. Pressing enter and confirming that the module name is automatically updated from None to the product name. The following picture demonstrates where you should do this change and how it should be after the change: Great! Now, we can draw the scene. Firstly, let's change the view background color. To do it, select the view that hangs from the view controller. Go to the Attribute Inspector by pressing command+ option + 4, look for background color, and choose other, as shown in the following picture: When the color dialog appears, choose the Color Sliders option at the top and select the RGB Sliders combo box option. Then, you can change the color as per your choice. In this case, let's set it to 250 for the three colors: Before you start a new app, create a mockup of every scene. In this mockup, try to write down the color numbers for the backgrounds, fonts, and so on. Remember that Xcode still doesn't have a way to work with styles as websites do with CSS, meaning that if you have to change the default background color, for example, you will have to repeat it everywhere. On the storyboard's right-hand side, you have the Object Library, which can be easily accessed with the command + option + control + 3 combination. From there, you can search for views, view controllers, and gestures, and drag them to the storyboard or scene. The following picture shows a sample of it: Now, add two labels, a search bar, and a table view. The first label should be the app title, so let's write City Info on it. Change its alignment to center, the font to American Typewriter, and the font size to 24. On the other label, let's do the same, but write Please select your city and its font size should be 18. The scene must result in something similar to the following picture: Do we still need to do anything else on this storyboard scene? The answer is yes. Now it is time for the auto layout, otherwise the scene components will be misplaced when you start the app. There are different ways to add auto layout constraints to a component. An easy way of doing it is by selecting the component by clicking on it like the top label. With the control key pressed, drag it to the other component on which the constraint will be based like the main view. The following picture shows a sample of a constraint being created from a table to the main view: Another way is by selecting the component and clicking on the left or on the middle button, which are to the bottom-right of the interface builder screen. The following picture highlights these buttons: Whatever is your favorite way of adding constraints, you will need the following constraints and values for the current scene: City Info label Center X equals to the center of superview (main view), value 0 City Info label top equals to the top layout guide, value 0 Select your city label top vertical spacing of 8 to the City Info label Select your city label alignment center X to superview, value 0 Search bar top value 8 to select your city label Search bar trailing and leading space 0 to superview Table view top space (space 0) to the search bar Table view trailing and leading space 0 to the search bar Table view bottom 0 to superview Before continuing, it is a good idea to check whether the layout suits for every resolution. To do it, open the assistant editor with command + option + .return and change its view to Preview: Here, you can have a preview of your screen on the device. You can also rotate the screens by clicking on the icon with a square and a arched arrow over it: Click on the plus sign to the bottom-left of the assistant editor to add more screens: Once you are happy with your layout, you can move on to the next step. Although the storyboard is not yet done, we are going to leave it for a while. Click on the InitialViewController.swift file. Let's start receiving information on where the device is using the GPS. To do it, import the Core Location framework and set the view controller as a delegate: import CoreLocation class InitialViewController: UIViewController, CLLocationManagerDelegate { After this, we can set the core location manager as a property and initialize it on viewDidLoadMethod. Type the following code to set locationManager and initialize InitialViewController: var locationManager = CLLocationManager() override func viewDidLoad() { super.viewDidLoad() locationManager.delegate = self locationManager.desiredAccuracy = kCLLocationAccuracyThreeKilometers locationManager.distanceFilter = 3000 if locationManager.respondsToSelector(Selector("requestWhenInUseAuthorization")) { locationManager.requestWhenInUseAuthorization() } locationManager.startUpdatingLocation() } After initializing the location manager, we have to check whether the GPS is working or not by implementing the didUpdateLocations method. Right now, we are going to print the last location and nothing more: func locationManager(manager: CLLocationManager!, didUpdateLocations locations: [CLLocation]!){ let lastLocation = locations.last! print(lastLocation) } Now, we can test the app. However, we still need to perform one more step. Go to your Info.plist file by pressing command + option + J and the file name. Add a new entry with the NSLocationWhenInUseUsageDescription key and change its type to String and its value to This app needs to know your location. This last step is mandatory since iOS 8. Press play and check whether you have received a coordinate, but not very frequently. Displaying cities information The next step is to create a class to store the information received from the Internet. In this case, we can do it in a straightforward manner by copying the JSON object properties in our class properties. Create a new group called Models and, inside it, a file called CityInfo.swift. There you can code CityInfo as follows: class CityInfo { var fcodeName:String? var wikipedia:String? var geonameId: Int! var population:Int? var countrycode:String? var fclName:String? var lat : Double! var lng: Double! var fcode: String? var toponymName:String? var name:String! var fcl:String? init?(json:JSON){){){ // if any required field is missing we must not create the object. if let name = json["name"].string,,, geonameId = json["geonameId"].int, lat = json["lat"].double, lng = json["lng"].double { self.name = name self.geonameId = geonameId self.lat = lat self.lng = lng }else{ return nil } self.fcodeName = json["fcodeName"].string self.wikipedia = json["wikipedia"].string self.population = json["population"].int self.countrycode = json["countrycode"].string self.fclName = json["fclName"].string self.fcode = json["fcode"].string self.toponymName = json["toponymName"].string self.fcl = json["fcl"].string } } Pay attention that our initializer has a question mark on its header; this is called a failable initializer. Traditional initializers always return a new instance of the newly requested object. However, with failable initializers, you can return a new instance or a nil value, indicating that the object couldn't be constructed. In this initializer, we used an object of the JSON type, which is a class that belongs to the SwiftyJSON library/framework. You can easily access its members by using brackets with string indices to access the members of a json object, like json ["field name"], or using brackets with integer indices to access elements of a json array. Doesn't matter, the way you have to use the return type, it will always be a JSON object, which can't be directly assigned to the variables of another built-in types, such as integers, strings, and so on. Casting from a JSON object to a basic type can be done by accessing properties with the same name as the destination type, such as .string for casting to string objects, .int for casting to int objects, .array or an array of JSON objects, and so on. Now, we have to think about how this information is going to be displayed. As we have to display this information repeatedly, a good way to do so would be with a table view. Therefore, we will create a custom table view cell for it. Go to your project navigator, create a new group called Cells, and add a new file called CityInfoCell.swift. Here, we are going to implement a class that inherits from UITableViewCell. Note that the whole object can be configured just by setting the cityInfo property: import UIKit class CityInfoCell:UITableViewCell { @IBOutlet var nameLabel:UILabel! @IBOutlet var coordinates:UILabel! @IBOutlet var population:UILabel! @IBOutlet var infoImage:UIImageView! private var _cityInfo:CityInfo! var cityInfo:CityInfo { get { return _cityInfo } set (cityInfo){ self._cityInfo = cityInfo self.nameLabel.text = cityInfo.name if let population = cityInfo.population { self.population.text = "Pop: (population)" }else { self.population.text = "" } self.coordinates.text = String(format: "%.02f, %.02f", cityInfo.lat, cityInfo.lng) if let _ = cityInfo.wikipedia { self.infoImage.image = UIImage(named: "info") } } } } Return to the storyboard and add a table view cell from the object library to the table view by dragging it. Click on this table view cell and add three labels and one image view to it. Try to organize it with something similar to the following picture: Change the labels font family to American Typewriter, and the font size to 16 for the city name and 12 for the population and the location label..Drag the info.png and noinfo.png images to your Images.xcassets project. Go back to your storyboard and set the image to noinfo in the UIImageView attribute inspector, as shown in the following screenshot: As you know, we have to set the auto layout constraints. Just remember that the constraints will take the table view cell as superview. So, here you have the constraints that need to be set: City name label leading equals 0 to the leading margin (left) City name label top equals 0 to the super view top margin City name label bottom equals 0 to the super view bottom margin City label horizontal space 8 to the population label Population leading equals 0 to the superview center X Population top equals to -8 to the superview top Population trailing (right) equals 8 to the noinfo image Population bottom equals 0 to the location top Population leading equals 0 to the location leading Location height equals to 21 Location trailing equals 8 to the image leading Location bottom equals 0 to the image bottom Image trailing equals 0 to the superview trailing margin Image aspect ratio width equals 0 to the image height Image bottom equals -8 to the superview bottom Image top equals -8 to the superview top Has everything been done for this table view cell? Of course not. We still need to set its class and connect each component. Select the table view cell and change its class to CityInfoCell: As we are here, let's do a similar task that is to change the cell identifier to cityinfocell. This way, we can easily instantiate the cell from our code: Now, you can connect the cell components with the ones we have in the CityInfoCell class and also connect the table view with the view controller: @IBOutlet var tableView: UITableView!! There are different ways to connect a view with the corresponding property. An easy way is to open the assistant view with the command + option + enter combination, leaving the storyboard on the left-hand side and the Swift file on the right-hand side. Then, you just need to drag the circle that will appear on the left-hand side of the @IBOutlet or the @IBAction attribute and connect with the corresponding visual object on the storyboard. After this, we need to set the table view delegate and data source, and also the search bar delegate with the view controller. It means that the InitialViewController class needs to have the following header. Replace the current InitialViewController header with: class InitialViewController: UIViewController, CLLocationManagerDelegate, UITableViewDataSource, UITableViewDelegate, UISearchBarDelegate { Connect the table view and search bar delegate and the data source with the view controller by control dragging from the table view to the view controller's icon at the top of the screen, as shown in the following screenshot: Summary In this article, you learned how to create custom NSError, which is the traditional way of reporting that something went wrong. Every time a function returns NSError, you should try to solve the problem or report what has happened to the user. We could also appreciate the new way of trapping errors with try and catch a few times. This is a new feature on Swift 2, but it doesn't mean that it will replace NSError. They will be used in different situations. Resources for Article: Further resources on this subject: Nodes[article] Network Development with Swift[article] Playing with Swift [article]
Read more
  • 0
  • 0
  • 3573
Packt
07 Oct 2015
47 min read
Save for later

PowerShell – the DSC Architecture

Packt
07 Oct 2015
47 min read
 "As an architect you design for the present, with an awareness of the past for a future which is essentially unknown."                                                                                        – Herman Foster In this article by James Pogran, author of the book Learning Powershell DSC, we will cover the following topics: Push and pull management General workflow Local Configuration Manager DSC Pull Server Deployment considerations (For more resources related to this topic, see here.) Overview DSC enables you to ensure that the components of your server environment have the correct state and configuration. It enables declarative, autonomous, and idempotent deployment, as well as configuration and conformance of standards-based managed elements. By its simplest definition, DSC is a Windows service, a set of configuration files, and a set of PowerShell modules and scripts. Of course, there is more to this; there's push and pull modes, MOF compilation, and module packaging, but this is really all you need to describe DSC architecture at a high level. At a lower level, DSC is much more complex. It has to be complex to handle all the different variations of operations you can throw at it. Something so flexible has to have some complicated inner workings. The beauty of DSC is that these complex inner workings are abstracted away from you most of the time, so you can focus on getting the job done. And if you need to, you can access the complex inner workings and tune them to your needs. To ensure we are all on the same page about the concepts we are going to cover, let's cover some terms. This won't be an exhaustive list, but it will be enough to get us started. Term Description Idempotent An operation that can be performed multiple times with the same result. DSC configuration file A PowerShell script file that contains the DSC DSL syntax and list of DSC Resources to execute. DSC configuration data A PowerShell data file or separate code block that defines data that can change on target nodes. DSC Resource A PowerShell module that contains idempotent functions that brings a target node to a desired state. DSC Cmdlets PowerShell Cmdlets specially made for DSC operations. MOF file Contains the machine-readable version of a DSC configuration file. LCM The DSC engine, which controls all execution of DSC configurations. CM The process of managing configuration on the servers in your environment. Drift A bucket term to indicate the difference between the desired state of a machine and the current state. Compilation Generally a programming term, in this case it refers to the process of turning a DSC configuration file into an MOF file Metadata Data that describes other data. Summarizes basic information about other data in a machine-readable format. Push and pull modes First and foremost, you must understand how DSC gets the information needed to configure a target node from the place it's currently stored to the target node. This may sound counterintuitive; you may be thinking we should be covering syntax or the different file formats in use first. Before we get to where we're going, we have to know how we are getting there. The more established CM products available on the market have coalesced into two approaches: push and pull. Push and pull refer to the directions and methods used to move information from the place where it is stored to the target nodes. It also describes the direction commands being sent to or received by the target nodes. Most CM products primarily use the pull method, which means they rely on agents to schedule, distribute, and rectify configurations on target nodes, but have a central server that holds configuration information and data. The server maintains the current state of all the target nodes, while the agent periodically executes configuration runs on the target nodes. This is a simplistic but effective approach, as it enables several highly important features. Because the server has the state of every machine, a query-able record of all servers exists that a user can utilize. At any one point in time, you can see the state of your entire infrastructure at a glance or in granular detail. Configuration runs can be executed on demand against a set of nodes or all nodes. Other popular management products that use this model are Puppet and Chef. Other CM products primarily use the push method, where a single workstation or user calls the agent directly. The user is solely responsible for scheduling executions and resolving all dependencies that the agent needs. It's a loose but flexible network as it allows the agents to operate even if there isn't a central server to report the status to. This is called a master-less deployment, in that there isn't anything keeping track of things. The benefit of this model largely depends on your specific use cases. Some environments need granularity in scheduling and a high level of control over how and when agents perform actions, so they benefit highly from the push model. They choose when to check for drift and when to correct drift on a server-to-server basis or an entire environment. Common uses for this approach are test and QA environments, where software configurations change frequently and there is a high expectation of customization. Other environments are less concerned with low-level customization and control and are more focused on ensuring a common state for a large environment (thousands and thousands of servers). Scheduling and controlling each individual server among thousands is less important than knowing that eventually, all severs will be in the same state, no matter how new or old they are. These environments want new a server quickly that conforms to an exacting specification without human intervention, so new severs are automatically pointed to a pull sever for a configuration assignment. Both DSC and other management products like Puppet and Chef can operate with and without a central server. Products like Ansible only support this method of agent management. Choosing which product to use is more a choice of which approach fits your environment best, rather than which product is best. The push management model DSC offers a push-based approach that is controlled by a user workstation initiating an execution on agents on target nodes, but there isn't a central server orchestrating things. Push management is very much an interactive process, where the user directly initiates and executes a specified configuration. The following diagram shows the push deployment model: This diagram shows the steps to perform a push deployment. The next section discusses the DSC workflow, where these steps will be covered, but for now, we see that a push deployment is comprised of three steps: authoring a configuration file, compiling the file to an MOF file, and then finally, executing the MOF on the target node. DSC operates in a push scenario when configurations are manually pushed to target nodes using the Start-DscConfiguration Cmdlet. It can be executed interactively, where the configuration is executed and the status is reported back to the user as it is running. It can also be initiated in a fire and forget manner as a job on the target node, where the configuration will be executed without reporting the status back to the user directly, but instead is logged to the DSC event log. Pushing configurations allow a great deal of flexibility. It's the primary way you will test your configurations. Run interactively with the Verbose and Wait parameter, the Start-DscConfiguration Cmdlet shows you a log of every step taken by the LCM, the DSC Resources it executes, and the entire DSC configuration run. A push-based approach also gives you an absolute time when the target node will have a configuration applied, instead of waiting on a schedule. This is useful in server environments when servers are set up once and stay around for a long time. This is easiest to set up and the most flexible of the two DSC methods, but is the hardest to maintain in large quantities and in the long term. The pull management model DSC offers a pull-based approach that is controlled by agents on target nodes, but there is a central server providing configuration files. This is a marked difference from the push models offered by other CM products. The following diagram shows the pull deployment model. The diagram shows the steps in a pull deployment and also shows how the status is reported for the compliance server. Refer back to following diagram when we cover pull servers later on in this article: DSC operates in a pull scenario when configurations are stored on a DSC Pull Server and pulled by LCM on each target node. The Pull Server is the harder of the two DSC methods to set up, but the easiest to maintain in large node quantities and in the long term. Pull management works great in server environments that have a lot of transient machines, like cloud or datacenter environments. These kinds of servers are created and destroyed frequently, and DSC will apply on a triggered basis. Pull Servers are also more scalable, as they can work against thousands of hosts in parallel. This seems counterintuitive, as with most Pull Servers we have a central point of drift detection, scheduling, and so on. This isn't the case with a DSC Pull Server; however, as it does not detect drift, compile MOFs, or other high cost actions. Compilation and the like happens on the author workstation or Converged infrastructure (CI) and the drift detection and scheduling happens on the agent, so the load is distributed across agents and not on the Pull Server. The general workflow The following diagram shows the authoring, staging, and execution phases of the DSC workflow. You will notice it does not look much different than the push or pull model diagrams. This similarity is on purpose, as the architecture of DSC allows the usage of it in either a push or pull deployment to be the same until the execution phase. This reduces the complexity of your configuration files and allows them to be used in either deployment mode without modification. Let's have a look at the entire DSC workflow: Authoring In order to tell DSC what state the target node should be in, you have to describe that state in the DSC DSL syntax. The end goal of the DSL syntax is to create an MOF file. This listing and compilation process comprises the entirety of the authoring phase. Even so, you will not be creating the MOF files directly yourself. The MOF syntax and format is very verbose and detailed, too much so for a human to reliably produce it. You can create an MOF file using a number of different methods―anything from Notepad to third-party tools, not just DSC tooling. The third-party vendors other than Microsoft will eventually implement their own compilers, as the operations to compile MOF is standardized and open for all to use, enabling authoring DSC files on any operating system. Syntax DSC provides a DSL to help you create MOF files. We call the file that holds the DSL syntax the DSC configuration file. Even though it is a PowerShell script file (a text file with a .ps1 extension), it can't do anything on its own. You can try to execute a configuration file all you want; it won't do anything to the system by itself. A DSC configuration file holds the information for the desired state, not the execution code to bring the node to a desired state. We talked about this separation of configuration information and execution logic before, and we are going to keep seeing this repeatedly throughout our use of DSC. The DSC DSL allows both imperative and declarative commands. What this means is that configuration files can both describe what has to be done (declarative) as well as have a PowerShell code that is executed inline (imperative). Declarative code will typically be DSC functions and resource declarations, and will make up the majority of code inside your DSC configuration file. Remember, the purpose of DSC is to express the expected state of the system, which you do by declaring it in these files in the human-readable language. Imperative code will typically make decisions based on metadata provided inside the configuration file, for example, choosing whether to apply a configuration to a target node inside the $AllNodes variable or deciding which files or modules to apply based on some algorithm. You will find that putting a lot of imperative code inside your configuration files will cause maintenance and troubleshooting problems in the future. Generally, a lot of imperative code indicates that you are performing actions or deciding on logic that should be in a DSC Resource, which is the best place to put imperative code. Compilation The DSC configuration file is compiled to an MOF format by invoking the declared DSC configuration block inside the DSC configuration file. When this is done, it creates a folder and one or more MOF files inside it. Each MOF file is for a single target node, containing all configuration information needed to ensure the desired state on the target machine. If at this point you are looking for example code of what this looks like, The example workflow has what you are looking for. We will continue explaining MOF compilation here, but if you want to jump ahead and take a look at the example and come back here when you are done, that's fine. You can only have one MOF file applied to any target node at any given time. Why one MOF file per target node? This is a good question. Due to the architecture of DSC, an MOF file is the one source of truth for that server. It holds everything that can describe that server so that nothing is missed. With DSC partial configurations, you can have separate DSC configuration blocks to delineate different parts of your installation or environment. This enables multiple teams to collaborate and participate in defining configurations for the environment instead of forcing all teams to use one DSC configuration script to track. For example, you can have a DSC partial configuration for an SQL server that is handled by the SQL team and another DSC partial configuration for the base operating system configuration that is handled by the operations team. Both partial configurations are used to produce one MOF file for a target node while allowing either DSC partial configuration to be worked on separately. In some cases, it's easier to have a single DSC configuration script that has the logic to determine what a target node needs installed or configured rather than a set of DSC partial configuration files that have to be tracked together by different people. Whichever one you choose is largely determined by your environment. Staging After authoring the configuration files and compiling them into MOF files, the next step is the staging phase. This phase slightly varies if you are using a push or pull model of deployment. When using the push model, the MOF files are pushed to the target node and executed immediately. There isn't much staging with push, as the whole point is to be interactive and immediate. In PowerShell v4, if a target node is managed by a DSC Pull Server, you cannot push the MOF file to it by using the Start-DscConfiguration Cmdlet. In PowerShell v4, a target node is either managed by a DSC Pull Server or not. This distinction is somewhat blurred in PowerShell v5, as a new DSC mode allows a target node to both be managed by a DSC Pull Server and have MOF files pushed to it. When using the pull model, the MOF files are pushed to the DSC Pull Server by the user and then pulled down to target nodes by DSC agents. Because the local LCMs on each target node pull the MOF when they hit the correct interval, MOF files are not immediately processed, and thus are staged. They are only processed when the LCM pulls the MOF from the Pull Server. When attached to a Pull Server, the LCM performs other actions to stage or prepare the target node. The LCM will request all required DSC resources from the Pull Server in order to execute the MOF in the next phase. Whatever process the MOF file uses to get to the target node, the LCM processes the MOF file by naming it pending.mof file and placing it inside the $env:systemRoot/system32/configuration path. If there was an existing MOF file executed before, it takes that file and renames it the previous.mof file. Execution After staging, the MOF files are ready for execution on the target node. An MOF is always executed as soon as it is delivered to a target node, regardless of whether the target node is configured for push or pull management. The LCM does run on a configurable schedule, but this schedule controls when the LCM pulls the new MOFs from the DSC Pull Server and when it checks the system state against the described desired state in the MOF file. When the LCM executes the MOF successfully, it renames the pending.mof file to current.mof file. The following diagram shows the execution phase: The execution phase operates the same no matter which deployment mode is in use, push or pull. However, different operations are started in the pull mode in comparison to the push mode, besides the obvious interactive nature of the push mode. Push executions In the push mode, the LCM expects all DSC resources to be present on the target node. Since the LCM doesn't have a way to know where to get the DSC resources used by the MOF file, it can't get them for you. Before running any push deployment on a target node, you must put all DSC resources needed there first. If they are not present, then the execution will fail. Using the Start-DscConfiguration Cmdlet, the MOF files are executed immediately. This kind of execution only happens when the user initiates it. The user can opt for the execution caused by the Start-DscConfiguration Cmdlet to happen interactively and see the output as it happens, or have it happen in the background and complete without any user interaction. The execution can happen again if the LCM ConfigurationMode mode is set to ApplyAndMonitor or ApplyAndAutoCorrect mode, but will only be applied once if ConfigurationMode is set to ApplyOnly. Pull executions In the pull mode, the LCM contacts the Pull Server for a new configuration, and the LCM downloads a new one if present. The LCM will parse the MOF and download any DSC resources that are specified in the configuration file, respecting the version number specified there. The MOF file is executed on a schedule that is set on each target node's LCM configuration. The same LCM schedule rules apply to a target node that is attached to a Pull Server as one that is not attached. The ApplyAndMonitor and ApplyAndAutoCorrect modes will continue to monitor the system state and change it if necessary. If it is set to the ApplyOnly mode, then LCM will check with the Pull Server to see if there are new MOF files to download, but will only apply them if the last execution failed. The execution happens continuously on a schedule that the LCM was set to use. In the next section, we will cover exactly how the LCM schedules configuration executions. The example workflow At this point, a simple example of the workflow you will use will be helpful to explain what we just covered. We will first create an example DSC configuration file. Then, we will compile it to an MOF file and show an example execution using the push deployment model. A short note about composing configuration files: if you use the built-in PowerShell Integrated Script Environment (ISE), then you will have intellisense provided as you type. This is useful as you start learning; the popup information can help you as you type things without having to look back at the documentation. The PowerShell ISE also provides on-demand syntax checking, and will look for errors as you type. The following text would be saved as a TestExample.ps1 file. You will notice this is a standalone file and contains no configuration data. Let's look at the following code snippet, which is a complete example of a DSC configuration file: # First we declare the configuration Configuration TestExample { # Then we declare the node we are targeting Node "localhost" { # Then we declare the action we want to perform Log ImportantMessage { Message = "This has done something important" } } } # Compile the Configuration function TestExample We can see the Configuration keyword, which holds all the node statements and DSC Resources statements. Then, the Node keyword is used to declare the target node we are operating on. This can either be hardcoded like in the example, or be passed in using the configuration data. And finally, the resource declaration for the action we want to take is added. In this example, we will output a message to the DSC event log when this is run on the localhost. We use the term keyword here to describe Configuration and Node. This is slightly inaccurate, as the actual definitions of Configuration and Node are PowerShell functions in the PSDesiredStateConfiguration module. PowerShell functions can also be defined as Cmdlets. This interchangeability of terms here is partly due to PowerShell's naming flexibility and partly due to informal conventions. It's sometimes a hot topic of contention. To compile this DSC configuration file into an MOF, we run the following script from the PowerShell console: PS C:\Examples> .\TestExample.ps1 Directory: C:\Examples\TestExample Mode LastWriteTime Length Name ---- ------------- ------ ---- -a--- 5/20/2015 7:28 PM 1136 localhost.mof As we can see from the result, compiling the configuration file to an MOF resulted in a folder with the name of the configuration block we just created and with one file called the localhost.mof file. Don't worry too much about reading or understanding the MOF syntax right now. For the most part, you won't be reading or dealing with it directly in your everyday use, but it is useful to know how the configuration block format looks in the MOF format. Let's try the following snippet: /* @TargetNode='localhost' @GeneratedBy=James @GenerationDate=05/20/2015 19:28:50 @GenerationHost=BLUEBOX */ instance of MSFT_LogResource as $MSFT_LogResource1ref { SourceInfo = "C:\\Examples\\TestExample.ps1::8::9::Log"; ModuleName = "PSDesiredStateConfiguration"; ModuleVersion = "1.0"; ResourceID = "[Log]ImportantMessage"; Message = "This has done something important"; }; instance of OMI_ConfigurationDocument { Version="1.0.0"; Author="James"; GenerationDate="05/20/2015 19:28:50"; GenerationHost="BLUEBOX"; }; We can see from this MOF that not only do we programmatically state the intent of this configuration (log a message), but we also note the computer it was compiled on as well as the user that did it. This metadata is used by the DSC engine when applying configurations and reporting statuses back to a Pull Server. Then, we execute this configuration on a target node using the push deployment model by calling the Start-DscConfiguration Cmdlet: PS C:\Examples> Start-DscConfiguration –Path C:\Examples\TestExample –Wait –Verbose VERBOSE: Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = SendConfigurationApply,'className' = MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'. VERBOSE: An LCM method call arrived from computer BLUEBOX with user sid ************. VERBOSE: [BLUEBOX]: LCM: [ Start Set ] VERBOSE: [BLUEBOX]: LCM: [ Start Resource ] [[Log]ImportantMessage] VERBOSE: [BLUEBOX]: LCM: [ Start Test ] [[Log]ImportantMessage] VERBOSE: [BLUEBOX]: LCM: [ End Test ] [[Log]ImportantMessage] in 0.0000 seconds. VERBOSE: [BLUEBOX]: LCM: [ Start Set ] [[Log]ImportantMessage] VERBOSE: [BLUEBOX]: [[Log]ImportantMessage] This has done something important VERBOSE: [BLUEBOX]: LCM: [ End Set ] [[Log]ImportantMessage] in 0.0000 seconds. VERBOSE: [BLUEBOX]: LCM: [ End Resource ] [[Log]ImportantMessage] VERBOSE: [BLUEBOX]: LCM: [ End Set ] in 0.3162 seconds. VERBOSE: Operation 'Invoke CimMethod' complete. VERBOSE: Time taken for configuration job to complete is 0.36 seconds Notice the logging here. We used the Verbose parameter, so we see listed before us every step that DSC took. Each line represents an action DSC is executing, and each has a Start and End word in it, signifying the start and end of each execution even though an execution may span multiple lines. Each INFO, VERBOSE, DEBUG, or ERROR parameter is written both to the console in front of us and also to the DSC event log. Everything done is logged for auditing and historical purposes. An important thing to note is that while everything is logged, not everything is logged to the same place. There are several DSC event logs: Microsoft-Windows-DSC/Operational, Microsoft-Windows-DSC/Analytical, and Microsoft-Windows-DSC/Debug. However, only the Microsoft-Windows-DSC/Operational event log is logged to by default; you have to enable the Microsoft-Windows-DSC/Analytical and Microsoft-Windows-DSC/Debug event log in order to see any events logged there. Any verbose messages are logged in Microsoft-Windows-DSC/Analytical, so beware if you use the Log DSC Resource extensively and intend to find those messages in the logs. A configuration data Now that we have covered how deployments work (push and pull) in DSC and covered the workflow (authoring, staging, and execution) for using DSC, we will pause here for a moment to discuss the differences between configuration files and configuration data. The DSC configuration blocks contain the entirety of the expected state of the target node. The DSL syntax used to describe the state is expressed in one configuration file in a near list format. It expresses all configuration points of the target system and is able to express dependencies between configuration points. DSC configuration data is separated from DSC configuration files to reduce variance and duplication. Some points that are considered data are software version numbers, file path locations, registry setting values, and domain-specific information like server roles or department names. You may be thinking, what is the difference between the data you put in a configuration file and a configuration data file? The data we put in a configuration file is structural data, data that does not change based on the environment. The data we put in configuration data files is environmental. For example, no matter the environment, a server needs IIS installed in order to serve webpages. The location of the source files for the webpage may change depending on whether the environment is the development environment or the production environment. The structural information (that we need IIS for) is contained in the DSC configuration file and the environmental information (source file locations) is stored in the configuration data file. Configuration data can be expressed in DSC in several ways. A hardcoded data Configuration data can be hardcoded inside DSC configuration files, but this is not optimal in most cases. You will mostly use this for static sets of information or to reduce redundant code as shown in the following code snippet: configuration FooBar { $features = @('Web-Server', 'Web-Asp-Net45') Foreach($feature in $features){ WindowsFeature "Install$($feature)" { Name = $feature } } } A parameter-based data A parameter-based data can be passed as parameters to a configuration block, like so: configuration FooBar { param([switch]$foo,$bar) if($foo){ WindowsFeature InstallIIS { Name = "Web-Server" } }elseif($bar){ WindowsFeature InstallHyperV { Name = "Microsoft-Hyper-V" } } } FooBar –Foo A hashtable data The most flexible and preferred method is to use the ConfigurationData hashtable. This specifically structured hashtable provides a flexible way of declaring frequently changing data in a format that DSC will be able to read and then insert into the MOF file as it compiles it. Don't worry too much if the importance of this feature is not readily apparent. With following command lines, we define a specifically formatted hashtable called $data: $data = @{ # Node specific data # Note that is an array of hashes. It's easy to miss # the array designation here AllNodes = @( # All the WebServers have this identical config @{ NodeName = "*" WebsiteName = "FooWeb" SourcePath = "C:\FooBar\" DestinationPath = "C:\inetpub\FooBar" DefaultWebSitePath = "C:\inetpub\wwwroot" }, @{ NodeName = "web1.foobar.com" Role = "Web" }, @{ NodeName = "web2.foobar.com" Role = "Web" }, @{ NodeName = "sql.foobar.com" Role = "Sql" } ); } configuration FooBar { # Dynamically find the web nodes from configuration data Node $AllNodes.where{$_.Role -eq "Web"}.NodeName { # Install the IIS role WindowsFeature IIS { Ensure = "Present" Name = "Web-Server" } } } # Pass the configuration data to configuration as follows: FooBar -ConfigurationData $data The first item's key is called $AllNodes key, the value of which is an array of hashtables. The content of these hashtables are free form, and can be whatever we need them to be, but they are meant to express the data on each target node. Here, we specify the roles of each node so that inside the configuration, we can perform a where clause and filter for only the nodes that have a web role. If you look back at the $AllNodes definition, you'll see the three nodes we defined (web1, web2, and sql) but also notice one where we just put an * sign in the NodeName field. This is a special convention that tells DSC that all the information in this hashtable is available to all the nodes defined in this AllNodes array. This is an easy way to specify defaults or properties that apply to all nodes being worked on. Local Configuration Manager Now that we have covered how deployments work (push and pull) in DSC and covered the workflow (authoring, staging, and execution) for using DSC, we will talk about how the execution happens on a target node. The LCM is the PowerShell DSC engine. It is the heart and soul of DSC. It runs on all target nodes and controls the execution of DSC configurations and resources whether you are using a push or pull deployment model. It is a Windows service, but is part of the WMI service host, so there is no direct service named LCM for you to look at. The LCM has a large range of settings that control everything from the scheduling of executions to how the LCM handles configuration drift. LCM settings are settable by DSC itself, although using a slightly different syntax. This allows the LCM settings to be deployed just like DSC configurations, in an automatable and repeatable manner. These settings are applied separately from your DSC configurations, so you will have configuration files for your LCM and separate files for your DSC configurations. This separation means that LCM settings can be applied per server or on all servers, so not all your target nodes have to have the same settings. This is useful if some servers have to have a stricter scheduler and control over their drift, whereas others can be checked less often or be more relaxed in their drift. Since the LCM settings are different from DSC settings but describe how DSC operates, they are considered DSC metadata. You will sometimes see them referred to as metadata instead of settings, because they describe the entirety of the process and not just LCM-specific operations. These pieces of information are stored in a separate MOF file than what the DSC configuration block compiles to. These files are named with the NodeName field you gave them and appended with meta.mof as the file extension. Anytime you configure the LCM, the *.meta.mof files will be generated. LCM settings Common settings that you will configure are listed in the following table. There are more settings available, but these are the ones that are most useful to know right away. Setting Description AllowModuleOverwrite Allows or disallows DSC resources to be overwritten on the target node. This applies to DSC Pull Server use only. ConfigurationMode Determines the type of operations to perform on this host. For example, if set to ApplyAndAutoCorrect and if the current state does not match the desired state, then DSC applies the corrections needed. ConfigurationModeFrequencyMins The interval in minutes to check if there is configuration drift. RebootNodeIfNeeded Automatically reboot server if configuration requires it when applied. RefreshFrequencyMins How often to check for a new configuration when LCM is attached to a Pull Server. RefreshMode Determines which deployment mode the target is in, push or pull. The LCM comes with most of these settings set to logical defaults to allow DSC to operate out of the box. You can check what is currently set by issuing the following Get-DscLocalConfigurationManager Cmdlet: PS C:\Examples> Get-DscLocalConfigurationManager ActionAfterReboot : ContinueConfiguration AllowModuleOverwrite : False CertificateID : ConfigurationID : ConfigurationMode : ApplyAndMonitor ConfigurationModeFrequencyMins : 15 Credential : DebugMode : {NONE} DownloadManagerCustomData : DownloadManagerName : LCMCompatibleVersions : {1.0} LCMState : Idle LCMVersion : 1.0 RebootNodeIfNeeded : False RefreshFrequencyMins : 30 RefreshMode : PUSH PSComputerName : Configuration modes An important setting to call out is the LCM ConfigurationMode setting. As stated earlier, this setting controls how DSC applies the configuration to the target node. There are three available settings: ApplyOnly, ApplyAndMonitor, and ApplyAndAutoCorrect. These settings will allow you to control how the LCM behaves and when it operates. This controls the actions taken when applying the configuration as well as how it handles drift occurring on the target node. ApplyOnly When the ApplyOnly mode is set, DSC will apply the configuration and do nothing further unless a new configuration is deployed to the target node. Note that this is a completely new configuration, not a refresh of the currently applied configuration. If the target node's configuration drifts or changes, no action will be taken by DSC. This is useful for a one time configuration of a target node or in cases where it is expected that a new configuration will be pushed at a later point, but some initial setup needs to be done now. This is not a commonly used setting. ApplyAndMonitor When the ApplyAndMonitor mode is set, DSC behaves exactly like ApplyOnly, except after the deployment, DSC will monitor the current state for configuration drift. This is the default setting for all DSC agents. It will report back any drift to the DSC logs or Pull Server, but will not act to rectify the drift. This is useful when you want to control when change happens on your servers, but reduces the autonomy DSC can have to correct changes in your infrastructure. ApplyAndAutoCorrect When the ApplyAndAutoCorrect mode is set, DSC will apply the configuration to the target node and continue to monitor for configuration drift. If any drift is detected, it will be logged and the configuration will be reapplied to the target node to bring it back into compliance. This gives DSC the greatest autonomy to ensure your environment is valid and act on any changes that may occur without your direct input. This is great for fully locked down environments where variance is not allowed, but must also be corrected on the next scheduled run and without fail. Refresh modes While the ConfigurationMode mode determines how DSC behaves in regard to configuration drift, the RefreshMode setting determines how DSC gets the configuration information. In the beginning of this article, we covered the push and pull deployment models, and this setting allows you to change which model the target node uses. By default, all installs are set to the push RefreshMode, which makes sense when you want DSC to work out of the box. Setting it to the pull RefreshMode allows the LCM to work with a central Pull Server. The LCM configuration Configuring the LCM is done by authoring an LCM configuration block with the desired settings specified. When compiled, the LCM configuration block produces a file with the extension meta.mof. Applying the meta.mof file is done by using the Set-DscLocalConfigurationManager Cmdlet. You are not required to write your LCM configuration block in a file; it can alternately be placed inside the DSC configuration file. There are several reasons to separate them. Your settings for the LCM could potentially change more often than your DSC configuration files, and keeping them separated reduces changes to your core files. You could also have different settings for different servers, which you may not want to express or tie down inside your DSC configuration files. It's up to you how you want to organize things. Compiling the LCM configuration block to MOF is done just like a DSC configuration block, by invoking the name of the LCM configuration you defined. You apply the resulting meta.mof file to the target node using the Set-DscLocalConfigurationManager Cmdlet. An example LCM configuration An example LCM configuration is as follows, saved as ExampleLCMConfig.ps1. We could have put this inside a regular DSC configuration file, but it was separated for a clearer example as shown: #Declare the configuration Configuration SetTheLCM { # Declare the settings we want configured LocalConfigurationManager { ConfigurationMode = "ApplyAndAutoCorrect" ConfigurationModeFrequencyMins = 120 RefreshMode = "Push" RebootNodeIfNeeded = $true } } SetTheLCM To compile this configuration into an MOF file, you execute the following configuration file in the PowerShell console: PS C:\Examples> .\ExampleLCMConfig.ps1 Directory: C:\Users\James\Desktop\Examples\SetTheLCM Mode LastWriteTime Length Name ---- ------------- ------ ---- -a--- 5/20/2015 7:28 PM 984 localhost.meta.mof As we can see from the output, a localhost.meta.mof file was created inside a folder named for the configuration, as a SetTheLCM folder. The filename reminds us again that the LCM settings are considered DSC metadata, so any files or operations on LCM get the "meta" moniker. Looking at the contents of the MOF file, we see the same syntax as the MOF file generated by the DSC configuration file. Let's have a look at the following snippet: /* @TargetNode='localhost' @GeneratedBy=James @GenerationDate=05/20/2015 19:28:50 @GenerationHost=BLUEBOX */ instance of MSFT_DSCMetaConfiguration as $MSFT_DSCMetaConfiguration1ref { RefreshMode = "Push"; ConfigurationModeFrequencyMins = 120; ConfigurationMode = "ApplyAndAutoCorrect"; RebootNodeIfNeeded = True; }; instance of OMI_ConfigurationDocument { Version="1.0.0"; Author="James"; GenerationDate="05/20/2015 19:28:50"; GenerationHost="BLUEBOX"; }; We then execute the LCM configuration by using the Set-DscLocalConfigurationManager cmdlet: PS C:\Examples> Set-DscLocalConfigurationManager -Path .\SetTheLCM\ -Verbose VERBOSE: Performing the operation "Start-DscConfiguration: SendMetaConfigurationApply" on target "MSFT_DSCLocalConfigurationManager". VERBOSE: Perform operation 'Invoke CimMethod' with following parameters, ''methodName' = SendMetaConfigurationApply,'className' = MSFT_DSCLocalConfigurationManager,'namespaceName' = root/Microsoft/Windows/DesiredStateConfiguration'. VERBOSE: An LCM method call arrived from computer BLUEBOX with user sid *********************. VERBOSE: [BLUEBOX]: LCM: [ Start Set ] VERBOSE: [BLUEBOX]: LCM: [ Start Resource ] [MSFT_DSCMetaConfiguration] VERBOSE: [BLUEBOX]: LCM: [ Start Set ] [MSFT_DSCMetaConfiguration] VERBOSE: [BLUEBOX]: LCM: [ End Set ] [MSFT_DSCMetaConfiguration] in 0.0520 seconds. VERBOSE: [BLUEBOX]: LCM: [ End Resource ] [MSFT_DSCMetaConfiguration] VERBOSE: [BLUEBOX]: LCM: [ End Set ] in 0.2555 seconds. VERBOSE: Operation 'Invoke CimMethod' complete. VERBOSE: Set-DscLocalConfigurationManager finished in 0.235 seconds. The DSC Pull Server The DSC Pull Server is your one stop central solution for managing a large environment using DSC. In the beginning of this article, we talked about the two deployment modes of DSC: push and pull. A DSC Pull Server operates with target nodes configured to be in the pull deployment mode. What is a DSC Pull Server? A DSC Pull Server is an IIS website that exposes an OData endpoint that responds to requests from the LCM configured on each target node and provides DSC configuration files and DSC Resources for download. That was a lot of acronyms and buzzwords, so let's take this one by one. IIS is an acronym for Internet Information Services, which is the set of components that allow you to host websites on a Windows server. OData is an acronym for Open Data Protocol, which defines a standard for querying and consuming RESTful APIs. One last thing to cover before we move on. A DSC pull server can be configured to use Server Message Block (SMB) shares instead of HTTP to distribute MOF files and DSC resources. This changes the distribution mechanism, but not much more internally to the DSC server. What does the Pull Server do for us? Since the LCM handles the scheduling and executing of the MOF files, what does the Pull Server do? The Pull Server operates as a single management point for all DSC operations. By deploying MOF files to the Pull Server, you control the configuration of any target node attached to it. Automatic and continuous configuration As a central location for all target nodes to report to, a Pull Server provides an automatic deployment of configurations. Once a target node's LCM is configured, it automatically will pull configurations and dependent files without requiring input from you. It will also do this continuously and on schedule, without requiring extra input from you. Repository The Pull Server is the central repository for all the MOF files and DSC Resources that the LCM uses to schedule and execute. With the push model, you are responsible for distributing the DSC Resources and MOF files to the target nodes yourself. A DSC pull server provides them to the target nodes on demand and ensures they have the correct version. Reporting The Pull Server tracks the status of every target node that uses it, so it also has another role called a reporting server. You can query the server for the status of all the nodes in your environment and the Pull Server will return information on their last run. A reporting server stores the pull operation status and configuration and node information in a database. Reporting endpoints can be used to periodically check the status of the nodes to see if their configurations are in sync with the Pull Server or not. The PowerShell team has transitioned from calling this a compliance server to a reporting server during the PowerShell v5 development cycle. Security A Pull Server can be set up to use HTTPS or SMB with NTFS permissions for the MOF and DSC Resource repositories. This controls access to the DSC configuration files and DSC Resources but also encrypts them over the wire. You will most likely at some point have to provide credentials for one of your settings or DSC Resources. Certificates can be used to encrypt the credentials being used in the DSC configurations. It would be foolish to enter in the credentials inside the DSC configuration files, as it would be in plain text that anyone could read. By setting up and using certificates to encrypt the credentials, only the servers with the correct certificates can read the credentials. Setting up a DSC Pull Server You would think with so many dependencies that setting up a DSC Pull Server would be hard. Actually, it's a perfect example of using DSC to configure a server! Again, don't worry if some of this is still not clear; we will cover making DSC configuration files in more detail later. Pull Server settings A Pull Server has several configuration points for each of the roles it performs. These can either be set manually or through DSC itself, as discussed in the following table: Name Description EndpointName Configures the name of the OData endpoint. Port The port the service listens on. CertificateThumbPrint The SSL certificate thumbprint the web service uses. PhysicalPath The install path of the DSC service. ModulePath The path to the DSC Resources and modules. ConfigurationPath The working directory for the DSC service. The compliance server settings are as discussed in the following table: Name Description EndpointName Configures the name of the OData endpoint. Port The port the service listens on. CertificateThumbPrint The SSL certificate thumbprint the web service uses. PhysicalPath The install path of the DSC service. Installing the DSC server The following example is taken from the example provided by the PowerShell team in the xPSDesiredStateConfiguration module. Just as when we showed an example DSC configuration in the authoring phase, don't get too caught up on the following syntax. Examine the structure and how much this looks like a list for what we need. Running this on a target node sets up everything needed to make it a Pull Server, ready to go from the moment it is finished. The first step is to make a text file called the SetupPullServer.ps1 file with the following content: # Declare our configuration here Configuration SetupPullServer { Import-DSCResource -ModuleName xPSDesiredStateConfiguration # Declare the node we are targeting Node "localhost" { # Declare we need the DSC-Service installed WindowsFeature DSCServiceFeature { Ensure = "Present" Name = "DSC-Service" } # Declare what settings the Pull Server should have xDscWebService PSDSCPullServer { Ensure = "Present" State = "Started" EndpointName = "PSDSCPullServer" Port = 8080 CertificateThumbPrint = "AllowUnencryptedTraffic" PhysicalPath = "$env:SystemDrive\inetpub\wwwroot\PSDSCPullServer" ModulePath = "$env:PROGRAMFILES\WindowsPowerShell\DscService\Modules" ConfigurationPath = "$env:PROGRAMFILES\WindowsPowerShell\DscService\Configuration" DependsOn = "[WindowsFeature]DSCServiceFeature" } # Declare what settings the Compliance Server should have xDscWebService PSDSCComplianceServer { Ensure = "Present" State = "Started" EndpointName = "PSDSCComplianceServer" Port = 9080 PhysicalPath = "$env:SystemDrive\inetpub\wwwroot\PSDSCComplianceServer" CertificateThumbPrint = "AllowUnencryptedTraffic" IsComplianceServer = $true DependsOn = @("[WindowsFeature]DSCServiceFeature","[xDSCWebService]PSDSCPullServer") } } } The next step is to invoke the DSC Configuration Cmdlet to produce an MOF file. By now, we don't need to show the output MOF file as we have covered that already. We then run the Start-DscConfiguration Cmdlet against the resulting folder and the Pull Server is set up. A good thing to remember when you eventually try to use this DSC configuration script to make a DSC Pull Server is that you can't make a client operating system a Pull Server. If you are working on a Windows 8.1 or 10 desktop while trying out these examples, some of them might not work for you because you are on a desktop OS. For example, the WindowsFeature DSC Resource only works on the server OS, whereas the WindowsOptionalFeature DSC Resource operates on the desktop OS. You will have to check each DSC resource to find out what OS or platforms they support, just like you would have to check the release notes of software to find out supported system requirements. Adding MOF files to a Pull Server Adding an MOF file to a Pull Server is slightly more involved than using an MOF with the push mode. You still compile the MOF with the same steps we outlined in the Authoring section earlier in this article. Pull Servers require MOFs to use checksums to determine when an MOF has changed for a given target node. They also require the MOF filename to be the ConfigurationID file of the target node. A unique identifier is much easier to work with than the names a given target node is using. This is typically done only once per server, and is kept for the lifetime of the server. It is usually decided when configuring the LCM for that target node. The first step is to take the compiled MOF and rename it with the unique identifier we assigned it when we were creating the configuration for it. In this example, we will assign a newly created GUID as shown: PS C:\Examples> Rename-Item -Path .\TestExample\localhost.mof -NewName "$([GUID]::NewGuid().ToString()).mof" PS C:\Examples> ls .\TestExample\ Directory: C:\TestExample Mode LastWriteTime Length Name ---- ------------- ------ ---- -a--- 5/20/2015 10:52 PM 1136 b1948d2b-2b80-4c4a-9913-ae6dcbf23a4d.mof The next step is to run the New-DSCCheckSum Cmdlet to generate a checksum for the MOF files in the TestExample folder as shown: PS C:\Examples> New-DSCCheckSum -ConfigurationPath .\TestExample\ -OutPath .\TestExample\ -Verbose VERBOSE: Create checksum file 'C:\Examples\TestExample\\b1948d2b-2b80-4c4a-9913-ae6dcbf23a4d.mof.checksum' PS C:\Examples> ls .\TestExample\ Directory: C:\TestExample Mode LastWriteTime Length Name ---- ------------- ------ ---- -a--- 5/21/2015 10:52 PM 1136 b1948d2b-2b80-4c4a-9913-ae6dcbf23a4d.mof -a--- 5/22/2015 10:52 PM 64 b1948d2b-2b80-4c4a-9913-ae6dcbf23a4d.mof.checksum PS C:\Examples> gc .\TestExample\b1948d2b-2b80-4c4a-9913-ae6dcbf23a4d.mof.checksum A62701D45833CEB2A39FE1917B527D983329CA8698951DC094335E6654FD37A6 The next step is to copy the checksum file and MOF file to the Pull Server MOF directory. This is typically located in C:\Program Files\WindowsPowerShell\DscService\Configuration path on the Pull Server, although it's configurable so it might have been changed in your deployment. Adding DSC Resources to a Pull Server In push mode, you can place a DSC Resource module folder in a PowerShell module path (any of the paths defined in the $env:PSModulePath path) and things will work out fine. A Pull Server requires that DSC Resources be placed in a specific directory and compressed into a ZIP format with a specific name in order for the Pull Server to recognize and be able to transfer the resource to the target node. Here is our example DSC Resource in a folder on our system. We are using the experimental xPSDesiredStateConfiguration resource provided by Microsoft, but these steps can apply to your custom resources, as well as shown in the following command: PS C:\Examples> ls . Directory: C:\Examples Mode LastWriteTime Length Name ---- ------------- ------ ---- d---- 5/20/2015 10:52 PM xPSDesiredStateConfiguration The first step is to compress the DSC Resource folder into a ZIP file. You may be tempted to use the .NET System.IO.Compression.Zip file classes to compress the folder to a ZIP file. In DSC v4, you cannot use these classes, as they create a ZIP file that the LCM cannot read correctly. This is a fault in the DSC code that reads the archive files However, in DSC v5, they have fixed this so that you can still use System.IO.Compression.zip file. A potentially easier option in PowerShell v5 is to use the built-in Compress-Archive Cmdlet to accomplish this. The only way to make a ZIP file for DSC v4 is either to use the built-in compression facility in Windows Explorer, a third-party utility like 7zip, or the COM Shell.Application object in a script. PS C:\Examples> ls . Directory: C:\Examples Mode LastWriteTime Length Name ---- ------------- ------ ---- d---- 5/20/2015 10:52 PM xPSDesiredStateConfiguration d---- 5/20/2015 10:52 PM xPSDesiredStateConfiguration.zip Once you have your ZIP file, we rename the file to MODULENAME_#.#.#.#.zip, where MODULENAME is the official name of the module and the #.#.#.# refers to the version of the DSC resource module we are working with. This version is not the version of the DSC Resource inside the module, but the version of the DSC Resource root module. You will find the correct version in the top-level psd1 file inside the root directory of the module. Let's have a look at the following example: PS C:\Examples> ls . Directory: C:\Examples Mode LastWriteTime Length Name ---- ------------- ------ ---- d---- 5/20/2015 10:52 PM xPSDesiredStateConfiguration d---- 5/20/2015 10:52 PM xPSDesiredStateConfiguration_3.2.0.0.zip As with MOF files, DSC needs a checksum in order to identify each DSC Resource. The next step is to run the New-DscCheckSum Cmdlet against our ZIP file and receive our checksum: PS C:\Examples> New-DSCCheckSum -ConfigurationPath .\xPSDesiredStateConfiguration_3.2.0.0.zip -OutPath . -Verbose VERBOSE: Create checksum file 'C:\Examples\xPSDesiredStateConfiguration _3.2.0.0.zip.checksum' PS C:\Examples> ls . Directory: C:\TestExample Mode LastWriteTime Length Name ---- ------------- ------ ---- -a--- 5/21/2015 10:52 PM 1136 xPSDesiredStateConfiguration _3.2.0.0.zip -a--- 5/22/2015 10:52 PM 64 xPSDesiredStateConfiguration _3.2.0.0.zip.checksum The final step is to copy the ZIP file and checksum file up to the C:\Program Files\WindowsPowerShell\DscService\Modules path on the Pull Server. Once completed, the previous steps provide a working Pull Server. You configure your target nodes using the steps outlined in the previous section on LCM and your target nodes will start pulling configurations. Deployment considerations By this point, we have covered the architecture and the two different ways that we can deploy DSC in your environment. When choosing the deployment method, you should be aware of some additional considerations and observations that have come through experience using DSC in production. General observations You will generally use the DSC push mode deployments to test new configurations or perform one off configurations of servers. While you can use the push mode against several servers at once, you lose the benefits of the Pull Server. Setting up a DSC Pull Server is the best option for a large set of nodes or environments that frequently build and destroy servers. It does have a significant learning curve in setting up the DSC resources and MOF files, but once done it is reliably repeatable without additional effort. When using Pull Servers, each target node is assigned a configuration ID that is required to be unique and is expected to stay with that server for its lifetime. There is currently no built-in tracking of configuration IDs inside DSC or in the Pull Server, and there are no checks to avoid duplicate collisions. This is by design, as it allows greater deployment flexibility. You can choose to have a unique ID for every target node in your environment or have one single ID for a group of systems. An example of sharing a configuration ID is a web farm that creates and destroys VMs based on demand during certain time periods. Since they all have the same configuration ID, they all get the same configuration with significantly less work on your part (not having to make multiple MOF files and maintain lists of IDs for temporary nodes). Maintaining a list of used IDs and which targets they refer to is currently up to you. Some have used the active directory IDs for the target node as an identifier. This is awkward to support as often we are running configurations on target nodes before they are joined to an AD domain. We recommend using a GUID as an identifier and keeping the configuration data files where the node identifiers are kept: in a source control system. LCM gotchas The LCM service runs under the system account and so has a high privilege access to the system. However, the system account is not a user account, which causes trouble when you assume DSC can perform an action just like you did a moment ago. Common gotchas include accessing network file shares or accessing parts of the system that require user credentials. These will typically fail with a generic Access Denied, which will most likely lead you down the wrong path when troubleshooting. Unfortunately, the only way to know this beforehand is to hope that the DSC Resource or application you are executing documented the permissions they needed to run. Some DSC Resources have parameters that accept a PSCredential object for this very purpose, so be sure to inspect examples or the DSC Resource itself to find out how to best handle access permissions. Trial and error will prove things one way or the other for you here. As described in the execution phase in The General workflow, when first deploying using push or pull and trying out new configurations, or troubleshooting existing ones, the frequent executions often cause problems. If the configuration run was interrupted or stopped mid-run, a pending.mof file is often left in place. This signals to DSC that a configuration is either in flight or that something else occurred and it should not run. When you try to run another configuration, you get an error saying that a configuration is currently in flight. To solve this, you need to delete the pending.mof file before running the Update-DscConfiguration or Start-DscConfiguration -Force Cmdlet. Deployment mode differences When used with a DSC Pull Server, the LCM does a lot of work for you. It will pull down the required DSC Resources for your DSC configuration file automatically, instead of you having to copy them there yourself. It will also report the status back to the Pull Server, so you can see the status of all your targets in one place. When used in the push mode, the LCM does all the work of applying your DSC configuration file for you, but does not do as much when in the pull mode. It does not auto download dependent DSC Resources for you. Summary In this article, we have identified the three phases of DSC use and the two different deployment models. We then covered how the phases and models work together to comprise the architecture of DSC. And lastly, we covered how the LCM and Pull Server work separately and together. Resources for Article: Further resources on this subject: Working with PowerShell[article] Installing/upgrading PowerShell[article] Managing Files, Folders, and Registry Items Using PowerShell [article]
Read more
  • 0
  • 0
  • 7309

article-image-managing-pools-desktops
Packt
07 Oct 2015
14 min read
Save for later

Managing Pools for Desktops

Packt
07 Oct 2015
14 min read
In this article by Andrew Alloway, the author of VMware Horizon View High Availability, we will review strategies for providing High Availability for various types of VMware Horizon View desktop pools. (For more resources related to this topic, see here.) Overview of pools VMware Horizon View provides administrators with the ability to automatically provision and manage pools of desktops. As part of our provisioning of desktops, we must also consider how we will continue service for the individual users in the event of a host or storage failure. Generally High Availability requirements fall into two categories for each pool. We can have stateless desktops where the user information is not stored on the VM between sessions and Stateful desktops where the user information is stored on the desktop between sessions. Stateless desktops In a stateless configuration, we are not required to store data on the Virtual Desktops between user sessions. This allows us to use Local Storage instead of shared storage for our HA strategies as we can tolerate host failures without the use of shared disk. We can achieve a stateless desktop configuration using roaming profiles and/or View Persona profiles. This can greatly reduce cost and maintenance requirements for View Deployments. Stateless desktops are typical in the following environments: Task Workers: A group of workers where the tasks are well known and they all share a common set of core applications. Task workers can use roaming profiles to maintain data between user sessions. In a multi shift environment, having stateless desktops means we only need to provision as many desktops that will be used consecutively. Task Worker setups are typically found in the following scenarios: Data entry Call centers Finance, Accounts Payables, Accounts Receivables Classrooms (in some situations) Laboratories Healthcare terminals Kiosk Users: A group of users that do not login. Logins are typically automatic or without credentials. Kiosk users are typically untrusted users. Kiosk VMs should be locked down and restricted to only the core applications that need to be run. Kiosks are typically refreshed after logoff or at scheduled times after hours. Kiosks can be found in situations such as the following: Airline Check-In stations Library Terminals Classrooms (in some situations) Customer service terminals Customer Self-Serve Digital Signage Stateful desktops Statefull desktops have some advantages from reduced iops and higher disk performance due to the ability to choose thick provisioning. Stateful desktops are desktops that require user data to be stored on the VM or Desktop Host between user sessions. These machines typically are required by users who will extensively customize their desktop in non-trivial ways, require complex or unique applications that are not shared by a large group or require the ability to modify their VM Stateful Desktops are typically used for the following situations: Users who require the ability to modify the installed applications Developers IT Administrators Unique or specialized users Department Managers VIP staff/managers Dedicated pools Dedicated pools are View Desktops provisioned using thin or thick provisioning. Dedicated pools are typically used for Stateful Desktop deployments. Each desktop can be provisioned with a dedicated persistent disk used for storing the User Profile and data. Once assigned a desktop that user will always log into the same desktop ensuring that their profile is kept constant. During OS refresh, balances and recomposes the OS disk is reverted back to the base image. Dedicated Pools with persistent disks offer simplicity for managing desktops as minimal profile management takes place. It is all managed by the View Composer/View Connection Server. It also ensures that applications that store profile data will almost always be able to retrieve the profile data on the next login. Meaning that the administrator doesn't have to track down applications that incorrectly store data outside the roaming profile folder. HA considerations for dedicated pools Dedicated pools unfortunately have very difficult HA requirements. Storing the user profile with the VM means that the VM has to be stored and maintained in an HA aware fashion. This almost always results in a shared disk solution being required for Dedicated Pools. In the event of a host outage other hosts connected to the same storage can start up the VM. For shared storage, we can use NFS, iSCSI, Fibre Channel, or VMware Virtual SAN storage. Consider investing in storage systems with primary and backup controllers as we will be dependent on the disk controllers being always available. Backups are also a must with this system as there is very little recovery options in the event of a storage array failure. Floating Pools Floating pools are a pool of desktops where any user can be assigned to any desktop in the pool upon login. Floating pools are generally used for stateless desktop deployments. Floating pools can be used with roaming profiles or View Persona to provide a consistent user experience on login. Since floating pools are treated as disposable VMs, we open up additional options for HA. Floating pools are given 2 local disks, the OS disk which is a replica from the assigned base VM, and the Disposable Disk where the page file, hibernation file, and temp drive are located. When Floating pools are refreshed, recomposed or rebalanced, all changes made to the desktop by the users are lost. This is due to the Disposable Disk being discarded between refreshes and the OS disk being reverted back to the Base Image. As such any session information such as Profile, Temp directory, and software changes are lost between refreshes. Refreshes can be scheduled to occure after logoff, after every X days or can be manually refreshed. HA considerations for floating pools Floating pools can be protected in several ways depending on the environment. Since floating pools can be deployed on local storage we can protect against a host failure by provisioning the Floating Pool VMs on multiple separate hosts. In the event of a host failure the remaining Virtual Desktops will be used to log users in. If there is free capacity in the cluster more Virtual Desktops will be provisioned on other hosts. For environments with shared storage Floating Pools can still be deployed on the shared storage but it is a good idea to have a secondary shared storage device or a highly available storage device. In the event of a storage failure the VMs can be started on the secondary storage device. VMware Virtual SAN is inherently HA safe and there is no need for a secondary datastore when using Virtual SAN. Many floating pool environments will utilize a profile management solution such as Roaming Profiles or View Persona Management. In these situations it is essential to setup a redundant storage location for View Profiles and or Roaming Profiles. In practice a Windows DFS share is a convenient and easy way to guard profiles against loss in the event of an outage. DFS can be configured to replicate changes made to the profile in real time between hosts. If the Windows DFS server is provisioned as VMs on shared storage make sure to create a DRS rule to separate the VMs onto different hosts. Where possible DFS servers should be stored on separate disk arrays to ensure they data is preserved in the event of the Disk Array, or Storage Processor failure. For more information regarding Windows DFS you can visit the link below https://technet.microsoft.com/en-us/library/jj127250.aspx Manual pools Manual pools are custom dedicated desktops for each user. A VM is manually built for each user who is using the manual pool. Manual Pools are Stateful pools that generally do not utilize profile management technologies such as View Persona or Roaming Profiles. Like Dedicated pools once a user is assigned to a VM they will always log into the same VM. As such HA requirements for manual pools are very similar to dedicated pools. Manual desktops can be configured in almost any maner desired by the administrator. There are no requirements for more than one disk to be attached to the Manual Pool desktop. Manual pools can also be configured to utilize physical hardware as the Desktop such as Blade Servers, Desktop Computers or even Laptops. In this situation there are limited high availability options without investing in exotic and expensive hardware. As best practice the physical hosts should be built with redundant power supplies, ECC RAM, mirrored hard disks pending budget and HA requirements. There should be a good backup strategy for managing physical hosts connected to the Manual Pools. HA considerations for manual pools Manual pools like dedicated pools have a difficult HA requirement. Storing the user profile with the VM means that the VM has to be stored and maintained in an HA aware fashion. This almost always results in a shared disk solution being required for Manual Pools. In the event of a host outage other hosts connected to the same storage can start up the VM. For shared storage, we can use NFS, iSCSI, Fibre Channel, or VMware VSAN storage. Consider investing in storage systems with primary and backup controllers as we will be dependent on the disk controllers being always available. Backups are also a must with this system as there is very little recovery options in the event of a storage array failure. VSAN deployments are inherently HA safe and are excellent candidates for Manual Pool storage. Manual pools given their static nature also have the option of using replication technology to backup the VMs onto another disk. You can use VMware vSphere Replication to do automatic replication or use a variety of storage replication solutions offered by storage and backup vendors. In some cases it may be possible to use fault tolerance on the Virtual Desktops for truly high availability. Note that this would limit the individual VMs to a single vCPU which may be undesirable. Remote Desktop services pools Remote Desktop Services Pools (RDS Pools) are pools where the remote session or application is hosted on a Windows Remote Desktop Server. The application or remote session is run under the users' credentials. Usually all the user data is stored locally on the Remote Desktop Server but can also be stored remotely using Roaming Profiles or View Persona Profiles. Folder Redirection to a central network location is also used with RDS Pools. Typical uses for Remote Desktop Services is for migrating users off legacy RDS environments, hosting applications, and providing access to troublesome applications or applications with large memory foot prints. The Windows Remote Desktop Server can be either a VM or a standalone physical host. It can be combined with Windows Clustering technology to provide scalability and high availability. You can also deploy a load balancer solution to manage connections between multiple Windows Remote Desktop Servers. Remote Desktop services pool HA considerations Remote Desktop services HA revolves around protecting individual RDS VMs or provisioning a cluster of RDS servers. When a single VM is deployed wilth RDS generally it is best to use vSphere HA and clustering featurs to protect the VM. If the RDS resources are larger than practical for a VM then we must focus on protecting individual host or clustering multiple hosts. When the Windows Remote Desktop Server is deployed as a VM the following options are available: Protect the VM with VMware HA, using shared storage This allows vCenter to fail over the VM to another host in the event of a host failure. vSphere will be responcible for starting the VM on another host. The VM will resume from a crashed state. Replicate the Virtual Machine to separate disks on separate hosts using VMware Virtual SAN: Same as above but in this case the VM has been replicated to another host using Virtual SAN technology. The remote VM will be started up from a crashed state, using the last consistent harddrive image that was replicated. Using replication technologies such as vSphere Replication: The VM will be periodically synchronized to a remote host. In the event of a host failure we can manually activate the remotely synchronized VM. Use a Vendors Storage Level replication: In this case we allow our storage vendor to provide replication technology to provide a redundant backup. This protects us in the event of a storage or host failure. These failures can be automated or manual. Consult with your Storage Vendor for more information. Protect the VM using backup technologies: This provides redundancy in the sense that we won't loose the VM if it fails. Unfortuantely you are at the mercy of your restore process to bring the VM back to life. The VM will resume from a crashed state. Always keep backups of production servers. For RDS servers running on a dedicated server we could utilize the following: Redundant power supplies: Redundant power supplies will keep the server going while a PSU is being replaced or becomes defective. It is also a good idea to have 2 separate power sources for each power supply. Simple things like a power bar going faulty or triping a breaker could bring down the server if there are not two independent power sources. Uninteruptable Power Supply: Battery backups are always a must for production level equipment. Make sure to scale the UPS to provide adequate power and duration for your environment. Redundant network interfaces: In rare sucumstances a NIC can go bad or a cable can be damaged. In this case redundant NICs will prevent a server outage. Remember that to protect against a switch outage we should plug the NICs into separate switches. Mirrored or redundant disks: Harddrives are one of the most common failures in computers. Mirrored harddrives or RAID configurations are a must for production level equipment. 2 or more hosts: Clustering physical servers will ensure that host failures won't cause downtime. Consider multi site configurations for even more redundancy. Shared Strategies for VMs and Hardware: Provide High Availability to the RDS using Microsoft Network Load Balancer (NLB): Microsoft Network Load Balancer can provide load balancing to the RDS servers directy. In this situation the clients would connect to a single IP managed by the NLB which would randomly be assigned to a server. Provide High Availability using a load balancer to manage sessions between RDS servers: Using a hardware or software load balancer is can be used instead of Microsoft Network Load Balancers. Load Balancer vendors provide a high variety of capabilities and features for their load balancers. Consult your load balancer vendor for best practices. Use DNS Round Robin to alternate between RDS hosts: On of the most cost effective load balancing methods. It has the drawback of not being able to balance the load or to direct clients away from failed hosts. Updating DNS may delay adding new capacity to the cluster or delay removing a failed host from the cluster. Remote Desktop Connection Broker with High Availability: We can provide RDS failover using the Connection Broker feature of our RDS server. For more details see the link below. For more information regarding Remote Desktop Connection Broker with High Availability see: https://technet.microsoft.com/en-us/library/ff686148%28WS.10%29.aspx Here is an example topology using physical or virtual Microsoft RDS Servers. We use a load balancing technology for the View Connection Servers as described in the previous chapter. We then will connect to the RDS via either a load balancer, DNS round robin, or Cluster IP. Summary In this article, we covered the concept of stateful and stateless desktops and the consequences and techniques for supporting each in a highly available environment. Resources for Article: Further resources on this subject: Working with Virtual Machines[article] Storage Scalability[article] Upgrading VMware Virtual Infrastructure Setups [article]
Read more
  • 0
  • 0
  • 8387

article-image-monitoring-openstack-networks
Packt
07 Oct 2015
6 min read
Save for later

Monitoring OpenStack Networks

Packt
07 Oct 2015
6 min read
In this article by Chandan Dutta Chowdhury and Sriram Subramanian, authors of the book OpenStack Networking Cookbook, we will explore various means to monitor the network resource utilization using Ceilometer. We will cover the following topics: Virtual Machine bandwidth monitoring L3 bandwidth monitoring (For more resources related to this topic, see here.) Introduction Due to the dynamic nature of virtual infrastructure and multiple users sharing the same cloud platform, the OpenStack administrator needs to track how the tenants use the resources. The data can also help in capacity planning by giving an estimate of the capacity of the physical devices and the trends of resource usage. An OpenStack Ceilometer project provides you with telemetry service. It can measure the usage of the resources by collecting statistics across the various OpenStack components. The usage data is collected over the message bus or by polling the various components. The OpenStack Neutron provides Ceilometer with the statistics that are related to the virtual networks. The following figure shows you how Ceilometer interacts with the Neutron and Nova services: To implement these recipes, we will use an OpenStack setup as described in the following screenshot: This setup has two compute nodes and one node for the controller and networking services. Virtual Machine bandwidth monitoring OpenStack Ceilometer collects the resource utilization of virtual machines by running a Ceilometer compute agent on all the compute nodes. These agents collect the various metrics that are related to each virtual machine running on the compute node. The data that is collected is periodically sent to the Ceilometer collector over the message bus. In this recipe, we will learn how to use the Ceilometer client to check the bandwidth utilization by a virtual machine. Getting ready For this recipe, you will need the following information: The SSH login credentials for a node where the OpenStack client packages are installed A shell RC file that initializes the environment variables for CLI How to do it… The following steps will show you how to determine the bandwidth utilization of a virtual machine: Using the appropriate credentials, SSH into the OpenStack node installed with the OpenStack client packages. Source the shell RC file to initialize the environment variables required for the CLI commands. Use the nova list command to find the ID of the virtual machine instance that is to be monitored: Use the ceilometer resource-list| grep <virtual-machine-id> command to find the resource ID of the network port associated with the virtual machine. Note down the resource ID for the virtual port associated to the virtual machine for use in the later commands. The virtual port resource ID is a combination of the virtual machine ID and the name of the tap interface for the virtual port. It's named in the form instance-<virtual-machine-id>-<tap-interface-name>: Use ceilometer meter-list –q resource=<virtual-port-resource-id> to find the meters associated with the network port on the virtual machine: Next, use ceilometer statistics –m <meter-name> –q resource=<virtual-port-resource-id> to view the network usage statistics. Use the meters that we discovered in the last step to view the associated data: Ceilometer stores the port bandwidth data for the incoming and outgoing packets and the bytes and their rates. How it works… The OpenStack Ceilometer compute agent collects the statistics related to the network port connected to the virtual machines and posts them on the message bus. These statistics are collected by the Ceilometer collector daemon. The Ceilometer client can be used to query a meter and filter the statistical data based on the resource ID. L3 bandwidth monitoring The OpenStack Neutron provides you with metering commands in order to enable the Layer 3 (L3) traffic monitoring. The metering commands create a label that can hold a list of the packet matching rules. Neutron counts and associates any L3 packet that matches these rules with the metering label. In this recipe, we will learn how to use the L3 traffic monitoring commands of Neutron to enable packet counting. Getting ready For this recipe, we will use a virtual machine that is connected to a network that, in turn, is connected to a router. The following figure describes the topology: We will use a network called private with CIDR of 10.10.10.0/24. For this recipe, you will need the following information: The SSH login credentials for a node where the OpenStack client packages are installed A shell RC file that initializes the environment variables for CLI The name of the L3 metering label The CIDR for which the traffic needs to be measured How to do it… The following steps will show you how to enable the monitoring traffic to or from any L3 network: Using the appropriate credentials, SSH into the OpenStack node installed with the OpenStack client packages. Source the shell RC file to initialize the environment variables required for the CLI commands. Use the Neutron meter-label-create command to create a metering label. Note the label ID as this will be used later with the Ceilometer commands: Use the Neutron meter-label-rule-create command to create a rule that associates a network address to the label that we created in the last step. In our case, we will count any packet that reaches the gateway from the CIDR 10.10.10.0/24 network to which the virtual machine is connected: Use the ceilometer meter-list command with the resource filter to find the meters associated with the label resource: Use the ceilometer statistics command to view the number of packets matching the metering label: The packet counting is now enabled and the bandwidth statistics can be viewed using Ceilometer. How it works… The Neutron monitoring agent implements the packet counting meter in the L3 router. It uses iptables to implement a packet counter. The Neutron agent collects the counter statistics periodically and posts them on the message bus, which is collected by the Ceilometer collector daemon. Summary In this article, we learned about ways to monitor the usage of virtual and physical networking resources. The resource utilization data can be used to bill the users of a public cloud and debug the infrastructure-related problems. Resources for Article: Further resources on this subject: Using the OpenStack Dash-board [Article] Installing Red Hat CloudForms on Red Hat OpenStack [Article] Securing OpenStack Networking [Article]
Read more
  • 0
  • 0
  • 17026
article-image-integrating-elasticsearch-hadoop-ecosystem
Packt
07 Oct 2015
14 min read
Save for later

Integrating Elasticsearch with the Hadoop ecosystem

Packt
07 Oct 2015
14 min read
In this article by Vishal Shukla, author of the book Elasticsearch for Hadoop, we will take a look at how ES-Hadoop can integrate with Pig and Spark with ease. Elasticsearch is great in getting insights into the indexed data. The Hadoop ecosystem does a great job in making Hadoop easily usable for different users by providing a comfortable interface. Some of the examples are Hive and Pig. Apart from these, Hadoop integrates well with other computing engines and platforms, such as Spark and Cascading. (For more resources related to this topic, see here.) Pigging out Elasticsearch For many use cases, Pig is one of the easiest ways to fiddle around with the data in the Hadoop ecosystem. Pig wins when it comes to ease of use and simple syntax for designing data flow pipelines without getting into complex programming. Assuming that you know Pig, we will cover how to move the data to and from Elasticsearch. If you don't know Pig yet, never mind. You can still carry on with the steps, and by the end of the article, you will at least know how to use Pig to perform data ingestion and reading with Elasticsearch. Setting up Apache Pig for Elasticsearch Let's start by setting up Apache Pig. At the time of writing this article, the latest Pig version available is 0.15.0. You can use the following steps to set up the same version: First, download the Pig distribution using the following command: $ sudo wget –O /usr/local/pig.tar.gz http://mirrors.sonic.net/apache/pig/pig-0.15.0/pig-0.15.0.tar.gz Then, extract Pig to the desired location and rename it to a convenient name: $ cd /userusr/local $ sudo tar –xvf pig.tar.gz $ sudo mv pig-0.15.0 pig Now, export the required environment variables by appending the following two lines in the /home/eshadoop/.bashrc file: export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin You can either log out and relogin to see the newly set environment variables or source the environment configuration with the following command: $ source ~/.bashrc Now, start the job history server daemon with the following command: $ mr-jobhistory-daemon.sh start historyserver You should see the Pig console with the following command: $ pig grunt> It's easy to forget to start the job history daemon once you restart your machine or VM. You may make this daemon run on start up, or you need to ensure this manually. Now, we have Pig up and running. In order to use Pig with Elasticsearch, we must ensure that the ES-Hadoop JAR file is available in the Pig classpath. Let's take the ES-Hadoop JAR file and and import it to HDFS using the following steps: First, download the ES-Hadoop JAR used to develop the examples in this article, as shown in the following command: $ wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/2.1.1/elasticsearch-hadoop-2.1.1.jar Then, move the downloaded JAR to a convenient name as follows: $ sudo mkdir /opt/lib Now, import the JAR to HDFS: $ hadoop fs –mkdir /lib $ hadoop fs –put elasticsearch-hadoop-2.1.1.jar /lib/elasticsearch-hadoop-2.1.1.jar Throughout this article, we will use a crime dataset that is tailored from the open dataset provided at https://data.cityofchicago.org/. This tailored dataset can be downloaded from http://www.packtpub.com/support, where all the code files required for this article are available. Once you have downloaded the dataset, import it to HDFS at /ch07/crime_data.csv. Importing data to Elasticsearch Let's import the crime dataset to Elasticsearch using Pig with ES-Hadoop. This provides the EsStorage class as Pig Storage. In order to use the EsStorage class, you need to have a registered ES-Hadoop JAR with Pig. You can register the JAR located in the local filesystem, HDFS, or other shared filesystems. The REGISTER command registers a JAR file that contains UDFs (User-defined functions) with Pig, as shown in the following code: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar; Then, load the CSV data file as a relation with the following code: grunt> SOURCE = load '/ch07/crimes_dataset.csv' using PigStorage(',') as (id:chararray, caseNumber:chararray, date:datetime, block:chararray, iucr:chararray, primaryType:chararray, description:chararray, location:chararray, arrest:boolean, domestic:boolean, lat:double,lon:double); This command reads the CSV fields and maps each token in the data to the respective field in the preceding command. The resulting relation, SOURCE, represents a relation with the Bag data structure that contains multiple Tuples. Now, generate the target Pig relation that has the structure that matches closely to the target Elasticsearch index mapping, as shown in the following code: grunt> TARGET = foreach SOURCE generate id, caseNumber, date, block, iucr, primaryType, description, location, arrest, domestic, TOTUPLE(lon, lat) AS geoLocation; Here, we need the nested object with the geoLocation name in the target Elasticsearch document. We can achieve this with a Tuple to represent the lat and lon fields. TOTUPLE() helps us to create this tuple. We then assigned the geoLocation alias for this tuple. Let's store the TARGET relationto the Elasticsearch index with the following code: grunt> STORE TARGET INTO 'esh_pig/crimes' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.timeout = 5m', 'es.index.auto.create = true', 'es.mapping.names=arrest:isArrest, domestic:isDomestic', 'es.mapping.id=id'); We can specify the target index and type to store indexed documents. The EsStorage class can accept multiple Elasticsearch configurations.es.mapping.names maps the Pig field name to Elasticsearch document's field name. You can use Pig's field id to assign a custom _id value for the Elasticsearch document using the es.mapping.id option. Similarly, you can set the _ttl and _timestamp metadata fields as well. Pig uses just one reducer in the default configuration. It is recommended to change this behavior to have a parallelism that matches the number of shards available, as shown in the following command: grunt> SET default_parallel 5; Pig also combines the input splits, irrespective of its size. This makes it efficient for small files by reducing the number of mappers. However, this will give performance issues for large files. You can disable this behavior in the Pig script, as shown in the following command: grunt> SET pig.splitCombination FALSE; Executing the preceding commands will create the Elasticsearch index and import crime data documents. If you observe the created documents in Elasticsearch, you can see the geoLocation value isan array in the [-87.74274476, 41.87404405]format. This is because by default, ES-Hadoop ignores the tuple field names and simply converts them as an ordered array. If you wish to make your geoLocation field look similar to the key/value-based object with the lat/lon keys, you can do so by including the following configuration in EsStorage: es.mapping.pig.tuple.use.field.names=true Writing from the JSON source If you have inputs as a well-formed JSON file, you can avoid conversion and transformations and directly pass the JSON document to Elasticsearch for indexing purposes. You may have the JSON data in Pig as chararray, bytearray, or in any other form that translates to well-formed JSON by calling the toString() method, as shown in the following code: grunt> JSON_DATA = LOAD '/ch07/crimes.json' USING PigStorage() AS (json:chararray); grunt> STORE JSON_DATA INTO 'esh_pig/crimes_json' USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'); Type conversions Take a look at the the type mapping of the esh_pig index in Elasticsearch. It maps the geoLocation type to double. This is done because Elasticsearch inferred the double type based on the field type we specified in Pig. To map geoLocation to geo_point, you must create the Elasticsearch mapping for it manually before executing the script. Although Elasticsearch provides a data type detection based on the type of field in the incoming document, it is always good to create the type mapping beforehand in Elasticsearch. This is a one-time activity that you should do. Then, you can run the MapReduce, Pig, Hive, Cascading, or Spark jobs multiple times. This will avoid any surprises in the type detection. For your reference, here is a list of some of the field types of Pig and Elasticsearch that map to each other. The table doesn't list no-brainer and absolutely intuitive type mappings: Pig type Elasticsearch type chararray This specifies string bytearray This indicates binary tuple This denotes an array(default) or object bag This specifies an array map This denotes an object bigdecimal This indicates Not supported biginteger This denotes Not supported Reading data from Elasticsearch Reading data from Elasticsearch using Pig is as simple as writing a single command with the Elasticsearch query. Here is a snippet of how to print tuples that has crimes related to theft: grunt> REGISTER hdfs://localhost:9000/lib/elasticsearch-hadoop-2.1.1.jar grunt> ES = LOAD 'esh_pig/crimes' using org.elasticsearch.hadoop.pig.EsStorage('{"query" : { "term" : { "primaryType" : "theft" } } }'); grunt> dump ES; Executing the preceding commands will print the tuples Pig console. Giving Spark to Elasticsearch Spark is a distributed computing system that provides huge performance boost compared to Hadoop MapReduce. It works on an abstraction of RDD (Resilient-distributed Datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration with Spark by enabling the creation of RDD from the data in Elasticsearch. Spark's increasing support of integrating with various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data makes it special when it comes to data integration. This means that when you use ES-Hadoop with Spark, you can make all these sources integrate with Elasticsearch easily. Setting up Spark In order to set up Apache Spark in order to execute a job, you can perform the following steps: First, download the Apache Spark distribution with the following command: $ sudo wget –O /usr/local/spark.tgzhttp://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz Then, extract Spark to the desired location and rename it to a convenient name, as shown in the following command: $ cd /user/local $ sudo tar –xvf spark.tgz $ sudo mv spark-1.4.1-bin-hadoop2.4 spark Importing data to Elasticsearch To import the crime dataset to Elasticsearch with Spark, let's see how we can write a Spark job. We will continue using Java to write Spark jobs for consistency. Here are the driver program's snippets: SparkConf conf = new SparkConf().setAppName("esh-spark").setMaster("local[4]"); conf.set("es.index.auto.create", "true"); JavaSparkContext context = new JavaSparkContext(conf); Set up the SparkConf object to configure the spark job. As always, you can also set most options (such as es.index.auto.create) and other configurations that we have seen throughout the article. Using this configuration, we created the JavaSparkContext object as follows: JavaRDD<String> textFile = context.textFile("hdfs://localhost:9000/ch07/crimes_dataset.csv"); Read the crime data CSV file as JavaRDD. Here, RDD is still of the type String that represents each line: JavaRDD<Crime> dataSplits = textFile.map(new Function<String, Crime>() { @Override public Crime call(String line) throws Exception { CSVParser parser = CSVParser.parse(line, CSVFormat.RFC4180); Crime c = new Crime(); CSVRecord record = parser.getRecords().get(0); c.setId(record.get(0)); .. .. String lat = record.get(10); String lon = record.get(11); Map<String, Double> geoLocation = new HashMap<>(); geoLocation.put("lat", StringUtils.isEmpty(lat)? null:Double.parseDouble(lat)); geoLocation.put("lon",StringUtils.isEmpty(lon)?null:Double. parseDouble(lon)); c.setGeoLocation(geoLocation); return c; } }); In the preceding snippet, we called the map() method on JavaRDD to map each of the input line to the Crime object. Note that we created a simple JavaBean class called Crime that implements the Serializable interface and maps to the Elasticsearch document structure. Using CSVParser, we parsed each field into the Crime object. We mapped nested the geoLocation object by embedding Map in the Crime object. This map is populated with the lat and lon fields. This map() method returns another JavaRDD that contains the Crime objects, as shown in the following code: JavaEsSpark.saveToEs(dataSplits, "esh_spark/crimes"); Save JavaRDD<Crime> to Elasticsearch with the JavaEsSpark class provided by Elasticsearch. For all the ES-Hadoop integrations, such as Pig, Hive, Cascading, Apache Storm, and Spark, you can use all the standard ES-Hadoop configurations and techniques. This includes dynamic/multiresource writes with a pattern similar to esh_spark/{primaryType} and use JSON strings to directly import the data to Elasticsearch as well. To control the Elasticsearch document metadata from being indexed, you can use the saveToEsWithMeta() method of JavaEsSpark. You can pass an instance of JavaPairRDD that contains Tuple2<Metadata, Object>, where Metadata represents a map that has the key/value pairs of the document metadata fields, such as id, ttl, timestamp, and version. Using SparkSQL ES-Hadoop also bridges Elasticsearch with the SparkSQL module. SparkSQL 1.3+ versions provide the DataFrame abstraction that represent a collection of Row. We will not discuss the details of DataFrame here. ES-Hadoop lets you persist your DataFrame instance to Elasticsearch transparently. Let's see how we can do this with the following code: SQLContext sqlContext = new SQLContext(context); DataFrame df = sqlContext.createDataFrame(dataSplits, Crime.class); Create an SQLContext instance using the JavaSparkContext instance. Using the SqlContextSqlContext instance, you can create DataFrame by calling the createDataFrame() method and passing the existing JavaRDD<T> and Class<T>, where T is a JavaBean class that implements the Serializable interface. Note that the passing class instance is required to infer a schema for DataFrame. If you wish to use nonJavaBean-based RDD, you can create the schema manually. The article source code contains the implementations of both the approaches for your reference. Take a look at the following code: JavaEsSparkSQL.saveToEs(df, "esh_sparksql/crimes_reflection"); Once you have the DataFrame instance, you can save it to Elasticsearch with the JavaEsSparkSQL class, as shown in the preceding code. Reading data from Elasticsearch Here is the snippet of SparkEsReader that finds crimes related to theft: JavaRDD<Map<String, Object>> esRDD = JavaEsSpark.esRDD(context, "esh_spark/crimes", "{"query" : { "term" : { "primaryType" : "theft" } } }").values(); for(Map<String,Object> item: esRDD.collect()){ System.out.println(item); } We used the same JavaEsSpark class to create RDD with documents that match the Elasticsearch query. Using SparkSQL ES-Hadoop provides a org.elasticsearch.spark.sql data source provider to read the data from Elasticsearch using SparkSQL, as shown in the following code: Map<String, String> options = new HashMap<>(); options.put("pushdown","true"); options.put("es.nodes","localhost"); DataFrame df = sqlContext.read() .options(options) .format("org.elasticsearch.spark.sql") .load("esh_sparksql/crimes_reflection"); The preceding code snippet uses the org.elasticsearch.spark.sql data source to load data from Elasticsearch. You can set the pushdown option to true to push the query execution down to Elasticsearch. This greatly increases its efficiency as the query execution is collocated where the data resides, as shown in the following code: df.registerTempTable("crimes"); DataFrame theftCrimes = sqlContext.sql("SELECT * FROM crimes WHERE primaryType='THEFT'"); for(Row row: theftCrimes.javaRDD().collect()){ System.out.println(row); } We registered table with the data frame and executed the SQL query on SqlContext. Note that we need to collect the final results locally to print in a driver class. Summary In this article, we looked at the various Hadoop ecosystem technologies. We set up Pig with ES-Hadoop and developed the script to interact with Elasticsearch. You also learned how to use ES-Hadoop to integrate Elasticsearch with Spark and empower it with powerful SQL engine SparkSQL. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [Article] Elasticsearch Administration [Article] Downloading and Setting Up ElasticSearch [Article]
Read more
  • 0
  • 0
  • 4453

article-image-introduction-data-analysis-and-libraries
Packt
07 Oct 2015
13 min read
Save for later

Introduction to Data Analysis and Libraries

Packt
07 Oct 2015
13 min read
In this article by Martin Czygan and Phuong Vothihong, the authors of the book Getting Started with Python Data Analysis, Data is raw information that can exist in any form, which is either usable or not. We can easily get data everywhere in our life; for example, the price of gold today is $ 1.158 per ounce. It does not have any meaning, except describing the gold price. This also shows that data is useful based on context. (For more resources related to this topic, see here.) With relational data connection, information appears and allows us to expand our knowledge beyond the range of our senses. When we possess gold price data gathered overtime, one information we might have is that the price has continuously risen from $1.152 to $1.158 for three days. It is used by someone who tracks gold prices. Knowledge helps people to create value in their lives and work. It is based on information that is organized, synthesized, or summarized to enhance comprehension, awareness, or understanding. It represents a state or potential for action and decisions. When the gold price continuously increases for three days, it will lightly decrease on the next day; this is useful knowledge. The following figure illustrates the steps from data to knowledge; we call this process the data analysis process and we will introduce it in the next section: In this article, we will cover the following topics: Data analysis and process Overview of libraries in data analysis using different programming languages Common Python data analysis libraries Data analysis and process Data is getting bigger and more diversified every day. Therefore, analyzing and processing data to advance human knowledge or to create value are big challenges. To tackle these challenges, you will need domain knowledge and a variety of skills, drawing from areas such as computer science, artificial intelligence (AI) and machine learning (ML), statistics and mathematics, and knowledge domain, as shown in the following figure: Let's us go through the Data analysis and it's domain knowledge: Computer science: We need this knowledge to provide abstractions for efficient data processing. A basic Python programming experience is required. We will introduce Python libraries used in data analysis. Artificial intelligence and machine learning: If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to model the data and learn from it in order to build smart products. Statistics and mathematics: We cannot extract useful information from raw data if we do not use statistical techniques or mathematical functions. Knowledge domain: Besides technology and general techniques, it is important to have an insight into the specific domain. What do the data fields mean? What data do we need to collect? Based on the expertise, we explore and analyze raw data by applying the above techniques, step by step. Data analysis is a process composed of the following steps: Data requirements: We have to define what kind of data will be collected based on the requirements or problem analysis. For example, if we want to detect a user's behavior while reading news on the internet, we should be aware of visited article links, date and time, article categories, and the user's time spent on different pages. Data collection: Data can be collected from a variety of sources: mobile, personal computer, camera, or recording device. It may also be obtained through different ways: communication, event, and interaction between person and person, person and device, or device and device. Data appears whenever and wherever in the world. The problem is, how can we find and gather it to solve our problem? This is the mission of this step. Data processing: Data that is initially obtained must be processed or organized for analysis. This process is performance-sensitive: How fast can we create, insert, update, or query data? For building a real product that has to process big data, we should consider this step carefully. What kind of database should we use to store data? What kind of data structure, such as analysis, statistics, or visualization, is suitable for our purposes? Data cleaning: After being processed and organized, the data may still contain duplicates or errors. Therefore, we need a cleaning step to reduce those situations and increase the quality of the results in the following steps. Common tasks include record matching, deduplication, or column segmentation. Depending on the type of data, we can apply several types of data cleaning. For example, a user's history of a visited news website might contain a lot of duplicate rows, because the user might have refreshed certain pages many times. For our specific issue, these rows might not carry any meaning when we explore the user's behavior. So, we should remove them before saving it to our database. Another situation we may encounter is click fraud on news—someone just wants to improve their website ranking or sabotage the website. In this case, the data will not help us to explore a user's behavior. We can use thresholds to check whether a visit page event comes from a real person or from malicious software. Exploratory data analysis: Now, we can start to analyze data through a variety of techniques referred to as exploratory data analysis. We may detect additional problems in data cleaning or discover requests for further data. Therefore, these steps may be iterative and repeated throughout the whole data analysis process. Data visualization techniques are also used to examine the data in graphs or charts. Visualization often facilitates the understanding of data sets, especially, if they are large or high-dimensional. Modelling and algorithms: A lot of mathematical formulas and algorithms may be applied to detect or predict useful knowledge from the raw data. For example, we can use similarity measures to cluster users who have exhibited similar news reading behavior and recommend articles of interest to them next time. Alternatively, we can detect users' gender based on their news reading behavior by applying classification models such as Support Vector Machine (SVM) or linear regression. Depending on the problem, we may use different algorithms to get an acceptable result. It can take a lot of time to evaluate the accuracy of the algorithms and to choose the best one to implement for a certain product. Data product: The goal of this step is to build data products that receive data input and generate output according to the problem requirements. We will apply computer science knowledge to implement our selected algorithms as well as manage the data storage. Overview of libraries in data analysis There are numerous data analysis libraries that help us to process and analyze data. They use different programming languages and have different advantages as well as disadvantages of solving various data analysis problems. Now, we introduce some common libraries that may be useful for you. They should give you an overview of libraries in the field. However, the rest of this focuses on Python-based libraries. Some of the libraries that use the Java language for data analysis are as follows: Weka: This is the library that I got familiar with, the first time I learned about data analysis. It has a graphical user interface that allows you to run experiments on a small dataset. This is great if you want to get a feel for what is possible in the data processing space. However, if you build a complex product, I think it is not the best choice because of its performance: sketchy API design, non-optimal algorithms, and little documentation (http://www.cs.waikato.ac.nz/ml/weka/). Mallet: This is another Java library that is used for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications on text. There is an add-on package to Mallet, called GRMM, that contains support for inference in general, graphical models, and training of Conditional random fields (CRF) with arbitrary graphical structure. In my experience, the library performance as well as the algorithms are better than Weka. However, its only focus is on text processing problems. The reference page is at http://mallet.cs.umass.edu/. Mahout: This is Apache's machine learning framework built on top of Hadoop; its goal is to build a scalable machine learning library. It looks promising, but comes with all the baggage and overhead of Hadoop. The Homepage is at http://mahout.apache.org/. Spark: This is a relatively new Apache project; supposedly up to a hundred times faster than Hadoop. It is also a scalable library that consists of common machine learning algorithms and utilities. Development can be done in Python as well as in any JVM language. The reference page is at https://spark.apache.org/docs/1.5.0/mllib-guide.html. Here are a few libraries that are implemented in C++: Vowpal Wabbit: This library is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research. It has been used to learn a tera-feature (1012) dataset on 1000 nodes in one hour. More information can be found in the publication at http://arxiv.org/abs/1110.4198. MultiBoost: This package is a multiclass, multilabel, and multitask classification boosting software implemented in C++. If you use this software, you should refer to the paper published in 2012, in the Journal of Machine Learning Research, MultiBoost: A Multi-purpose Boosting Package, D.Benbouzid, R. Busa-Fekete, N. Casagrande, F.-D. Collin, and B. Kégl. MLpack: This is also a C++ machine learning library, developed by the Fundamental Algorithmic and Statistical Tools Laboratory (FASTLab) at Georgia Tech. It focusses on scalability, speed, and ease-of-use and was presented at the BigLearning workshop of NIPS 2011. Its homepage is at http://www.mlpack.org/about.html. Caffe: The last C++ library we want to mention is Caffe. This is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. You can find more information about it at http://caffe.berkeleyvision.org/. Other libraries for data processing and analysis are as follows: Statsmodels: This is a great Python library for statistical modelling and is mainly used for predictive and exploratory analysis. Modular toolkit for data processing (MDP):This is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures (http://mdp-toolkit.sourceforge.net/index.html). Orange: This is an open source data visualization and analysis for novices and experts. It is packed with features for data analysis and has add-ons for bioinformatics and text mining. It contains an implementation of self-organizing maps, which sets it apart from the other projects as well (http://orange.biolab.si/). Mirador: This is a tool for the visual exploration of complex datasets supporting Mac and Windows. It enables users to discover correlation patterns and derive new hypotheses from data (http://orange.biolab.si/). RapidMiner: This is another GUI-based tool for data mining, machine learning, and predictive analysis (https://rapidminer.com/). Theano: This bridges the gap between Python and lower-level languages. Theano gives very significant performance gains, particularly for large matrix operations and is, therefore, a good choice for deep learning models. However, it is not easy to debug because of the additional compilation layer. Natural language processing toolkit (NLTK): This is written in Python with very unique and salient features. Here, I could not list all libraries for data analysis. However, I think the above libraries are enough to take a lot of your time to learn and build data analysis applications. Python libraries in data analysis Python is a multi-platform, general purpose programming language that can run on Windows, Linux/Unix, and Mac OS X, and has been ported to the Java and the .NET virtual machines as well. It has a powerful standard library. In addition, it has many libraries for data analysis: Pylearn2, Hebel, Pybrain, Pattern, MontePython, and MILK. We will cover some common Python data analysis libraries such as Numpy, Pandas, Matplotlib, PyMongo, and scikit-learn. Now, to help you getting started, I will briefly present an overview of each library for those who are less familiar with the scientific Python stack. NumPy One of the fundamental packages used for scientific computing with Python is Numpy. Among other things, it contains the following: A powerful N-dimensional array object Sophisticated (broadcasting) functions for performing array computations Tools for integrating C/C++ and Fortran code Useful linear algebra operations, Fourier transforms, and random number capabilities. Besides this, it can also be used as an efficient multidimensional container of generic data. Arbitrary data types can be defined and integrated with a wide variety of databases. Pandas Pandas is a Python package that supports rich data structures and functions for analyzing data and is developed by the PyData Development Team. It is focused on the improvement of Python's data libraries. Pandas consists of the following things: A set of labelled array data structures; the primary of which are Series, DataFrame, and Panel Index objects enabling both simple axis indexing and multilevel/hierarchical axis indexing An integrated group by engine for aggregating and transforming datasets Date range generation and custom date offsets Input/output tools that loads and saves data from flat files or PyTables/HDF5 format Optimal memory versions of the standard data structures Moving window statistics and static and moving window linear/panel regression Because of these features, Pandas is an ideal tool for systems that need complex data structures or high-performance time series functions such as financial data analysis applications. Matplotlib Matplotlib is the single most used Python package for 2D-graphic. It provides both a very quick way to visualize data from Python and publication-quality figures in many formats: line plots, contour plots, scatter plots, or Basemap plot. It comes with a set of default settings, but allows customizing all kinds of properties. However, we can easily create our chart with the defaults of almost every property in Matplotlib. PyMongo MongoDB is a type of NoSQL database. It is highly scalable, robust, and perfect to work with JavaScript-based web applications because we can store data as JSON documents and use flexible schemas. PyMongo is a Python distribution containing tools for working with MongoDB. Many tools have also been written for working with PyMongo to add more features such as MongoKit, Humongolus, MongoAlchemy, and Ming. scikit-learn scikit-learn is an open source machine learning library using the Python programming language. It supports various machine learning models, such as classification, regression, and clustering algorithms, interoperated with the Python numerical and scientific libraries NumPy and SciPy. The latest scikit-learn version is 0.16.1, published in April 2015. Summary In this article, there were three main points that we presented. Firstly, we figured out the relationship between raw data, information and knowledge. Because of its contribution in our life, we continued to discuss an overview of data analysis and processing steps in the second part. Finally, we introduced a few common supported libraries that are useful for practical data analysis applications. Among those we will focus on Python libraries in data analysis. Resources for Article: Further resources on this subject: Exploiting Services with Python [Article] Basics of Jupyter Notebook and Python [Article] How to do Machine Learning with Python [Article]
Read more
  • 0
  • 0
  • 28651
Modal Close icon
Modal Close icon