Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7018 Articles
article-image-your-first-fuelphp-application-7-easy-steps
Packt
04 Mar 2015
12 min read
Save for later

Your first FuelPHP application in 7 easy steps

Packt
04 Mar 2015
12 min read
In this article by Sébastien Drouyer, author of the book FuelPHP Application Development Blueprints we will see that FuelPHP is an open source PHP framework using the latest technologies. Its large community regularly creates and improves packages and extensions, and the framework’s core is constantly evolving. As a result, FuelPHP is a very complete solution for developing web applications. (For more resources related to this topic, see here.) In this article, we will also see how easy it is for developers to create their first website using the PHP oil utility. The target application Suppose you are a zoo manager and you want to keep track of the monkeys you are looking after. For each monkey, you want to save: Its name If it is still in the zoo Its height A description input where you can enter custom information You want a very simple interface with five major features. You want to be able to: Create new monkeys Edit existing ones List all monkeys View a detailed file for each monkey Delete monkeys These preceding five major features, very common in computer applications, are part of the Create, Read, Update and Delete (CRUD) basic operations. Installing the environment The FuelPHP framework needs the three following components: Webserver: The most common solution is Apache PHP interpreter: The 5.3 version or above Database: We will use the most popular one, MySQL The installation and configuration procedures of these components will depend on the operating system you use. We will provide here some directions to get you started in case you are not used to install your development environment. Please note though that these are very generic guidelines. Feel free to search the web for more information, as there are countless resources on the topic. Windows A complete and very popular solution is to install WAMP. This will install Apache, MySQL and PHP, in other words everything you need to get started. It can be accessed at the following URL: http://www.wampserver.com/en/ Mac PHP and Apache are generally installed on the latest version of the OS, so you just have to install MySQL. To do that, you are recommended to read the official documentation: http://dev.mysql.com/doc/refman/5.1/en/macosx-installation.html A very convenient solution for those of you who have the least system administration skills is to install MAMP, the equivalent of WAMP but for the Mac operating system. It can be downloaded through the following URL: http://www.mamp.info/en/downloads/ Ubuntu As this is the most popular Linux distribution, we will limit our instructions to Ubuntu. You can install a complete environment by executing the following command lines: # Apache, MySQL, PHP sudo apt-get install lamp-server^   # PHPMyAdmin allows you to handle the administration of MySQL DB sudo apt-get install phpmyadmin   # Curl is useful for doing web requests sudo apt-get install curl libcurl3 libcurl3-dev php5-curl   # Enabling the rewrite module as it is needed by FuelPHP sudo a2enmod rewrite   # Restarting Apache to apply the new configuration sudo service apache2 restart Getting the FuelPHP framework There are four common ways to download FuelPHP: Downloading and unzipping the compressed package which can be found on the FuelPHP website. Executing the FuelPHP quick command-line installer. Downloading and installing FuelPHP using Composer. Cloning the FuelPHP GitHub repository. It is a little bit more complicated but allows you to select exactly the version (or even the commit) you want to install. The easiest way is to download and unzip the compressed package located at: http://fuelphp.com/files/download/28 You can get more information about this step in Chapter 1 of FuelPHP Application Development Blueprints, which can be accessed freely. It is also well-documented on the website installation instructions page: http://fuelphp.com/docs/installation/instructions.html Installation directory and apache configuration Now that you know how to install FuelPHP in a given directory, we will explain where to install it and how to configure Apache. The simplest way The simplest way is to install FuelPHP in the root folder of your web server (generally the /var/www directory on Linux systems). If you install fuel in the DIR directory inside the root folder (/var/www/DIR), you will be able to access your project on the following URL: http://localhost/DIR/public/ However, be warned that fuel has not been implemented to support this, and if you publish your project this way in the production server, it will introduce security issues you will have to handle. In such cases, you are recommended to use the second way we explained in the section below, although, for instance if you plan to use a shared host to publish your project, you might not have the choice. A complete and up to date documentation about this issue can be found in the Fuel installation instruction page: http://fuelphp.com/docs/installation/instructions.html By setting up a virtual host Another way is to create a virtual host to access your application. You will need a *nix environment and a little bit more apache and system administration skills, but the benefit is that it is more secured and you will be able to choose your working directory. You will need to change two files: Your apache virtual host file(s) in order to link a virtual host to your application Your system host file, in order redirect the wanted URL to your virtual host In both cases, the files location will be very dependent on your operating system and the server environment you are using, so you will have to figure their location yourself (if you are using a common configuration, you won’t have any problem to find instructions on the web). In the following example, we will set up your system to call your application when requesting the my.app URL on your local environment. Let’s first edit the virtual host file(s); add the following code at the end: <VirtualHost *:80>    ServerName my.app    DocumentRoot YOUR_APP_PATH/public    SetEnv FUEL_ENV "development"    <Directory YOUR_APP_PATH/public>        DirectoryIndex index.php        AllowOverride All        Order allow,deny        Allow from all    </Directory> </VirtualHost> Then, open your system host files and add the following line at the end: 127.0.0.1 my.app Depending on your environment, you might need to restart Apache after that. You can now access your website on the following URL: http://my.app/ Checking that everything works Whether you used a virtual host or not, the following should now appear when accessing your website: Congratulation! You just have successfully installed the FuelPHP framework. The welcome page shows some recommended directions to continue your project. Database configuration As we will store our monkeys into a MySQL database, it is time to configure FuelPHP to use our local database. If you open fuel/app/config/db.php, all you will see is an empty array but this configuration file is merged to fuel/app/config/ENV/db.php, ENV being the current Fuel’s environment, which in that case is development. You should therefore open fuel/app/config/development/db.php: <?php //... return array( 'default' => array(    'connection' => array(      'dsn'       => 'mysql:host=localhost;dbname=fuel_dev',      'username'   => 'root',      'password'   => 'root',    ), ), ); You should adapt this array to your local configuration, particularly the database name (currently set to fuel_dev), the username, and password. You must create your project’s database manually. Scaffolding Now that the database configuration is set, we will be able to generate a scaffold. We will use for that the generate feature of the oil utility. Open the command-line utility and go to your website root directory. To generate a scaffold for a new model, you will need to enter the following line: php oil generate scaffold/crud MODEL ATTR_1:TYPE_1 ATTR_2:TYPE_2 ... Where: MODEL is the model name ATTR_1, ATTR_2… are the model’s attributes names TYPE_1, TYPE_2… are each attribute type In our case, it should be: php oil generate scaffold/crud monkey name:string still_here:bool height:float description:text Here we are telling oil to generate a scaffold for the monkey model with the following attributes: name: The name of the monkey. Its type is string and the associated MySQL column type will be VARCHAR(255). still_here: Whether or not the monkey is still in the facility. Its type is boolean and the associated MySQL column type will be TINYINT(1). height: Height of the monkey. Its type is float and its associated MySQL column type will be FLOAT. description: Description of the monkey. Its type is text and its associated MySQL column type will be TEXT. You can do much more using the oil generate feature, as generating models, controllers, migrations, tasks, package and so on. We will see some of these in the FuelPHP Application Development Blueprints book and you are also recommended to take a look at the official documentation: http://fuelphp.com/docs/packages/oil/generate.html When you press Enter, you will see the following lines appear: Creating migration: APPPATH/migrations/001_create_monkeys.php Creating model: APPPATH/classes/model/monkey.php Creating controller: APPPATH/classes/controller/monkey.php Creating view: APPPATH/views/monkey/index.php Creating view: APPPATH/views/monkey/view.php Creating view: APPPATH/views/monkey/create.php Creating view: APPPATH/views/monkey/edit.php Creating view: APPPATH/views/monkey/_form.php Creating view: APPPATH/views/template.php Where APPPATH is your website directory/fuel/app. Oil has generated for us nine files: A migration file, containing all the necessary information to create the model’s associated table The model A controller Five view files and a template file More explanation about these files and how they interact with each other can be accessed in Chapter 1 of the FuelPHP Application Development Blueprints book, freely available. For those of you who are not yet familiar with MVC and HMVC frameworks, don’t worry; the chapter contains an introduction to the most important concepts. Migrating One of the generated files was APPPATH/migrations/001_create_monkeys.php. It is a migration file and contains the required information to create our monkey table. Notice the name is structured as VER_NAME where VER is the version number and NAME is the name of the migration. If you execute the following command line: php oil refine migrate All migrations files that have not been yet executed will be executed from the oldest version to the latest version (001, 002, 003, and so on). Once all files are executed, oil will display the latest version number. Once executed, if you take a look at your database, you will observe that not one, but two tables have been created: monkeys: As expected, a table have been created to handle your monkeys. Notice that the table name is the plural version of the word we typed for generating the scaffold; such a transformation was internally done using the Inflector::pluralize method. The table will contain the specified columns (name, still_here), the id column, but also created_at and updated_at. These columns respectively store the time an object was created and updated, and are added by default each time you generate your models. It is though possible to not generate them with the --no-timestamp argument. migration: This other table was automatically created. It keeps track of the migrations that were executed. If you look into its content, you will see that it already contains one row; this is the migration you just executed. You can notice that the row does not only indicate the name of the migration, but also a type and a name. This is because migrations files can be placed at many places such as modules or packages. The oil utility allows you to do much more. Don’t hesitate to take a look at the official documentation: http://fuelphp.com/docs/packages/oil/intro.html Or, again, to read FuelPHP Application Development Blueprints’ Chapter 1 which is available for free. Using your application Now that we generated the code and migrated the database, our application is ready to be used. Request the following URL: If you created a virtual host: http://my.app/monkey Otherwise (don’t forget to replace DIR): http://localhost/DIR/public/monkey As you can notice, this webpage is intended to display the list of all monkeys, but since none have been added, the list is empty. Then let’s add a new monkey by clicking on the Add new Monkey button. The following webpage should appear: You can enter your monkey’s information here. The form is certainly not perfect - for instance the Still here field use a standard input although a checkbox would be more appropriated - but it is a great start. All we will have to do is refine the code a little bit. Once you have added several monkeys, you can again take a look at the listing page: Again, this is a great start, though we might want to refine it. Each item on the list has three associated actions: View, Edit, and Delete. Let’s first click on View: Again a great start, though we will refine this webpage. You can return back to the listing by clicking on Back or edit the monkey file by clicking on Edit. Either accessed from the listing page or the view page, it will display the same form as when creating a new monkey, except that the form will be prefilled of course. Finally, if you click on Delete, a confirmation box will appear to prevent any miss clicking. Want to learn more ? Don’t hesitate to check out FuelPHP Application Development Blueprints’ Chapter 1 which is freely available in Packt Publishing’s website. In this chapter, you will find a more thorough introduction to FuelPHP and we will show how to improve this first application. You are also recommended to explore FuelPHP website, which contains a lot of useful information and an excellent documentation: http://www.fuelphp.com There is much more to discover about this wonderful framework. Summary In this article we leaned about the installation of the FuelPHP environment and installation of directories in it. Resources for Article: Further resources on this subject: PHP Magic Features [Article] FuelPHP [Article] Building a To-do List with Ajax [Article]
Read more
  • 0
  • 0
  • 7271

article-image-test-driving-uitableviews-cedar
Joe Masilotti
04 Mar 2015
8 min read
Save for later

Test Driving UITableViews with Cedar

Joe Masilotti
04 Mar 2015
8 min read
One of the first things a developer does when learning iOS development is to display a list of items to the user. In iOS we use UITableViews to show one-dimensional tables of information. In practice they look like a long list of data and should be used in that way. UITableViews get their information from a UITableViewDataSource, which responds to a few delegate methods for a number of cells and what information the cells contain. This post will follow a step-by-step guide to test driving UITableViews in iOS. All code samples will use the behavior-driven testing framework Cedar. Cedar can be installed as a Cocoapod by adding the following to your Podfile: target Specs do pod Cedar end Follow this guide for installation and configuration instructions if you are having trouble or want a crash course on the framework. Unit-Style Approach One way to test table views is to follow a unit-style approach on the data source. The goal there is to call single public methods and assert that the correct state was altered or the return value was configured correctly. The target for unit testing a UITableView is its UITableViewDataSource property. The tests for this are fairly straightforward as they call -tableView:cellForRowAtIndexPath: and -tableView:numberOfCellsInSection: directly. For example, let's say we want our controller to display a table with the current list of iPhones. Our mental assertions are that this table should show a single section with nine items, one for each of the iPhone, iPhone 3G, iPhone 3GS, iPhone 4, iPhone 4s, iPhone 5, iPhone 5s, iPhone 6, and iPhone 6 Plus. The unit tests will follow a very similar pattern. Since a table defaults to one section we don't need to write a test asserting the number of sections. We can just go about testing that there are nine cells and assuming that the first and last cells text is correct, everything is working. describe(@"ViewController", ^{ __block ViewController *subject; beforeEach(^{ subject = [[ViewController alloc] init]; }); describe(@"-tableView:numberOfRowsInSection:", ^{ it(@"should have nine cells", ^{ [subject tableView:subject.tableView numberOfRowsInSection:0] should equal(9); }); }); describe(@"-tableView:cellForRowAtIndexPath:", ^{ __block UITableViewCell *cell; context(@"the first cell", ^{ beforeEach(^{ NSIndexPath *indexPath = [NSIndexPath indexPathForRow:0 inSection:0]; cell = [subject tableView:subject.tableView cellForRowAtIndexPath:indexPath]; }); it(@"should display 'iPhone'", ^{ cell.textLabel.text should equal(@"iPhone"); }); }); context(@"the last cell", ^{ beforeEach(^{ NSIndexPath *indexPath = [NSIndexPath indexPathForRow:8 inSection:0]; cell = [subject tableView:subject.tableView cellForRowAtIndexPath:indexPath]; }); it(@"should display 'iPhone 6 Plus'", ^{ cell.textLabel.text should equal(@"iPhone 6 Plus"); }); }); }); }); Now the good part about these tests is that they are easy to follow and straight to the point. When we ask how many items there are we expect the right amount. And when we want to ensure the first cell is set up correctly we test just that. Issues Unfortunately there are a few problems with this approach. The biggest issue is that we can get these tests to pass without actually displaying anything on the screen. A simple implementation of these two methods in our controller will make everything green but has no guarantee that a table view is on the screen (or that one even exists!). The first step in remedying this is to write a test asserting that the table view is a subview. Another, albeit minor, issue is we are breaking encapsulation; we are exposing that our controller conforms to the UITableViewDataSource protocol. Let's see what we can do about these two problems. Benefits Don't think that unit-style is bad, it just has different uses. If you have an app that uses multiple instances you will see benefits from this approach. This is because all you would need in your controller is to ensure the right type of data source was configured. You could take this one step farther by injecting the array of items to display and unit testing that. Then you have a repeatable unit of code that shows a list of data conforming to your app's specifications, which is quite powerful. Behavior-Driven Approach Let's take a more behavioral approach to our problem. Our goal is to display to the user the list of iPhones. If we care about what the user sees what is the closest way of replicating that? How about what cells are visible to the user? From Apple's documentation, -visibleCells on UITableView: Returns the table cells that are visible in the receiver. This sounds interesting. Let's restructure our tests to run assertions on the cells that the user sees, not some made up world of delegates and data sources. describe(@"when the view loads", ^{ beforeEach(^{ subject.view should_not be_nil; [subject.view layoutIfNeeded]; }); it(@"should display the first iPhone, first", ^{ UITableViewCell *firstCell = subject.tableView.visibleCells.firstObject; firstCell.textLabel.text should equal(@"iPhone"); }); it(@"display the iPhone 6 Plus, last", ^{ UITableViewCell *lastCell = subject.tableView.visibleCells.lastObject; lastCell.textLabel.text should equal(@"iPhone 6 Plus"); }); }); Note that in the beforeEach we assert that the view should exist. This is to kick off the controller's view lifecycle methods, namely -loadView and -viewDidLoad. We then tell its view to layout its subviews if need be. This ensures that anything we add as subviews have their layout constraints configured and applied. To get this to pass we have a few things to take care of. Create the backing array of iPhones Create the table view and add it as a subview Become the data source and respond to the calls The first one is easy so let's knock that out first. @interface ViewController () <UITableViewDataSource> @property (nonatomic) UITableView *tableView; @property (nonatomic, strong) NSArray *iPhones; @end @implementation ViewController - (instancetype)init { if (self = [super init]) { self.iPhones = @[ @"iPhone", @"iPhone 3G", @"iPhone 3GS", @"iPhone 4", @"iPhone 4s", @"iPhone 5", @"iPhone 5s", @"iPhone 6", @"iPhone 6 Plus" ]; } return self; } Note the opening up of the -tableView property in the interface extension. This allows us to keep it private in the header and the outside world while still being able to modify it internally. Next let's add the table view and its auto layout constraints. - (void)viewDidLoad { [super viewDidLoad]; self.tableView = [[UITableView alloc] init]; [self.view addSubview:self.tableView]; [self addTableViewConstraints]; } #pragma mark - Private - (void)addTableViewConstraints { self.tableView.translatesAutoresizingMaskIntoConstraints = NO; NSDictionary *views = @{ @"tableView": self.tableView }; [self.view addConstraints:[NSLayoutConstraint constraintsWithVisualFormat:@"V:|[tableView]|" options:kNilOptions metrics:nil views:views]]; [self.view addConstraints:[NSLayoutConstraint constraintsWithVisualFormat:@"H:|[tableView]|" options:kNilOptions metrics:nil views:views]]; } Since we aren't working with Storyboards or xibs/nibs we create the table view manually and add it as a subview. We also will need to add some simple auto layout constraints to have it fill the screen. Check out Apple's Auto Layout by Example guide if you would like a deeper explanation. Finally let's get to the meat of the issue and respond to the data source methods. #pragma mark - <UITableViewDataSource> - (NSInteger)tableView:(UITableView *)tableView numberOfRowsInSection:(NSInteger)section { return self.iPhones.count; } - (UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath { UITableViewCell *cell = [tableView dequeueReusableCellWithIdentifier:kCellIdentifier forIndexPath:indexPath]; cell.textLabel.text = self.iPhones[indexPath.row]; return cell; } We also need to become the data source of the table so do that and register the cell in -viewDidLoad. [self.tableView registerClass:[UITableViewCell class] forCellReuseIdentifier:kCellIdentifier]; self.tableView.dataSource = self; Finally add the constant to the top of the file. NSString * const kCellIdentifier = @"CellIdentifier"; What's interesting with this approach is that not until you have every line correct with the tests pass. This helps ensure that what is happening under spec is closer to the real experience of the app. For example, having a table view on the screen, responding to the delegate calls, but not assigning the delegate won't get you anywhere. In the unit approach you could have done just that but still seen your tests go green. Benefits of Behavior Testing When testing behavior you put yourself in a world that more closely represents the state when a user is interacting with it. It also enables you to test collaboration between objects without having to single very simple piece of the architecture out. This means it can be easy to get carried away and start writing full integration tests from controllers. If you keep to only testing one or two layers of abstraction, in this case the table view through the delegate, your code and specs remain easy to read and understand. A side effect of this approach enabled us to hide some implementation details in the production code. This means we are more freely to do a green-to-green refactor without having to change our specs. For example, we could extract the UITableViewDataSource into its own object and know that it works correctly when all of the existing tests still pass. If we wanted to then reuse that collaborator we could then extract the specs and have it stand on its own. Or if our backing array turned into an NSDictionary and found everything by key nothing in our tests would have to change. There are many styles of testing and even more ways to test Objective-C code and the Cocoa Touch framework. Behavior testing is just one approach that has proved to be the most flexible and easy to understand for me. What other techniques and methods have you implemented to ensure code coverage on your own iOS apps? About the author Joe Masilotti is a test-driven iOS developer living in Brooklyn, NY. He contributes to open-source testing tools on GitHub and talks about development, cooking, and craft beer on Twitter.
Read more
  • 0
  • 0
  • 2975

Packt
04 Mar 2015
22 min read
Save for later

Python functions – Avoid repeating code

Packt
04 Mar 2015
22 min read
In this article by Silas Toms, author of the book ArcPy and ArcGIS – Geospatial Analysis with Python we will see how programming languages share a concept that has aided programmers for decades: functions. The idea of a function, loosely speaking, is to create blocks of code that will perform an action on a piece of data, transforming it as required by the programmer and returning the transformed data back to the main body of code. Functions are used because they solve many different needs within programming. Functions reduce the need to write repetitive code, which in turn reduces the time needed to create a script. They can be used to create ranges of numbers (the range() function), or to determine the maximum value of a list (the max function), or to create a SQL statement to select a set of rows from a feature class. They can even be copied and used in another script or included as part of a module that can be imported into scripts. Function reuse has the added bonus of making programming more useful and less of a chore. When a scripter starts writing functions, it is a major step towards making programming part of a GIS workflow. (For more resources related to this topic, see here.) Technical definition of functions Functions, also called subroutines or procedures in other programming languages, are blocks of code that have been designed to either accept input data and transform it, or provide data to the main program when called without any input required. In theory, functions will only transform data that has been provided to the function as a parameter; it should not change any other part of the script that has not been included in the function. To make this possible, the concept of namespaces is invoked. Namespaces make it possible to use a variable name within a function, and allow it to represent a value, while also using the same variable name in another part of the script. This becomes especially important when importing modules from other programmers; within that module and its functions, the variables that it contains might have a variable name that is the same as a variable name within the main script. In a high-level programming language such as Python, there is built-in support for functions, including the ability to define function names and the data inputs (also known as parameters). Functions are created using the keyword def plus a function name, along with parentheses that may or may not contain parameters. Parameters can also be defined with default values, so parameters only need to be passed to the function when they differ from the default. The values that are returned from the function are also easily defined. A first function Let's create a function to get a feel for what is possible when writing functions. First, we need to invoke the function by providing the def keyword and providing a name along with the parentheses. The firstFunction() will return a string when called: def firstFunction():    'a simple function returning a string'    return "My First Function" >>>firstFunction() The output is as follows: 'My First Function' Notice that this function has a documentation string or doc string (a simple function returning a string) that describes what the function does; this string can be called later to find out what the function does, using the __doc__ internal function: >>>print firstFunction.__doc__ The output is as follows: 'a simple function returning a string' The function is defined and given a name, and then the parentheses are added followed by a colon. The following lines must then be indented (a good IDE will add the indention automatically). The function does not have any parameters, so the parentheses are empty. The function then uses the keyword return to return a value, in this case a string, from the function. Next, the function is called by adding parentheses to the function name. When it is called, it will do what it has been instructed to do: return the string. Functions with parameters Now let's create a function that accepts parameters and transforms them as needed. This function will accept a number and multiply it by 3: def secondFunction(number):    'this function multiples numbers by 3'    return number *3 >>> secondFunction(4) The output is as follows: 12 The function has one flaw, however; there is no assurance that the value passed to the function is a number. We need to add a conditional to the function to make sure it does not throw an exception: def secondFunction(number):    'this function multiples numbers by 3'    if type(number) == type(1) or type(number) == type(1.0):        return number *3 >>> secondFunction(4.0) The output is as follows: 12.0 >>>secondFunction(4) The output is as follows: 12 >>>secondFunction("String") >>> The function now accepts a parameter, checks what type of data it is, and returns a multiple of the parameter whether it is an integer or a function. If it is a string or some other data type, as shown in the last example, no value is returned. There is one more adjustment to the simple function that we should discuss: parameter defaults. By including default values in the definition of the function, we avoid having to provide parameters that rarely change. If, for instance, we wanted a different multiplier than 3 in the simple function, we would define it like this: def thirdFunction(number, multiplier=3):    'this function multiples numbers by 3'    if type(number) == type(1) or type(number) == type(1.0):        return number *multiplier >>>thirdFunction(4) The output is as follows: 12 >>>thirdFunction(4,5) The output is as follows: 20 The function will work when only the number to be multiplied is supplied, as the multiplier has a default value of 3. However, if we need another multiplier, the value can be adjusted by adding another value when calling the function. Note that the second value doesn't have to be a number as there is no type checking on it. Also, the default value(s) in a function must follow the parameters with no defaults (or all parameters can have a default value and the parameters can be supplied to the function in order or by name). Using functions to replace repetitive code One of the main uses of functions is to ensure that the same code does not have to be written over and over. The first portion of the script that we could convert into a function is the three ArcPy functions. Doing so will allow the script to be applicable to any of the stops in the Bus Stop feature class and have an adjustable buffer distance: bufferDist = 400 buffDistUnit = "Feet" lineName = '71 IB' busSignage = 'Ferry Plaza' sqlStatement = "NAME = '{0}' AND BUS_SIGNAG = '{1}'" def selectBufferIntersect(selectIn,selectOut,bufferOut,     intersectIn, intersectOut, sqlStatement,   bufferDist, buffDistUnit, lineName, busSignage):    'a function to perform a bus stop analysis'    arcpy.Select_analysis(selectIn, selectOut, sqlStatement.format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut, bufferOut, "{0} {1}".format(bufferDist), "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut, intersectIn), intersectOut, "ALL", "", "INPUT")    return intersectOut This function demonstrates how the analysis can be adjusted to accept the input and output feature class variables as parameters, along with some new variables. The function adds a variable to replace the SQL statement and variables to adjust the bus stop, and also tweaks the buffer distance statement so that both the distance and the unit can be adjusted. The feature class name variables, defined earlier in the script, have all been replaced with local variable names; while the global variable names could have been retained, it reduces the portability of the function. The next function will accept the result of the selectBufferIntersect() function and search it using the Search Cursor, passing the results into a dictionary. The dictionary will then be returned from the function for later use: def createResultDic(resultFC):    'search results of analysis and create results dictionary' dataDictionary = {}      with arcpy.da.SearchCursor(resultFC, ["STOPID","POP10"]) as cursor:        for row in cursor:            busStopID = row[0]            pop10 = row[1]            if busStopID not in dataDictionary.keys():                dataDictionary[busStopID] = [pop10]            else:                dataDictionary[busStopID].append(pop10)    return dataDictionary This function only requires one parameter: the feature class returned from the searchBufferIntersect() function. The results holding dictionary is first created, then populated by the search cursor, with the busStopid attribute used as a key, and the census block population attribute added to a list assigned to the key. The dictionary, having been populated with sorted data, is returned from the function for use in the final function, createCSV(). This function accepts the dictionary and the name of the output CSV file as a string: def createCSV(dictionary, csvname): 'a function takes a dictionary and creates a CSV file'    with open(csvname, 'wb') as csvfile:        csvwriter = csv.writer(csvfile, delimiter=',')        for busStopID in dictionary.keys():            popList = dictionary[busStopID]            averagePop = sum(popList)/len(popList)            data = [busStopID, averagePop]            csvwriter.writerow(data) The final function creates the CSV using the csv module. The name of the file, a string, is now a customizable parameter (meaning the script name can be any valid file path and text file with the extension .csv). The csvfile parameter is passed to the CSV module's writer method and assigned to the variable csvwriter, and the dictionary is accessed and processed, and passed as a list to csvwriter to be written to the CSV file. The csv.writer() method processes each item in the list into the CSV format and saves the final result. Open the CSV file with Excel or a text editor such as Notepad. To run the functions, we will call them in the script following the function definitions: analysisResult = selectBufferIntersect(Bus_Stops,Inbound71, Inbound71_400ft_buffer, CensusBlocks2010, Intersect71Census, bufferDist, lineName,                busSignage ) dictionary = createResultDic(analysisResult) createCSV(dictionary,r'C:\Projects\Output\Averages.csv') Now, the script has been divided into three functions, which replace the code of the first modified script. The modified script looks like this: # -*- coding: utf-8 -*- # --------------------------------------------------------------------------- # 8662_Chapter4Modified1.py # Created on: 2014-04-22 21:59:31.00000 #   (generated by ArcGIS/ModelBuilder) # Description: # Adjusted by Silas Toms # 2014 05 05 # ---------------------------------------------------------------------------   # Import arcpy module import arcpy import csv   # Local variables: Bus_Stops = r"C:\Projects\PacktDB.gdb\SanFrancisco\Bus_Stops" CensusBlocks2010 = r"C:\Projects\PacktDB.gdb\SanFrancisco\CensusBlocks2010" Inbound71 = r"C:\Projects\PacktDB.gdb\Chapter3Results\Inbound71" Inbound71_400ft_buffer = r"C:\Projects\PacktDB.gdb\Chapter3Results\Inbound71_400ft_buffer" Intersect71Census = r"C:\Projects\PacktDB.gdb\Chapter3Results\Intersect71Census" bufferDist = 400 lineName = '71 IB' busSignage = 'Ferry Plaza' def selectBufferIntersect(selectIn,selectOut,bufferOut,intersectIn,                          intersectOut, bufferDist,lineName, busSignage ):    arcpy.Select_analysis(selectIn,                          selectOut,                           "NAME = '{0}' AND BUS_SIGNAG = '{1}'".format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut,                          bufferOut,                          "{0} Feet".format(bufferDist),                          "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut,intersectIn),                              intersectOut, "ALL", "", "INPUT")    return intersectOut   def createResultDic(resultFC):    dataDictionary = {}       with arcpy.da.SearchCursor(resultFC,                                ["STOPID","POP10"]) as cursor:        for row in cursor:            busStopID = row[0]            pop10 = row[1]            if busStopID not in dataDictionary.keys():                dataDictionary[busStopID] = [pop10]            else:                dataDictionary[busStopID].append(pop10)    return dataDictionary   def createCSV(dictionary, csvname):    with open(csvname, 'wb') as csvfile:        csvwriter = csv.writer(csvfile, delimiter=',')        for busStopID in dictionary.keys():            popList = dictionary[busStopID]            averagePop = sum(popList)/len(popList)            data = [busStopID, averagePop]            csvwriter.writerow(data) analysisResult = selectBufferIntersect(Bus_Stops,Inbound71, Inbound71_400ft_buffer,CensusBlocks2010,Intersect71Census, bufferDist,lineName, busSignage ) dictionary = createResultDic(analysisResult) createCSV(dictionary,r'C:\Projects\Output\Averages.csv') print "Data Analysis Complete" Further generalization of the functions, while we have created functions from the original script that can be used to extract more data about bus stops in San Francisco, our new functions are still very specific to the dataset and analysis for which they were created. This can be very useful for long and laborious analysis for which creating reusable functions is not necessary. The first use of functions is to get rid of the need to repeat code. The next goal is to then make that code reusable. Let's discuss some ways in which we can convert the functions from one-offs into reusable functions or even modules. First, let's examine the first function: def selectBufferIntersect(selectIn,selectOut,bufferOut,intersectIn,                          intersectOut, bufferDist,lineName, busSignage ):    arcpy.Select_analysis(selectIn,                          selectOut,                          "NAME = '{0}' AND BUS_SIGNAG = '{1}'".format(lineName, busSignage))    arcpy.Buffer_analysis(selectOut,                          bufferOut,                          "{0} Feet".format(bufferDist),                          "FULL", "ROUND", "NONE", "")    arcpy.Intersect_analysis("{0} #;{1} #".format(bufferOut,intersectIn),                              intersectOut, "ALL", "", "INPUT")    return intersectOut This function appears to be pretty specific to the bus stop analysis. It's so specific, in fact, that while there are a few ways in which we can tweak it to make it more general (that is, useful in other scripts that might not have the same steps involved), we should not convert it into a separate function. When we create a separate function, we introduce too many variables into the script in an effort to simplify it, which is a counterproductive effort. Instead, let's focus on ways to generalize the ArcPy tools themselves. The first step will be to split the three ArcPy tools and examine what can be adjusted with each of them. The Select tool should be adjusted to accept a string as the SQL select statement. The SQL statement can then be generated by another function or by parameters accepted at runtime. For instance, if we wanted to make the script accept multiple bus stops for each run of the script (for example, the inbound and outbound stops for each line), we could create a function that would accept a list of the desired stops and a SQL template, and would return a SQL statement to plug into the Select tool. Here is an example of how it would look: def formatSQLIN(dataList, sqlTemplate):    'a function to generate a SQL statement'    sql = sqlTemplate #"OBJECTID IN "    step = "("    for data in dataList:        step += str(data)    sql += step + ")"    return sql   def formatSQL(dataList, sqlTemplate):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(data) + ' OR '        else:            sql += sqlTemplate.format(data)    return sql   >>> dataVals = [1,2,3,4] >>> sqlOID = "OBJECTID = {0}" >>> sql = formatSQL(dataVals, sqlOID) >>> print sql The output is as follows: OBJECTID = 1 OR OBJECTID = 2 OR OBJECTID = 3 OR OBJECTID = 4 This new function, formatSQL(), is a very useful function. Let's review what it does by comparing the function to the results following it. The function is defined to accept two parameters: a list of values and a SQL template. The first local variable is the empty string sql, which will be added to using string addition. The function is designed to insert the values into the variable sql, creating a SQL statement by taking the SQL template and using string formatting to add them to the template, which in turn is added to the SQL statement string (note that sql += is equivelent to sql = sql +). Also, an operator (OR) is used to make the SQL statement inclusive of all data rows that match the pattern. This function uses the built-in enumerate function to count the iterations of the list; once it has reached the last value in the list, the operator is not added to the SQL statement. Note that we could also add one more parameter to the function to make it possible to use an AND operator instead of OR, while still keeping OR as the default: def formatSQL2(dataList, sqlTemplate, operator=" OR "):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(data) + operator        else:            sql += sqlTemplate.format(data)    return sql   >>> sql = formatSQL2(dataVals, sqlOID," AND ") >>> print sql The output is as follows: OBJECTID = 1 AND OBJECTID = 2 AND OBJECTID = 3 AND OBJECTID = 4 While it would make no sense to use an AND operator on ObjectIDs, there are other cases where it would make sense, hence leaving OR as the default while allowing for AND. Either way, this function can now be used to generate our bus stop SQL statement for multiple stops (ignoring, for now, the bus signage field): >>> sqlTemplate = "NAME = '{0}'" >>> lineNames = ['71 IB','71 OB'] >>> sql = formatSQL2(lineNames, sqlTemplate) >>> print sql The output is as follows: NAME = '71 IB' OR NAME = '71 OB' However, we can't ignore the Bus Signage field for the inbound line, as there are two starting points for the line, so we will need to adjust the function to accept multiple values: def formatSQLMultiple(dataList, sqlTemplate, operator=" OR "):    'a function to generate a SQL statement'    sql = ''    for count, data in enumerate(dataList):        if count != len(dataList)-1:            sql += sqlTemplate.format(*data) + operator        else:            sql += sqlTemplate.format(*data)    return sql   >>> sqlTemplate = "(NAME = '{0}' AND BUS_SIGNAG = '{1}')" >>> lineNames = [('71 IB', 'Ferry Plaza'),('71 OB','48th Avenue')] >>> sql = formatSQLMultiple(lineNames, sqlTemplate) >>> print sql The output is as follows: (NAME = '71 IB' AND BUS_SIGNAG = 'Ferry Plaza') OR (NAME = '71 OB' AND BUS_SIGNAG = '48th Avenue') The slight difference in this function, the asterisk before the data variable, allows the values inside the data variable to be correctly formatted into the SQL template by exploding the values within the tuple. Notice that the SQL template has been created to segregate each conditional by using parentheses. The function(s) are now ready for reuse, and the SQL statement is now ready for insertion into the Select tool: sql = formatSQLMultiple(lineNames, sqlTemplate) arcpy.Select_analysis(Bus_Stops, Inbound71, sql) Next up is the Buffer tool. We have already taken steps towards making it generalized by adding a variable for the distance. In this case, we will only add one more variable to it, a unit variable that will make it possible to adjust the buffer unit from feet to meter or any other allowed unit. We will leave the other defaults alone. Here is an adjusted version of the Buffer tool: bufferDist = 400 bufferUnit = "Feet" arcpy.Buffer_analysis(Inbound71,                      Inbound71_400ft_buffer,                      "{0} {1}".format(bufferDist, bufferUnit),                      "FULL", "ROUND", "NONE", "") Now, both the buffer distance and buffer unit are controlled by a variable defined in the previous script, and this will make it easily adjustable if it is decided that the distance was not sufficient and the variables might need to be adjusted. The next step towards adjusting the ArcPy tools is to write a function, which will allow for any number of feature classes to be intersected together using the Intersect tool. This new function will be similar to the formatSQL functions as previous, as they will use string formatting and addition to allow for a list of feature classes to be processed into the correct string format for the Intersect tool to accept them. However, as this function will be built to be as general as possible, it must be designed to accept any number of feature classes to be intersected: def formatIntersect(features):    'a function to generate an intersect string'    formatString = ''    for count, feature in enumerate(features):        if count != len(features)-1:            formatString += feature + " #;"        else:            formatString += feature + " #"        return formatString >>> shpNames = ["example.shp","example2.shp"] >>> iString = formatIntersect(shpNames) >>> print iString The output is as follows: example.shp #;example2.shp # Now that we have written the formatIntersect() function, all that needs to be created is a list of the feature classes to be passed to the function. The string returned by the function can then be passed to the Intersect tool: intersected = [Inbound71_400ft_buffer, CensusBlocks2010] iString = formatIntersect(intersected) # Process: Intersect arcpy.Intersect_analysis(iString,                          Intersect71Census, "ALL", "", "INPUT") Because we avoided creating a function that only fits this script or analysis, we now have two (or more) useful functions that can be applied in later analyses, and we know how to manipulate the ArcPy tools to accept the data that we want to supply to them. Summary In this article, we discussed how to take autogenerated code and make it generalized, while adding functions that can be reused in other scripts and will make the generation of the necessary code components, such as SQL statements, much easier. Resources for Article: Further resources on this subject: Enterprise Geodatabase [article] Adding Graphics to the Map [article] Image classification and feature extraction from images [article]
Read more
  • 0
  • 0
  • 27288

article-image-writing-consumers
Packt
04 Mar 2015
20 min read
Save for later

Writing Consumers

Packt
04 Mar 2015
20 min read
This article by Nishant Garg, the author of the book Learning Apache Kafka Second Edition, focuses on the details of Writing Consumers. Consumers are the applications that consume the messages published by Kafka producers and process the data extracted from them. Like producers, consumers can also be different in nature, such as applications doing real-time or near real-time analysis, applications with NoSQL or data warehousing solutions, backend services, consumers for Hadoop, or other subscriber-based solutions. These consumers can also be implemented in different languages such as Java, C, and Python. (For more resources related to this topic, see here.) In this article, we will focus on the following topics: The Kafka Consumer API Java-based Kafka consumers Java-based Kafka consumers consuming partitioned messages At the end of the article, we will explore some of the important properties that can be set for a Kafka consumer. So, let's start. The preceding diagram explains the high-level working of the Kafka consumer when consuming the messages. The consumer subscribes to the message consumption from a specific topic on the Kafka broker. The consumer then issues a fetch request to the lead broker to consume the message partition by specifying the message offset (the beginning position of the message offset). Therefore, the Kafka consumer works in the pull model and always pulls all available messages after its current position in the Kafka log (the Kafka internal data representation). While subscribing, the consumer connects to any of the live nodes and requests metadata about the leaders for the partitions of a topic. This allows the consumer to communicate directly with the lead broker receiving the messages. Kafka topics are divided into a set of ordered partitions and each partition is consumed by one consumer only. Once a partition is consumed, the consumer changes the message offset to the next partition to be consumed. This represents the states about what has been consumed and also provides the flexibility of deliberately rewinding back to an old offset and re-consuming the partition. In the next few sections, we will discuss the API provided by Kafka for writing Java-based custom consumers. All the Kafka classes referred to in this article are actually written in Scala. Kafka consumer APIs Kafka provides two types of API for Java consumers: High-level API Low-level API The high-level consumer API The high-level consumer API is used when only data is needed and the handling of message offsets is not required. This API hides broker details from the consumer and allows effortless communication with the Kafka cluster by providing an abstraction over the low-level implementation. The high-level consumer stores the last offset (the position within the message partition where the consumer left off consuming the message), read from a specific partition in Zookeeper. This offset is stored based on the consumer group name provided to Kafka at the beginning of the process. The consumer group name is unique and global across the Kafka cluster and any new consumers with an in-use consumer group name may cause ambiguous behavior in the system. When a new process is started with the existing consumer group name, Kafka triggers a rebalance between the new and existing process threads for the consumer group. After the rebalance, some messages that are intended for a new process may go to an old process, causing unexpected results. To avoid this ambiguous behavior, any existing consumers should be shut down before starting new consumers for an existing consumer group name. The following are the classes that are imported to write Java-based basic consumers using the high-level consumer API for a Kafka cluster: ConsumerConnector: Kafka provides the ConsumerConnector interface (interface ConsumerConnector) that is further implemented by the ZookeeperConsumerConnector class (kafka.javaapi.consumer.ZookeeperConsumerConnector). This class is responsible for all the interaction a consumer has with ZooKeeper. The following is the class diagram for the ConsumerConnector class: KafkaStream: Objects of the kafka.consumer.KafkaStream class are returned by the createMessageStreams call from the ConsumerConnector implementation. This list of the KafkaStream objects is returned for each topic, which can further create an iterator over messages in the stream. The following is the Scala-based class declaration: class KafkaStream[K,V](private val queue:                       BlockingQueue[FetchedDataChunk],                       consumerTimeoutMs: Int,                       private val keyDecoder: Decoder[K],                       private val valueDecoder: Decoder[V],                       val clientId: String) Here, the parameters K and V specify the type for the partition key and message value, respectively. In the create call from the ConsumerConnector class, clients can specify the number of desired streams, where each stream object is used for single-threaded processing. These stream objects may represent the merging of multiple unique partitions. ConsumerConfig: The kafka.consumer.ConsumerConfig class encapsulates the property values required for establishing the connection with ZooKeeper, such as ZooKeeper URL, ZooKeeper session timeout, and ZooKeeper sink time. It also contains the property values required by the consumer such as group ID and so on. A high-level API-based working consumer example is discussed after the next section. The low-level consumer API The high-level API does not allow consumers to control interactions with brokers. Also known as "simple consumer API", the low-level consumer API is stateless and provides fine grained control over the communication between Kafka broker and the consumer. It allows consumers to set the message offset with every request raised to the broker and maintains the metadata at the consumer's end. This API can be used by both online as well as offline consumers such as Hadoop. These types of consumers can also perform multiple reads for the same message or manage transactions to ensure the message is consumed only once. Compared to the high-level consumer API, developers need to put in extra effort to gain low-level control within consumers by keeping track of offsets, figuring out the lead broker for the topic and partition, handling lead broker changes, and so on. In the low-level consumer API, consumers first query the live broker to find out the details about the lead broker. Information about the live broker can be passed on to the consumers either using a properties file or from the command line. The topicsMetadata() method of the kafka.javaapi.TopicMetadataResponse class is used to find out metadata about the topic of interest from the lead broker. For message partition reading, the kafka.api.OffsetRequest class defines two constants: EarliestTime and LatestTime, to find the beginning of the data in the logs and the new messages stream. These constants also help consumers to track which messages are already read. The main class used within the low-level consumer API is the SimpleConsumer (kafka.javaapi.consumer.SimpleConsumer) class. The following is the class diagram for the SimpleConsumer class:   A simple consumer class provides a connection to the lead broker for fetching the messages from the topic and methods to get the topic metadata and the list of offsets. A few more important classes for building different request objects are FetchRequest (kafka.api.FetchRequest), OffsetRequest (kafka.javaapi.OffsetRequest), OffsetFetchRequest (kafka.javaapi.OffsetFetchRequest), OffsetCommitRequest (kafka.javaapi.OffsetCommitRequest), and TopicMetadataRequest (kafka.javaapi.TopicMetadataRequest). All the examples in this article are based on the high-level consumer API. For examples based on the low-level consumer API, refer tohttps://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example. Simple Java consumers Now we will start writing a single-threaded simple Java consumer developed using the high-level consumer API for consuming the messages from a topic. This SimpleHLConsumer class is used to fetch a message from a specific topic and consume it, assuming that there is a single partition within the topic. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As a next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); Now let us see the major properties mentioned in the code: zookeeper.connect: This property specifies the ZooKeeper <node:port> connection detail that is used to find the Zookeeper running instance in the cluster. In the Kafka cluster, Zookeeper is used to store offsets of messages consumed for a specific topic and partition by this consumer group. group.id: This property specifies the name for the consumer group shared by all the consumers within the group. This is also the process name used by Zookeeper to store offsets. zookeeper.session.timeout.ms: This property specifies the Zookeeper session timeout in milliseconds and represents the amount of time Kafka will wait for Zookeeper to respond to a request before giving up and continuing to consume messages. zookeeper.sync.time.ms: This property specifies the ZooKeeper sync time in milliseconds between the ZooKeeper leader and the followers. auto.commit.interval.ms: This property defines the frequency in milliseconds at which consumer offsets get committed to Zookeeper. Reading messages from a topic and printing them As a final step, we need to read the message using the following code: Map<String, Integer> topicMap = new HashMap<String, Integer>(); // 1 represents the single thread topicCount.put(topic, new Integer(1));   Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap = consumer.createMessageStreams(topicMap);   // Get the list of message streams for each topic, using the default decoder. List<KafkaStream<byte[], byte[]>>streamList =  consumerStreamsMap.get(topic);   for (final KafkaStream <byte[], byte[]> stream : streamList) { ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();   while (consumerIte.hasNext())     System.out.println("Message from Single Topic :: "     + new String(consumerIte.next().message())); } So the complete program will look like the following code: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class SimpleHLConsumer {   private final ConsumerConnector consumer;   private final String topic;     public SimpleHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper,             groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void testConsumer() {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define single thread for topic     topicMap.put(topic, new Integer(1));       Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       for (final KafkaStream<byte[], byte[]> stream : streamList) {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Message from Single Topic :: "           + new String(consumerIte.next().message()));     }     if (consumer != null)       consumer.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer(           zooKeeper, groupId, topic);     simpleHLConsumer.testConsumer();   }   } Before running this, make sure you have created the topic kafkatopic from the command line: [root@localhost kafka_2.9.2-0.8.1.1]#bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic kafkatopic Before compiling and running a Java-based Kafka program in the console, make sure you download the slf4j-1.7.7.tar.gz file from http://www.slf4j.org/download.html and copy slf4j-log4j12-1.7.7.jar contained within slf4j-1.7.7.tar.gz to the /opt/kafka_2.9.2-0.8.1.1/libs directory. Also add all the libraries available in /opt/kafka_2.9.2-0.8.1.1/libs to the classpath using the following commands: [root@localhost kafka_2.9.2-0.8.1.1]# export KAFKA_LIB=/opt/kafka_2.9.2-0.8.1.1/libs [root@localhost kafka_2.9.2-0.8.1.1]# export CLASSPATH=.:$KAFKA_LIB/jopt-simple-3.2.jar:$KAFKA_LIB/kafka_2.9.2-0.8.1.1.jar:$KAFKA_LIB/log4j-1.2.15.jar:$KAFKA_LIB/metrics-core-2.2.0.jar:$KAFKA_LIB/scala-library-2.9.2.jar:$KAFKA_LIB/slf4j-api-1.7.2.jar:$KAFKA_LIB/slf4j-log4j12-1.7.7.jar:$KAFKA_LIB/snappy-java-1.0.5.jar:$KAFKA_LIB/zkclient-0.3.jar:$KAFKA_LIB/zookeeper-3.3.4.jar Multithreaded Java consumers The previous example is a very basic example of a consumer that consumes messages from a single broker with no explicit partitioning of messages within the topic. Let's jump to the next level and write another program that consumes messages from multiple partitions connecting to single/multiple topics. A multithreaded, high-level, consumer-API-based design is usually based on the number of partitions in the topic and follows a one-to-one mapping approach between the thread and the partitions within the topic. For example, if four partitions are defined for any topic, as a best practice, only four threads should be initiated with the consumer application to read the data; otherwise, some conflicting behavior, such as threads never receiving a message or a thread receiving messages from multiple partitions, may occur. Also, receiving multiple messages will not guarantee that the messages will be placed in order. For example, a thread may receive two messages from the first partition and three from the second partition, then three more from the first partition, followed by some more from the first partition, even if the second partition has data available. Let's move further on. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As the next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); The preceding properties have already been discussed in the previous example. For more details on Kafka consumer properties, refer to the last section of this article. Reading the message from threads and printing it The only difference in this section from the previous section is that we first create a thread pool and get the Kafka streams associated with each thread within the thread pool, as shown in the following code: // Define thread count for each topic topicMap.put(topic, new Integer(threadCount));   // Here we have used a single topic but we can also add // multiple topics to topicCount MAP Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap            = consumer.createMessageStreams(topicMap);   List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap.get(topic);   // Launching the thread pool executor = Executors.newFixedThreadPool(threadCount); The complete program listing for the multithread Kafka consumer based on the Kafka high-level consumer API is as follows: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class MultiThreadHLConsumer {     private ExecutorService executor;   private final ConsumerConnector consumer;   private final String topic;     public MultiThreadHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper, groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void shutdown() {     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public void testMultiThreadConsumer(int threadCount) {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define thread count for each topic     topicMap.put(topic, new Integer(threadCount));       // Here we have used a single topic but we can also add     // multiple topics to topicCount MAP     Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       // Launching the thread pool     executor = Executors.newFixedThreadPool(threadCount);       // Creating an object messages consumption     int count = 0;     for (final KafkaStream<byte[], byte[]> stream : streamList) {       final int threadNumber = count;       executor.submit(new Runnable() {       public void run() {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Thread Number " + threadNumber + ": "         + new String(consumerIte.next().message()));         System.out.println("Shutting down Thread Number: " +         threadNumber);         }       });       count++;     }     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     int threadCount = Integer.parseInt(args[3]);     MultiThreadHLConsumer multiThreadHLConsumer =         new MultiThreadHLConsumer(zooKeeper, groupId, topic);     multiThreadHLConsumer.testMultiThreadConsumer(threadCount);     try {       Thread.sleep(10000);     } catch (InterruptedException ie) {       }     multiThreadHLConsumer.shutdown();     } } Compile the preceding program, and before running it, read the following tip. Before we run this program, we need to make sure our cluster is running as a multi-broker cluster (comprising either single or multiple nodes).  Once your multi-broker cluster is up, create a topic with four partitions and set the replication factor to 2 before running this program using the following command: [root@localhost kafka-0.8]# bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic kafkatopic --partitions 4 --replication-factor 2 The Kafka consumer property list The following lists of a few important properties that can be configured for high-level, consumer-API-based Kafka consumers. The Scala class kafka.consumer.ConsumerConfig provides implementation-level details for consumer configurations. For a complete list, visit http://kafka.apache.org/documentation.html#consumerconfigs. Property name Description Default value group.id This property defines a unique identity for the set of consumers within the same consumer group.   consumer.id This property is specified for the Kafka consumer and generated automatically if not defined. null zookeeper.connect This property specifies the Zookeeper connection string, < hostname:port/chroot/path>. Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by the consumer group. /chroot/path defines the data location in a global zookeeper namespace.   client.id The client.id value is specified by the Kafka client with each request and is used to identify the client making the requests. ${group.id} zookeeper.session.timeout.ms This property defines the time (in milliseconds) for a Kafka consumer to wait for a Zookeeper pulse before it is declared dead and rebalance is initiated. 6000 zookeeper.connection.timeout.ms This value defines the maximum waiting time (in milliseconds) for the client to establish a connection with ZooKeeper. 6000 zookeeper.sync.time.ms This property defines the time it takes to sync a Zookeeper follower with the Zookeeper leader (in milliseconds). 2000 auto.commit.enable This property enables a periodical commit of message offsets to the Zookeeper that are already fetched by the consumer. In the event of consumer failures, these committed offsets are used as a starting position by the new consumers. true auto.commit.interval.ms This property defines the frequency (in milliseconds) for the consumed offsets to get committed to ZooKeeper. 60 * 1000 auto.offset.reset This property defines the offset value if an initial offset is available in Zookeeper or the offset is out of range. Possible values are: largest: reset to largest offset smallest: reset to smallest offset anything else: throw an exception largest consumer.timeout.ms This property throws an exception to the consumer if no message is available for consumption after the specified interval. -1 Summary In this article, we have learned how to write basic consumers and learned about some advanced levels of Java consumers that consume messages from partitions. Resources for Article: Further resources on this subject: Introducing Kafka? [article] Introduction To Apache Zookeeper [article] Creating Apache Jmeter™ Test Workbench [article]
Read more
  • 0
  • 0
  • 3687

article-image-prototyping-arduino-projects-using-python
Packt
04 Mar 2015
18 min read
Save for later

Prototyping Arduino Projects using Python

Packt
04 Mar 2015
18 min read
In this article by Pratik Desai, the author of Python Programming for Arduino, we will cover the following topics: Working with pyFirmata methods Servomotor – moving the motor to a certain angle The Button() widget – interfacing GUI with Arduino and LEDs (For more resources related to this topic, see here.) Working with pyFirmata methods The pyFirmata package provides useful methods to bridge the gap between Python and Arduino's Firmata protocol. Although these methods are described with specific examples, you can use them in various different ways. This section also provides detailed description of a few additional methods. Setting up the Arduino board To set up your Arduino board in a Python program using pyFirmata, you need to specifically follow the steps that we have written down. We have distributed the entire code that is required for the setup process into small code snippets in each step. While writing your code, you will have to carefully use the code snippets that are appropriate for your application. You can always refer to the example Python files containing the complete code. Before we go ahead, let's first make sure that your Arduino board is equipped with the latest version of the StandardFirmata program and is connected to your computer: Depending upon the Arduino board that is being utilized, start by importing the appropriate pyFirmata classes to the Python code. Currently, the inbuilt pyFirmata classes only support the Arduino Uno and Arduino Mega boards: from pyfirmata import Arduino In case of Arduino Mega, use the following line of code: from pyfirmata import ArduinoMega Before we start executing any methods that is associated with handling pins, it is required to properly set the Arduino board. To perform this task, we have to first identify the USB port to which the Arduino board is connected and assign this location to a variable in the form of a string object. For Mac OS X, the port string should approximately look like this: port = '/dev/cu.usbmodemfa1331' For Windows, use the following string structure: port = 'COM3' In the case of the Linux operating system, use the following line of code: port = '/dev/ttyACM0' The port's location might be different according to your computer configuration. You can identify the correct location of your Arduino USB port by using the Arduino IDE. Once you have imported the Arduino class and assigned the port to a variable object, it's time to engage Arduino with pyFirmata and associate this relationship to another variable: board = Arduino(port) Similarly, for Arduino Mega, use this: board = ArduinoMega(port) The synchronization between the Arduino board and pyFirmata requires some time. Adding sleep time between the preceding assignment and the next set of instructions can help to avoid any issues that are related to serial port buffering. The easiest way to add sleep time is to use the inbuilt Python method, sleep(time): from time import sleep sleep(1) The sleep() method takes seconds as the parameter and a floating-point number can be used to provide the specific sleep time. For example, for 200 milliseconds, it will be sleep(0.2). At this point, you have successfully synchronized your Arduino Uno or Arduino Mega board to the computer using pyFirmata. What if you want to use a different variant (other than Arduino Uno or ArduinoMega) of the Arduino board? Any board layout in pyFirmata is defined as a dictionary object. The following is a sample of the dictionary object for the Arduino board: arduino = {     'digital' : tuple(x for x in range(14)),     'analog' : tuple(x for x in range(6)),     'pwm' : (3, 5, 6, 9, 10, 11),     'use_ports' : True,     'disabled' : (0, 1) # Rx, Tx, Crystal     } For your variant of the Arduino board, you have to first create a custom dictionary object. To create this object, you need to know the hardware layout of your board. For example, an Arduino Nano board has a layout similar to a regular Arduino board, but it has eight instead of six analog ports. Therefore, the preceding dictionary object can be customized as follows: nano = {     'digital' : tuple(x for x in range(14)),     'analog' : tuple(x for x in range(8)),     'pwm' : (3, 5, 6, 9, 10, 11),     'use_ports' : True,     'disabled' : (0, 1) # Rx, Tx, Crystal     } As you have already synchronized the Arduino board earlier, modify the layout of the board using the setup_layout(layout) method: board.setup_layout(nano) This command will modify the default layout of the synchronized Arduino board to the Arduino Nano layout or any other variant for which you have customized the dictionary object. Configuring Arduino pins Once your Arduino board is synchronized, it is time to configure the digital and analog pins that are going to be used as part of your program. Arduino board has digital I/O pins and analog input pins that can be utilized to perform various operations. As we already know, some of these digital pins are also capable of PWM. The direct method Now before we start writing or reading any data to these pins, we have to first assign modes to these pins. In the Arduino sketch-based, we use the pinMode function, that is, pinMode(11, INPUT) for this operation. Similarly, in pyFirmata, this assignment operation is performed using the mode method on the board object as shown in the following code snippet: from pyfirmata import Arduino from pyfirmata import INPUT, OUTPUT, PWM   # Setting up Arduino board port = '/dev/cu.usbmodemfa1331' board = Arduino(port)   # Assigning modes to digital pins board.digital[13].mode = OUTPUT board.analog[0].mode = INPUT The pyFirmata library includes classes for the INPUT and OUTPUT modes, which are required to be imported before you utilized them. The preceding example shows the delegation of digital pin 13 as an output and the analog pin 0 as an input. The mode method is performed on the variable assigned to the configured Arduino board using the digital[] and analog[] array index assignment. The pyFirmata library also supports additional modes such as PWM and SERVO. The PWM mode is used to get analog results from digital pins, while SERVO mode helps a digital pin to set the angle of the shaft between 0 to 180 degrees. If you are using any of these modes, import their appropriate classes from the pyFirmata library. Once these classes are imported from the pyFirmata package, the modes for the appropriate pins can be assigned using the following lines of code: board.digital[3].mode = PWM board.digital[10].mode = SERVO Assigning pin modes The direct method of configuring pin is mostly used for a single line of execution calls. In a project containing a large code and complex logic, it is convenient to assign a pin with its role to a variable object. With an assignment like this, you can later utilize the assigned variable throughout the program for various actions, instead of calling the direct method every time you need to use that pin. In pyFirmata, this assignment can be performed using the get_pin(pin_def) method: from pyfirmata import Arduino port = '/dev/cu.usbmodemfa1311' board = Arduino(port)   # pin mode assignment ledPin = board.get_pin('d:13:o') The get_pin() method lets you assign pin modes using the pin_def string parameter, 'd:13:o'. The three components of pin_def are pin type, pin number, and pin mode separated by a colon (:) operator. The pin types ( analog and digital) are denoted with a and d respectively. The get_pin() method supports three modes, i for input, o for output, and p for PWM. In the previous code sample, 'd:13:o' specifies the digital pin 13 as an output. In another example, if you want to set up the analog pin 1 as an input, the parameter string will be 'a:1:i'. Working with pins As you have configured your Arduino pins, it's time to start performing actions using them. Two different types of methods are supported while working with pins: reporting methods and I/O operation methods. Reporting data When pins get configured in a program as analog input pins, they start sending input values to the serial port. If the program does not utilize this incoming data, the data starts getting buffered at the serial port and quickly overflows. The pyFirmata library provides the reporting and iterator methods to deal with this phenomenon. The enable_reporting() method is used to set the input pin to start reporting. This method needs to be utilized before performing a reading operation on the pin: board.analog[3].enable_reporting() Once the reading operation is complete, the pin can be set to disable reporting: board.analog[3].disable_reporting() In the preceding example, we assumed that you have already set up the Arduino board and configured the mode of the analog pin 3 as INPUT. The pyFirmata library also provides the Iterator() class to read and handle data over the serial port. While working with analog pins, we recommend that you start an iterator thread in the main loop to update the pin value to the latest one. If the iterator method is not used, the buffered data might overflow your serial port. This class is defined in the util module of the pyFirmata package and needs to be imported before it is utilized in the code: from pyfirmata import Arduino, util # Setting up the Arduino board port = 'COM3' board = Arduino(port) sleep(5)   # Start Iterator to avoid serial overflow it = util.Iterator(board) it.start() Manual operations As we have configured the Arduino pins to suitable modes and their reporting characteristic, we can start monitoring them. The pyFirmata provides the write() and read() methods for the configured pins. The write() method The write() method is used to write a value to the pin. If the pin's mode is set to OUTPUT, the value parameter is a Boolean, that is, 0 or 1: board.digital[pin].mode = OUTPUT board.digital[pin].write(1) If you have used an alternative method of assigning the pin's mode, you can use the write() method as follows: ledPin = board.get_pin('d:13:o') ledPin.write(1) In case of the PWM signal, the Arduino accepts a value between 0 and 255 that represents the length of the duty cycle between 0 and 100 percent. The PyFiramta library provides a simplified method to deal with the PWM values as instead of values between 0 and 255, as you can just provide a float value between 0 and 1.0. For example, if you want a 50 percent duty cycle (2.5V analog value), you can specify 0.5 with the write() method. The pyFirmata library will take care of the translation and send the appropriate value, that is, 127, to the Arduino board via the Firmata protocol: board.digital[pin].mode = PWM board.digital[pin].write(0.5) Similarly, for the indirect method of assignment, you can use code similar to the following one: pwmPin = board.get_pin('d:13:p') pwmPin.write(0.5) If you are using the SERVO mode, you need to provide the value in degrees between 0 and 180. Unfortunately, the SERVO mode is only applicable for direct assignment of the pins and will be available in future for indirect assignments: board.digital[pin].mode = SERVO board.digital[pin].write(90) The read() method The read() method provides an output value at the specified Arduino pin. When the Iterator() class is being used, the value received using this method is the latest updated value at the serial port. When you read a digital pin, you can get only one of the two inputs, HIGH or LOW, which will translate to 1 or 0 in Python: board.digital[pin].read() The analog pins of Arduino linearly translate the input voltages between 0 and +5V to 0 and 1023. However, in pyFirmata, the values between 0 and +5V are linearly translated into the float values of 0 and 1.0. For example, if the voltage at the analog pin is 1V, an Arduino program will measure a value somewhere around 204, but you will receive the float value as 0.2 while using pyFirmata's read() method in Python. Servomotor – moving the motor to certain angle Servomotors are widely used electronic components in applications such as pan-tilt camera control, robotics arm, mobile robot movements, and so on where precise movement of the motor shaft is required. This precise control of the motor shaft is possible because of the position sensing decoder, which is an integral part of the servomotor assembly. A standard servomotor allows the angle of the shaft to be set between 0 and 180 degrees. The pyFirmata provides the SERVO mode that can be implemented on every digital pin. This prototyping exercise provides a template and guidelines to interface a servomotor with Python. Connections Typically, a servomotor has wires that are color-coded red, black and yellow, respectively to connect with the power, ground, and signal of the Arduino board. Connect the power and the ground of the servomotor to the 5V and the ground of the Arduino board. As displayed in the following diagram, connect the yellow signal wire to the digital pin 13: If you want to use any other digital pin, make sure that you change the pin number in the Python program in the next section. Once you have made the appropriate connections, let's move on to the Python program. The Python code The Python file consisting this code is named servoCustomAngle.py and is located in the code bundle of this book, which can be downloaded from https://www.packtpub.com/books/content/support/19610. Open this file in your Python editor. Like other examples, the starting section of the program contains the code to import the libraries and set up the Arduino board: from pyfirmata import Arduino, SERVO from time import sleep   # Setting up the Arduino board port = 'COM5' board = Arduino(port) # Need to give some time to pyFirmata and Arduino to synchronize sleep(5) Now that you have Python ready to communicate with the Arduino board, let's configure the digital pin that is going to be used to connect the servomotor to the Arduino board. We will complete this task by setting the mode of pin 13 to SERVO: # Set mode of the pin 13 as SERVO pin = 13 board.digital[pin].mode = SERVO The setServoAngle(pin,angle) custom function takes the pins on which the servomotor is connected and the custom angle as input parameters. This function can be used as a part of various large projects that involve servos: # Custom angle to set Servo motor angle def setServoAngle(pin, angle):   board.digital[pin].write(angle)   sleep(0.015) In the main logic of this template, we want to incrementally move the motor shaft in one direction until it achieves the maximum achievable angle (180 degrees) and then move it back to the original position with the same incremental speed. In the while loop, we will ask the user to provide inputs to continue this routine, which will be captured using the raw_input() function. The user can enter character y to continue this routine or enter any other character to abort the loop: # Testing the function by rotating motor in both direction while True:   for i in range(0, 180):     setServoAngle(pin, i)   for i in range(180, 1, -1):     setServoAngle(pin, i)     # Continue or break the testing process   i = raw_input("Enter 'y' to continue or Enter to quit): ")   if i == 'y':     pass   else:     board.exit()     break While working with all these prototyping examples, we used the direct communication method by using digital and analog pins to connect the sensor with Arduino. Now, let's get familiar with another widely used communication method between Arduino and the sensors. This is called I2C communication. The Button() widget – interfacing GUI with Arduino and LEDs Now that you have had your first hands-on experience in creating a Python graphical interface, let's integrate Arduino with it. Python makes it easy to interface various heterogeneous packages within each other and that is what you are going to do. In the next coding exercise, we will use Tkinter and pyFirmata to make the GUI work with Arduino. In this exercise, we are going to use the Button() widget to control the LEDs interfaced with the Arduino board. Before we jump to the exercises, let's build the circuit that we will need for all upcoming programs. The following is a Fritzing diagram of the circuit where we use two different colored LEDs with pull up resistors. Connect these LEDs to digital pins 10 and 11 on your Arduino Uno board, as displayed in the following diagram: While working with the code provided in this section, you will have to replace the Arduino port that is used to define the board variable according to your operating system. Also, make sure that you provide the correct pin number in the code if you are planning to use any pins other than 10 and 11. For some exercises, you will have to use the PWM pins, so make sure that you have correct pins. You can use the entire code snippet as a Python file and run it. But, this might not be possible in the upcoming exercises due to the length of the program and the complexity involved. For the Button() widget exercise, open the exampleButton.py file. The code contains three main components: pyFirmata and Arduino configurations Tkinter widget definitions for a button The LED blink function that gets executed when you press the button As you can see in the following code snippet, we have first imported libraries and initialized the Arduino board using the pyFirmata methods. For this exercise, we are only going to work with one LED and we have initialized only the ledPin variable for it: import Tkinter import pyfirmata from time import sleep port = '/dev/cu.usbmodemfa1331' board = pyfirmata.Arduino(port) sleep(5) ledPin = board.get_pin('d:11:o') As we are using the pyFirmata library for all the exercises in this article, make sure that you have uploaded the latest version of the standard Firmata sketch on your Arduino board. In the second part of the code, we have initialized the root Tkinter widget as top and provided a title string. We have also fixed the size of this window using the minsize() method. In order to get more familiar with the root widget, you can play around with the minimum and maximum size of the window: top = Tkinter.Tk() top.title("Blink LED using button") top.minsize(300,30) The Button() widget is a standard Tkinter widget that is mostly used to obtain the manual, external input stimulus from the user. Like the Label() widget, the Button() widget can be used to display text or images. Unlike the Label() widget, it can be associated with actions or methods when it is pressed. When the button is pressed, Tkinter executes the methods or commands specified by the command option: startButton = Tkinter.Button(top,                              text="Start",                              command=onStartButtonPress) startButton.pack() In this initialization, the function associated with the button is onStartButtonPress and the "Start" string is displayed as the title of the button. Similarly, the top object specifies the parent or the root widget. Once the button is instantiated, you will need to use the pack() method to make it available in the main window. In the preceding lines of code, the onStartButonPress() function includes the scripts that are required to blink the LEDs and change the state of the button. A button state can have the state as NORMAL, ACTIVE, or DISABLED. If it is not specified, the default state of any button is NORMAL. The ACTIVE and DISABLED states are useful in applications when repeated pressing of the button needs to be avoided. After turning the LED on using the write(1) method, we will add a time delay of 5 seconds using the sleep(5) function before turning it off with the write(0) method: def onStartButtonPress():   startButton.config(state=Tkinter.DISABLED)   ledPin.write(1)   # LED is on for fix amount of time specified below   sleep(5)   ledPin.write(0)   startButton.config(state=Tkinter.ACTIVE) At the end of the program, we will execute the mainloop() method to initiate the Tkinter loop. Until this function is executed, the main window won't appear. To run the code, make appropriate changes to the Arduino board variable and execute the program. The following screenshot with a button and title bar will appear as the output of the program. Clicking on the Start button will turn on the LED on the Arduino board for the specified time delay. Meanwhile, when the LED is on, you will not be able to click on the Start button again. Now, in this particular program, we haven't provided sufficient code to safely disengage the Arduino board and it will be covered in upcoming exercises. Summary In this article, we learned about the Python library pyFirmata to interface Arduino to your computer using the Firmata protocol. We build a prototype using pyFirmata and Arduino to control servomotor and also developed another one with GUI, based on the Tkinter library, to control LEDs. Resources for Article: Further resources on this subject: Python Functions : Avoid Repeating Code? [article] Python 3 Designing Tasklist Application [article] The Five Kinds Of Python Functions Python 3.4 Edition [article]
Read more
  • 0
  • 0
  • 24158

article-image-deployment-scenarios
Packt
04 Mar 2015
10 min read
Save for later

Deployment Scenarios

Packt
04 Mar 2015
10 min read
In this article by Andrea Gazzarini, author of the book Apache Solr Essentials, contains information on the various ways in which you can deploy Solr, including key features and pros and cons for each scenario. Solr has a wide range of deployment alternatives, from monolithic to distributed indexes and standalone to clustered instances. We will organize this article by deployment scenarios, with a growing level of complexity. This article will cover the following topics: Sharding Replication: master, slave, and repeaters (For more resources related to this topic, see here.) Standalone instance All the examples use a standalone instance of Solr, that is, one or more cores managed by a Solr deployment hosted in a standalone servlet container (for example, Jetty, Tomcat, and so on). This kind of deployment is useful for development because, as you learned, it is very easy to start and debug. Besides, it can also be suitable for a production context if you don't have strict non-functional requirements and have a small or medium amount of data. I have used a standalone instance to provide autocomplete services for small and medium intranet systems. Anyway, the main features of this kind of deployment are simplicity and maintainability; one simple node acts as both an indexer and a searcher. The following diagram depicts a standalone instance with two cores: Shards When a monolithic index becomes too large for a single node or when additions, deletions, or queries take too long to execute, the index can be split into multiple pieces called shards. The previous sentence highlights a logical and theoretical evolution path of a Solr index. However, this (in general) is valid for all scenarios we will describe. It is strongly recommended that you perform a preliminary analysis of your data and the estimated growth factor in order to decide from the beginning the right configuration that suits your requirements. Although it is possible to split an existing index into shards (https://lucene.apache.org/core/4_10_3/misc/org/apache/lucene/index/PKIndexSplitter.html), things definitely become easier if you start directly with a distributed index (if you need it, of course). The index is split vertically so that each shard contains a disjoint set of the entire index. Solr will query and merge results across those shards. The following diagram illustrates a Solr deployment with 3 nodes; this deployment consists of two cores (C1 and C2) divided into three shards (S1, S2, and S3): When using shards, only query requests are distributed. This means that it's up to the indexer to add and distribute the data across nodes, and to subsequently forward a change request (that is, delete, replace, and commit) for a given document to the appropriate shard (the shard that owns the document). The Solr Wiki recommends a simple, hash-based algorithm to determine the shard where a given document should be indexed: documentId.hashCode() % numServers Using this approach is also useful in order to know in advance where to send delete or update requests for a given document. On the opposite side, a searcher client will send a query request to any node, but it has to specify an additional shards parameter that declares the target shards that will be queried. In the following example, assuming that two shards are hosted in two servers listening to ports 8080 and 8081, the same request when sent to both nodes will produce the same result: http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2 http://localhost:8081/solr/c2/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2 When sending a query request, a client can optionally include a pseudofield associated with the [shard] transformer. In this case, as a part of each returned document, there will be additional information indicating the owning shard. This is an example of such a request: http://localhost:8080/solr/c1/query?q=*:*&shards=localhost:8080/solr/c1,localhost:8081/solr/c2&src_shard:[shard] Here is the corresponding response (note the pseudofield aliased as src_shard): <result name="response" numFound="192" start="0"> <doc>    <str name="id">9920</str>    <str name="brand">Fender</str>    <str name="model">Jazz Bass</str>    <arr name="artist">    <str>Marcus Miller</str>    </arr><str name="series">Marcus Miller signature</str>    <str name="src_shard">localhost:8080/solr/shard1</str> </doc> … <doc>    <str name="id">4392</str>    <str name="brand">Music Man</str>    <str name="model">Sting Ray</str>    <arr name="artist"><str>Tony Levin</str></arr>    <str name="series">5 strings DeLuxe</str>    <str name="src_shard">localhost:8081/solr/shard2</str> </doc> </result> The following are a few things to keep in mind when using this deployment scenario: The schema must have a uniqueKey field. This field must be declared as stored and indexed; in addition, it is supposed to be unique across all shards. Inverse Document Frequency (IDF) calculations cannot be distributed. IDF is computed per shard. Joins between documents belonging to different shards are not supported. If a shard receives both index and query requests, the index may change during a query execution, thus compromising the outgoing results (for example, a matching document that has been deleted). Master/slaves scenario In a master/slaves scenario, there are two types of Solr servers: an indexer (the master) and one or more searchers (the slaves). The master is the server that manages the index. It receives update requests and applies those changes. A searcher, on the other hand, is a Solr server that exposes search services to external clients. The index, in terms of data files, is replicated from the indexer to the searcher through HTTP by means of a built-in RequestHandler that must be configured on both the indexer side and searcher side (within the solrconfig.xml configuration file). On the indexer (master), a replication configuration looks like this: <requestHandler    name="/replication"  class="solr.ReplicationHandler">    <lst name="master">      <str name="replicateAfter">startup</str>      <str name="replicateAfter">optimize</str>      <str name="confFiles">schema.xml,stopwords.txt</str>    </lst> </requestHandler> The replication mechanism can be configured to be triggered after one of the following events: Commit: A commit has been applied Optimize: The index has been optimized Startup: The Solr instance has started In the preceding example, we want the index to be replicated after startup and optimize commands. Using the confFiles parameter, we can also indicate a set of configuration files (schema.xml and stopwords.txt, in the example) that must be replicated together with the index. Remember that changes on those files don't trigger any replication. Only a change in the index, in conjunction with one of the events we defined in the replicateAfter parameter, will mark the index (and the configuration files) as replicable. On the searcher side, the configuration looks like the following: <requestHandler name="/replication" class="solr.ReplicationHandler"> <lst name="slave">    <str name="masterUrl">http://<localhost>:<port>/solrmaster</str>    <str name="pollInterval">00:00:10</str> </lst> </requestHandler> You can see that a searcher periodically keeps polling the master (the pollInterval parameter) to check whether a newer version of the index is available. If it is, the searcher will start the replication mechanism by issuing a request to the master, which is completely unaware of the searchers. The replicability status of the index is actually indicated by a version number. If the searcher has the same version as the master, it means the index is the same. If the versions are different, it means that a newer version of the index is available on the master, and replication can start. Other than separating responsibilities, this deployment configuration allows us to have a so-called diamond architecture, consisting of one indexer and several searchers. When the replication is triggered, each searcher in the ring will receive a whole copy of the index. This allows the following: Load balancing of the incoming (query) requests. An increment to the availability of the whole system. In the event of a server crash, the other searchers will continue to serve the incoming requests. The following diagram illustrates a master/slave deployment scenario with one indexer, three searchers, and two cores: If the searchers are in several geographically dislocated data centers, an additional role called repeater can be configured in each data center in order to rationalize the replication data traffic flow between nodes. A repeater is simply a node that acts as both a master and a slave. It is a slave of the main master, and at the same time, it acts as master of the searchers within the same data center, as shown in this diagram: Shards with replication This scenario combines shards and replication in order to have a scalable system with high throughput and availability. There is one indexer and one or more searchers for each shard, allowing load balancing between (query) shard requests. The following diagram illustrates a scenario with two cores, three shards, one indexer, and (due to problems with available space), only one searcher for each shard: The drawback of this approach is undoubtedly the overall growing complexity of the system that requires more effort in terms of maintainability, manageability, and system administration. In addition to this, each searcher is an independent node, and we don't have a central administration console where a system administrator can get a quick overview of system health. Summary In this article, we described various ways in which you can deploy Solr. Each deployment scenario has specific features, advantages, and drawbacks that make a choice ideal for one context and bad for another. A good thing is that the different scenarios are not strictly exclusive; they follow an incremental approach. In an ideal context, things should start immediately with the perfect scenario that fits your needs. However, unless your requirements are clear right from the start, you can begin with a simple configuration and then change it, depending on how your application evolves. Resources for Article: Further resources on this subject: Tuning Solr JVM and Container [article] Boost Your search [article] In the Cloud [article]
Read more
  • 0
  • 0
  • 2009
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime
article-image-angularjs-performance
Packt
04 Mar 2015
20 min read
Save for later

AngularJS Performance

Packt
04 Mar 2015
20 min read
In this article by Chandermani, the author of AngularJS by Example, we focus our discussion on the performance aspect of AngularJS. For most scenarios, we can all agree that AngularJS is insanely fast. For standard size views, we rarely see any performance bottlenecks. But many views start small and then grow over time. And sometimes the requirement dictates we build large pages/views with a sizable amount of HTML and data. In such a case, there are things that we need to keep in mind to provide an optimal user experience. Take any framework and the performance discussion on the framework always requires one to understand the internal working of the framework. When it comes to Angular, we need to understand how Angular detects model changes. What are watches? What is a digest cycle? What roles do scope objects play? Without a conceptual understanding of these subjects, any performance guidance is merely a checklist that we follow without understanding the why part. Let's look at some pointers before we begin our discussion on performance of AngularJS: The live binding between the view elements and model data is set up using watches. When a model changes, one or many watches linked to the model are triggered. Angular's view binding infrastructure uses these watches to synchronize the view with the updated model value. Model change detection only happens when a digest cycle is triggered. Angular does not track model changes in real time; instead, on every digest cycle, it runs through every watch to compare the previous and new values of the model to detect changes. A digest cycle is triggered when $scope.$apply is invoked. A number of directives and services internally invoke $scope.$apply: Directives such as ng-click, ng-mouse* do it on user action Services such as $http and $resource do it when a response is received from server $timeout or $interval call $scope.$apply when they lapse A digest cycle tracks the old value of the watched expression and compares it with the new value to detect if the model has changed. Simply put, the digest cycle is a workflow used to detect model changes. A digest cycle runs multiple times till the model data is stable and no watch is triggered. Once you have a clear understanding of the digest cycle, watches, and scopes, we can look at some performance guidelines that can help us manage views as they start to grow. (For more resources related to this topic, see here.) Performance guidelines When building any Angular app, any performance optimization boils down to: Minimizing the number of binding expressions and hence watches Making sure that binding expression evaluation is quick Optimizing the number of digest cycles that take place The next few sections provide some useful pointers in this direction. Remember, a lot of these optimization may only be necessary if the view is large. Keeping the page/view small The sanest advice is to keep the amount of content available on a page small. The user cannot interact/process too much data on the page, so remember that screen real estate is at a premium and only keep necessary details on a page. The lesser the content, the lesser the number of binding expressions; hence, fewer watches and less processing are required during the digest cycle. Remember, each watch adds to the overall execution time of the digest cycle. The time required for a single watch can be insignificant but, after combining hundreds and maybe thousands of them, they start to matter. Angular's data binding infrastructure is insanely fast and relies on a rudimentary dirty check that compares the old and the new values. Check out the stack overflow (SO) post (http://stackoverflow.com/questions/9682092/databinding-in-angularjs), where Misko Hevery (creator of Angular) talks about how data binding works in Angular. Data binding also adds to the memory footprint of the application. Each watch has to track the current and previous value of a data-binding expression to compare and verify if data has changed. Keeping a page/view small may not always be possible, and the view may grow. In such a case, we need to make sure that the number of bindings does not grow exponentially (linear growth is OK) with the page size. The next two tips can help minimize the number of bindings in the page and should be seriously considered for large views. Optimizing watches for read-once data In any Angular view, there is always content that, once bound, does not change. Any read-only data on the view can fall into this category. This implies that once the data is bound to the view, we no longer need watches to track model changes, as we don't expect the model to update. Is it possible to remove the watch after one-time binding? Angular itself does not have something inbuilt, but a community project bindonce (https://github.com/Pasvaz/bindonce) is there to fill this gap. Angular 1.3 has added support for bind and forget in the native framework. Using the syntax {{::title}}, we can achieve one-time binding. If you are on Angular 1.3, use it! Hiding (ng-show) versus conditional rendering (ng-if/ng-switch) content You have learned two ways to conditionally render content in Angular. The ng-show/ng-hide directive shows/hides the DOM element based on the expression provided and ng-if/ng-switch creates and destroys the DOM based on an expression. For some scenarios, ng-if can be really beneficial as it can reduce the number of binding expressions/watches for the DOM content not rendered. Consider the following example: <div ng-if='user.isAdmin'>   <div ng-include="'admin-panel.html'"></div></div> The snippet renders an admin panel if the user is an admin. With ng-if, if the user is not an admin, the ng-include directive template is neither requested nor rendered saving us of all the bindings and watches that are part of the admin-panel.html view. From the preceding discussion, it may seem that we should get rid of all ng-show/ng-hide directives and use ng-if. Well, not really! It again depends; for small size pages, ng-show/ng-hide works just fine. Also, remember that there is a cost to creating and destroying the DOM. If the expression to show/hide flips too often, this will mean too many DOMs create-and-destroy cycles, which are detrimental to the overall performance of the app. Expressions being watched should not be slow Since watches are evaluated too often, the expression being watched should return results fast. The first way we can make sure of this is by using properties instead of functions to bind expressions. These expressions are as follows: {{user.name}}ng-show='user.Authorized' The preceding code is always better than this: {{getUserName()}}ng-show = 'isUserAuthorized(user)' Try to minimize function expressions in bindings. If a function expression is required, make sure that the function returns a result quickly. Make sure a function being watched does not: Make any remote calls Use $timeout/$interval Perform sorting/filtering Perform DOM manipulation (this can happen inside directive implementation) Or perform any other time-consuming operation Be sure to avoid such operations inside a bound function. To reiterate, Angular will evaluate a watched expression multiple times during every digest cycle just to know if the return value (a model) has changed and the view needs to be synchronized. Minimizing the deep model watch When using $scope.$watch to watch for model changes in controllers, be careful while setting the third $watch function parameter to true. The general syntax of watch looks like this: $watch(watchExpression, listener, [objectEquality]); In the standard scenario, Angular does an object comparison based on the reference only. But if objectEquality is true, Angular does a deep comparison between the last value and new value of the watched expression. This can have an adverse memory and performance impact if the object is large. Handling large datasets with ng-repeat The ng-repeat directive undoubtedly is the most useful directive Angular has. But it can cause the most performance-related headaches. The reason is not because of the directive design, but because it is the only directive that allows us to generate HTML on the fly. There is always the possibility of generating enormous HTML just by binding ng-repeat to a big model list. Some tips that can help us when working with ng-repeat are: Page data and use limitTo: Implement a server-side paging mechanism when a number of items returned are large. Also use the limitTo filter to limit the number of items rendered. Its syntax is as follows: <tr ng-repeat="user in users |limitTo:pageSize">…</tr> Look at modules such as ngInfiniteScroll (http://binarymuse.github.io/ngInfiniteScroll/) that provide an alternate mechanism to render large lists. Use the track by expression: The ng-repeat directive for performance tries to make sure it does not unnecessarily create or delete HTML nodes when items are added, updated, deleted, or moved in the list. To achieve this, it adds a $$hashKey property to every model item allowing it to associate the DOM node with the model item. We can override this behavior and provide our own item key using the track by expression such as: <tr ng-repeat="user in users track by user.id">…</tr> This allows us to use our own mechanism to identify an item. Using your own track by expression has a distinct advantage over the default hash key approach. Consider an example where you make an initial AJAX call to get users: $scope.getUsers().then(function(users){ $scope.users = users;}) Later again, refresh the data from the server and call something similar again: $scope.users = users; With user.id as a key, Angular is able to determine what elements were added/deleted and moved; it can also determine created/deleted DOM nodes for such elements. Remaining elements are not touched by ng-repeat (internal bindings are still evaluated). This saves a lot of CPU cycles for the browser as fewer DOM elements are created and destroyed. Do not bind ng-repeat to a function expression: Using a function's return value for ng-repeat can also be problematic, depending upon how the function is implemented. Consider a repeat with this: <tr ng-repeat="user in getUsers()">…</tr> And consider the controller getUsers function with this: $scope.getUser = function() {   var orderBy = $filter('orderBy');   return orderBy($scope.users, predicate);} Angular is going to evaluate this expression and hence call this function every time the digest cycle takes place. A lot of CPU cycles were wasted sorting user data again and again. It is better to use scope properties and presort the data before binding. Minimize filters in views, use filter elements in the controller: Filters defined on ng-repeat are also evaluated every time the digest cycle takes place. For large lists, if the same filtering can be implemented in the controller, we can avoid constant filter evaluation. This holds true for any filter function that is used with arrays including filter and orderBy. Avoiding mouse-movement tracking events The ng-mousemove, ng-mouseenter, ng-mouseleave, and ng-mouseover directives can just kill performance. If an expression is attached to any of these event directives, Angular triggers a digest cycle every time the corresponding event occurs and for events like mouse move, this can be a lot. We have already seen this behavior when working with 7 Minute Workout, when we tried to show a pause overlay on the exercise image when the mouse hovers over it. Avoid them at all cost. If we just want to trigger some style changes on mouse events, CSS is a better tool. Avoiding calling $scope.$apply Angular is smart enough to call $scope.$apply at appropriate times without us explicitly calling it. This can be confirmed from the fact that the only place we have seen and used $scope.$apply is within directives. The ng-click and updateOnBlur directives use $scope.$apply to transition from a DOM event handler execution to an Angular execution context. Even when wrapping the jQuery plugin, we may require to do a similar transition for an event raised by the JQuery plugin. Other than this, there is no reason to use $scope.$apply. Remember, every invocation of $apply results in the execution of a complete digest cycle. The $timeout and $interval services take a Boolean argument invokeApply. If set to false, the lapsed $timeout/$interval services does not call $scope.$apply or trigger a digest cycle. Therefore, if you are going to perform background operations that do not require $scope and the view to be updated, set the last argument to false. Always use Angular wrappers over standard JavaScript objects/functions such as $timeout and $interval to avoid manually calling $scope.$apply. These wrapper functions internally call $scope.$apply. Also, understand the difference between $scope.$apply and $scope.$digest. $scope.$apply triggers $rootScope.$digest that evaluates all application watches whereas, $scope.$digest only performs dirty checks on the current scope and its children. If we are sure that the model changes are not going to affect anything other than the child scopes, we can use $scope.$digest instead of $scope.$apply. Lazy-loading, minification, and creating multiple SPAs I hope you are not assuming that the apps that we have built will continue to use the numerous small script files that we have created to separate modules and module artefacts (controllers, directives, filters, and services). Any modern build system has the capability to concatenate and minify these files and replace the original file reference with a unified and minified version. Therefore, like any JavaScript library, use minified script files for production. The problem with the Angular bootstrapping process is that it expects all Angular application scripts to be loaded before the application can bootstrap. We cannot load modules, controllers, or in fact, any of the other Angular constructs on demand. This means we need to provide every artefact required by our app, upfront. For small applications, this is not a problem as the content is concatenated and minified; also, the Angular application code itself is far more compact as compared to the traditional JavaScript of jQuery-based apps. But, as the size of the application starts to grow, it may start to hurt when we need to load everything upfront. There are at least two possible solutions to this problem; the first one is about breaking our application into multiple SPAs. Breaking applications into multiple SPAs This advice may seem counterintuitive as the whole point of SPAs is to get rid of full page loads. By creating multiple SPAs, we break the app into multiple small SPAs, each supporting parts of the overall app functionality. When we say app, it implies a combination of the main (such as index.html) page with ng-app and all the scripts/libraries and partial views that the app loads over time. For example, we can break the Personal Trainer application into a Workout Builder app and a Workout Runner app. Both have their own start up page and scripts. Common scripts such as the Angular framework scripts and any third-party libraries can be referenced in both the applications. On similar lines, common controllers, directives, services, and filters too can be referenced in both the apps. The way we have designed Personal Trainer makes it easy to achieve our objective. The segregation into what belongs where has already been done. The advantage of breaking an app into multiple SPAs is that only relevant scripts related to the app are loaded. For a small app, this may be an overkill but for large apps, it can improve the app performance. The challenge with this approach is to identify what parts of an application can be created as independent SPAs; it totally depends upon the usage pattern of the application. For example, assume an application has an admin module and an end consumer/user module. Creating two SPAs, one for admin and the other for the end customer, is a great way to keep user-specific features and admin-specific features separate. A standard user may never transition to the admin section/area, whereas an admin user can still work on both areas; but transitioning from the admin area to a user-specific area will require a full page refresh. If breaking the application into multiple SPAs is not possible, the other option is to perform the lazy loading of a module. Lazy-loading modules Lazy-loading modules or loading module on demand is a viable option for large Angular apps. But unfortunately, Angular itself does not have any in-built support for lazy-loading modules. Furthermore, the additional complexity of lazy loading may be unwarranted as Angular produces far less code as compared to other JavaScript framework implementations. Also once we gzip and minify the code, the amount of code that is transferred over the wire is minimal. If we still want to try our hands on lazy loading, there are two libraries that can help: ocLazyLoad (https://github.com/ocombe/ocLazyLoad): This is a library that uses script.js to load modules on the fly angularAMD (http://marcoslin.github.io/angularAMD): This is a library that uses require.js to lazy load modules With lazy loading in place, we can delay the loading of a controller, directive, filter, or service script, until the page that requires them is loaded. The overall concept of lazy loading seems to be great but I'm still not sold on this idea. Before we adopt a lazy-load solution, there are things that we need to evaluate: Loading multiple script files lazily: When scripts are concatenated and minified, we load the complete app at once. Contrast it to lazy loading where we do not concatenate but load them on demand. What we gain in terms of lazy-load module flexibility we lose in terms of performance. We now have to make a number of network requests to load individual files. Given these facts, the ideal approach is to combine lazy loading with concatenation and minification. In this approach, we identify those feature modules that can be concatenated and minified together and served on demand using lazy loading. For example, Personal Trainer scripts can be divided into three categories: The common app modules: This consists of any script that has common code used across the app and can be combined together and loaded upfront The Workout Runner module(s): Scripts that support workout execution can be concatenated and minified together but are loaded only when the Workout Runner pages are loaded. The Workout Builder module(s): On similar lines to the preceding categories, scripts that support workout building can be combined together and served only when the Workout Builder pages are loaded. As we can see, there is a decent amount of effort required to refactor the app in a manner that makes module segregation, concatenation, and lazy loading possible. The effect on unit and integration testing: We also need to evaluate the effect of lazy-loading modules in unit and integration testing. The way we test is also affected with lazy loading in place. This implies that, if lazy loading is added as an afterthought, the test setup may require tweaking to make sure existing tests still run. Given these facts, we should evaluate our options and check whether we really need lazy loading or we can manage by breaking a monolithic SPA into multiple smaller SPAs. Caching remote data wherever appropriate Caching data is the one of the oldest tricks to improve any webpage/application performance. Analyze your GET requests and determine what data can be cached. Once such data is identified, it can be cached from a number of locations. Data cached outside the app can be cached in: Servers: The server can cache repeated GET requests to resources that do not change very often. This whole process is transparent to the client and the implementation depends on the server stack used. Browsers: In this case, the browser caches the response. Browser caching depends upon the server sending HTTP cache headers such as ETag and cache-control to guide the browser about how long a particular resource can be cached. Browsers can honor these cache headers and cache data appropriately for future use. If server and browser caching is not available or if we also want to incorporate any amount of caching in the client app, we do have some choices: Cache data in memory: A simple Angular service can cache the HTTP response in the memory. Since Angular is SPA, the data is not lost unless the page refreshes. This is how a service function looks when it caches data: var workouts;service.getWorkouts = function () {   if (workouts) return $q.resolve(workouts);   return $http.get("/workouts").then(function (response){       workouts = response.data;       return workouts;   });}; The implementation caches a list of workouts into the workouts variable for future use. The first request makes a HTTP call to retrieve data, but subsequent requests just return the cached data as promised. The usage of $q.resolve makes sure that the function always returns a promise. Angular $http cache: Angular's $http service comes with a configuration option cache. When set to true, $http caches the response of the particular GET request into a local cache (again an in-memory cache). Here is how we cache a GET request: $http.get(url, { cache: true}); Angular caches this cache for the lifetime of the app, and clearing it is not easy. We need to get hold of the cache dedicated to caching HTTP responses and clear the cache key manually. The caching strategy of an application is never complete without a cache invalidation strategy. With cache, there is always a possibility that caches are out of sync with respect to the actual data store. We cannot affect the server-side caching behavior from the client; consequently, let's focus on how to perform cache invalidation (clearing) for the two client-side caching mechanisms described earlier. If we use the first approach to cache data, we are responsible for clearing cache ourselves. In the case of the second approach, the default $http service does not support clearing cache. We either need to get hold of the underlying $http cache store and clear the cache key manually (as shown here) or implement our own cache that manages cache data and invalidates cache based on some criteria: var cache = $cacheFactory.get('$http');cache.remove("http://myserver/workouts"); //full url Using Batarang to measure performance Batarang (a Chrome extension), as we have already seen, is an extremely handy tool for Angular applications. Using Batarang to visualize app usage is like looking at an X-Ray of the app. It allows us to: View the scope data, scope hierarchy, and how the scopes are linked to HTML elements Evaluate the performance of the application Check the application dependency graph, helping us understand how components are linked to each other, and with other framework components. If we enable Batarang and then play around with our application, Batarang captures performance metrics for all watched expressions in the app. This data is nicely presented as a graph available on the Performance tab inside Batarang: That is pretty sweet! When building an app, use Batarang to gauge the most expensive watches and take corrective measures, if required. Play around with Batarang and see what other features it has. This is a very handy tool for Angular applications. This brings us to the end of the performance guidelines that we wanted to share in this article. Some of these guidelines are preventive measures that we should take to make sure we get optimal app performance whereas others are there to help when the performance is not up to the mark. Summary In this article, we looked at the ever-so-important topic of performance, where you learned ways to optimize an Angular app performance. Resources for Article: Further resources on this subject: Role of AngularJS [article] The First Step [article] Recursive directives [article]
Read more
  • 0
  • 0
  • 5548

article-image-native-ms-security-tools-and-configuration
Packt
04 Mar 2015
19 min read
Save for later

Native MS Security Tools and Configuration

Packt
04 Mar 2015
19 min read
This article, written by Santhosh Sivarajan, the author of Getting Started with Windows Server Security, will introduce another powerful Microsoft tool called Microsoft Security Compliance Manager (SCM). As its name suggests, it is a platform for managing and maintaining your security and compliance polices. At this point, we have established baseline security based on your business requirement, using Microsoft SCW. These polices can be a pure reflection of your business requirements. However, in an enterprise world, you have to consider compliance, regulations, other industry standards, and best practices to maximize the effectiveness of the security policy. That's where Microsoft SCM can provide more business value. We will talk more about the included SCM baselines later in the article. The goal of the article is to walk you through the configuration and administration process of Microsoft SCM and explain how it can be used in an enterprise environment to support your security needs. Then we will talk about a method to maintain the desired state of the server using a Microsoft tool called Attack Surface Analyzer (ASA). At the end of the article, you will see an option to add more security restrictions using another Microsoft tool called AppLocker. (For more resources related to this topic, see here.) Microsoft SCM Microsoft SCM is a centralized security and compliance policy manager product from Microsoft. It is a standalone application. Microsoft develops these baselines and best practice recommendations based on customer feedback and other agency's recommendations. These polices are consistently reviewed and updated. So, it is important that you are using the latest policy baseline. If there is a new policy, you will be able to download and update the baseline from the Microsoft SCM console itself. Since Microsoft SCM supports multiple input and output formats such as XML, Group Policy Objects (GPO), Desired Configuration Management (DCM), Security Content Automation Protocol (SCAP), and so on, it can be a centralized platform for your network infrastructure and other security and compliance products. It is also possible to integrate SCM with Microsoft System Center 2012 Process Pack for IT GRC. More details can be found at http://technet.microsoft.com/en-us/library/dd206732.aspx. Installing Microsoft SCM We will start with the installation process. As mentioned earlier, it is a standalone product. It uses Microsoft SQL Server 2008 or higher as the database. If you don't have a SQL database already installed on your system, the SCM installation process will automatically install Microsoft SQL Server 2008 Express Edition. You can perform the following steps to install Microsoft SCM: Download Microsoft Security Compliance Manager from http://www.microsoft.com/en-us/download/details.aspx?id=16776. Double-click on Security_Compliance_Manager_Setup.exe to start the installation process. Click on Next on the welcome window. Make sure to select the Always check for SCM and baseline updates option. Accept the License Agreement option and click on Next. Select the installation folder from the Installation Folder window by clicking on the Browse button. Click on Next. On the Microsoft SQL Server 2008 Express window, click on Next to install Microsoft SQL Server 2008 Express Edition. If you have Microsoft SQL Server already installed on your system, you can select the correct server details from this window. Accept the License Agreement option for SQL Server 2008 Express and click on Next. Click on Install on the Ready to Install window to begin the installation. You will see the progress in the Installing the Microsoft Security Compliance Manager window. If it asks you to restart the computer, click on OK. Click on Finish to complete the installation. This section provides a high level overview of the product before starting the administration and management process. The left pane of the SCMconsole provides the list of all available baselines. This is the baseline library inside SCM. The center pane displays more information based on your policy section from the baseline library. The right pane, also called the Actions pane, provides commands and options to manage your policies. As you can see in the following screenshot, it provides a few options to export these policies into different formats. So, if you have a different compliance manager tool, you can use these files with your existing tool.  SCM – Export options In compliance with other products, Microsoft SCM supports different severity levels—critical, optional, important, and none. As you can see in the following screenshot, on a custom policy, the severity levels can be changed to None, Important, Optional, or Critical based on your requirements:   For each of these events, you will see additional details and reference articles (CCE, OVAL, and so on) in the Setting Details section. Administering Microsoft SCM This section provides you with an overview of Microsoft SCM and some administration procedures to create and manage policies. These tasks can be achieved by performing the following steps: Open Security Compliance Manager. If you see a Download Updates popup window, click on the Download button to start the download and complete the database update process. Security Compliance Manager consists of mainly two sections: Custom Baselines and Microsoft Baselines. We will go through the details later in this article. SCM - Baselines Expand Microsoft Baselines. Since we are focusing more on Windows Server 2012, I will start with this section. Select the Windows Server 2012 node. This node contains predefined security polices based on Microsoft and industry best practices. I will use the predefined WS2012 Web Server Security template for this exercise. You will not be able to make changes to the settings in the default template. If you need to make changes, you can make a copy of the template and make changes there. Select the WS2012 Web Server Security template. From the right pane, select the Duplicate option. In the Duplicate window, enter the name for this new security policy. Click on Save. The new template will be saved under the Custom Baselines node. You can review the policy and make necessary changes in the newly created policy. Creating and implementing security policies At this point, you have installed SCM and are familiar with the basic administration tasks. From this section onwards, you will be working on a real-world scenario where you will be exporting a policy from Active Directory, importing into SCM, merging with an SCM baseline, and importing back into Active Directory. In this section, our goal is to export this web server policy and merge it with an SCM baseline and import it back into Active Directory. Exporting GPO from Active Directory We will start by exporting the existing web server policy from Active Directory. The following steps can be performed to export (backup) an Active Directory GPO-based policy: Open the Group Policy Manager console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate GPO and select Back Up. GPO – Back up In the Back Up Group Policy Object window, enter the Location and Description details for the backup file. Click on the Back Up button to start the backup operation. You will see the progress in the Backup window. Click on OK when it completes the backup operation. GPO can also be backed up using the Backup-GPO PowerShell cmdlet. The following is an example:Backup-Gpo. Name- "WebServerbaselineV2.0". Path- D:Backup -Comment "Baseline Backup" The backup folder name will be the GUID of the GPO itself. Importing GPO into SCM An exported GPO-based policy can be imported directly into SCM. An administrator can perform the following steps to complete this task: Open Microsoft Compliance Security Manager. From the Import section on the right pane, select the GPO Backup (Folder) option. SCM – Import In the Browse For Folder window, select the GPO backup folder. Click on OK. In the GPO Name window, confirm or change the baseline name. Click on OK. In the SCM Log window, you will see the status. Click on OK to close the window. You will see the imported policy under Custom Baselines | GPO Import | Policy Name. Currently, SCM supports importing from GPO backup and SCM CAB files. If you have some other policy or baseline (for example, DISA STIGs) that you would like to import into SCM, you need to import these polices into Active Directory first, and then export/backup to GPO before you can import into SCM. Merging imported GPO with the SCM baseline policy The third step in this process is to merge the imported policy with the SCM baseline policy. Keep in mind that some configurations and settings will be lost when you merge an existing GPO with the SCM baseline policy. For example, service-related or ACL configurations may not be preserved when you associate and merge with an SCM baseline policy. If you have these types of configuration in your GPO and want to retain them, you may need to split the GPO and use two separate GPOs. Inside the SCM, the import process is to map these configurations with the SCM library to preserve these settings. If it doesn't match or map, these settings will be dropped from the new baseline policy. For this exercise, my assumption is that you don't have a custom configuration or settings in the imported policy. The following steps can be used to Associate and Merge a GPO-based policy into an SCM-based policy: Select the imported policy in Microsoft Compliance Security Manager. From the right pane, select the Associate option from the Baseline section.Selecting the Associate option From the Associate Product with GPO window, select the appropriate baseline policy. Since we are working with a Windows Server 2012 policy, I will be selecting Windows Server 2012 as the product. If you have a different operating system, select the correct policy from the product list. Click on Associate. Your custom policy must have unique settings in the baseline policy in order to associate a custom policy with the SCM baseline policy; otherwise, the Associate button will be grayed out. Enter a name for this policy in the Baseline Policy window. You will see this policy in the Custom Baselines | Windows Server 2012 section. Select this policy. From the right pane, select the Compare/Merge option from the Baseline section. Selecting the Compare / Merge option Now you have associated your policy with an SCM baseline policy. The next step is to compare and merge your policy with a baseline SCM policy. From the Compare Baseline window, select the appropriate baseline policy. Since we are working with a web server baseline, we will be selecting WS2012 Web Server Security 1.0 as the policy. Click on OK. You will see the result in the Compare Baselines window. You can review the differ and match details here. Since we are planning to merge these two polices, we will be selecting the Merge Baselines option. You will see the summary report in the Merge Baselines window. Click on OK. In the Specify a name for the merged baseline window, enter a new name for this policy. Click on OK. This merged policy will be stored in the Custom Baselines– Windows Server 2012 section. Exporting the SCM baseline policy At this point, you have created a new policy that contains your custom policy and best practices provided by SCM. The next step is to export this policy to a supported format. Since we are dealing with Active Directory and GPO, we will be exporting it into a GPO-based policy. You can perform the following steps to export an SCM policy to a GPO-based backup policy: Select the policy from Microsoft Compliance Security Manager. From the Export section, select the GPO Backup (Folder) option. GPO Backup (Folder) From the Browse for Folder window, select the folder to store this policy in. Click on OK. Importing a policy into Active Directory The final step in this process is to import these settings back to Active Directory. This can be achieved by using Group Policy Management Console (GPMC). The following steps can be used to import an SCM-based policy into Active Directory: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name | Group Policy Objects. Right-click on the appropriate policy. Select the Import Settings option. The Import Settings option Click on Next in the Welcome window. It is always a best practice to back up the existing settings. Click on Backup to continue with the backup operation. Once you have completed the backup, click on Next in the Backup GPO window. In the Backup Location window, select the backup location folder. Click on Next. Confirm the GPO name in the Source GPO window. Click on Next. You will see the scanning settings in the Scanning Backup window. Click on Next to continue. Click on Finish in the Completing the Import Settings Wizard window to complete the import operation. Click on OK in the Import window. Maintaining and monitoring the integrity of a baseline policy Once you have baseline security in place, whether it is a true business policy or a combination of business and industry practices, you will need to maintain this state to ensure the security and integrity. The whole idea is to compare your baseline image with the current image in order to validate the settings. There are many ways to achieve this. Microsoft has a free tool called Attack Surface Analyzer (ASA) that can be used to compare the two states of the system. The details and capabilities of this tool can found at http://www.microsoft.com/en-us/download/details.aspx?id=24487. Microsoft ASA An administrator can perform the following steps to install, configure, and generate an Attack Surface Report using Microsoft ASA: Download Attack Surface Analyzer from http://www.microsoft.com/en-us/download/details.aspx?id=24487. Complete the installation. It is a standalone, simple MSI installation process. Open the Attack Surface Analyzer tool. The first step is to create the baseline state. Select the Run New Scan option and enter a name for the CAB file. Click on Run Scan to start the scanning process. You will see the status and progress in the Collecting Data window. When it completes, it will create a CAB file with the result. The second step in this process is to analyze the baseline state against the existing server so as to identify the differences. You will need to create another report (Product CAB) to compare the CAB file with the baseline CAB. Select the Run New Scan option again and enter a name for the product CAB file. Click on Run Scan to start the scanning process. Complete the CAB creation process. The third step in the process is to compare the baseline CAB with the product CAB to get the delta. Select the Generate Standard Attack Surface Report option. In the Select Options section, select the baseline CAB name, select the product CAB name, and enter a name for the attack report. Click on Generate to start the process. You will see the status in the Running Analysis window. The report will be opened automatically in the web browser. This report has three sections: Report Summary, Security Issues, and Attack Surface. The following is an example of a Security Issues report Application control and management At this point, you have a baseline policy for your server platform. Now we can add more restrictions based on your requirements to provide a more secure environment. In the following section, my plan is to introduce an option to "blacklist" and "whitelist" some of the applications using a built-in native option called AppLocker. The details of the AppLocker application can be found at http://technet.microsoft.com/en-us/library/hh831409.aspx. AppLocker AppLocker polices are part of Application Control Policies in GPOs. There are four types of built-in rules: Executable, Windows Installable, Script, and Packed App rules. Before you create or enforce a policy, you need to perform an inventory check to identify the current usage of these applications in your environment. AppLocker has an inventory process called Auditing that helps you to achieve this. In this scenario, our goal is to block unauthorized access of the NLTEST application from all servers. Creating a policy As the first step, you need to identify the current usage of the application in your environment. The following steps can be performed to create a new AppLocker policy in an Active Directory environment: Open Group Policy Manager Console. Expand Forest | Domain | Domain Name. Right-click on the Group Policy Object node and select New. Enter a name for the GPO in the New GPO window. Leave Source Starter GPO as (none). Click on OK. This will create a new blank GPO in the Group Policy Object node. We will be using this GPO to configure the AppLocker settings. Right-click on the newly created GPO and select Edit. This will open the Group Policy Management Editor window. Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create Default Rules. These default rules allow users and built-in administrators to run default programs and administrators to run files and applications. Based on your requirements, you can modify and delete these rules. The default AppLocker rule allows everyone to run files located only in the Windows folder, and the administrator can run all files. The default AppLocker rule Expand Policies | Windows Settings | Security Settings | AppLocker. Right-click on Executable Polices and select Create New Rules. Click on Next in the Create Executable Rules window. In the Permission window, select Deny. In the User or Group section, click on Select and select the Server Admins group. Here, I have created a security group with all server administrators in that group. In the Conditions window, select the File Hash option. Click on Next. In the File Hash window, select the correct file name using the Browse File option. In this scenario, I will be selecting the NLTEST.exe file. Click on Next. In the Name and Description window, select or enter an appropriate name for this rule. Click on Create. Auditing a policy The next step in this process is to audit the previously created polices to ensure that there will not be any adverse effects to your environment. An administrator can perform the following steps to audit an existing policy in an Active Directory environment: Right-click on AppLocker (Policies | Windows Settings | Security Settings) and go to Properties. On the Enforcement tab, select appropriate rule types as Configured. From the drop-down list, select the rule as Audit only. Click on OK. GPO – AppLocker policy You can see the application usage and history in the Event log. Open Event Viewer. Navigate to Applications and Services Logs | Microsoft | Windows | AppLocker. Based on your policy configuration, you will see the appropriate event information in the AppLocker section. In an enterprise world, manually checking the items in an event log is not going to be a viable option. You have a few options available to automate this process. You can forward the event log to a central server (Event Forwarding) and verify from that single console, or you can use the Get-WinEvent PowerShell cmdlet to collect these events remotely. The following section provides an option to evaluate these logs using the Get-WinEvent PowerShell cmdlet. By default, AppLocker events are located in the Applications and Services Logs | Microsoft | Windows | AppLocker section of the Event Viewer. The Get-WinEvent -ComputerName "SERVER01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file Server01.txt cmdlet filters all AppLocker-related events from Server01 and puts them in the output file Server01.txt. Here are some of the events that you will see in the event log: If you have multiple computers to evaluate, you can create a simple PowerShell script to automatically input the computer names. The following is a sample PowerShell script. The Servers.txt file will be your input file that contains all of the server names: $OutPut = "C:InputOutput.txt" Get-Content "C:InputServers.txt" | Foreach-Object { $_| out-file $OutPut -Append -Encoding ascii Get-WinEvent -ComputerName "Infralab01.MYINFRALAB.COM" –LogName *AppLocker* | fl | out-file $OutPut -Append -Encoding ascii } Implementing the policy Once you have verified the audit result, you can enforce the policy using the AppLockerGPO. The following steps can be used to implement the AppLocker GPO in an Active Directory environment: Open Group Policy Manager Console. Expand the Forest | Domain | Domain Name | Group Policy Object node. Right-click on the Server Application Restriction GPO and select Edit. This will open a Group Policy Management Editor MMC window. Opening the Group Policy Management Editor MMC window From Group Policy Management Editor, expand Policies | Windows Settings | Security Settings. Right-click on AppLocker and select Properties. In the AppLocker Properties window, change Executable rules to Enforce rules. Click on OK: Close the Group Policy Management Editor MMC window. The new policy will apply to the server based on your Active Directory replication interval and GPO refresh cycle. You can use the GPUPDATE/Force command to force the GPOon to a local server. Two different results are shown in the following screenshots. As you can see in the following screenshot, the user Johndoe was denied the execution of the NLTEST.exe application:   Since the following user was part of the Server Admins group, the user was allowed to execute the NLTEST.exe application:   Some additional security recommendations to consider when installing and configuring AppLocker are included at http://technet.microsoft.com/en-us/library/ee844118(WS.10).aspx. AppLocker and PowerShell AppLocker supports PowerShell, and it has a PowerShell module called AppLocker. An administrator can create, test, and troubleshoot the AppLocker policies using these cmdlets. You need to import the AppLocker module before these cmdlets can be used. The following are the supported cmdlets in the module: Summary We started this article with baseline security for your server platform, which was originally created using Microsoft SCW. In this article, you learned how to incorporate this policy with the baseline and best practice recommendations using MicrosoftSCM. Then you used AppLocker to enforce more application-based security. We also learned how to monitor the state of the server and compare it with the baseline to identify the security vulnerabilities and issues using Microsoft ASA. Resources for Article:  Further resources on this subject: Active Directory migration [article] Microsoft DAC 2012 [article] Insight into Hyper-V Storage [article]
Read more
  • 0
  • 0
  • 2075

article-image-knockoutjs-templates
Packt
04 Mar 2015
38 min read
Save for later

KnockoutJS Templates

Packt
04 Mar 2015
38 min read
 In this article by Jorge Ferrando, author of the book KnockoutJS Essentials, we are going talk about how to design our templates with the native engine and then we will speak about mechanisms and external libraries we can use to improve the Knockout template engine. When our code begins to grow, it's necessary to split it in several parts to keep it maintainable. When we split JavaScript code, we are talking about modules, classes, function, libraries, and so on. When we talk about HTML, we call these parts templates. KnockoutJS has a native template engine that we can use to manage our HTML. It is very simple, but also has a big inconvenience: templates, it should be loaded in the current HTML page. This is not a problem if our app is small, but it could be a problem if our application begins to need more and more templates. (For more resources related to this topic, see here.) Preparing the project First of all, we are going to add some style to the page. Add a file called style.css into the css folder. Add a reference in the index.html file, just below the bootstrap reference. The following is the content of the file: .container-fluid { margin-top: 20px; } .row { margin-bottom: 20px; } .cart-unit { width: 80px; } .btn-xs { font-size:8px; } .list-group-item { overflow: hidden; } .list-group-item h4 { float:left; width: 100px; } .list-group-item .input-group-addon { padding: 0; } .btn-group-vertical > .btn-default { border-color: transparent; } .form-control[disabled], .form-control[readonly] { background-color: transparent !important; } Now remove all the content from the body tag except for the script tags and paste in these lines: <div class="container-fluid"> <div class="row" id="catalogContainer">    <div class="col-xs-12"       data-bind="template:{name:'header'}"></div>    <div class="col-xs-6"       data-bind="template:{name:'catalog'}"></div>    <div id="cartContainer" class="col-xs-6 well hidden"       data-bind="template:{name:'cart'}"></div> </div> <div class="row hidden" id="orderContainer"     data-bind="template:{name:'order'}"> </div> <div data-bind="template: {name:'add-to-catalog-modal'}"></div> <div data-bind="template: {name:'finish-order-modal'}"></div> </div> Let's review this code. We have two row classes. They will be our containers. The first container is named with the id value as catalogContainer and it will contain the catalog view and the cart. The second one is referenced by the id value as orderContainer and we will set our final order there. We also have two more <div> tags at the bottom that will contain the modal dialogs to show the form to add products to our catalog and the other one will contain a modal message to tell the user that our order is finished. Along with this code you can see a template binding inside the data-bind attribute. This is the binding that Knockout uses to bind templates to the element. It contains a name parameter that represents the ID of a template. <div class="col-xs-12" data-bind="template:{name:'header'}"></div> In this example, this <div> element will contain the HTML that is inside the <script> tag with the ID header. Creating templates Template elements are commonly declared at the bottom of the body, just above the <script> tags that have references to our external libraries. We are going to define some templates and then we will talk about each one of them: <!-- templates --> <script type="text/html" id="header"></script> <script type="text/html" id="catalog"></script> <script type="text/html" id="add-to-catalog-modal"></script> <script type="text/html" id="cart-widget"></script> <script type="text/html" id="cart-item"></script> <script type="text/html" id="cart"></script> <script type="text/html" id="order"></script> <script type="text/html" id="finish-order-modal"></script> Each template name is descriptive enough by itself, so it's easy to know what we are going to set inside them. Let's see a diagram showing where we dispose each template on the screen:   Notice that the cart-item template will be repeated for each item in the cart collection. Modal templates will appear only when a modal dialog is displayed. Finally, the order template is hidden until we click to confirm the order. In the header template, we will have the title and the menu of the page. The add-to-catalog-modal template will contain the modal that shows the form to add a product to our catalog. The cart-widget template will show a summary of our cart. The cart-item template will contain the template of each item in the cart. The cart template will have the layout of the cart. The order template will show the final list of products we want to buy and a button to confirm our order. The header template Let's begin with the HTML markup that should contain the header template: <script type="text/html" id="header"> <h1>    Catalog </h1>   <button class="btn btn-primary btn-sm" data-toggle="modal"     data-target="#addToCatalogModal">    Add New Product </button> <button class="btn btn-primary btn-sm" data-bind="click:     showCartDetails, css:{ disabled: cart().length < 1}">    Show Cart Details </button> <hr/> </script> We define a <h1> tag, and two <button> tags. The first button tag is attached to the modal that has the ID #addToCatalogModal. Since we are using Bootstrap as the CSS framework, we can attach modals by ID using the data-target attribute, and activate the modal using the data-toggle attribute. The second button will show the full cart view and it will be available only if the cart has items. To achieve this, there are a number of different ways. The first one is to use the CSS-disabled class that comes with Twitter Bootstrap. This is the way we have used in the example. CSS binding allows us to activate or deactivate a class in the element depending on the result of the expression that is attached to the class. The other method is to use the enable binding. This binding enables an element if the expression evaluates to true. We can use the opposite binding, which is named disable. There is a complete documentation on the Knockout website http://knockoutjs.com/documentation/enable-binding.html: <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, enable: cart().length > 0"> Show Cart Details </button>   <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, disable: cart().length < 1"> Show Cart Details </button> The first method uses CSS classes to enable and disable the button. The second method uses the HTML attribute, disabled. We can use a third option, which is to use a computed observable. We can create a computed observable variable in our view-model that returns true or false depending on the length of the cart: //in the viewmodel. Remember to expose it var cartHasProducts = ko.computed(function(){ return (cart().length > 0); }); //HTML <button class="btn btn-primary btn-sm" data-bind="click:   showCartDetails, enable: cartHasProducts"> Show Cart Details </button> To show the cart, we will use the click binding. Now we should go to our viewmodel.js file and add all the information we need to make this template work: var cart = ko.observableArray([]); var showCartDetails = function () { if (cart().length > 0) {    $("#cartContainer").removeClass("hidden"); } }; And you should expose these two objects in the view-model: return {    searchTerm: searchTerm,    catalog: filteredCatalog,    newProduct: newProduct,    totalItems:totalItems,    addProduct: addProduct,    cart: cart,    showCartDetails: showCartDetails, }; The catalog template The next step is to define the catalog template just below the header template: <script type="text/html" id="catalog"> <div class="input-group">    <span class="input-group-addon">      <i class="glyphicon glyphicon-search"></i> Search    </span>    <input type="text" class="form-control" data-bind="textInput:       searchTerm"> </div> <table class="table">    <thead>    <tr>      <th>Name</th>      <th>Price</th>      <th>Stock</th>      <th></th>    </tr>    </thead>    <tbody data-bind="foreach:catalog">    <tr data-bind="style:color:stock() < 5?'red':'black'">      <td data-bind="text:name"></td>      <td data-bind="text:price"></td>      <td data-bind="text:stock"></td>      <td>        <button class="btn btn-primary"          data-bind="click:$parent.addToCart">          <i class="glyphicon glyphicon-plus-sign"></i> Add        </button>      </td>    </tr>    </tbody>    <tfoot>    <tr>      <td colspan="3">        <strong>Items:</strong><span           data-bind="text:catalog().length"></span>      </td>      <td colspan="1">        <span data-bind="template:{name:'cart-widget'}"></span>      </td>    </tr>    </tfoot> </table> </script> Now, each line uses the style binding to alert the user, while they are shopping, that the stock is reaching the maximum limit. The style binding works the same way that CSS binding does with classes. It allows us to add style attributes depending on the value of the expression. In this case, the color of the text in the line must be black if the stock is higher than five, and red if it is four or less. We can use other CSS attributes, so feel free to try other behaviors. For example, set the line of the catalog to green if the element is inside the cart. We should remember that if an attribute has dashes, you should wrap it in single quotes. For example, background-color will throw an error, so you should write 'background-color'. When we work with bindings that are activated depending on the values of the viewmodel, it is good practice to use computed observables. Therefore, we can create a computed value in our product model that returns the value of the color that should be displayed: //In the Product.js var _lineColor = ko.computed(function(){ return (_stock() < 5)? 'red' : 'black'; }); return { lineColor:_lineColor }; //In the template <tr data-bind="style:lineColor"> ... </tr> It would be even better if we create a class in our style.css file that is called stock-alert and we use the CSS binding: //In the style file .stock-alert { color: #f00; } //In the Product.js var _hasStock = ko.computed(function(){ return (_stock() < 5);   }); return { hasStock: _hasStock }; //In the template <tr data-bind="css: hasStock"> ... </tr> Now, look inside the <tfoot> tag. <td colspan="1"> <span data-bind="template:{name:'cart-widget'}"></span> </td> As you can see, we can have nested templates. In this case, we have the cart-widget template inside our catalog template. This give us the possibility of having very complex templates, splitting them into very small pieces, and combining them, to keep our code clean and maintainable. Finally, look at the last cell of each row: <td> <button class="btn btn-primary"     data-bind="click:$parent.addToCart">    <i class="glyphicon glyphicon-plus-sign"></i> Add </button> </td> Look at how we call the addToCart method using the magic variable $parent. Knockout gives us some magic words to navigate through the different contexts we have in our app. In this case, we are in the catalog context and we want to call a method that lies one level up. We can use the magical variable called $parent. There are other variables we can use when we are inside a Knockout context. There is complete documentation on the Knockout website http://knockoutjs.com/documentation/binding-context.html. In this project, we are not going to use all of them. But we are going quickly explain these binding context variables, just to understand them better. If we don't know how many levels deep we are, we can navigate to the top of the view-model using the magic word $root. When we have many parents, we can get the magic array $parents and access each parent using indexes, for example, $parents[0], $parents[1]. Imagine that you have a list of categories where each category contains a list of products. These products are a list of IDs and the category has a method to get the name of their products. We can use the $parents array to obtain the reference to the category: <ul data-bind="foreach: {data: categories}"> <li data-bind="text: $data.name"></li> <ul data-bind="foreach: {data: $data.products, as: 'prod'}>    <li data-bind="text:       $parents[0].getProductName(prod.ID)"></li> </ul> </ul> Look how helpful the as attribute is inside the foreach binding. It makes code more readable. But if you are inside a foreach loop, you can also access each item using the $data magic variable, and you can access the position index that each element has in the collection using the $index magic variable. For example, if we have a list of products, we can do this: <ul data-bind="foreach: cart"> <li><span data-bind="text:$index">    </span> - <span data-bind="text:$data.name"></span> </ul> This should display: 0 – Product 1 1 – Product 2 2 – Product 3 ...  KnockoutJS magic variables to navigate through contexts Now that we know more about what binding variables are, let's go back to our code. We are now going to write the addToCart method. We are going to define the cart items in our js/models folder. Create a file called CartProduct.js and insert the following code in it: //js/models/CartProduct.js var CartProduct = function (product, units) { "use strict";   var _product = product,    _units = ko.observable(units);   var subtotal = ko.computed(function(){    return _product.price() * _units(); });   var addUnit = function () {    var u = _units();    var _stock = _product.stock();    if (_stock === 0) {      return;    } _units(u+1);    _product.stock(--_stock); };   var removeUnit = function () {    var u = _units();    var _stock = _product.stock();    if (u === 0) {      return;    }    _units(u-1);    _product.stock(++_stock); };   return {    product: _product,    units: _units,    subtotal: subtotal,    addUnit : addUnit,    removeUnit: removeUnit, }; }; Each cart product is composed of the product itself and the units of the product we want to buy. We will also have a computed field that contains the subtotal of the line. We should give the object the responsibility for managing its units and the stock of the product. For this reason, we have added the addUnit and removeUnit methods. These methods add one unit or remove one unit of the product if they are called. We should reference this JavaScript file into our index.html file with the other <script> tags. In the viewmodel, we should create a cart array and expose it in the return statement, as we have done earlier: var cart = ko.observableArray([]); It's time to write the addToCart method: var addToCart = function(data) { var item = null; var tmpCart = cart(); var n = tmpCart.length; while(n--) {    if (tmpCart[n].product.id() === data.id()) {      item = tmpCart[n];    } } if (item) {    item.addUnit(); } else {    item = new CartProduct(data,0);    item.addUnit();    tmpCart.push(item);       } cart(tmpCart); }; This method searches the product in the cart. If it exists, it updates its units, and if not, it creates a new one. Since the cart is an observable array, we need to get it, manipulate it, and overwrite it, because we need to access the product object to know if the product is in the cart. Remember that observable arrays do not observe the objects they contain, just the array properties. The add-to-cart-modal template This is a very simple template. We just wrap the code to add a product to a Bootstrap modal: <script type="text/html" id="add-to-catalog-modal"> <div class="modal fade" id="addToCatalogModal">    <div class="modal-dialog">      <div class="modal-content">        <form class="form-horizontal" role="form"           data-bind="with:newProduct">          <div class="modal-header">            <button type="button" class="close"               data-dismiss="modal">              <span aria-hidden="true">&times;</span>              <span class="sr-only">Close</span>            </button><h3>Add New Product to the Catalog</h3>          </div>          <div class="modal-body">            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                  placeholder="Name" data-bind="textInput:name">              </div>            </div>            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                   placeholder="Price" data-bind="textInput:price">              </div>            </div>            <div class="form-group">              <div class="col-sm-12">                <input type="text" class="form-control"                   placeholder="Stock" data-bind="textInput:stock">              </div>            </div>          </div>          <div class="modal-footer">            <div class="form-group">              <div class="col-sm-12">                <button type="submit" class="btn btn-default"                  data-bind="{click:$parent.addProduct}">                  <i class="glyphicon glyphicon-plus-sign">                  </i> Add Product                </button>              </div>            </div>          </div>        </form>      </div><!-- /.modal-content -->    </div><!-- /.modal-dialog --> </div><!-- /.modal --> </script> The cart-widget template This template gives the user information quickly about how many items are in the cart and how much all of them cost: <script type="text/html" id="cart-widget"> Total Items: <span data-bind="text:totalItems"></span> Price: <span data-bind="text:grandTotal"></span> </script> We should define totalItems and grandTotal in our viewmodel: var totalItems = ko.computed(function(){ var tmpCart = cart(); var total = 0; tmpCart.forEach(function(item){    total += parseInt(item.units(),10); }); return total; }); var grandTotal = ko.computed(function(){ var tmpCart = cart(); var total = 0; tmpCart.forEach(function(item){    total += (item.units() * item.product.price()); }); return total; }); Now you should expose them in the return statement, as we always do. Don't worry about the format now, you will learn how to format currency or any kind of data in the future. Now you must focus on learning how to manage information and how to show it to the user. The cart-item template The cart-item template displays each line in the cart: <script type="text/html" id="cart-item"> <div class="list-group-item" style="overflow: hidden">    <button type="button" class="close pull-right" data-bind="click:$root.removeFromCart"><span>&times;</span></button>    <h4 class="" data-bind="text:product.name"></h4>    <div class="input-group cart-unit">      <input type="text" class="form-control" data-bind="textInput:units" readonly/>        <span class="input-group-addon">          <div class="btn-group-vertical">            <button class="btn btn-default btn-xs"               data-bind="click:addUnit">              <i class="glyphicon glyphicon-chevron-up"></i>            </button>            <button class="btn btn-default btn-xs"               data-bind="click:removeUnit">              <i class="glyphicon glyphicon-chevron-down"></i>            </button>          </div>        </span>    </div> </div> </script> We set an x button in the top-right of each line to easily remove a line from the cart. As you can see, we have used the $root magic variable to navigate to the top context because we are going to use this template inside a foreach loop, and it means this template will be in the loop context. If we consider this template as an isolated element, we can't be sure how deep we are in the context navigation. To be sure, we go to the right context to call the removeFormCart method. It's better to use $root instead of $parent in this case. The code for removeFromCart should lie in the viewmodel context and should look like this: var removeFromCart = function (data) { var units = data.units(); var stock = data.product.stock(); data.product.stock(units+stock); cart.remove(data); }; Notice that in the addToCart method, we get the array that is inside the observable. We did that because we need to navigate inside the elements of the array. In this case, Knockout observable arrays have a method called remove that allows us to remove the object that we pass as a parameter. If the object is in the array, it will be removed. Remember that the data context is always passed as the first parameter in the function we use in the click events. The cart template The cart template should display the layout of the cart: <script type="text/html" id="cart"> <button type="button" class="close pull-right"     data-bind="click:hideCartDetails">    <span>&times;</span> </button> <h1>Cart</h1> <div data-bind="template: {name: 'cart-item', foreach:cart}"     class="list-group"></div> <div data-bind="template:{name:'cart-widget'}"></div> <button class="btn btn-primary btn-sm"     data-bind="click:showOrder">    Confirm Order </button> </script> It's important that you notice the template binding that we have just below <h1>Cart</h1>. We are binding a template with an array using the foreach argument. With this binding, Knockout renders the cart-item template for each element inside the cart collection. This considerably reduces the code we write in each template and in addition makes them more readable. We have once again used the cart-widget template to show the total items and the total amount. This is one of the good features of templates, we can reuse content over and over. Observe that we have put a button at the top-right of the cart to close it when we don't need to see the details of our cart, and the other one to confirm the order when we are done. The code in our viewmodel should be as follows: var hideCartDetails = function () { $("#cartContainer").addClass("hidden"); }; var showOrder = function () { $("#catalogContainer").addClass("hidden"); $("#orderContainer").removeClass("hidden"); }; As you can see, to show and hide elements we use jQuery and CSS classes from the Bootstrap framework. The hidden class just adds the display: none style to the elements. We just need to toggle this class to show or hide elements in our view. Expose these two methods in the return statement of your view-model. We will come back to this when we need to display the order template. This is the result once we have our catalog and our cart:   The order template Once we have clicked on the Confirm Order button, the order should be shown to us, to review and confirm if we agree. <script type="text/html" id="order"> <div class="col-xs-12">    <button class="btn btn-sm btn-primary"       data-bind="click:showCatalog">      Back to catalog    </button>    <button class="btn btn-sm btn-primary"       data-bind="click:finishOrder">      Buy & finish    </button> </div> <div class="col-xs-6">    <table class="table">      <thead>      <tr>        <th>Name</th>        <th>Price</th>        <th>Units</th>        <th>Subtotal</th>      </tr>      </thead>      <tbody data-bind="foreach:cart">      <tr>        <td data-bind="text:product.name"></td>        <td data-bind="text:product.price"></td>        <td data-bind="text:units"></td>        <td data-bind="text:subtotal"></td>      </tr>      </tbody>      <tfoot>      <tr>        <td colspan="3"></td>        <td>Total:<span data-bind="text:grandTotal"></span></td>      </tr>      </tfoot>    </table> </div> </script> Here we have a read-only table with all cart lines and two buttons. One is to confirm, which will show the modal dialog saying the order is completed, and the other gives us the option to go back to the catalog and keep on shopping. There is some code we need to add to our viewmodel and expose to the user: var showCatalog = function () { $("#catalogContainer").removeClass("hidden"); $("#orderContainer").addClass("hidden"); }; var finishOrder = function() { cart([]); hideCartDetails(); showCatalog(); $("#finishOrderModal").modal('show'); }; As we have done in previous methods, we add and remove the hidden class from the elements we want to show and hide. The finishOrder method removes all the items of the cart because our order is complete; hides the cart and shows the catalog. It also displays a modal that gives confirmation to the user that the order is done.  Order details template The finish-order-modal template The last template is the modal that tells the user that the order is complete: <script type="text/html" id="finish-order-modal"> <div class="modal fade" id="finishOrderModal">    <div class="modal-dialog">            <div class="modal-content">        <div class="modal-body">        <h2>Your order has been completed!</h2>        </div>        <div class="modal-footer">          <div class="form-group">            <div class="col-sm-12">              <button type="submit" class="btn btn-success"                 data-dismiss="modal">Continue Shopping              </button>            </div>          </div>        </div>      </div><!-- /.modal-content -->    </div><!-- /.modal-dialog --> </div><!-- /.modal --> </script> The following screenshot displays the output:   Handling templates with if and ifnot bindings You have learned how to show and hide templates with the power of jQuery and Bootstrap. This is quite good because you can use this technique with any framework you want. The problem with this type of code is that since jQuery is a DOM manipulation library, you need to reference elements to manipulate them. This means you need to know over which element you want to apply the action. Knockout gives us some bindings to hide and show elements depending on the values of our view-model. Let's update the show and hide methods and the templates. Add both the control variables to your viewmodel and expose them in the return statement. var visibleCatalog = ko.observable(true); var visibleCart = ko.observable(false); Now update the show and hide methods: var showCartDetails = function () { if (cart().length > 0) {    visibleCart(true); } };   var hideCartDetails = function () { visibleCart(false); };   var showOrder = function () { visibleCatalog(false); };   var showCatalog = function () { visibleCatalog(true); }; We can appreciate how the code becomes more readable and meaningful. Now, update the cart template, the catalog template, and the order template. In index.html, consider this line: <div class="row" id="catalogContainer"> Replace it with the following line: <div class="row" data-bind="if: visibleCatalog"> Then consider the following line: <div id="cartContainer" class="col-xs-6 well hidden"   data-bind="template:{name:'cart'}"></div> Replace it with this one: <div class="col-xs-6" data-bind="if: visibleCart"> <div class="well" data-bind="template:{name:'cart'}"></div> </div> It is important to know that the if binding and the template binding can't share the same data-bind attribute. This is why we go from one element to two nested elements in this template. In other words, this example is not allowed: <div class="col-xs-6" data-bind="if:visibleCart,   template:{name:'cart'}"></div> Finally, consider this line: <div class="row hidden" id="orderContainer"   data-bind="template:{name:'order'}"> Replace it with this one: <div class="row" data-bind="ifnot: visibleCatalog"> <div data-bind="template:{name:'order'}"></div> </div> With the changes we have made, showing or hiding elements now depends on our data and not on our CSS. This is much better because now we can show and hide any element we want using the if and ifnot binding. Let's review, roughly speaking, how our files are now: We have our index.html file that has the main container, templates, and libraries: <!DOCTYPE html> <html> <head> <title>KO Shopping Cart</title> <meta name="viewport" content="width=device-width,     initial-scale=1"> <link rel="stylesheet" type="text/css"     href="css/bootstrap.min.css"> <link rel="stylesheet" type="text/css" href="css/style.css"> </head> <body>   <div class="container-fluid"> <div class="row" data-bind="if: visibleCatalog">    <div class="col-xs-12"       data-bind="template:{name:'header'}"></div>    <div class="col-xs-6"       data-bind="template:{name:'catalog'}"></div>    <div class="col-xs-6" data-bind="if: visibleCart">      <div class="well" data-bind="template:{name:'cart'}"></div>    </div> </div> <div class="row" data-bind="ifnot: visibleCatalog">    <div data-bind="template:{name:'order'}"></div> </div> <div data-bind="template: {name:'add-to-catalog-modal'}"></div> <div data-bind="template: {name:'finish-order-modal'}"></div> </div>   <!-- templates --> <script type="text/html" id="header"> ... </script> <script type="text/html" id="catalog"> ... </script> <script type="text/html" id="add-to-catalog-modal"> ... </script> <script type="text/html" id="cart-widget"> ... </script> <script type="text/html" id="cart-item"> ... </script> <script type="text/html" id="cart"> ... </script> <script type="text/html" id="order"> ... </script> <script type="text/html" id="finish-order-modal"> ... </script> <!-- libraries --> <script type="text/javascript"   src="js/vendors/jquery.min.js"></script> <script type="text/javascript"   src="js/vendors/bootstrap.min.js"></script> <script type="text/javascript"   src="js/vendors/knockout.debug.js"></script> <script type="text/javascript"   src="js/models/product.js"></script> <script type="text/javascript"   src="js/models/cartProduct.js"></script> <script type="text/javascript" src="js/viewmodel.js"></script> </body> </html> We also have our viewmodel.js file: var vm = (function () { "use strict"; var visibleCatalog = ko.observable(true); var visibleCart = ko.observable(false); var catalog = ko.observableArray([...]); var cart = ko.observableArray([]); var newProduct = {...}; var totalItems = ko.computed(function(){...}); var grandTotal = ko.computed(function(){...}); var searchTerm = ko.observable(""); var filteredCatalog = ko.computed(function () {...}); var addProduct = function (data) {...}; var addToCart = function(data) {...}; var removeFromCart = function (data) {...}; var showCartDetails = function () {...}; var hideCartDetails = function () {...}; var showOrder = function () {...}; var showCatalog = function () {...}; var finishOrder = function() {...}; return {    searchTerm: searchTerm,    catalog: filteredCatalog,    cart: cart,    newProduct: newProduct,    totalItems:totalItems,    grandTotal:grandTotal,    addProduct: addProduct,    addToCart: addToCart,    removeFromCart:removeFromCart,    visibleCatalog: visibleCatalog,    visibleCart: visibleCart,    showCartDetails: showCartDetails,    hideCartDetails: hideCartDetails,    showOrder: showOrder,    showCatalog: showCatalog,    finishOrder: finishOrder }; })(); ko.applyBindings(vm); It is useful to debug to globalize the view-model. It is not good practice in production environments, but it is good when you are debugging your application. Window.vm = vm; Now you have easy access to your view-model from the browser debugger or from your IDE debugger. In addition to the product model, we have created a new model called CartProduct: var CartProduct = function (product, units) { "use strict"; var _product = product,    _units = ko.observable(units); var subtotal = ko.computed(function(){...}); var addUnit = function () {...}; var removeUnit = function () {...}; return {    product: _product,    units: _units,    subtotal: subtotal,    addUnit : addUnit,    removeUnit: removeUnit }; }; You have learned how to manage templates with Knockout, but maybe you have noticed that having all templates in the index.html file is not the best approach. We are going to talk about two mechanisms. The first one is more home-made and the second one is an external library used by lots of Knockout developers, created by Jim Cowart, called Knockout.js-External-Template-Engine (https://github.com/ifandelse/Knockout.js-External-Template-Engine). Managing templates with jQuery Since we want to load templates from different files, let's move all our templates to a folder called views and make one file per template. Each file will have the same name the template has as an ID. So if the template has the ID, cart-item, the file should be called cart-item.html and will contain the full cart-item template: <script type="text/html" id="cart-item"></script>  The views folder with all templates Now in the viewmodel.js file, remove the last line (ko.applyBindings(vm)) and add this code: var templates = [ 'header', 'catalog', 'cart', 'cart-item', 'cart-widget', 'order', 'add-to-catalog-modal', 'finish-order-modal' ];   var busy = templates.length; templates.forEach(function(tpl){ "use strict"; $.get('views/'+ tpl + '.html').then(function(data){    $('body').append(data);    busy--;    if (!busy) {      ko.applyBindings(vm);    } }); }); This code gets all the templates we need and appends them to the body. Once all the templates are loaded, we call the applyBindings method. We should do it this way because we are loading templates asynchronously and we need to make sure that we bind our view-model when all templates are loaded. This is good enough to make our code more maintainable and readable, but is still problematic if we need to handle lots of templates. Further more, if we have nested folders, it becomes a headache listing all our templates in one array. There should be a better approach. Managing templates with koExternalTemplateEngine We have seen two ways of loading templates, both of them are good enough to manage a low number of templates, but when lines of code begin to grow, we need something that allows us to forget about template management. We just want to call a template and get the content. For this purpose, Jim Cowart's library, koExternalTemplateEngine, is perfect. This project was abandoned by the author in 2014, but it is still a good library that we can use when we develop simple projects. We just need to download the library in the js/vendors folder and then link it in our index.html file just below the Knockout library. <script type="text/javascript" src="js/vendors/knockout.debug.js"></script> <script type="text/javascript"   src="js/vendors/koExternalTemplateEngine_all.min.js"></script> Now you should configure it in the viewmodel.js file. Remove the templates array and the foreach statement, and add these three lines of code: infuser.defaults.templateSuffix = ".html"; infuser.defaults.templateUrl = "views"; ko.applyBindings(vm); Here, infuser is a global variable that we use to configure the template engine. We should indicate which suffix will have our templates and in which folder they will be. We don't need the <script type="text/html" id="template-id"></script> tags any more, so we should remove them from each file. So now everything should be working, and the code we needed to succeed was not much. KnockoutJS has its own template engine, but you can see that adding new ones is not difficult. If you have experience with other template engines such as jQuery Templates, Underscore, or Handlebars, just load them in your index.html file and use them, there is no problem with that. This is why Knockout is beautiful, you can use any tool you like with it. You have learned a lot of things in this article, haven't you? Knockout gives us the CSS binding to activate and deactivate CSS classes according to an expression. We can use the style binding to add CSS rules to elements. The template binding helps us to manage templates that are already loaded in the DOM. We can iterate along collections with the foreach binding. Inside a foreach, Knockout gives us some magic variables such as $parent, $parents, $index, $data, and $root. We can use the binding as along with the foreach binding to get an alias for each element. We can show and hide content using just jQuery and CSS. We can show and hide content using the bindings: if, ifnot, and visible. jQuery helps us to load Knockout templates asynchronously. You can use the koExternalTemplateEngine plugin to manage templates in a more efficient way. The project is abandoned but it is still a good solution. Summary In this article, you have learned how to split an application using templates that share the same view-model. Now that we know the basics, it would be interesting to extend the application. Maybe we can try to create a detailed view of the product, or maybe we can give the user the option to register where to send the order. Resources for Article: Further resources on this subject: Components [article] Web Application Testing [article] Top features of KnockoutJS [article]
Read more
  • 0
  • 0
  • 11034

article-image-scipy-signal-processing
Packt
03 Mar 2015
14 min read
Save for later

SciPy for Signal Processing

Packt
03 Mar 2015
14 min read
In this article by Sergio J. Rojas G. and Erik A Christensen, authors of the book Learning SciPy for Numerical and Scientific Computing - Second Edition, we will focus on the usage of some most commonly used routines that are included in SciPy modules—scipy.signal, scipy.ndimage, and scipy.fftpack, which are used for signal processing, multidimensional image processing, and computing Fourier transforms, respectively. We define a signal as data that measures either a time-varying or spatially varying phenomena. Sound or electrocardiograms are excellent examples of time-varying quantities, while images embody the quintessential spatially varying cases. Moving images are treated with the techniques of both types of signals, obviously. The field of signal processing treats four aspects of this kind of data: its acquisition, quality improvement, compression, and feature extraction. SciPy has many routines to treat effectively tasks in any of the four fields. All these are included in two low-level modules (scipy.signal being the main module, with an emphasis on time-varying data, and scipy.ndimage, for images). Many of the routines in these two modules are based on Discrete Fourier Transform of the data. SciPy has an extensive package of applications and definitions of these background algorithms, scipy.fftpack, which we will start covering first. (For more resources related to this topic, see here.) Discrete Fourier Transforms The Discrete Fourier Transform (DFT from now on) transforms any signal from its time/space domain into a related signal in the frequency domain. This allows us not only to be able to analyze the different frequencies of the data, but also for faster filtering operations, when used properly. It is possible to turn a signal in the frequency domain back to its time/spatial domain; thanks to the Inverse Fourier Transform. We will not go into detail of the mathematics behind these operators, since we assume familiarity at some level with this theory. We will focus on syntax and applications instead. The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension, which are fft and ifft (one dimension), fft2 and ifft2 (two dimensions), and fftn and ifftn (any number of dimensions). All of these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real valued, and should offer real-valued frequencies, we use rfft and irfft instead, for a faster algorithm. All these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means in particular, that if x happens to be two-dimensional, for example, fft will output another two-dimensional array where each row is the transform of each row of the original. We can change it to columns instead, with the optional parameter, axis. The rest of parameters are also optional; n indicates the length of the transform, and overwrite_x gets rid of the original data to save memory and resources. We usually play with the integer n when we need to pad the signal with zeros, or truncate it. For higher dimension, n is substituted by shape (a tuple), and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with fftshift. The inverse of this operation, ifftshift, is also included in the module. The following code shows some of these routines in action, when applied to a checkerboard image: >>> import numpy >>> from scipy.fftpack import fft,fft2, fftshift >>> import matplotlib.pyplot as plt >>> B=numpy.ones((4,4)); W=numpy.zeros((4,4)) >>> signal = numpy.bmat("B,W;W,B") >>> onedimfft = fft(signal,n=16) >>> twodimfft = fft2(signal,shape=(16,16)) >>> plt.figure() >>> plt.gray() >>> plt.subplot(121,aspect='equal') >>> plt.pcolormesh(onedimfft.real) >>> plt.colorbar(orientation='horizontal') >>> plt.subplot(122,aspect='equal') >>> plt.pcolormesh(fftshift(twodimfft.real)) >>> plt.colorbar(orientation='horizontal') >>> plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin, and nice symmetries in the frequency domain. In the following screenshot (obtained from the preceding code), the left-hand side image is fft and the right-hand side image is fft2 of a 2 x 2 checkerboard signal: The scipy.fftpack module also offers the Discrete Cosine Transform with its inverse (dct, idct) as well as many differential and pseudo-differential operators defined in terms of all these transforms: diff (for derivative/integral), hilbert and ihilbert (for the Hilbert transform), tilbert and itilbert (for the h-Tilbert transform of periodic sequences), and so on. Signal construction To aid in the construction of signals with predetermined properties, the scipy.signal module has a nice collection of the most frequent one-dimensional waveforms in the literature: chirp and sweep_poly (for the frequency-swept cosine generator), gausspulse (a Gaussian modulated sinusoid) and sawtooth and square (for the waveforms with those names). They all take as their main parameter a one-dimensional ndarray representing the times at which the signal is to be evaluated. Other parameters control the design of the signal, according to frequency or time constraints. Let's take a look into the following code snippet, which illustrates the use of these one dimensional waveforms that we just discussed: >>> import numpy >>> from scipy.signal import chirp, sawtooth, square, gausspulse >>> import matplotlib.pyplot as plt >>> t=numpy.linspace(-1,1,1000) >>> plt.subplot(221); plt.ylim([-2,2]) >>> plt.plot(t,chirp(t,f0=100,t1=0.5,f1=200))   # plot a chirp >>> plt.subplot(222); plt.ylim([-2,2]) >>> plt.plot(t,gausspulse(t,fc=10,bw=0.5))     # Gauss pulse >>> plt.subplot(223); plt.ylim([-2,2]) >>> t*=3*numpy.pi >>> plt.plot(t,sawtooth(t))                     # sawtooth >>> plt.subplot(224); plt.ylim([-2,2]) >>> plt.plot(t,square(t))                       # Square wave >>> plt.show() Generated by this code, the following diagram shows waveforms for chirp (upper-left), gausspulse (upper-right), sawtooth (lower-left), and square (lower-right): The usual method of creating signals is to import them from the file. This is possible by using purely NumPy routines, for example fromfile: fromfile(file, dtype=float, count=-1, sep='') The file argument may point to either a file or a string, the count argument is used to determine the number of items to read, and sep indicates what constitutes a separator in the original file/string. For images, we have the versatile routine, imread in either the scipy.ndimage or scipy.misc module: imread(fname, flatten=False) The fname argument is a string containing the location of an image. The routine infers the type of file, and reads the data into an array, accordingly. In case the flatten argument is turned to True, the image is converted to gray scale. Note that, in order to work, the Python Imaging Library (PIL) needs to be installed. It is also possible to load .wav files for analysis, with the read and write routines from the wavfile submodule in the scipy.io module. For instance, given any audio file with this format, say audio.wav, the command, rate,data = scipy.io.wavfile.read("audio.wav"), assigns an integer value to the rate variable, indicating the sample rate of the file (in samples per second), and a NumPy ndarray to the data variable, containing the numerical values assigned to the different notes. If we wish to write some one-dimensional ndarray data into an audio file of this kind, with the sample rate given by the rate variable, we may do so by issuing the following command: >>> scipy.io.wavfile.write("filename.wav",rate,data) Filters A filter is an operation on signals that either removes features or extracts some component. SciPy has a very complete set of known filters, as well as the tools to allow construction of new ones. The complete list of filters in SciPy is long, and we encourage the reader to explore the help documents of the scipy.signal and scipy.ndimage modules for the complete picture. We will introduce in these pages, as an exposition, some of the most used filters in the treatment of audio or image processing. We start by creating a signal worth filtering: >>> from numpy import sin, cos, pi, linspace >>> f=lambda t: cos(pi*t) + 0.2*sin(5*pi*t+0.1) + 0.2*sin(30*pi*t)    + 0.1*sin(32*pi*t+0.1) + 0.1*sin(47* pi*t+0.8) >>> t=linspace(0,4,400); signal=f(t) We first test the classical smoothing filter of Wiener and Kolmogorov, wiener. We present in a plot, the original signal (in black) and the corresponding filtered data, with a choice of a Wiener window of the size 55 samples (in blue). Next, we compare the result of applying the median filter, medfilt, with a kernel of the same size as before (in red): >>> from scipy.signal import wiener, medfilt >>> import matplotlib.pylab as plt >>> plt.plot(t,signal,'k') >>> plt.plot(t,wiener(signal,mysize=55),'r',linewidth=3) >>> plt.plot(t,medfilt(signal,kernel_size=55),'b',linewidth=3) >>> plt.show() This gives us the following graph showing the comparison of smoothing filters (wiener is the one that has its starting point just below 0.5 and medfilt has its starting point just above 0.5): Most of the filters in the scipy.signal module can be adapted to work in arrays of any dimension. But in the particular case of images, we prefer to use the implementations in the scipy.ndimage module, since they are coded with these objects in mind. For instance, to perform a median filter on an image for smoothing, we use scipy.ndimage.median_filter. Let's see an example. We will start by loading Lena to the array and corrupting the image with Gaussian noise (zero mean and standard deviation of 16): >>> from scipy.stats import norm     # Gaussian distribution >>> import matplotlib.pyplot as plt >>> import scipy.misc >>> import scipy.ndimage >>> plt.gray() >>> lena=scipy.misc.lena().astype(float) >>> plt.subplot(221); >>> plt.imshow(lena) >>> lena+=norm(loc=0,scale=16).rvs(lena.shape) >>> plt.subplot(222); >>> plt.imshow(lena) >>> denoised_lena = scipy.ndimage.median_filter(lena,3) >>> plt.subplot(224); >>> plt.imshow(denoised_lena) The set of filters for images come in two flavors—statistical and morphological. For example, among the filters of statistical nature, we have the Sobel algorithm oriented to detection of edges (singularities along curves). Its syntax is as follows: sobel(image, axis=-1, output=None, mode='reflect', cval=0.0) The optional parameter, axis, indicates the dimension in which the computations are performed. By default, this is always the last axis (-1). The mode parameter, which is one of the strings 'reflect', 'constant', 'nearest', 'mirror', or 'wrap', indicates how to handle the border of the image, in case there is insufficient data to perform the computations there. In case the mode is 'constant', we may indicate the value to use in the border, with the cval parameter. Let's look into the following code snippet, which illustrates the use of the sobel filter: >>> from scipy.ndimage.filters import sobel >>> import numpy >>> lena=scipy.misc.lena() >>> sblX=sobel(lena,axis=0); sblY=sobel(lena,axis=1) >>> sbl=numpy.hypot(sblX,sblY) >>> plt.subplot(223); >>> plt.imshow(sbl) >>> plt.show() The following screenshot illustrates Lena (upper-left) and noisy Lena (upper-right) with the preceding two filters in action—edge map with sobel (lower-left) and median filter (lower-right): Morphology We also have the possibility of creating and applying filters to images based on mathematical morphology, both to binary and gray-scale images. The four basic morphological operations are opening (binary_opening), closing (binary_closing), dilation (binary_dilation), and erosion (binary_erosion). Note that the syntax for each of these filters is very simple, since we only need two ingredients—the signal to filter and the structuring element to perform the morphological operation. Let's take a look into the general syntax for these morphological operations: binary_operation(signal, structuring_element) We may use combinations of these four basic morphological operations to create more complex filters for removal of holes, hit-or-miss transforms (to find the location of specific patterns in binary images), denoising, edge detection, and many more. The SciPy module also allows for creating some common filters using the preceding syntax. For instance, for the location of the letter e in a text, we could use the following command instead: >>> binary_hit_or_miss(text, letterE) For comparative purposes, let's use this command in the following code snippet: >>> import numpy >>> import scipy.ndimage >>> import matplotlib.pylab as plt >>> from scipy.ndimage.morphology import binary_hit_or_miss >>> text = scipy.ndimage.imread('CHAP_05_input_textImage.png') >>> letterE = text[37:53,275:291] >>> HitorMiss = binary_hit_or_miss(text, structure1=letterE,    origin1=1) >>> eLocation = numpy.where(HitorMiss==True) >>> x=eLocation[1]; y=eLocation[0] >>> plt.imshow(text, cmap=plt.cm.gray, interpolation='nearest') >>> plt.autoscale(False) >>> plt.plot(x,y,'wo',markersize=10) >>> plt.axis('off') >>> plt.show() The output for the preceding lines of code is generated as follows: For gray-scale images, we may use a structuring element (structuring_element) or a footprint. The syntax is, therefore, a little different: grey_operation(signal, [structuring_element, footprint, size, ...]) If we desire to use a completely flat and rectangular structuring element (all ones), then it is enough to indicate the size as a tuple. For instance, to perform gray-scale dilation of a flat element of size (15,15) on our classical image of Lena, we issue the following command: >>> grey_dilation(lena, size=(15,15)) The last kind of morphological operations coded in the scipy.ndimage module perform distance and feature transforms. Distance transforms create a map that assigns to each pixel, the distance to the nearest object. Feature transforms provide with the index of the closest background element instead. These operations are used to decompose images into different labels. We may even choose different metrics such as Euclidean distance, chessboard distance, and taxicab distance. The syntax for the distance transform (distance_transform) using a brute force algorithm is as follows: distance_transform_bf(signal, metric='euclidean', sampling=None, return_distances=True, return_indices=False,                      distances=None, indices=None) We indicate the metric with the strings such as 'euclidean', 'taxicab', or 'chessboard'. If we desire to provide the feature transform instead, we switch return_distances to False and return_indices to True. Similar routines are available with more sophisticated algorithms—distance_transform_cdt (using chamfering for taxicab and chessboard distances). For Euclidean distance, we also have distance_transform_edt. All these use the same syntax. Summary In this article, we explored signal processing (any dimensional) including the treatment of signals in frequency space, by means of their Discrete Fourier Transforms. These correspond to the fftpack, signal, and ndimage modules. Resources for Article: Further resources on this subject: Signal Processing Techniques [article] SciPy for Computational Geometry [article] Move Further with NumPy Modules [article]
Read more
  • 0
  • 0
  • 13934
article-image-elasticsearch-administration
Packt
03 Mar 2015
28 min read
Save for later

Elasticsearch Administration

Packt
03 Mar 2015
28 min read
In this article by Rafał Kuć and Marek Rogoziński, author of the book Mastering Elasticsearch, Second Edition we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this article, you will have learned: (For more resources related to this topic, see here.) Configuring the discovery and recovery modules Using the Cat API that allows a human-readable insight into the cluster status The backup and restore functionality Federated search Discovery and recovery modules When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster. After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present). In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are. Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing. Discovery configuration As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.name setting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network. There are a few implementations of the discovery module that we can use, so let's see what the options are. Zen discovery Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production. For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes. If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast. If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them. Multicast Zen discovery configuration The multicast part of the Zen discovery module exposes the following settings: discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name. discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication. discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to. discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages. discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over. discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method. The unicast Zen discovery configuration The unicast part of Zen discovery provides the following configuration options: discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster. discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value. Master node One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster. Configuring master and data nodes By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses. Configuring data-only nodes To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file: node.master: falsenode.data: true Configuring master-only nodes To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file: node.master: truenode.data: false Configuring the query processing-only nodes For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.yml configuration file: node.master: falsenode.data: false Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity. The master election configuration We already wrote about the master election configuration in Elasticsearch Server Second Edition, but this topic is very important, so we decided to refresh our knowledge about it. Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split. In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster. Zen discovery fault detection and configuration Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change: discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node. discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value. discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network). There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting the discovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network. The Amazon EC2 discovery Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API. Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings. The EC2 plugin installation The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command: bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0 The EC2 plugin's generic configuration This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work: cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast. Optional EC2 discovery configuration options The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings: cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values can be eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1. cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com. cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http). cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used. cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens. discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes. The EC2 nodes scanning configuration The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its behavior: discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication). discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster. discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster. discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched. discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following: discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery. cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. Other discovery implementations The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are: Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper). The gateway and recovery configuration The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied. One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none. The gateway recovery process Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync. Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts. Configuration properties Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify: gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start. gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start. gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start. gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0. Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3m Expectations on nodes In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are: gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered. gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately. gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately. Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property: gateway.expected_nodes: 6 So the whole configuration would look like this: gateway.recover_after_data_nodes: 4gateway.recover_after_time: 3mgateway.expected_nodes: 6 The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes). The local gateway With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types. The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching. The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process. In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default. There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them. Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster. Low-level recovery configuration We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. However, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery. Cluster- level recovery configuration The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are: indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput. indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process. indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network. indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process. indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true. In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this. All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file. Index-level recovery settings In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values: quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value. quorum-1: 50 percent of the shards for a given index need to be present and be allocable. full: All of the shards for the given index need to be present and be allocable. full-1: 100 percent minus one shards for the given index need to be present and be allocable. integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2 shards to be allocable. It is good to know about this property, but in most cases, the default value will be sufficient for most deployments. Summary In this article, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node. If you want to dig deeper, buy the book Mastering Elasticsearch, Second Edition and read in a simple step-by-step fashion using Elasticsearch to enhance your knowlege further. Resources for Article: Further resources on this subject: Downloading and Setting Up ElasticSearch [Article] Indexing the Data [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 5417

article-image-mapreduce-functions
Packt
03 Mar 2015
11 min read
Save for later

MapReduce functions

Packt
03 Mar 2015
11 min read
 In this article, by John Zablocki, author of the book, Couchbase Essentials, you will be acquainted to MapReduce and how you'll use it to create secondary indexes for our documents. At its simplest, MapReduce is a programming pattern used to process large amounts of data that is typically distributed across several nodes in parallel. In the NoSQL world, MapReduce implementations may be found on many platforms from MongoDB to Hadoop, and of course, Couchbase. Even if you're new to the NoSQL landscape, it's quite possible that you've already worked with a form of MapReduce. The inspiration for MapReduce in distributed NoSQL systems was drawn from the functional programming concepts of map and reduce. While purely functional programming languages haven't quite reached mainstream status, languages such as Python, C#, and JavaScript all support map and reduce operations. (For more resources related to this topic, see here.) Map functions Consider the following Python snippet: numbers = [1, 2, 3, 4, 5] doubled = map(lambda n: n * 2, numbers) #doubled == [2, 4, 6, 8, 10] These two lines of code demonstrate a very simple use of a map() function. In the first line, the numbers variable is created as a list of integers. The second line applies a function to the list to create a new mapped list. In this case, the map() function is supplied as a Python lambda, which is just an inline, unnamed function. The body of lambda multiplies each number by two. This map() function can be made slightly more complex by doubling only odd numbers, as shown in this code: numbers = [1, 2, 3, 4, 5] defdouble_odd(num):   if num % 2 == 0:     return num   else:     return num * 2   doubled = map(double_odd, numbers) #doubled == [2, 2, 6, 4, 10] Map functions are implemented differently in each language or platform that supports them, but all follow the same pattern. An iterable collection of objects is passed to a map function. Each item of the collection is then iterated over with the map function being applied to that iteration. The final result is a new collection where each of the original items is transformed by the map. Reduce functions Like maps, the reduce functions also work by applying a provided function to an iterable data structure. The key difference between the two is that the reduce function works to produce a single value from the input iterable. Using Python's built-in reduce() function, we can see how to produce a sum of integers, as follows: numbers = [1, 2, 3, 4, 5] sum = reduce(lambda x, y: x + y, numbers) #sum == 15 You probably noticed that unlike our map operation, the reduce lambda has two parameters (x and y in this case). The argument passed to x will be the accumulated value of all applications of the function so far, and y will receive the next value to be added to the accumulation. Parenthetically, the order of operations can be seen as ((((1 + 2) + 3) + 4) + 5). Alternatively, the steps are shown in the following list: x = 1, y = 2 x = 3, y = 3 x = 6, y = 4 x = 10, y = 5 x = 15 As this list demonstrates, the value of x is the cumulative sum of previous x and y values. As such, reduce functions are sometimes termed accumulate or fold functions. Regardless of their name, reduce functions serve the common purpose of combining pieces of a recursive data structure to produce a single value. Couchbase MapReduce Creating an index (or view) in Couchbase requires creating a map function written in JavaScript. When the view is created for the first time, the map function is applied to each document in the bucket containing the view. When you update a view, only new or modified documents are indexed. This behavior is known as incremental MapReduce. You can think of a basic map function in Couchbase as being similar to a SQL CREATE INDEX statement. Effectively, you are defining a column or a set of columns, to be indexed by the server. Of course, these are not columns, but rather properties of the documents to be indexed. Basic mapping To illustrate the process of creating a view, first imagine that we have a set of JSON documents as shown here: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 3, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; Each document contains title and author properties. In Couchbase, to query these documents by either title or author, we'd first need to write a map function. Without considering how map functions are written in Couchbase, we're able to understand the process with vanilla JavaScript: books.map(function(book) {   return book.author; }); In the preceding snippet, we're making use of the built-in JavaScript array's map() function. Similar to the Python snippets we saw earlier, JavaScript's map() function takes a function as a parameter and returns a new array with mapped objects. In this case, we'll have an array with each book's author, as follows: ["Robert Ludlow", "Mario Puzzo", "Nicholas Pileggi"] At this point, we have a mapped collection that will be the basis for our author index. However, we haven't provided a means for the index to be able to refer back to its original document. If we were using a relational database, we'd have effectively created an index on the Title column with no way to get back to the row that contained it. With a slight modification to our map function, we are able to provide the key (the id property) of the document as well in our index: books.map(function(book) {   return [book.author, book.id]; }); In this slightly modified version, we're including the ID with the output of each author. In this way, the index has its document's key stored with its title. [["The Bourne Identity", 1], ["The Godfather", 2], ["Wiseguy", 3]] We'll soon see how this structure more closely resembles the values stored in a Couchbase index. Basic reducing Not every Couchbase index requires a reduce component. In fact, we'll see that Couchbase already comes with built-in reduce functions that will provide you with most of the reduce behavior you need. However, before relying on only those functions, it's important to understand why you'd use a reduce function in the first place. Returning to the preceding example of the map, let's imagine we have a few more documents in our set, as follows: var books=[     { "id": 1, "title": "The Bourne Identity", "author": "Robert Ludlow"     },     { "id": 2, "title": "The Bourne Ultimatum", "author": "Robert Ludlow"     },     { "id": 3, "title": "The Godfather", "author": "Mario Puzzo"     },     { "id": 4, "title": "The Bourne Supremacy", "author": "Robert Ludlow"     },     { "id": 5, "title": "The Family", "author": "Mario Puzzo"     },  { "id": 6, "title": "Wiseguy", "author": "Nicholas Pileggi"     } ]; We'll still create our index using the same map function because it provides a way of accessing a book by its author. Now imagine that we want to know how many books an author has written, or (assuming we had more data) the average number of pages written by an author. These questions are not possible to answer with a map function alone. Each application of the map function knows nothing about the previous application. In other words, there is no way for you to compare or accumulate information about one author's book to another book by the same author. Fortunately, there is a solution to this problem. As you've probably guessed, it's the use of a reduce function. As a somewhat contrived example, consider this JavaScript: mapped = books.map(function (book) {     return ([book.id, book.author]); });   counts = {} reduced = mapped.reduce(function(prev, cur, idx, arr) { var key = cur[1];     if (! counts[key]) counts[key] = 0;     ++counts[key] }, null); This code doesn't quite accurately reflect the way you would count books with Couchbase but it illustrates the basic idea. You look for each occurrence of a key (author) and increment a counter when it is found. With Couchbase MapReduce, the mapped structure is supplied to the reduce() function in a better format. You won't need to keep track of items in a dictionary. Couchbase views At this point, you should have a general sense of what MapReduce is, where it came from, and how it will affect the creation of a Couchbase Server view. So without further ado, let's see how to write our first Couchbase view. In fact, there were two to choose from. The bucket we'll use is beer-sample. If you didn't install it, don't worry. You can add it by opening the Couchbase Console and navigating to the Settings tab. Here, you'll find the option to install the bucket, as shown next: First, you need to understand the document structures with which you're working. The following JSON object is a beer document (abbreviated for brevity): {  "name": "Sundog",  "type": "beer",  "brewery_id": "new_holland_brewing_company",  "description": "Sundog is an amber ale...",  "style": "American-Style Amber/Red Ale",  "category": "North American Ale" } As you can see, the beer documents have several properties. We're going to create an index to let us query these documents by name. In SQL, the query would look like this: SELECT Id FROM Beers WHERE Name = ? You might be wondering why the SQL example includes only the Id column in its projection. For now, just know that to query a document using a view with Couchbase, the property by which you're querying must be included in an index. To create that index, we'll write a map function. The simplest example of a map function to query beer documents by name is as follows: function(doc) {   emit(doc.name); } This body of the map function has only one line. It calls the built-in Couchbase emit() function. This function is used to signal that a value should be indexed. The output of this map function will be an array of names. The beer-sample bucket includes brewery data as well. These documents look like the following code (abbreviated for brevity): {   "name": "Thomas Hooker Brewing",   "city": "Bloomfield",   "state": "Connecticut",   "website": "http://www.hookerbeer.com/",   "type": "brewery" } If we reexamine our map function, we'll see an obvious problem; both the brewery and beer documents have a name property. When this map function is applied to the documents in the bucket, it will create an index with documents from either the brewery or beer documents. The problem is that Couchbase documents exist in a single container—the bucket. There is no namespace for a set of related documents. The solution has typically involved including a type or docType property on each document. The value of this property is used to distinguish one document from another. In the case of the beer-sample database, beer documents have type = "beer" and brewery documents have type = "brewery". Therefore, we are easily able to modify our map function to create an index only on beer documents: function(doc) {   if (doc.type == "beer") {     emit(doc.name);   } } The emit() function actually takes two arguments. The first, as we've seen, emits a value to be indexed. The second argument is an optional value and is used by the reduce function. Imagine that we want to count the number of beer types in a particular category. In SQL, we would write the following query: SELECT Category, COUNT(*) FROM Beers GROUP BY Category To achieve the same functionality with Couchbase Server, we'll need to use both map and reduce functions. First, let's write the map. It will create an index on the category property: function(doc) {   if (doc.type == "beer") {     emit(doc.category, 1);   } } The only real difference between our category index and our name index is that we're including an argument for the value parameter of the emit() function. What we'll do with that value is simply count them. This counting will be done in our reduce function: function(keys, values) {   return values.length; } In this example, the values parameter will be given to the reduce function as a list of all values associated with a particular key. In our case, for each beer category, there will be a list of ones (that is, [1, 1, 1, 1, 1, 1]). Couchbase also provides a built-in _count function. It can be used in place of the entire reduce function in the preceding example. Now that we've seen the basic requirements when creating an actual Couchbase view, it's time to add a view to our bucket. The easiest way to do so is to use the Couchbase Console. Summary In this article, you learned the purpose of secondary indexes in a key/value store. We dug deep into MapReduce, both in terms of its history in functional languages and as a tool for NoSQL and big data systems. Resources for Article: Further resources on this subject: Map Reduce? [article] Introduction to Mapreduce [article] Working with Apps Splunk [article]
Read more
  • 0
  • 0
  • 4795

article-image-performance-considerations
Packt
03 Mar 2015
13 min read
Save for later

Performance Considerations

Packt
03 Mar 2015
13 min read
In this article by Dayong Du, the author of Apache Hive Essentials, we will look at the different performance considerations when using Hive. Although Hive is built to deal with big data, we still cannot ignore the importance of performance. Most of the time, a better Hive query can rely on the smart query optimizer to find the best execution strategy as well as the default setting best practice from vendor packages. However, as experienced users, we should learn more about the theory and practice of performance tuning in Hive, especially when working in a performance-based project or environment. We will start from utilities available in Hive to find potential issues causing poor performance. Then, we introduce the best practices of performance considerations in the areas of queries and job. (For more resources related to this topic, see here.) Performance utilities Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries. The EXPLAIN statement Hive provides an EXPLAIN command to return a query execution plan without running the query. We can use an EXPLAIN command for queries if we have a doubt or a concern about performance. The EXPLAIN command will help to see the difference between two or more queries for the same purpose. The syntax for EXPLAIN is as follows: EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query The following keywords can be used: EXTENDED: This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree. DEPENDENCY: This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0. AUTHORIZATION: This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0. A typical query plan contains the following three sections. We will also have a look at an example later: Abstract syntax tree (AST): Hive uses a pacer generator called ANTLR (see http://www.antlr.org/) to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time. Stage dependencies: This lists all dependencies and number of stages used to run the query. Stage plans: It contains important information, such as operators and sort orders, for running the job. The following is what a typical query plan looks like. From the following example, we can see that the AST section is not shown since the EXTENDED keyword is not used with EXPLAIN. In the STAGE DEPENDENCIES section, both Stage-0 and Stage-1 are independent root stages. In the STAGE PLANS section, Stage-1 has one map and reduce referred to by Map Operator Tree and Reduce Operator Tree. Inside each Map/Reduce Operator Tree section, all operators corresponding to Hive query keywords as well as expressions and aggregations are listed. The Stage-0 stage does not have map and reduce. It is just a Fetch operation. jdbc:hive2://> EXPLAIN SELECT sex_age.sex, count(*). . . . . . .> FROM employee_partitioned. . . . . . .> WHERE year=2014 GROUP BY sex_age.sex LIMIT 2;+-----------------------------------------------------------------------------+| Explain |+-----------------------------------------------------------------------------+| STAGE DEPENDENCIES: || Stage-1 is a root stage || Stage-0 is a root stage || || STAGE PLANS: || Stage: Stage-1 || Map Reduce || Map Operator Tree: || TableScan || alias: employee_partitioned || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Select Operator || expressions: sex_age (type: struct<sex:string,age:int>) || outputColumnNames: sex_age || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Group By Operator || aggregations: count() || keys: sex_age.sex (type: string) || mode: hash || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL || Column stats: NONE || Reduce Output Operator || key expressions: _col0 (type: string) || sort order: + || Map-reduce partition columns: _col0 (type: string) || Statistics: Num rows: 0 Data size: 227 Basic stats:PARTIAL|| Column stats: NONE || value expressions: _col1 (type: bigint) || Reduce Operator Tree: || Group By Operator || aggregations: count(VALUE._col0) || keys: KEY._col0 (type: string) || mode: mergepartial || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Select Operator || expressions: _col0 (type: string), _col1 (type: bigint) || outputColumnNames: _col0, _col1 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || Limit || Number of rows: 2 || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || File Output Operator || compressed: false || Statistics: Num rows: 0 Data size: 0 Basic stats: NONE || Column stats: NONE || table: || input format: org.apache.hadoop.mapred.TextInputFormat || output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|| serde:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|| || Stage: Stage-0 || Fetch Operator || limit: 2 |+-----------------------------------------------------------------------------+53 rows selected (0.26 seconds) The ANALYZE statement Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level. These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query. The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples: jdbc:hive2://> ANALYZE TABLE employee COMPUTE STATISTICS;No rows affected (27.979 seconds)jdbc:hive2://> ANALYZE TABLE employee_partitioned. . . . . . .> PARTITION(year=2014, month=12) COMPUTE STATISTICS;No rows affected (45.054 seconds)jdbc:hive2://> ANALYZE TABLE employee_id COMPUTE STATISTICS. . . . . . .> FOR COLUMNS employee_id;No rows affected (41.074 seconds) Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED/FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}). The following is an example: jdbc:hive2://> DESCRIBE EXTENDED employee_partitioned. . . . . . .> PARTITION(year=2014, month=12);jdbc:hive2://> DESCRIBE EXTENDED employee;…parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4, totalSize=227, rawDataSize=223}).jdbc:hive2://> DESCRIBE FORMATTED employee.name;+--------+---------+---+---+---------+--------------+-----------+-----------+|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_len|+--------+---------+---+---+---------+--------------+-----------+-----------+| name | string | | | 0 | 5 | 5.6 | 7 |+--------+---------+---+---+---------+--------------+-----------+-----------++---------+----------+-----------------+|num_trues|num_falses| comment |+---------+----------+-----------------+| | |from deserializer|+---------+----------+-----------------+3 rows selected (0.116 seconds) Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting: jdbc:hive2://> SET hive.stats.autogather=ture; Hive logs Logs provide useful information to find out how a Hive query/job runs. By checking the Hive logs, we can identify runtime problems and issues that may cause bad performance. There are two types of logs available in Hive: system log and job log. The system log contains the Hive running status and issues. It is configured in {HIVE_HOME}/conf/hive-log4j.properties. The following three lines for Hive log can be found: hive.root.logger=WARN,DRFAhive.log.dir=/tmp/${user.name}hive.log.file=hive.log To modify the status, we can either modify the preceding lines in hive-log4j.properties (applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows: hive --hiveconf hive.root.logger=DEBUG,console The job log contains Hive query information and is saved at the same place, /tmp/${user.name}, by default as one file for each Hive user session. We can override it in hive-site.xml with the hive.querylog.location property. If a Hive query generates MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI. Job and query optimization Job and query optimization covers experience and skills to improve performance in the area of job-running mode, JVM reuse, job parallel running, and query optimizations in JOIN. Local mode Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings: jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default falsejdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;--default 4 A job must satisfy the following conditions to run in the local mode: The total input size of the job is lower than hive.exec.mode.local.auto.inputbytes.max The total number of map tasks is less than hive.exec.mode.local.auto.input.files.max The total number of reduce tasks required is 1 or 0 JVM reuse By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1: jdbc:hive2://> SET mapred.job.reuse.jvm.num.tasks=5; We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM. Parallel execution Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. We can enable this feature with the following settings: jdbc:hive2://> SET hive.exec.parallel=true; -- default falsejdbc:hive2://> SET hive.exec.parallel.thread.number=16;-- default 8, it defines the max number for running in parallel Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance. Join optimization Here, we'll briefly review the key settings for join improvement. Common join The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right-most side or specified by hit, as follows: /*+ STREAMTABLE(stream_table_name) */. Map join Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;--default 25Mjdbc:hive2://> SET hive.auto.convert.join.noconditionaltask=true;--default false. Set to true so that map join hint is not needed jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask.size=10000000;--The default value controls the size of table to fit in memory Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query. Bucket map join Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true; --default falsejdbc:hive2://> SET hive.optimize.bucketmapjoin=true; --default false In bucket map join, all the join tables must be bucket tables and join on buckets columns. In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables. Sort merge bucket (SMB) join SMB is the join performed on the bucket tables that have the same sorted, bucket, and join condition columns. It reads data from both bucket tables and performs common joins (map and reduce triggered) on the bucket tables. We need to enable the following properties to use SMB: jdbc:hive2://> SET hive.input.format=. . . . . . .> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true; Sort merge bucket map (SMBM) join SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings: jdbc:hive2://> SET hive.auto.convert.join=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join=truejdbc:hive2://> SET hive.optimize.bucketmapjoin=true;jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;jdbc:hive2://> SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ; Skew join When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens: jdbc:hive2://> SET hive.optimize.skewjoin=true;--If there is data skew in join, set it to true. Default is false. jdbc:hive2://> SET hive.skewjoin.key=100000;--This is the default value. If the number of key is bigger than--this, the new keys will send to the other unused reducers. Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result: SET hive.groupby.skewindata=true; Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew. For more information about Hive join optimization, please refer to the Apache Hive wiki available at https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization. Summary In this article, we first covered how to identify performance bottlenecks using the EXPLAIN and ANALYZE statements. Then, we discussed job and query optimization in Hive. Resources for Article: Further resources on this subject: Apache Maven and m2eclipse [Article] Apache Karaf – Provisioning and Clusters [Article] Introduction to Apache ZooKeeper [Article]
Read more
  • 0
  • 0
  • 2339
article-image-basics-programming-julia
Packt
03 Mar 2015
17 min read
Save for later

Basics of Programming in Julia

Packt
03 Mar 2015
17 min read
 In this article by Ivo Balbaert, author of the book Getting Started with Julia Programming, we will explore how Julia interacts with the outside world, reading from standard input and writing to standard output, files, networks, and databases. Julia provides asynchronous networking I/O using the libuv library. We will see how to handle data in Julia. We will also discover the parallel processing model of Julia. In this article, the following topics are covered: Working with files (including the CSV files) Using DataFrames (For more resources related to this topic, see here.) Working with files To work with files, we need the IOStream type. IOStream is a type with the supertype IO and has the following characteristics: The fields are given by names(IOStream) 4-element Array{Symbol,1}:  :handle   :ios    :name   :mark The types are given by IOStream.types (Ptr{None}, Array{Uint8,1}, String, Int64) The file handle is a pointer of the type Ptr, which is a reference to the file object. Opening and reading a line-oriented file with the name example.dat is very easy: // code in Chapter 8io.jl fname = "example.dat"                                 f1 = open(fname) fname is a string that contains the path to the file, using escaping of special characters with when necessary; for example, in Windows, when the file is in the test folder on the D: drive, this would become d:\test\example.dat. The f1 variable is now an IOStream(<file example.dat>) object. To read all lines one after the other in an array, use data = readlines(f1), which returns 3-element Array{Union(ASCIIString,UTF8String),1}: "this is line 1.rn" "this is line 2.rn" "this is line 3." For processing line by line, now only a simple loop is needed: for line in data   println(line) # or process line end close(f1) Always close the IOStream object to clean and save resources. If you want to read the file into one string, use readall. Use this only for relatively small files because of the memory consumption; this can also be a potential problem when using readlines. There is a convenient shorthand with the do syntax for opening a file, applying a function process, and closing it automatically. This goes as follows (file is the IOStream object in this code): open(fname) do file     process(file) end The do command creates an anonymous function, and passes it to open. Thus, the previous code example would have been equivalent to open(process, fname). Use the same syntax for processing a file fname line by line without the memory overhead of the previous methods, for example: open(fname) do file     for line in eachline(file)         print(line) # or process line     end end Writing a file requires first opening it with a "w" flag, then writing strings to it with write, print, or println, and then closing the file handle that flushes the IOStream object to the disk: fname =   "example2.dat" f2 = open(fname, "w") write(f2, "I write myself to a filen") # returns 24 (bytes written) println(f2, "even with println!") close(f2) Opening a file with the "w" option will clear the file if it exists. To append to an existing file, use "a". To process all the files in the current folder (or a given folder as an argument to readdir()), use this for loop: for file in readdir()   # process file end Reading and writing CSV files A CSV file is a comma-separated file. The data fields in each line are separated by commas "," or another delimiter such as semicolons ";". These files are the de-facto standard for exchanging small and medium amounts of tabular data. Such files are structured so that one line contains data about one data object, so we need a way to read and process the file line by line. As an example, we will use the data file Chapter 8winequality.csv that contains 1,599 sample measurements, 12 data columns, such as pH and alcohol per sample, separated by a semicolon. In the following screenshot, you can see the top 20 rows:   In general, the readdlm function is used to read in the data from the CSV files: # code in Chapter 8csv_files.jl: fname = "winequality.csv" data = readdlm(fname, ';') The second argument is the delimiter character (here, it is ;). The resulting data is a 1600x12 Array{Any,2} array of the type Any because no common type could be found:     "fixed acidity"   "volatile acidity"      "alcohol"   "quality"      7.4                        0.7                                9.4              5.0      7.8                        0.88                              9.8              5.0      7.8                        0.76                              9.8              5.0   … If the data file is comma separated, reading it is even simpler with the following command: data2 = readcsv(fname) The problem with what we have done until now is that the headers (the column titles) were read as part of the data. Fortunately, we can pass the argument header=true to let Julia put the first line in a separate array. It then naturally gets the correct datatype, Float64, for the data array. We can also specify the type explicitly, such as this: data3 = readdlm(fname, ';', Float64, 'n', header=true) The third argument here is the type of data, which is a numeric type, String or Any. The next argument is the line separator character, and the fifth indicates whether or not there is a header line with the field (column) names. If so, then data3 is a tuple with the data as the first element and the header as the second, in our case, (1599x12 Array{Float64,2}, 1x12 Array{String,2}) (There are other optional arguments to define readdlm, see the help option). In this case, the actual data is given by data3[1] and the header by data3[2]. Let's continue working with the variable data. The data forms a matrix, and we can get the rows and columns of data using the normal array-matrix syntax). For example, the third row is given by row3 = data[3, :] with data:  7.8  0.88  0.0  2.6  0.098  25.0  67.0  0.9968  3.2  0.68  9.8  5.0, representing the measurements for all the characteristics of a certain wine. The measurements of a certain characteristic for all wines are given by a data column, for example, col3 = data[ :, 3] represents the measurements of citric acid and returns a column vector 1600-element Array{Any,1}:   "citric acid" 0.0  0.0  0.04  0.56  0.0  0.0 …  0.08  0.08  0.1  0.13  0.12  0.47. If we need columns 2-4 (volatile acidity to residual sugar) for all wines, extract the data with x = data[:, 2:4]. If we need these measurements only for the wines on rows 70-75, get these with y = data[70:75, 2:4], returning a 6 x 3 Array{Any,2} outputas follows: 0.32   0.57  2.0 0.705  0.05  1.9 … 0.675  0.26  2.1 To get a matrix with the data from columns 3, 6, and 11, execute the following command: z = [data[:,3] data[:,6] data[:,11]] It would be useful to create a type Wine in the code. For example, if the data is to be passed around functions, it will improve the code quality to encapsulate all the data in a single data type, like this: type Wine     fixed_acidity::Array{Float64}     volatile_acidity::Array{Float64}     citric_acid::Array{Float64}     # other fields     quality::Array{Float64} end Then, we can create objects of this type to work with them, like in any other object-oriented language, for example, wine1 = Wine(data[1, :]...), where the elements of the row are splatted with the ... operator into the Wine constructor. To write to a CSV file, the simplest way is to use the writecsv function for a comma separator, or the writedlm function if you want to specify another separator. For example, to write an array data to a file partial.dat, you need to execute the following command: writedlm("partial.dat", data, ';') If more control is necessary, you can easily combine the more basic functions from the previous section. For example, the following code snippet writes 10 tuples of three numbers each to a file: // code in Chapter 8tuple_csv.jl fname = "savetuple.csv" csvfile = open(fname,"w") # writing headers: write(csvfile, "ColName A, ColName B, ColName Cn") for i = 1:10   tup(i) = tuple(rand(Float64,3)...)   write(csvfile, join(tup(i),","), "n") end close(csvfile) Using DataFrames If you measure n variables (each of a different type) of a single object of observation, then you get a table with n columns for each object row. If there are m observations, then we have m rows of data. For example, given the student grades as data, you might want to know "compute the average grade for each socioeconomic group", where grade and socioeconomic group are both columns in the table, and there is one row per student. The DataFrame is the most natural representation to work with such a (m x n) table of data. They are similar to pandas DataFrames in Python or data.frame in R. A DataFrame is a more specialized tool than a normal array for working with tabular and statistical data, and it is defined in the DataFrames package, a popular Julia library for statistical work. Install it in your environment by typing in Pkg.add("DataFrames") in the REPL. Then, import it into your current workspace with using DataFrames. Do the same for the packages DataArrays and RDatasets (which contains a collection of example datasets mostly used in the R literature). A common case in statistical data is that data values can be missing (the information is not known). The DataArrays package provides us with the unique value NA, which represents a missing value, and has the type NAtype. The result of the computations that contain the NA values mostly cannot be determined, for example, 42 + NA returns NA. (Julia v0.4 also has a new Nullable{T} type, which allows you to specify the type of a missing value). A DataArray{T} array is a data structure that can be n-dimensional, behaves like a standard Julia array, and can contain values of the type T, but it can also contain the missing (Not Available) values NA and can work efficiently with them. To construct them, use the @data macro: // code in Chapter 8dataarrays.jl using DataArrays using DataFrames dv = @data([7, 3, NA, 5, 42]) This returns 5-element DataArray{Int64,1}: 7  3   NA  5 42. The sum of these numbers is given by sum(dv) and returns NA. One can also assign the NA values to the array with dv[5] = NA; then, dv becomes [7, 3, NA, 5, NA]). Converting this data structure to a normal array fails: convert(Array, dv) returns ERROR: NAException. How to get rid of these NA values, supposing we can do so safely? We can use the dropna function, for example, sum(dropna(dv)) returns 15. If you know that you can replace them with a value v, use the array function: repl = -1 sum(array(dv, repl)) # returns 13 A DataFrame is a kind of an in-memory database, versatile in the ways you can work with the data. It consists of columns with names such as Col1, Col2, Col3, and so on. Each of these columns are DataArrays that have their own type, and the data they contain can be referred to by the column names as well, so we have substantially more forms of indexing. Unlike two-dimensional arrays, columns in a DataFrame can be of different types. One column might, for instance, contain the names of students and should therefore be a string. Another column could contain their age and should be an integer. We construct a DataFrame from the program data as follows: // code in Chapter 8dataframes.jl using DataFrames # constructing a DataFrame: df = DataFrame() df[:Col1] = 1:4 df[:Col2] = [e, pi, sqrt(2), 42] df[:Col3] = [true, false, true, false] show(df) Notice that the column headers are used as symbols. This returns the following 4 x 3 DataFrame object: We could also have used the full constructor as follows: df = DataFrame(Col1 = 1:4, Col2 = [e, pi, sqrt(2), 42],    Col3 = [true, false, true, false]) You can refer to the columns either by an index (the column number) or by a name, both of the following expressions return the same output: show(df[2]) show(df[:Col2]) This gives the following output: [2.718281828459045, 3.141592653589793, 1.4142135623730951,42.0] To show the rows or subsets of rows and columns, use the familiar splice (:) syntax, for example: To get the first row, execute df[1, :]. This returns 1x3 DataFrame.  | Row | Col1 | Col2    | Col3 |  |-----|------|---------|------|  | 1   | 1    | 2.71828 | true | To get the second and third row, execute df [2:3, :] To get only the second column from the previous result, execute df[2:3, :Col2]. This returns [3.141592653589793, 1.4142135623730951]. To get the second and third column from the second and third row, execute df[2:3, [:Col2, :Col3]], which returns the following output: 2x2 DataFrame  | Row | Col2    | Col3  |  |---- |-----   -|-------|  | 1   | 3.14159 | false |  | 2   | 1.41421 | true  | The following functions are very useful when working with DataFrames: The head(df) and tail(df) functions show you the first six and the last six lines of data respectively. The names function gives the names of the columns names(df). It returns 3-element Array{Symbol,1}:  :Col1  :Col2  :Col3. The eltypes function gives the data types of the columns eltypes(df). It gives the output as 3-element Array{Type{T<:Top},1}:  Int64  Float64  Bool. The describe function tries to give some useful summary information about the data in the columns, depending on the type, for example, describe(df) gives for column 2 (which is numeric) the min, max, median, mean, number, and percentage of NAs: Col2 Min      1.4142135623730951 1st Qu.  2.392264761937558  Median   2.929937241024419 Mean     12.318522011105483  3rd Qu.  12.856194490192344  Max      42.0  NAs      0  NA%      0.0% To load in data from a local CSV file, use the method readtable. The returned object is of type DataFrame: // code in Chapter 8dataframes.jl using DataFrames fname = "winequality.csv" data = readtable(fname, separator = ';') typeof(data) # DataFrame size(data) # (1599,12) Here is a fraction of the output: The readtable method also supports reading in gzipped CSV files. Writing a DataFrame to a file can be done with the writetable function, which takes the filename and the DataFrame as arguments, for example, writetable("dataframe1.csv", df). By default, writetable will use the delimiter specified by the filename extension and write the column names as headers. Both readtable and writetable support numerous options for special cases. Refer to the docs for more information (refer to http://dataframesjl.readthedocs.org/en/latest/). To demonstrate some of the power of DataFrames, here are some queries you can do: Make a vector with only the quality information data[:quality] Give the wines with alcohol percentage equal to 9.5, for example, data[ data[:alcohol] .== 9.5, :] Here, we use the .== operator, which does element-wise comparison. data[:alcohol] .== 9.5 returns an array of Boolean values (true for datapoints, where :alcohol is 9.5, and false otherwise). data[boolean_array, : ] selects those rows where boolean_array is true. Count the number of wines grouped by quality with by(data, :quality, data -> size(data, 1)), which returns the following: 6x2 DataFrame | Row | quality | x1  | |-----|---------|-----| | 1    | 3      | 10  | | 2    | 4      | 53  | | 3    | 5      | 681 | | 4    | 6      | 638 | | 5    | 7      | 199 | | 6    | 8      | 18  | The DataFrames package contains the by function, which takes in three arguments: A DataFrame, here it takes data A column to split the DataFrame on, here it takes quality A function or an expression to apply to each subset of the DataFrame, here data -> size(data, 1), which gives us the number of wines for each quality value Another easy way to get the distribution among quality is to execute the histogram hist function hist(data[:quality]) that gives the counts over the range of quality (2.0:1.0:8.0,[10,53,681,638,199,18]). More precisely, this is a tuple with the first element corresponding to the edges of the histogram bins, and the second denoting the number of items in each bin. So there are, for example, 10 wines with quality between 2 and 3, and so on. To extract the counts as a variable count of type Vector, we can execute _, count = hist(data[:quality]); the _ means that we neglect the first element of the tuple. To obtain the quality classes as a DataArray class, we will execute the following: class = sort(unique(data[:quality])) We can now construct a df_quality DataFrame with the class and count columns as df_quality = DataFrame(qual=class, no=count). This gives the following output: 6x2 DataFrame | Row | qual | no  | |-----|------|-----| | 1   | 3    | 10  | | 2   | 4    | 53  | | 3   | 5    | 681 | | 4   | 6    | 638 | | 5   | 7    | 199 | | 6   | 8    | 18  | To deepen your understanding and learn about the other features of Julia DataFrames (such as joining, reshaping, and sorting), refer to the documentation available at http://dataframesjl.readthedocs.org/en/latest/. Other file formats Julia can work with other human-readable file formats through specialized packages: For JSON, use the JSON package. The parse method converts the JSON strings into Dictionaries, and the json method turns any Julia object into a JSON string. For XML, use the LightXML package For YAML, use the YAML package For HDF5 (a common format for scientific data), use the HDF5 package For working with Windows INI files, use the IniFile package Summary In this article we discussed the basics of network programming in Julia. Resources for Article: Further resources on this subject: Getting Started with Electronic Projects? [article] Getting Started with Selenium Webdriver and Python [article] Handling The Dom In Dart [article]
Read more
  • 0
  • 0
  • 18945

article-image-getting-started-postgresql
Packt
03 Mar 2015
11 min read
Save for later

Getting Started with PostgreSQL

Packt
03 Mar 2015
11 min read
In this article by Ibrar Ahmed, Asif Fayyaz, and Amjad Shahzad, authors of the book PostgreSQL Developer's Guide, we will come across the basic features and functions of PostgreSQL, such as writing queries using psql, data definition in tables, and data manipulation from tables. (For more resources related to this topic, see here.) PostgreSQL is widely considered to be one of the most stable database servers available today, with multiple features that include: A wide range of built-in types MVCC New SQL enhancements, including foreign keys, primary keys, and constraints Open source code, maintained by a team of developers Trigger and procedure support with multiple procedural languages Extensibility in the sense of adding new data types and the client language From the early releases of PostgreSQL (from version 6.0 that is), many changes have been made, with each new major version adding new and more advanced features. The current version is PostgreSQL 9.4 and is available from several sources and in various binary formats. Writing queries using psql Before proceeding, allow me to explain to you that throughout this article, we will use a warehouse database called warehouse_db. In this section, I will show you how you can create such a database, providing you with sample code for assistance. You will need to do the following: We are assuming here that you have successfully installed PostgreSQL and faced no issues. Now, you will need to connect with the default database that is created by the PostgreSQL installer. To do this, navigate to the default path of installation, which is /opt/PostgreSQL/9.4/bin from your command line, and execute the following command that will prompt for a postgres user password that you provided during the installation: /opt/PostgreSQL/9.4/bin$./psql -U postgres Password for user postgres: Using the following command, you can log in to the default database with the user postgres and you will be able to see the following on your command line: psql (9.4beta1) Type "help" for help postgres=# You can then create a new database called warehouse_db using the following statement in the terminal: postgres=# CREATE DATABASE warehouse_db; You can then connect with the warehouse_db database using the following command: postgres=# c warehouse_db You are now connected to the warehouse_db database as the user postgres, and you will have the following warehouse_db shell: warehouse_db=# Let's summarize what we have achieved so far. We are now able to connect with the default database postgres and created a warehouse_db database successfully. It's now time to actually write queries using psql and perform some Data Definition Language (DDL) and Data Manipulation Language (DML) operations, which we will cover in the following sections. In PostgreSQL, we can have multiple databases. Inside the databases, we can have multiple extensions and schemas. Inside each schema, we can have database objects such as tables, views, sequences, procedures, and functions. We are first going to create a schema named record and then we will create some tables in this schema. To create a schema named record in the warehouse_db database, use the following statement: warehouse_db=# CREATE SCHEMA record; Creating, altering, and truncating a table In this section, we will learn about creating a table, altering the table definition, and truncating the table. Creating tables Now, let's perform some DDL operations starting with creating tables. To create a table named warehouse_tbl, execute the following statements: warehouse_db=# CREATE TABLE warehouse_tbl ( warehouse_id INTEGER NOT NULL, warehouse_name TEXT NOT NULL, year_created INTEGER, street_address TEXT, city CHARACTER VARYING(100), state CHARACTER VARYING(2), zip CHARACTER VARYING(10), CONSTRAINT "PRIM_KEY" PRIMARY KEY (warehouse_id) ); The preceding statements created the table warehouse_tbl that has the primary key warehouse_id. Now, as you are familiar with the table creation syntax, let's create a sequence and use that in a table. You can create the hist_id_seq sequence using the following statement: warehouse_db=# CREATE SEQUENCE hist_id_seq; The preceding CREATE SEQUENCE command creates a new sequence number generator. This involves creating and initializing a new special single-row table with the name hist_id_seq. The user issuing the command will own the generator. You can now create the table that implements the hist_id_seq sequence using the following statement: warehouse_db=# CREATE TABLE history ( history_id INTEGER NOT NULL DEFAULT nextval('hist_id_seq'), date TIMESTAMP WITHOUT TIME ZONE, amount INTEGER, data TEXT, customer_id INTEGER, warehouse_id INTEGER, CONSTRAINT "PRM_KEY" PRIMARY KEY (history_id), CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) ); The preceding query will create a history table in the warehouse_db database, and the history_id column uses the sequence as the default input value. In this section, we successfully learned how to create a table and also learned how to use a sequence inside the table creation syntax. Altering tables Now that we have learned how to create multiple tables, we can practice some ALTER TABLE commands by following this section. With the ALTER TABLE command, we can add, remove, or rename table columns. Firstly, with the help of the following example, we will be able to add the phone_no column in the previously created table warehouse_tbl: warehouse_db=# ALTER TABLE warehouse_tbl ADD COLUMN phone_no INTEGER; We can then verify that a column is added in the table by describing the table as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id  | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | phone_no       | integer               | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY"FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) To drop a column from a table, we can use the following statement: warehouse_db=# ALTER TABLE warehouse_tbl DROP COLUMN phone_no; We can then finally verify that the column has been removed from the table by describing the table again as follows: warehouse_db=# d warehouse_tbl            Table "public.warehouse_tbl"                  Column     |         Type         | Modifiers ----------------+------------------------+----------- warehouse_id   | integer               | not null warehouse_name | text                   | not null year_created   | integer               | street_address | text                   | city           | character varying(100) | state           | character varying(2)   | zip             | character varying(10) | Indexes: "PRIM_KEY" PRIMARY KEY, btree (warehouse_id) Referenced by: TABLE "history" CONSTRAINT "FORN_KEY" FOREIGN KEY  (warehouse_id) REFERENCES warehouse_tbl(warehouse_id) TABLE  "history" CONSTRAINT "FORN_KEY" FOREIGN KEY (warehouse_id)  REFERENCES warehouse_tbl(warehouse_id) Truncating tables The TRUNCATE command is used to remove all rows from a table without providing any criteria. In the case of the DELETE command, the user has to provide the delete criteria using the WHERE clause. To truncate data from the table, we can use the following statement: warehouse_db=# TRUNCATE TABLE warehouse_tbl; We can then verify that the warehouse_tbl table has been truncated by performing a SELECT COUNT(*) query on it using the following statement: warehouse_db=# SELECT COUNT(*) FROM warehouse_tbl; count -------      0 (1 row) Inserting, updating, and deleting data from tables In this section, we will play around with data and learn how to insert, update, and delete data from a table. Inserting data So far, we have learned how to create and alter a table. Now it's time to play around with some data. Let's start by inserting records in the warehouse_tbl table using the following command snippet: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 1, 'Mark Corp', 2009, '207-F Main Service Road East', 'New London', 'CT', 4321 ); We can then verify that the record has been inserted by performing a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, warehouse_name, street_address               FROM warehouse_tbl; warehouse_id | warehouse_name |       street_address         ---------------+----------------+------------------------------- >             1 | Mark Corp     | 207-F Main Service Road East (1 row) Updating data Once we have inserted data in our table, we should know how to update it. This can be done using the following statement: warehouse_db=# UPDATE warehouse_tbl SET year_created=2010 WHERE year_created=2009; To verify that a record is updated, let's perform a SELECT query on the warehouse_tbl table as follows: warehouse_db=# SELECT warehouse_id, year_created FROM               warehouse_tbl; warehouse_id | year_created --------------+--------------            1 |         2010 (1 row) Deleting data To delete data from a table, we can use the DELETE command. Let's add a few records to the table and then later on delete data on the basis of certain conditions: warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 2, 'Bill & Co', 2014, 'Lilly Road', 'New London', 'CT', 4321 ); warehouse_db=# INSERT INTO warehouse_tbl ( warehouse_id, warehouse_name, year_created, street_address, city, state, zip ) VALUES ( 3, 'West point', 2013, 'Down Town', 'New London', 'CT', 4321 ); We can then delete data from the warehouse.tbl table, where warehouse_name is Bill & Co, by executing the following statement: warehouse_db=# DELETE FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; To verify that a record has been deleted, we will execute the following SELECT query: warehouse_db=# SELECT warehouse_id, warehouse_name FROM warehouse_tbl WHERE warehouse_name='Bill & Co'; warehouse_id | warehouse_name --------------+---------------- (0 rows) The DELETE command is used to drop a row from a table, whereas the DROP command is used to drop a complete table. The TRUNCATE command is used to empty the whole table. Summary In this article, we learned how to utilize the SQL language for a collection of everyday DBMS exercises in an easy-to-use practical way. We also figured out how to make a complete database that incorporates DDL (create, alter, and truncate) and DML (insert, update, and delete) operators. Resources for Article: Further resources on this subject: Indexes [Article] Improving proximity filtering with KNN [Article] Using Unrestricted Languages [Article]
Read more
  • 0
  • 0
  • 2587
Modal Close icon
Modal Close icon