How-To Tutorials

28 Oct 2013

21 min read

Getting into the Store

28 Oct 2013

(For more resources related to this topic, see here.) This all starts by visiting https://appdev.microsoft.com/StorePortals, which will get you to the store dashboard that you use to submit and manage your applications. If you already have an account you'll just log in here and proceed. If not, we'll take a look at ways of getting it set up. There are a couple of ways to get a store account, which you will need before you can submit any game or application to the store. There are also two different types of accounts: Individual accounts Company accounts In most cases you will only need the first option. It's cheaper and easier to get, and you won't require the enterprise features provided by the company account for a game. For this reason we'll focus on the individual account. To register you'll need a credit card for verification, even if you gain a free account another way. Just follow the registration instructions, pay the fee, and complete verification, after which you'll be ready to go. Free accounts Students and developers with MSDN subscriptions can access registration codes that waive the fee for a minimum of one year. If you meet either of these requirements you can gain a code using the respective methods, and use that code during the registration process to set the fee to zero. Students can access their free accounts using the DreamSpark service that Microsoft runs. To access this you need to create an account on www.dreamspark.com and create an account. From there follow the steps to verify your student status and visit https://www.dreamspark.com/Student/Windows-Store-Access.aspx to get your registration code. If you have access to an MSDN subscription you can use this to gain a store account for free. Just log in to your account and in your account benefits overview you should be able to generate your registration code. Submitting your game So your game is polished and ready to go. What do you need to do to get it in the store? First log in to the dashboard and select Submit an App from the menu on the left. Here you can see the steps required to submit the app. This may look like a lot to do, but don't worry. Most of these are very simple to resolve and can be done before you even start working on the game. The first step is to choose a name for your game, and this can be done whenever you want. By reserving a name and creating the application entry you have a year to submit your application, giving you plenty of time to complete it. This is why it's a good idea to jump in and register your application once you have a name for it. If you change your mind later and want a different name you can always change it. The next step is to choose how and where you will sell your game. The other thing you need to choose here are the markets you want to sell your game in. This can be an area you need to be careful of, because the markets you choose here define the localization or content you need to watch for in your game. Certain markets are restrictive and including content that isn't appropriate for a market you say you want to sell in can cause you to fail the certification process. Once that is done you need to choose when you want to release your game—you can choose to release as soon as certification finishes or on a specific date, and then you choose the app category, which in this case will be Games. Don't forget to specify the genre of your game as the sub-category so players can find it. The final option on the Selling Details page which applies to us is the Hardware requirements section. Here we define the DirectX feature-level required for the game, and the minimum RAM required to run it. This is important because the store can help ensure that players don't try to play your game on systems that cannot run it. The next section allows you to define the in-app offers that will be made available to players. The Age rating and rating certificates section allows you to define the minimum age required to play the game, as well as submit official ratings certificates from ratings boards so that they may be displayed in the store to meet legal requirements. The latter part is optional in some cases, and may affect where you can submit your game depending on local laws. Aside from official ratings, all applications and games submitted to the store require a voluntary rating, chosen from one of the following age options: 3+ 7+ 12+ 16+ 18+ While all content is checked, the 7+ and 3+ ratings both have extra checks because of the extra requirements for those age ranges. The 3+ rating is especially restrictive as apps submitted with that age limit may not contain features that could connect to online services, collect personal information, or use the webcam or microphone. To play it safe it's recommended the 12+ rating is chosen, and if you're still uncertain, higher is safer. GDF Certificates The other entry required here if you have official ratings certificates is a GDF file. This is a Game Definition File, which defines the different ratings in a single location and provides the necessary information to display the rating and inform any parental settings. To do this you need to use the GDFMAKER.exe utility that ships with the Windows 8 SDK, and generate a GDF file which you can submit to the store. Alongside that you need to create a DLL containing that file (as a resource) without any entry point to include in the application package. For full details on how to create the GDF as well as the DLL, view the following MSDN article: http://msdn.microsoft.com/en-us/library/windows/apps/hh465153.aspx The final section before you need to submit your compiled application package is the cryptography declaration. For most games you should be able to declare that you aren't using any cryptography within the game and quickly move through this step. If you are using cryptography, including encrypting game saves or data files, you will need to declare that here and follow the instructions to either complete the step or provide an Export Control Classification Number (ECCN). Now you need to upload the compiled app package before you can continue, so we'll take a look at what it takes to do that before you continue. App packages To submit your game to the store, you need to package it up in a format that makes it easy to upload, and easy for the store to distribute. This is done by compiling the application as an .appx file. But before that happens we need to ensure we have defined all of the required metadata, and fulfill the certification requirements, otherwise we'll be uploading a package only to fail soon after. Part of this is done through the application manifest editor, which is accessible in Visual Studio by double-clicking on the Package.appxmanifest file in solution explorer. This editor is where you specify the name that will be seen in the start menu, as well as the icons used by the application. To pass certification all icons have to be provided at 100 percent DPI, which is referred to as Scale 100 in the editor. Icon/Image Base resolution Required Standard 150 x 150 px Yes Wide 310 x 150 px If Wide Tile Enabled Small 30 x 30 px Yes Store 50 x 50 px Yes Badge 24 x 24 px If Toasts Enabled Splash 620 x 300 px Yes If you wish to provide a higher quality images for people running on high DPI setups, you can do so with a simple filename change. If you add scale-XXX to your filename, just before the extension, and replace XXX with one of the following values, Windows will automatically make use of it at the appropriate DPI. scale-100 scale-140 scale-180 In the following image you can see the options available for editing the visual assets in the application. These all apply to the start menu and application start-up experience, including the splash screen and toast notifications. Toast Notifications in Windows 8 are pop-up notifications that slide in from the edge of the screen and show the users some information for a short period of time. They can click on the toast to open the application if they want. Alongside Live Tiles, Toast Notifications allow you to give the user information when the application is not running (although they work when the application is running). The previous table shows the different images you required for a Windows 8 application, and if they are mandatory or just required in certain situations. Note that this does not include the imagery required for the store, which includes some screenshots of the application and optional promotional art in case you want your application to be featured. You must replace all of the required icons with your own. Automated checks during certification will detect the use of the default "box" icon shown in the previous screenshot and automatically fail the submission. Capabilities Once you have the visual aspects in place, you need to declare the capabilities that the application will receive. Your game may not need any, however you should still only specify what you need to run, as some of these capabilities come with extra implications and non-obvious requirements. Adding a privacy policy One of those requirements is the privacy policy. Even if you are creating a game, there may be situations where you are collecting private information, which requires you to have a privacy policy. The biggest issue here is connecting to the internet. If your game marks any of the Internet capabilities in the manifest, you automatically trigger a check for a privacy policy as private information—in this case an IP address—is being shared. To avoid failing certification for this, you need to put together a privacy policy if you collect privacy information, or use any of the capabilities that would indicate you collect information. These include the Internet capabilities as well as location, webcam, and microphone capabilities. This privacy policy just needs to describe what you will do with the information, and directly mention your game and publisher name. Once you have the policy written, it needs to be posted in two locations. The first is a publicly accessible website, which you will provide a link to when filling out the description after uploading your game. The second is within the game itself. It is recommended you place this policy in the Windows 8 provided settings menu, which you can build using XAML or your own code. If you're going with a completely native Windows 8 application you may want to display the policy in your own way and link to it from options within your game. Declarations Once you've indicated the capabilities you want, you need to declare any operating system integration you've done. For most games you won't use this, however if you're taking advantage of Windows 8 features such as share targets (the destination for data shared using the Share Charm), or you have a Game Definition File, you will need to declare it here and provide the required information for the operating system. In the case of the GDF, you need to provide the file so that the parental controls system can make use of the ratings to appropriately control access. Certification kit The next step is to make sure you aren't going to fail the automated tests during certification. Microsoft provides the same automated tests used when you submit your app in the Windows Application Certification Kit (WACK). WACK is installed by default with Visual Studio 2012 or higher version. There are two ways to run the test: after you build your application package, or by running the kit directly against an installed app. We'll look at the latter first, as you might want to run the test on your deployed test game well before you build anything for the store. This is also the only way to run the WACK on a WinRT device, if you want to cover all bases. If you haven't already deployed or tested your app, deploy it using the Build menu in Visual Studio and then search for the Windows App Cert Kit using the start menu (just start typing). When you run this you will be given an option to choose which type of application you want to validate. In this case we want to select the Windows Store App option, which will then give you access to the list of apps installed on your machine. From there it's just a matter of selecting the app you want and starting the test. At this point you will want to leave your machine alone until the automated tests are complete. Any interference could lead to an incorrect failure of the certification tests. The results will indicate ways you can fix any issues; however, you should be fine for most of the tests. The biggest issues will arise from third party libraries that haven't been developed or ported to Windows 8. In this case the only option is to fix them yourself (if they're open source) or find an alternative. Once you have the test passing, or you feel confident that it won't be an issue, you need to create app packages that are compatible with the store. At this point your game will be associated with the submission you have created in the Windows Store dashboard so that it is prepared for upload. Creating your app packages To do this, right click on your game project in Visual Studio and click on Create App Packages inside the Store menu. Once you do that, you'll be asked if you want to create a package for the store. The difference between the two options comes down to how the package is signed. If you choose No here, you can create a package with your test certificate, which can be distributed for testing. These packages must be manually installed and cannot be submitted to the store. You can, however, use this type of package on other machines to install your game for testers to try out. Choosing No will give you a folder with a .ps1 file (Powershell), which you can run to execute the install script. Choosing Yes at this option will take you to a login screen where you can enter your Windows Store developer account details. Once you've logged in you will be presented with a list of applications that you have registered with the store. If you haven't yet reserved the name of your application, you can click on the Reserve Name link, which will take you directly to the appropriate page in the store dashboard. Otherwise select the name of the game you're trying to build and click on Next. The next screen will allow you to specify which architectures to build for, and the version number of the built package. As this is a C++ game, we need to provide separate packages for the ARM, x86, and x64 builds, depending on what you want to support. Simply providing an x86 and ARM build will cover the entire market; 64 bit can be nice to have if you need a lot of memory, but ultimately it is optional, and some users may not even be able to run x64 code. When you're ready click on Create and Visual Studio will proceed to build your game and compile the requested packages, placing them in the directory specified. If you've built for the store, you will need the .appxupload files from this directory when you proceed to upload your game. Once the build has completed you will be asked if you want to launch the Windows Application Certification Kit. As mentioned previously this will test your game for certification failures, and if you're submitting to the store it's strongly recommended you run this. Doing so at this screen will automatically deploy the built package and run the test, so ensure you have a little bit of time to let it run. Uploading and submitting Now that you have a built app package you can return to the store dashboard to submit your game. Just edit the submission you made previously and enter the Packages section, which will take you to the page where you can upload the appxupload file. Once you have successfully uploaded your game you will gain access to the next section, the Description. This is where you define the details that will be displayed in the store. This is also where your marketing skills come into play as you prepare the content that will hopefully get players to buy your game. You start with the description of your game, and any big feature bullet points you want to emphasize. This is the best place to mention any reviews or praise, as well as give a quick description that will help the players decide if they want to try your game. You can have a number of app features listed; however, like any "back of the box" bullet points, keep them short and exciting. Along with the description, the store requires at least one screenshot to display to the potential player. These screenshots need to be of the entire screen, and that means they need to be at least 1366x768, which is the minimum resolution of Windows 8. These are also one of the best ways to promote your game, so ensure you take some great screenshots that show off the fun and appeal of your game. There are a few ways to take a screenshot of your game. If you're testing in the simulator you can use the screenshot icon on the right toolbar of the simulator to take the screenshot. If not, you can use Windows Key + Prt Scr SysRq to take a screenshot of your entire screen, and then use that (or edit it if you have multiple monitors). Screenshots taken with either of these tools can be found in the Screenshots folder within your Pictures library. There are two other small pieces of information that are required during this stage: Copyright Info and Support contact info. For the support info, an e-mail address will usually suffice. At this point you can also include your website and, if applicable to your game, a link to the privacy policy included in your game. Note that if you require a privacy policy, it must be included in two places: your game, and the privacy policy field on this form. The last items you may want to add here are promotional images. These images are intended for use in store promotions and allow Microsoft to easily feature your game with larger promotional imagery in prominent locations within the store. If you are serious about maximizing the reach of your game, you will want to include these images. If you don't, the number of places your game can be featured will be reduced. At a minimum the 414x180 px image should be included if you want some form of promotion. Now you're almost done! The next section allows you to leave notes for the testing team. This is where you would leave test account details for any features in your game that require an account so that they can test those features. This is also the location to leave any notes about testing in case there are situations where you can point out any features that might not be obvious. In certain situations you may have an exemption from Microsoft for a certification requirement; this would be where you include that exemption. When every step has been completed and you have tick marks in all of the stages, the Submit for Certification button will unlock, allowing you to complete your submission and send it off for certification. At this stage a number of automated tests will run before human testers will try your game on a variety of devices to ensure it fits the requirements for the store. If all goes well, you will receive an email notifying you of your successful certification and, depending on if you set the release date as ASAP, you will find your game in the store a few hours later (it may take a few hours to appear in the store once you receive an email informing you that your game or app is in the store). Certification tips Your first stop should be the certification requirements page, which lists all of the current requirements your game will be tested for: http://msdn.microsoft.com/en-us/library/windows/apps/hh694083.aspx. There are some requirements that you should take note of, and in this section we'll take a look at ways to help ensure you pass those requirements. Privacy The first of course is the privacy policy. As mentioned before, if your game collects any sort of personal information, you will need that policy in two places: In full text within the game Accessible through an Internet link The default app template generated by Visual Studio automatically enables the Internet capability, and by simply having that enabled you require a privacy policy. If you aren't connecting to the Internet at all in your game, you should always ensure that none of the Internet options are enabled before you package your game. If you share any personal information, then you need to provide players with a method of opting in to the sharing. This could be done by gating the functionality behind a login screen. Note that this functionality can be locked away, and the requirement doesn't demand that you find a way to remain fully functional even if the user opts out. Features One requirement is that your game support both touch input and keyboard/mouse input. You can easily support this by using an input system like the one described in this article; however, by supporting touch input you get mouse input for free and technically fulfill this requirement. It's all about how much effort you want to put into the experience your player will have, and that's why including gamepad input is recommended as some players may want to use a connected Xbox 360 gamepad for their input device in games. Legacy APIs Although your game might run while using legacy APIs, it won't pass certification. This is checked through an automated test that also occurs during the WACK testing process, so you can easily check if you have used any illegal APIs. This often arises in third party libraries which make use of parts of the standard IO library such as the console, or the insecure versions of functions such as strcpy or fopen. Some of these APIs don't exist in WinRT for good reason; the console, for example, just doesn't exist, so calling APIs that work directly with the console makes no sense, and isn't allowed. Debug Another issue that may arise through the use of third-party libraries is that some of them may be compiled in debug mode. This could present issues at runtime for your app, and the packaging system will happily include these when compiling your game, unless it has to compile them itself. This is detected by the WACK and can be resolved by finding a release mode version, or recompiling the library. WACK The final tip is: run WACK. This kit quickly and easily finds most of the issues you may encounter during certification, and you see the issues immediately rather than waiting for it to fail during the certification process. Your final step before submitting to the store should be to run WACK, and even while developing it's a good idea to compile in release mode and run the tests to just make sure nothing is broken. Summary By now you should know how to submit your game to the store, and get through certification with little to no issues. We've looked at what you require for the store including imagery and metadata, as well as how to make use of the Windows Application Certification Kit to find problems early on and fix them up without waiting hours or days for certification to fail your game. One area unique to games that we have covered in this article is game ratings. If you're developing your game for certain markets where ratings are required, or if you are developing children's games, you may need to get a rating certificate, and hopefully you have an idea of where to look to do this. Resources for Article: Further resources on this subject: Introduction to Game Development Using Unity 3D [Article] HTML5 Games Development: Using Local Storage to Store Game Data [Article] Unity Game Development: Interactions (Part 1) [Article]

0
0
4710

article-image-getting-started-codeblocks

Packt

28 Oct 2013

7 min read

Getting Started with Code::Blocks

Packt

28 Oct 2013

7 min read

0
0
8147

Packt

28 Oct 2013

12 min read

Low-Level Index Control

Packt

28 Oct 2013

12 min read

(For more resources related to this topic, see here.) Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch. Setting per-field similarity Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file): { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code: { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } } And that's all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields. In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section. Default codec properties When using the default codec we are allowed to configure the following properties: min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25. max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48. Direct codec properties The direct codec allows us to configure the following properties: min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8. low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32. Memory codec properties By using the memory codec we are allowed to alter the following properties: pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data. acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead. Pulsing codec properties When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows: freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed. Bloom filter-based codec properties If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties: delegate: It specifies the name of the codec we want to wrap, with the bloom filter. ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million. For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file): { "settings" : { "index" : { "codec" : { "postings_format" : { "custom_bloom" : { "type" : "bloom_filter", "delegate" : "direct", "ffp" : "10k=0.03, 1m=0.05" } } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "postings_format" : "custom_bloom" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } NRT, flush, refresh, and transaction log In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let's index an example document to the newly created index by using the following command: curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }' Now, we will replace this document and immediately we will try to find it. In order to do this, we'll use the following command chain: curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl localhost:9200/test/test/_search?pretty The preceding command will probably result in the response, which is very similar to the following response: {"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "test", "_id" : "1", "_score" : 1.0, "_source" : { "title": "test" } } ] } } The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened? But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching. Updating index and committing changes The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state. Let's return to our example. The first operation adds the document to the index, but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command: curl –XGET localhost:9200/test/_refresh If we add the preceding command before the search, ElasticSearch would respond as we had expected. Changing the default refresh time The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "5m" } }' The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won't be visible by queries. As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done. The transaction log Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing. Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared. In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command: curl –XGET localhost:9200/_flush Or we can run the flush command for the particular index, which in our case is the one called library: curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh In the second example we used it together with the refresh, which after flushing the data opens a new searcher. The transaction log configuration If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior: index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it's better to do flush more often with less data being stored in it. index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000. index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB. index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents. All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards. Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "translog.disable_flush" : true } }' The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done. Near Real Time GET Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command: curl -XGET localhost:9200/test/test/1?pretty ElasticSearch should return the result similar to the following: { "_index" : "test", "_type" : "test", "_id" : "1", "_version" : 2, "exists" : true, "_source" : { "title": "test2" } } If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

0
0
2106

Packt

28 Oct 2013

29 min read

Availability Management

Packt

28 Oct 2013

29 min read

0
0
2757

Packt

25 Oct 2013

5 min read

Gesture

Packt

25 Oct 2013

5 min read

(For more resources related to this topic, see here.) Gestures While there are ongoing arguments in the courts of America at the time of writing over who invented the likes of dragging images, it is without a doubt that a key feature of iOS is the ability to use gestures. To put it simply, when you tap the screen to start an app or select a part of an image to enlarge it or anything like that, you are using gestures. A gesture (in terms of iOS) is any touch interaction between the UI and the device. With iOS 6, there are six gestures the user has the ability to use. These gestures, along with brief explanations, have been listed in the following table: Class Name and type Gesture UIPanGesture Recognizer PanGesture; Continuous type Pan images or over-sized views by dragging across the screen UISwipeGesture Recognizer SwipeGesture; Continuous type Similar to panning, except it is a swipe UITapGesture Recognizer TapGesture; Discrete type Tap the screen a number of times (configurable) UILongPressGesture Recognizer LongPressGesture; Discrete type Hold the finger down on the screen UIPinchGesture Recognizer PinchGesture; Continuous type Zoom by pinching an area and moving your fingers in or out UIRotationGesture Recognizer RotationGesture; Continuous type Rotate by moving your fingers in opposite directions Gestures can be added by programming or via Xcode. The available gestures are listed in the following screenshot with the rest of the widgets on the right-hand side of the designer: To add a gesture, drag the gesture you want to use under the view on the View bar (shown in the following screenshot): Design the UI as you want and while pressing the Ctrl key, drag the gesture to what you want to recognize using the gesture. In my example, the object you want to recognize is anywhere on the screen. Once you have connected the gesture to what you want to recognize, you will see the configurable options of the gesture. The Taps field is the number of taps required before the Recognizer is triggered, and the Touches field is the number of points onscreen required to be touched for the Recognizer to be triggered. When you come to connect up the UI, the gesture must also be added. Gesture code When using Xcode, it is simple to code gestures. The class defined in the Xcode design for the tapping gesture is called tapGesture and is used in the following code: private int tapped = 0; public override void ViewDidLoad() { base.ViewDidLoad(); tapGesture.AddTarget(this, new Selector("screenTapped")); View.AddGestureRecognizer(tapGesture); } [Export("screenTapped")]public void SingleTap(UIGestureRecognizer s) { tapped++; lblCounter.Text = tapped.ToString(); } There is nothing really amazing to the code; it just displays how many times the screen has been tapped. The Selector method is called by the code when the tap has been seen. The method name doesn't make any difference as long as the Selector and Export names are the same. Types When the gesture types were originally described, they were given a type. The type reflects the number of messages sent to the Selector method. A discrete one generates a single message. A continuous one generates multiple messages, which requires the Selector method to be more complex. The complexity is added by the Selector method having to check the State of the gesture to decide on what to do with what message and whether it has been completed. Adding a gesture in code It is not a requirement that Xcode be used to add a gesture. To perform the same task in the following code as my preceding code did in Xcode is easy. The code will be as follows: UITapGestureRecognizer t'pGesture = new UITapGestureRecognizer() { NumberOfTapsRequired = 1 }; The rest of the code from AddTarget can then be used. Continuous types The following code, a Pinch Recognizer, shows a simple rescaling. There are a couple of other states that I'll explain after the code. The only difference in the designer code is that I have UIImageView instead of a label and a UIPinchGestureRecognizer class instead of a UITapGestureRecognizer class. public override void ViewDidLoad() { base.ViewDidLoad(); uiImageView.Image =UIImage.FromFile("graphics/image.jpg") Scale(new SizeF(160f, 160f); pinchGesture.AddTarget(this, new Selector("screenTapped")); uiImageView.AddGestureRecognizer(pinchGesture); } [Export("screenTapped")]public void SingleTap(UIGestureRecognizer s) { UIPinchGestureRecognizer pinch = (UIPinchGestureRecognizer)s; float scale = 0f; PointF location; switch(s.State) { case UIGestureRecognizerState.Began: Console.WriteLine("Pinch begun"); location = s.LocationInView(s.View); break; case UIGestureRecognizerState.Changed: Console.WriteLine("Pinch value changed"); scale = pinch.Scale; uiImageView.Image = UIImage FromFile("graphics/image.jpg") Scale(new SizeF(160f, 160f), scale); break; case UIGestureRecognizerState.Cancelled: Console.WriteLine("Pinch cancelled"); uiImageView.Image = UIImage FromFile("graphics/image.jpg") Scale(new SizeF(160f, 160f)); scale = 0f; break; case UIGestureRecognizerState.Recognized: Console.WriteLine("Pinch recognized"); break; } } Other UIGestureRecognizerState values The following table gives a list of other Recognizer states: State Description Notes Possible Default state; gesture hasn't been recognized Used by all gestures Failed Gesture failed No messages sent for this state Translation Direction of pan Used in the pan gesture Velocity Speed of pan Used in the pan gesture In addition to these, it should be noted that discrete types only use Possible and Recognized states. Summary Gestures certainly can add a lot to your apps. They can enable the user to speed around an image, move about a map, enlarge and reduce, as well as select areas of anything on a view. Their flexibility underpins why iOS is recognized as being an extremely versatile device for users to manipulate images, video, and anything else on-screen. Resources for Article: Further resources on this subject: Creating and configuring a basic mobile application [Article] So, what is XenMobile? [Article] Building HTML5 Pages from Scratch [Article]

0
0
7402

article-image-authenticating-your-application-devise

Packt

25 Oct 2013

11 min read

Authenticating Your Application with Devise

Packt

25 Oct 2013

11 min read

(For more resources related to this topic, see here.) Signing in using authentication other than e-mails By default, Devise only allows e-mails to be used for authentication. For some people, this condition will lead to the question, "What if I want to use some other field besides e-mail? Does Devise allow that?" The answer is yes; Devise allows other attributes to be used to perform the sign-in process. For example, I will use username as a replacement for e-mail, and you can change it later with whatever you like, including userlogin, adminlogin, and so on. We are going to start by modifying our user model. Create a migration file by executing the following command inside your project folder: $ rails generate migration add_username_to_users username:string This command will produce a file, which is depicted by the following screenshot: The generated migration file Execute the migrate (rake db:migrate) command to alter your users table, and it will add a new column named username. You need to open the Devise's main configuration file at config/initializers/devise.rb and modify the code: config.authentication_keys = [:username] config.case_insensitive_keys = [:username] config.strip_whitespace_keys = [:username] You have done enough modification to your Devise configuration, and now you have to modify the Devise views to add a username field to your sign-in and sign-up pages. By default, Devise loads its views from its gemset code. The only way to modify the Devise views is to generate copies of its views. This action will automatically override its default views. To do this, you can execute the following command: $ rails generate devise:views It will generate some files, which are shown in the following screenshot: Devise views files As I have previously mentioned, these files can be used to customize another view. But we are going to talk about it a little later in this article. Now, you have the views and you can modify some files to insert the username field. These files are listed as follows: app/views/devise/sessions/new.html.erb: This is a view file for the sign-up page. Basically, all you need to do is change the email field into the username field. #app/views/devise/sessions/new.html.erb <h2>Sign in</h2> <%= notice %> <%= alert %> <%= form_for(resource, :as => resource_name, :url => session_path (resource_name)) do |f| %> <div><%= f.label :username %><br /> <%= f.text_field :username, :autofocus => true %><div> <div><%= f.label :password %><br /> <%= f.password_field :password %></div> <% if devise_mapping.rememberable? -%> <div><%= f.check_box :remember_me %> <%= f.label :remember_me %></div> <% end -%> <div><%= f.submit "Sign in" %></div> <% end %> %= render "devise/shared/links" %> You are now allowed to sign in with your username. The modification will be shown, as depicted in the following screenshot: The sign-in page with username app/views/devise/registrations/new.html.erb: This file is a view file for the registration page. It is a bit different from the sign-up page; in this file, you need to add the username field, so that the user can fill in their username when they perform the registration. #app/views/devise/registrations/new.html.erb <h2>Sign Up</h2> <%= form_for() do |f| %> <%= devise_error_messages! %> <div><%= f.label :email %><br /> <%= f.email_field :email, :autofocus => true %></div> <div><%= f.label :username %><br /> <%= f.text_field :username %></div> <div><%= f.label :password %><br /> <%= f.password_field :password %></div> <div><%= f.label :password_confirmation %><br /> <%= f.password_field :password_confirmation %></div> <div><%= f.submit "Sign up" %></div> <% end %> <%= render "devise/shared/links" %> Especially for registration, you need to perform extra modifications. Mass assignment rules written in the app/controller/application_controller.rb file, and now, we are going to modify them a little. Add username to the sanitizer for sign-in and sign-up, and you will have something as follows: #these codes are written inside configure_permitted_parameters function devise_parameter_sanitizer.for(:sign_in) {|u| u.permit(:email, :username )} devise_parameter_sanitizer.for(:sign_up) {|u| u.permit(:email, :username, :password, :password_confirmation)} These changes will allow you to perform a sign-up along with the username data. The result of the preceding example is shown in the following screenshot: The sign-up page with username I want to add a new case for your sign-in, which is only one field for username and e-mail. This means that you can sign in either with your e-mail ID or username like in Twitter's sign-in form. Based on what we have done before, you already have username and email columns; now, open /app/models/user.rb and add the following line: attr_accessor :signin Next, you need to change the authentication keys for Devise. Open /config/initializers/devise.rb and change the value for config.authentication_keys, as shown in the following code snippet: config.authentication_keys = [ :signin ] Let's go back to our user model. You have to override the lookup function that Devise uses when performing a sign-in. To do this, add the following method inside your model class: def self.find_first_by_auth_conditions(warden_conditions) conditions = warden_conditions.dup where(conditions).where(["lower(username) = :value OR lower(email) = :value", { :value => signin.downcase }]).first end As an addition, you can add a validation for your username, so it will be case insensitive. Add the following validation code into your user model: validates :username, :uniqueness => {:case_sensitive => false} Please open /app/controller/application_controller.rb and make sure you have this code to perform parameter filtering: before_filter :configure_permitted_parameters, if: :devise_controller? protected def configure_permitted_parameters devise_parameter_sanitizer.for(:sign_in) {|u| u.permit(:signin)} devise_parameter_sanitizer.for(:sign_up) {|u| u.permit(:email, : username, :password, :password_confirmation)} end We're almost there! Currently, I assume that you've already stored an account that contains the e-mail ID and username. So, you just need to make a simple change in your sign-in view file (/app/views/devise/sessions/new.html.erb). Make sure that the file contains this code: <h2>Sign in</h2> <%= notice %> <%= alert %> <%= form_for(resource, :as => resource_name, :url => session_path (resource_name)) do |f| %> <div><%= f.label "Username or Email" %><br /> <%= f.text_field :signin, :autofocus => true %></div> <div><%= f.label :password %><br /> <%= f.password_field :password %></div> <% if devise_mapping.rememberable? -%> <div><%= f.check_box :remember_me %> <%= f.label :remember_me %> </div> <% end -%> <div><%= f.submit "Sign in" %></div> <% end %> <%= render "devise/shared/links" %> You can see that you don't have a username or email field anymore. The field is now replaced by a single field named :signin that will accept either the e-mail ID or the username. It's efficient, isn't it? Updating the user account Basically, you are already allowed to access your user account when you activate the registerable module in the model. To access the page, you need to log in first and then go to /users/edit. The page is as shown in the following screenshot: The edit account page But, what if you want to edit your username or e-mail ID? How will you do that? What if you have extra information in your users table, such as addresses, birth dates, bios, and passwords as well? How will you edit these? Let me show you how to edit your user data including your password, or edit your user data without editing your password. Editing your data, including the password: To perform this action, the first thing that you need to do is modify your view. Your view should contain the following code: <div><%= f.label :username %><br /> <%= f.text_field :username %></div> Now, we are going to overwrite Devise's logic. To do this, you have to create a new controller named registrations_controller. Please use the rails command to generate the controller, as shown: $ rails generate controller registrations update It will produce a file located at app/controllers/. Open the file and make sure you write this code within the controller class: class RegistrationsController < Devise::RegistrationsController def update new_params = params.require(:user).permit(:email,:username, : current_password, :password,:password_confirmation) @user = User.find(current_user.id) if @user.update_with_password(new_params) set_flash_message :notice, :updated sign_in @user, :bypass => true redirect_to after_update_path_for(@user) else render "edit" end end end Let's look at the code. Currently, Rails 4 has a new method in organizing whitelist attributes. Therefore, before performing mass assignment attributes, you have to prepare your data. This is done in the first line of the update method. Now, if you see the code, there's a method defined by Devise named update_with_password. This method will use mass assignment attributes with the provided data. Since we have prepared it before we used it, it will be fine. Next, you have to edit your route file a bit. You should modify the rule defined by Devise, so instead of using the original controller, Devise will use the controller you created before. The modification should look as follows: devise_for :users, :controllers => {:registrations => "registrations"} Now you have modified the original user edit page, and it will be a little different. You can turn on your Rails server and see it in action. The view is as depicted in the following screenshot: The modified account edit page Now, try filling up these fields one by one. If you are filling them with different values, you will be updating all the data (e-mail, username, and password), and this sounds dangerous. You can modify the controller to have better data update security, and it all depends on your application's workflows and rules. Editing your data, excluding the password: Actually, you already have what it takes to update data without changing your password. All you need to do is modify your registrations_controller.rb file. Your update function should be as follows: class RegistrationsController < Devise::RegistrationsController def update new_params = params.require(:user).permit(:email,:username, : current_password, :password,:password_confirmation) change_password = true if params[:user][:password].blank? params[:user].delete("password") params[:user].delete("password_confirmation") new_params = params.require(:user).permit(:email,:username) change_password = false end @user = User.find(current_user.id) is_valid = false if change_password is_valid = @user.update_with_password(new_params) else @user.update_without_password(new_params) end if is_valid set_flash_message :notice, :updated sign_in @user, :bypass => true redirect_to after_update_path_for(@user) else render "edit" end end end The main difference from the previous code is now you have an algorithm that will check whether the user intends to update your data with their password or not. If not, the code will call the update_without_password method. Now, you have codes that allow you to edit with/without a password. Now, refresh your browser and try editing with or without a password. It won't be a problem anymore. Summary Now, I believe that you will be able to make your own Rails application with Devise. You should be able to make your own customizations based on your needs. Resources for Article: Further resources on this subject: Integrating typeahead.js into WordPress and Ruby on Rails [Article] Facebook Application Development with Ruby on Rails [Article] Designing and Creating Database Tables in Ruby on Rails [Article]

0
0
10681

How-To Tutorials

Packt

25 Oct 2013

11 min read

Social-Engineer Toolkit

Packt

25 Oct 2013

11 min read

(For more resources related to this topic, see here.) Social engineering is an act of manipulating people to perform actions that they don't intend to do. A cyber-based, socially engineered scenario is designed to trap a user into performing activities that can lead to the theft of confidential information or some malicious activity. The reason for the rapid growth of social engineering amongst hackers is that it is difficult to break the security of a platform, but it is far easier to trick the user of that platform into performing unintentional malicious activity. For example, it is difficult to break the security of Gmail in order to steal someone's password, but it is easy to create a socially engineered scenario where the victim can be tricked to reveal his/her login information by sending a fake login/phishing page. The Social-Engineer Toolkit is designed to perform such tricking activities. Just like we have exploits and vulnerabilities for existing software and operating systems, SET is a generic exploit of humans in order to break their own conscious security. It is an official toolkit available at https://www.trustedsec.com/, and it comes as a default installation with BackTrack 5. In this article, we will analyze the aspect of this tool and how it adds more power to the Metasploit framework. We will mainly focus on creating attack vectors and managing the configuration file, which is considered the heart of SET. So, let's dive deeper into the world of social engineering. Getting started with the Social-Engineer Toolkit (SET) Let's start our introductory recipe about SET, where we will be discussing SET on different platforms. Getting ready SET can be downloaded for different platforms from its official website: https://www.trustedsec.com/. It has both the GUI version, which runs through the browser, and the command-line version, which can be executed from the terminal. It comes pre-installed in BackTrack, which will be our platform for discussion in this article. How to do it... To launch SET on BackTrack, start the terminal window and pass the following path: root@bt:~# cd /pentest/exploits/set root@bt:/pentest/exploits/set# ./set Copyright 2012, The Social-Engineer Toolkit (SET) All rights reserved. Select from the menu: If you are using SET for the first time, you can update the toolkit to get the latest modules and fix known bugs. To start the updating process, we will pass the svn update command. Once the toolkit is updated, it is ready for use. The GUI version of SET can be accessed by navigating to Applications | BackTrack | Exploitation tools | Social-Engineer Toolkit. How it works... SET is a Python-based automation tool that creates a menu-driven application for us. Faster execution and the versatility of Python make it the preferred language for developing modular tools, such as SET. It also makes it easy to integrate the toolkit with web servers. Any open source HTTP server can be used to access the browser version of SET. Apache is typically considered the preferable server while working with SET. There's more... Sometimes, you may have an issue upgrading to the new release of SET in BackTrack 5 R3. Try out the following steps: You should remove the old SET using the following command: dpkg –r set We can remove SET in two ways. First, we can trace the path to /pentest/exploits/set, making sure we are in the directory and then opt for the 'rm' command for removing all files present there. Or, we can use the method shown previously. Then, for reinstallation, we can download its clone using the following command: Git clone https://github.com/trustedsec/social-engineer-toolkit /set Working with the SET config file In this recipe, we will take a close look at the SET config file, which contains default values for different parameters that are used by the toolkit. The default configuration works fine with most of the attacks, but there can be situations when you have to modify the settings according to the scenario and requirements. So, let's see what configuration settings are available in the config file. Getting ready To launch the config file, move to the config file and open the set_config file: root@bt:/pentest/exploits/set# nano config/set_config The configuration file will be launched with some introductory statements, as shown in the following screenshot: How to do it... Let's go through it step-by-step: First, we will see what configuration settings are available for us: # DEFINE THE PATH TO METASPLOIT HERE, FOR EXAMPLE /pentest/exploits/framework3 METASPLOIT_PATH=/pentest/exploits/framework3 The first configuration setting is related to the Metasploit installation directory. Metasploit is required by SET for proper functioning, as it picks up payloads and exploits from the framework: # SPECIFY WHAT INTERFACE YOU WANT ETTERCAP TO LISTEN ON, IF NOTHING WILL DEFAULT # EXAMPLE: ETTERCAP_INTERFACE=wlan0 ETTERCAP_INTERFACE=eth0 # # ETTERCAP HOME DIRECTORY (NEEDED FOR DNS_SPOOF) ETTERCAP_PATH=/usr/share/ettercap Ettercap is a multipurpose sniffer for switched LAN. Ettercap section can be used to perform LAN attacks like DNS poisoning, spoofing etc. The above SET setting can be used to either set ettercap ON of OFF depending upon the usability. # SENDMAIL ON OR OFF FOR SPOOFING EMAIL ADDRESSES SENDMAIL=OFF The sendmail e-mail server is primarily used for e-mail spoofing. This attack will work only if the target's e-mail server does not implement reverse lookup. By default, its value is set to OFF. The following setting shows one of the most widely used attack vectors of SET. This configuration will allow you to sign a malicious Java applet with your name or with any fake name, and then it can be used to perform a browser-based Java applet infection attack: # CREATE SELF-SIGNED JAVA APPLETS AND SPOOF PUBLISHER NOTE THIS REQUIRES YOU TO # INSTALL ---> JAVA 6 JDK, BT4 OR UBUNTU USERS: apt-get install openjdk-6-jdk # IF THIS IS NOT INSTALLED IT WILL NOT WORK. CAN ALSO DO apt-get install sun-java6-jdk SELF_SIGNED_APPLET=OFF We will discuss this attack vector in detail in a later recipe, that is, the Spear phishing attack vector . This attack vector will also require JDK to be installed on your system. Let's set its value to ON, as we will be discussing this attack in detail: SELF_SIGNED_APPLET=ON # AUTODETECTION OF IP ADDRESS INTERFACE UTILIZING GOOGLE, SET THIS ON IF YOU WANT # SET TO AUTODETECT YOUR INTERFACE AUTO_DETECT=ON The AUTO_DETECT flag is used by SET to auto-discover the network settings. It will enable SET to detect your IP address if you are using NAT/Port forwarding, and it allows you to connect to the external Internet. The following setting is used to set up the Apache web server to perform web-based attack vectors. It is always preferred to set it to ON for better attack performance: # USE APACHE INSTEAD OF STANDARD PYTHON WEB SERVERS, THIS WILL INCREASE SPEED OF # THE ATTACK VECTOR APACHE_SERVER=OFF # # PATH TO THE APACHE WEBROOT APACHE_DIRECTORY=/var/www The following setting is used to set up the SSL certificate while performing web attacks. Several bugs and issues have been reported for the WEBATTACK_SSL setting of SET. So, it is recommended to keep this flag OFF: # TURN ON SSL CERTIFICATES FOR SET SECURE COMMUNICATIONS THROUGH WEB_ATTACK VECTOR WEBATTACK_SSL=OFF The following setting can be used to build a self-signed certificate for web attacks, but there will be a warning message saying Untrusted certificate. Hence, it is recommended to use this option wisely to avoid alerting the target user: # PATH TO THE PEM FILE TO UTILIZE CERTIFICATES WITH THE WEB ATTACK VECTOR (REQUIRED) # YOU CAN CREATE YOUR OWN UTILIZING SET, JUST TURN ON SELF_SIGNED_CERT # IF YOUR USING THIS FLAG, ENSURE OPENSSL IS INSTALLED! # SELF_SIGNED_CERT=OFF The following setting is used to enable or disable the Metasploit listener once the attack is executed: # DISABLES AUTOMATIC LISTENER - TURN THIS OFF IF YOU DON'T WANT A METASPLOIT LISTENER IN THE BACKGROUND. AUTOMATIC_LISTENER=ON The following configuration will allow you to use SET as a standalone toolkit without using Metasploit functionalities, but it is always recommended to use Metasploit along with SET in order to increase the penetration testing performance: # THIS WILL DISABLE THE FUNCTIONALITY IF METASPLOIT IS NOT INSTALLED AND YOU JUST WANT TO USE SETOOLKIT OR RATTE FOR PAYLOADS # OR THE OTHER ATTACK VECTORS. METASPLOIT_MODE=ON These are a few important configuration settings available for SET. Proper knowledge of the config file is essential to gain full control over the SET. How it works... The SET config file is the heart of the toolkit, as it contains the default values that SET will pick while performing various attack vectors. A misconfigured SET file can lead to errors during the operation, so it is essential to understand the details defined in the config file in order to get the best results. The How to do it... section clearly reflects the ease with which we can understand and manage the config file. Working with the spear-phishing attack vector A spear-phishing attack vector is an e-mail attack scenario that is used to send malicious mails to target/specific user(s). In order to spoof your own e-mail address, you will require a sendmail server. Change the config setting to SENDMAIL=ON. If you do not have sendmail installed on your machine, then it can be downloaded by entering the following command: root@bt:~# apt-get install sendmail Reading package lists... Done Getting ready Before we move ahead with a phishing attack, it is imperative for us to know how the e-mail system works. Recipient e-mail servers, in order to mitigate these types of attacks, deploy gray-listing, SPF records validation, RBL verification, and content verification. These verification processes ensure that a particular e-mail arrived from the same e-mail server as its domain. For example, if a spoofed e-mail address, <richyrich@gmail.com>, arrives from the IP 202.145.34.23, it will be marked as malicious, as this IP address does not belong to Gmail. Hence, in order to bypass these security measures, the attacker should ensure that the server IP is not present in the RBL/SURL list. As the spear-phishing attack relies heavily on user perception, the attacker should conduct a recon of the content that is being sent and should ensure that the content looks as legitimate as possible. Spear-phishing attacks are of two types—web-based content and payload-based content. How to do it... The spear-phishing module has three different attack vectors at our disposal: Let's analyze each of them. Passing the first option will start our mass-mailing attack. The attack vector starts with selecting a payload. You can select any vulnerability from the list of available Metasploit exploit modules. Then, we will be prompted to select a handler that can connect back to the attacker. The options will include setting the vnc server or executing the payload and starting the command line, and so on. The next few steps will be starting the sendmail server, setting a template for a malicious file format, and selecting a single or mass-mail attack: Finally, you will be prompted to either choose a known mail service, such as Gmail or Yahoo, or use your own server: 1. Use a gmail Account for your email attack. 2. Use your own server or open relay set:phishing>1 set:phishing> From address (ex: moo@example.com):bigmoney@gmail.com set:phishing> Flag this message/s as high priority? [yes|no]:y Setting up your own server cannot be very reliable, as most of the mail services follow a reverse lookup to make sure that the e-mail has generated from the same domain name as the address name. Let's analyze another attack vector of spear-fishing. Creating a file format payload is another attack vector in which we can generate a file format with a known vulnerability and send it via e-mail to attack our target. It is preferred to use MS Word-based vulnerabilities, as they are difficult to detect whether they are malicious or not, so they can be sent as an attachment via an e-mail: set:phishing> Setup a listener [yes|no]:y [-] *** [-] * WARNING: Database support has been disabled [-] *** At last, we will be prompted on whether we want to set up a listener or not. The Metasploit listener will begin and we will wait for the user to open the malicious file and connect back to the attacking system. The success of e-mail attacks depends on the e-mail client that we are targeting. So, a proper analysis of this attack vector is essential. How it works... As discussed earlier, the spear-phishing attack vector is a social engineering attack vector that targets specific users. An e-mail is sent from the attacking machine to the target user(s). The e-mail will contain a malicious attachment, which will exploit a known vulnerability on the target machine and provide a shell connectivity to the attacker. The SET automates the entire process. The major role that social engineering plays here is setting up a scenario that looks completely legitimate to the target, fooling the target into downloading the malicious file and executing it.

0
0
17989

How-To Tutorials

Packt

25 Oct 2013

5 min read

The solvers – these great unknown

Packt

25 Oct 2013

5 min read

(For more resources related to this topic, see here.) Variable-step versus fixed-step solvers A variable-step solver will choose to use a shorter time step when the signal derivative is changing faster in order to obtain a better approximation, while it will use a longer step when the result changes at a slower pace. The most used variable-step solver is the ode45 solver (this is the default Simulink solver as well). A fixed-step solver will, on the other hand, always use the same step size. The simplest is the ode1 solver (the famous Euler method). We've already run our example with a fixed-step solver. Let's change this to a variable-step solver by opening the Model Configuration Parameters window and setting the ode45 solver. We now have the option to configure more Solver parameters; we'll allow it to use a Max step size of 1 second and a Min step size of 0.2 seconds, setting other parameters to the default value. Running the simulation now will give us the following result: We can see that the step size is indeed variable: the second value has been computed after only 0.2 seconds, and then the step size grew bigger and stayed at 1 second. We got a good approximation in the problematic region while using longer steps in the almost constant region, thus achieving a faster simulation time than a fixed-step solver with a step size of 0.2 seconds. But how does the solver determine when it's necessary to reduce the step size? The answer is by looking at the tolerances defined in the Model Configuration Parameters window. Relative tolerance defines the maximum error as a percentage of the actual value. The default value (1e-3) means that an error of 0.1 percent is accepted. Absolute tolerance defines the maximum error as a scalar value. When it is set to auto, Simulink will assume 1e-6 initially (that is, 0.000001), then change it to the maximum error calculated with the relative tolerance. The Shape preservation option, off ( Disable all ) by default, helps to increase the accuracy when the solution's derivative varies rapidly at the cost of a more computing-intensive simulation (thus increasing the overall time to compute the result). Finally, there's the option Number of consecutive min steps , which defines how many times the solver can use a time step smaller than the minimum step size violations defined before issuing a warning or an error. Let's set the Relative tolerance parameter to 1e-5 and run the simulation. We'll obtain this result: This time, the solver chose to use more shorter steps while the signal was changing to improve the accuracy, thanks to the relative tolerance we've set; then the step size grew gradually back to 1 second. Variable-step solvers give us a good trade-off between result approximation and overall simulation times in models, whereas fixed-step solvers are too computationally intensive to be used efficiently. Fixed-step solvers are required for code generation and real-time control purposes because Simulink can't go back in time to give a more accurate result. Continuous versus discrete A system can be continuous or discrete based on the presence of integrators or derivatives in the model (these are called continuous states). In other words, if the system is represented by the means of one or more differential equations, the system is continuous. Our cruise controller is continuous because it has an integral component. The continuous solver can choose to perform several iteration cycles in a single time step to reach the best possible approximation of the final result. Such solvers fall into the ODE ( Ordinary Differential Equation solvers) category. If a system is discrete, there are no differential equations but only memories and mathematical operands. A discrete solver is unable to perform the numerical integration required by continuous states. The continuous system requires continuous solvers, while Simulink will switch to a discrete solver for discrete systems even if a continuous solver has been specified. Try to set a discrete solver in our example system and run a simulation. An error window will appear, telling us that a discrete solver can't be used because the model contains continuous states. Stiff versus nonstiff In a stiff continuous system, certain solvers are unable to compute the result unless the step size is small enough; thus, a stiff solver is needed. These systems can be identified by an overall slowly changing solution that can vary rapidly in a very small time period. Our example is mildly stiff: the ode45 solver had to use a small step size to compute the result close to t = 0, but it managed to get the job done by changing the time step. A more appropriate choice would have been the ode23t solver (moderately stiff), which would have given the following result: Let's choose another nonstiff solver, the fixed-step ode1 , implementing the Euler method. Set the Fixed step size parameter to 1.5 and run the simulation: That's interesting: Using the Euler method with a long time step shows the result to be oscillating around the mathematical result, but the solution is still convergent. Setting the step size to 2 seconds will show that it can't converge anymore, and any step size higher than 2 seconds will make the simulation diverge from the correct result and go towards infinite oscillation. The following figure shows the result with a step size of 2.1 and a simulation time of 20 seconds: The following article presents stiff solvers in a way that is easy to understand: http://blogs.mathworks.com/seth/2012/07/03/why-do-we-need-stiff-ode-solvers/. The documentation center, accessible by pressing F1 , has detailed information about the solver types and parameters. This information can be found by navigating to Simulink | Simulation | Configure simulation . Summary In this article, we've learned how solvers work and how to choose the right solver to run the simulations Resources for Article: Further resources on this subject: 2-Dimensional Image Filtering [Article] GoboLinux: An Interview with Lucas Villa Real [Article] The NGINX HTTP Server [Article]

0
0
23587

article-image-managing-test-structure-robot-framework

Packt

25 Oct 2013

5 min read

Managing Test Structure with Robot Framework

Packt

25 Oct 2013

5 min read

(For more resources related to this topic, see here.) To manage the increased number of tests and to ensure their execution and quick and effective analysis, robot framework can be used which is a tool written in python which can be used to write different tests that may call other testing tools/software which perform actual tests and report back to the test framework. You should use robot framework because of various advantages that it offers, some of which is detailed in the following section: Uniformity XTest compliant Library support Flexible Distributive Open Uniformity Robot framework tests are uniform and readable from the end user's perspective and even if different technologies are involved in different tests for the same software, the test structure (since it is written in robot framework) for all the tests is consistent across the test project. XTest Compliant Robot framework is also xtest compliant, which means that it follows the same set of testing jargon and process as in other popular testing tools like JUnit, NUnit, and so on; also the developer associated with writing tests in robot framework has the least of surprises. Library Support Robot framework is supported by a vide variety of libraries for third party tools and given its robust community, it is easy to create a custom library as well. Flexible Not only the tests are uniform, but they may also be written according to the needs of the testing project. It is not uncommon to extend robot framework using keywords and associate tests to various keywords. This reduces complexity and brings about changes in writing tests. Test writing strategies like simple, keyword driven or behavior driven tests can be created and the same test can be written in many interesting ways. Distributive Robot framework is network friendly tool, and can also be used remotely to collect and share information. It also works with various network protocols straight out of box and presents advantages of a mature tool that can communicate and control tests present remotely. Open Owing to the open source nature(licensed under Apache version 2.0) of the robot framework, it is possible to see how it has been created and not get limited to its binary. For large scale customization, the source can directly be tweaked to perform work as desired. Environments Being a python based tool, it can work in python and python based environments such as jyton or ironpython which ensures that tests can be executed for newer application platforms like the java and .Net natively. The installations of various environments can be obtained directly or by building the source code . As mentioned by the build script, by using appropriate python instance to call the install.py, robot framework can be installed in different environments where the different executable are generated pertaining to the environment. For python environment, following executable are generated: pybot and rebot. For java environment, following executable are generated: jybot and jyrebot. For .net environment, following executable are generated: ipybot and ipyrebot. Note that in all three environments, the test structure and syntax is the same. What differs is the target applications which could be clr/jvm dependant and running robot framework on the same environment offers it the flexibility to access the software under test (SUT) more clearly. Test Naming Conventions In order to name tests, robot framework is very peculiar; it uses the configuration file and folder names to determine the execution order and test naming. For example, consider the following test hierarchy: application/ tests/ Test1.txt Other tests/ Another test.txt Running the pybot in the application folder will result in the following: The testsuite will be named after the root of test configuration, in this case, tests. Next, the Test1.txt will result in a nested testsuite, Test1 within testsuite tests. Similarly the testsuite another test will be nested under the tests/other tests testsuites. In order to organize the results, it is important to ensure proper naming. So, instead of using spaces, underscore '_' should be used. In order to get proper ordering irrespective of the test names, prefixing the test configuration with numbers can be done. This test hierarchy will therefore result in more understandable tests as well as the execution order of the tests can easily be predetermined. As mentioned, the test hierarchy is similar to any other XUnit based testing solution. Here, note that you can go on inserting test suites with themselves, which is different from other testing tool hierarchies. This approach gives more flexibility over the tests and introduces a new level of nesting. Running tests Running tests in robot framework is also very descriptive and the results of the tests can instantly be seen. In a similar manner, the results of the test can be viewed not only in the form of a detailed report, but also in the form of a log. The test results are also saved in an XML format where the tests can be recreated/merged/compared with other tests at a later date. This is achieved through the rebot command which can recreate the test structure in the generated report and log files through reading the metadata present in the XML file. Summary Through this article, you can clearly see the benefits of grouping and structuring the tests using robot framework and allowing for scalability in the tests by making the tests encapsulated within one another through the use of nested test suites. Robot Framework will not only integrated with your testing tool/solution, but also act as an assistant for your tests and manage, run and report their results back to you in a concise manner. Resources for Article: Further resources on this subject: Cross-browser-distributed testing [Article] Testing Backbone.js Application [Article] Python Testing: Installing the Robot Framework [Article]

0
0
8155

How-To Tutorials

article-image-scratching-surface-zend-framework-2

Packt

25 Oct 2013

11 min read

Scratching the Surface of Zend Framework 2

Packt

25 Oct 2013

11 min read

Bootstrap your app There are two ways to bootstrap your ZF2 app. The default is less flexible but handles the entire configuration, and the manual is really flexible but you have to take care of everything. The goal of the bootstrap is to provide to the application, ZendMvcApplication, with all the components and dependencies needed to successfully handle a request. A Zend Framework 2 application relies on the following six components: Configuration array ServiceManager instance EventManager instance ModuleManager instance Request object Response object As these are the pillars of a ZF2 application, we will take a look at how these components are configured to bootstrap the app. To begin with, we will see how the components interact from a high perspective and then we will jump into details of how each one works. When a new request arrives to our application, ZF2 needs to set up the environment to be able to fulfill it. This process implies reading configuration files and creating the required objects and services; attach them to the events that are going to be used and finally create the request object based on the request data. Once we have the request object, ZF2 will tell the router to do his job and will inspect the request object to determine who is responsible for processing the data. Once a controller and action has been identified as the one in charge of the request, ZF2 dispatches it and gives the controller/action the control of the program in order to execute the code that will interpret the request and will do something with it. This can be from accepting an uploaded image to showing a sign-up form and also changing data on an external database. When the controller processes the data, sometimes a view object is generated to encapsulate the data that we should send to the client who made the request, and a response object is created. After we have a response object, ZF2 sends it to the browser and the request ends. Now that we have seen a very simple overview of the lifecycle of a request we will jump into the details of how each object works, the options available and some examples of each one. Configuration array Let's dissect the first component of the list by taking a look at the index.php file: chdir(dirname(__DIR__)); // Setup autoloading require 'init_autoloader.php'; // Run the application! ZendMvcApplication::init(require 'config/application.config.php')->run(); As you can see we only do three things. The first thing is we change the current folder for the convenience of making everything relative to the root folder. Then we require the autoloader file; we will examine this file later. Finally, we initialize a ZendMvcApplication object by passing a configuration file and only then does the run method get called. The configuration file looks like the following code snippet: return array( 'modules' => array( 'Application', ), 'module_listener_options' => array( 'config_glob_paths' => array( 'config/autoload/{,*.}{global,local}.php', ), 'module_paths' => array( './module', './vendor', ), ), ); This file will return an array containing the configuration options for the application. Two options are used: modules and module_listener_options. As ZF2 uses a module organization approach, we should add the modules that we want to use on the application here. The second option we are using is passed as configuration to the ModuleManager object. The config_glob_path array is used when scanning the folders in search of config files and the module_paths array is used to tell ModuleManager a set of paths where the module resides. ZF2 uses a module approach to organize files. A module can contain almost anything, simple PHP files, view scripts, images, CSS, JavaScript, and so on. This approach will allow us to build reusable blocks of functionality and we will adhere to this while developing our project. PSR-0 and autoloaders Before continuing with the key components, let's take a closer look at the init_autoloader.php file used in the index.php file. As is stated on the first block comment, this file is more complicated than it's supposed to be. This is because ZF2 will try to set up different loading mechanisms and configurations. if (file_exists('vendor/autoload.php')) { $loader = include 'vendor/autoload.php'; } The first thing is to check if there is an autoload.php file inside the vendor folder; if it's found, we will load it. This is because the user might be using composer, in which case composer will provide a PSR-0 class loader. Also, this will register the namespaces defined by composer on the loader. PSR-0 is an autoloading standard proposed by the PHP Framework Interop Group (http://www.php-fig.org/) that describes the mandatory requirements for autoloader interoperability between frameworks. Zend Framework 2 is one of the projects that adheres to it. if (getenv('ZF2_PATH')) { $zf2Path = getenv('ZF2_PATH'); } elseif (get_cfg_var('zf2_path')) { $zf2Path = get_cfg_var('zf2_path'); } elseif (is_dir('vendor/ZF2/library')) { $zf2Path = 'vendor/ZF2/library'; } In the next section we will try to get the path of the ZF2 files from different sources. We will first try to get it from the environment, if not, we'll try from a directive value in the php.ini file. Finally, if the previous methods fail the code, we will try to check whether a specific folder exists inside the vendor folder. if ($zf2Path) { if (isset($loader)) { $loader->add('Zend', $zf2Path); } else { include $zf2Path . '/Zend/Loader/AutoloaderFactory.php'; ZendLoaderAutoloaderFactory::factory(array( 'ZendLoaderStandardAutoloader' => array( 'autoregister_zf' => true ) )); } } Finally, if the framework is found by any of these methods, based on the existence of the composer autoloader, the code will just add the Zend namespace or will instantiate an internal autoloader, ZendLoaderAutoloader, and use it as a default. As you can see, there are multiple ways to set up the autoloading mechanism on ZF2 and at the end what matters is which one you prefer, as all of them in essence will behave the same. ServiceManager After all this execution of code, we arrive at the last section of the index.php file where we actually instantiate the ZendMvcApplication object. As we said, there are two methods of creating an instance of ZendMvcApplication. In the default approach, we call the static method init of the class by passing an optional configuration as the first parameter. This method will take care of instantiating a new ServiceManager object, storing the configuration inside, loading the modules specified in the configuration, and getting a configured ZendMvcApplication. ServiceManager is a service/object locator that implements the Service Locator design pattern; its responsibility is to retrieve other objects. $serviceManager = new ServiceManager( new ServiceServiceManagerConfig($smConfig) ); $serviceManager->setService('ApplicationConfig', $configuration); $serviceManager->get('ModuleManager')->loadModules(); return $serviceManager->get('Application')->bootstrap(); As you can see, the init method calls the bootstrap() method of the ZendMvcApplication instance. Service Locator is a design pattern used in software development to encapsulate the process of obtaining other objects. The concept is based on a central repository that stores the objects and also knows how to create them if required. EventManager This component is designed to provide multiple functionalities. It can be used to implement simple observer patterns, and also can be used to do aspect-oriented design or even create event-driven architectures. The basic operations you can do over these components is attaching and detaching listeners to named events, trigger events, and interrupting the execution of listeners when an event is fired. Let's see a couple of examples on how to attach to an event and how to fire them: //Registering an event listener $events = new EventManager(); $events->attach(array('EVENT_NAME'), $callback); //Triggering an event $events->trigger('EVENT_NAME', $this, $params); Inside the bootstrap method of ZendMvcApplication, we are registering the events of RouteListener, DispatchListener, and ViewManager. After that, the code is instantiating a new custom event called MvcEvent that will be used as the target when firing events. Finally, this piece of code will fire the bootstrap event. ModuleManager Zend Framework 2 introduces a completely redesigned ModuleManager. This new module has been built with simplicity, flexibility, and reuse in mind. These modules can hold everything from PHP to images, CSS, library code, views, and so on. The responsibility of this component in the bootstrap process of an app is loading the available modules specified by the config file. This is accomplished by the following code line located in the init method of ZendMvcApplication: $serviceManager->get('ModuleManager')->loadModules(); This line, when executed, will retrieve the list of modules located at the config file and will load each module. Each module has to contain a file called Module.php with the initialization of the components of the module if needed. This will allow the module manager to retrieve the configuration of the module. Let's see the usual content of this file: namespace MyModule; class Module { public function getAutoloaderConfig() { return array( 'ZendLoaderClassMapAutoloader' => array( __DIR__ . '/autoload_classmap.php', ), 'ZendLoaderStandardAutoloader' => array( 'namespaces' => array( __NAMESPACE__ => __DIR__ . '/src/' . __NAMESPACE__, ), ), ); } public function getConfig() { return include __DIR__ . '/config/module.config.php'; } } As you can see we are defining a method called getAutoloaderConfig() that provides the configuration for the autoloader to ModuleManager. The last method getConfig() is used to provide the configuration of the module to ModuleManager; for example, this will contain the routes handled by the module. Request object This object encapsulates all data related to a request and allows the developer to interact with the different parts of a request. This object is used in the constructor of ZendMvcApplication and also is set inside MvcEvent to be able to retrieve when some events are fired. Response object This object encapsulates all the parts of an HTTP response and provides the developer with a fluent interface to set all the response data. This object is used in the same way as the request object. Basically it is instantiated on the constructor and added to MvcEvent to be able to interact with it across all the events and classes. The request object As we said, the request object will encapsulate all the data related to a request and provide the developer with a fluent API to access the data. Let's take a look at the details of the request object in order to understand how to use it and what it can offer to us: use ZendHttpRequest; $string = "GET /foo HTTP/1.1rnrnSome Content"; $request = Request::fromString($string); $request->getMethod(); $request->getUri(); $request->getUriString(); $request->getVersion(); $request->getContent(); This example comes directly from the documentation and shows how a request object can be created from a string, and then access some data related with the request using the methods provided. So, every time we need to know something related to the request, we will access this object to get the data we need. If we check the code on ZendHttpPhpEnvironmentRequest.php, the first thing we can notice is that the data is populated on the constructor using the superglobal arrays. All this data is processed and then populated inside the object to be able to expose it in a standard way using methods. To manipulate the URI of the request you can get/set the data with three methods, two getters and one setter. The only difference with the getters is that one returns a plain string and the other returns an HttpUri object. getUri() and getUriString() setUri() To retrieve the data passed in the request, there are a few specialized methods depending on the data you want to get: getQuery() getPost() getFiles() getHeader() and getHeaders() About the request method, the object has a general way to know the method used, returning a string or nine specialized functions that will test specific methods based on the RFC 2616, which defines the standard methods for an HTTP request. getMethod() isOptions() isGet() isHead() isPost() isPut() isDelete() isTrace() isConnect() isPatch() Finally, two more methods are available in this object that will test special requests such as AJAX and requests made by a flash object. isXmlHttpRequest() isFlashRequest() Notice that the data stored on the superglobal arrays when populated on the object are converted from an Array to a Parameters object. The Parameters object lives in the Stdlib section of ZF2, a folder where common objects can be found and used across the framework. In this case, the Parameters class is an extension of ArrayObject and implements ParametersInterface that will bring ArrayAccess, Countable, Serializable, and Traversable functionality to the parameters stored inside the object. The goal with this object is to provide a common interface to access data stored in the superglobal arrays. This expands the ways you can interact with the data in an object-oriented approach.

0
0
1839

article-image-designing-sample-applications-and-creating-reports

Packt

25 Oct 2013

6 min read

Designing Sample Applications and Creating Reports

Packt

25 Oct 2013

6 min read

0
0
1715

Packt

25 Oct 2013

28 min read

About Cassandra

Packt

25 Oct 2013

28 min read

(For more resources related to this topic, see here.) So, if Cassandra is so good at everything, why not everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl et al in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), told that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates." If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is what makes Cassandra so blazing fast? Let us dive deeper into the Cassandra architecture. Cassandra architecture Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Google BigTable, a distributed storage system for structured data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf [2006]), and Amazon Dynamo, Amazon's highly available key-value store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf [2007]). Figure 2.3: Read throughputs shows linear scaling of Cassandra Like BigTable, it has tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named as column family. The properties such as Eventual Consistency and decentralization are taken from Dynamo. For now, assume a column family as a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell can have its own unique name within the row. The columns in the rows are sorted by this unique column name. Also, since the number of rows is allowed to be very large (1.7*(10)^38), we distribute the rows uniformly across all the available machines by dividing the rows in equal token groups. These rows create a Keyspace. Keyspace is a set of all the tokens (row IDs). Ring representation Cassandra cluster is denoted as a ring. The idea behind this representation is to show token distribution. Let's take an example. Assume that there is a partitioner that generates tokens from zero to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, the third, 65 to 96, and the fourth, 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring. (Figure 2.22) Partitioner is the hash function that determines the range of possible row keys. Cassandra uses a partitioner to calculate the token equivalent to a row key (row ID). Figure 2.4: Token ownership and distribution in a balanced Cassandra ring When you start to configure Cassandra, one thing that you may want to set is the maximum token number that a particular machine could hold. This property can be set in the Cassandra.yaml file as initial_token. One thing that may confuse a beginner is that the value of the initial token is what the last token owns. Be aware that nodes can be rebalanced and these tokens can be changed as the new nodes join or old nodes get discarded. This is the initial token because this is just the initial value, and it may be changed later. How Cassandra works Diving into various components of Cassandra without having a context is really a frustrating experience. It does not makes sense why you are studying SSTable, MemTable, and Log Structured Merge (LSM) tree without being able to see how they fit into functionality and performance guarantees that Cassandra gives. So, first, we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Figure 2.5: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. Messaging Layer is responsible for internode communications like gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual column data. SSTable is the disk version of MemTables. When MemTables are full, SSTables are persisted to the hard disk. Bloom filters is a probabilistic data structure that lives in memory. It helps Cassandra to quickly detect which SSTable does not have the requested data. CommitLog is the usual commit log that contains all the mutations that are to be applied. It lives on the disk and helps to replay uncommitted changes. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called as the coordinator node. When a node in Cassandra cluster receives a write request, it delegates it to a service called StorageProxy. This node may or may not be the right place to write the data to. The task of StorageProxy is to get the nodes (all the replicas) that are responsible to hold the data that is going to be written. It utilizes a replication strategy to do that. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. So, the following figure and steps after that show all that can happen during a write mechanism: Figure 2.6: A simplistic representation of the write mechanism. The figure on the left represents the node-local activities on receipt of the write request If FailureDetector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If FailureDetector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted handoff. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The Anti-entropy mechanism is responsible for consistency rather than hinted handoff. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across datacenters, it will be a bad idea to send individual messages to all the replicas in other datacenters. It rather sends the message to one replica in each datacenter with a header instructing it to forward the request to other replica nodes in that datacenter. Now, the data is received by the node that should actually store that data. The data first gets appended to CommitLog, and pushed to a MemTable for the appropriate column family in the memory. When MemTable gets full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on Replication Strategy. StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the Snitch function that is set up for this cluster. Basically, there are the following types of Snitch: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) AbstractNetworkTopologySnitch: Implementation of the Snitch function works like this: nodes on the same rack are closest. The nodes in the same datacenter but in different rack are closer than those in other datacenters, but farther than the nodes in the same rack. Nodes in different datacenters are the farthest. To a node, the nearest node will be the one on the same rack. If there is no node on the same rack, the nearest node will be the one that lives in the same datacenter, but on a different rack. If there is no node in the datacenter, any nearest neighbor will be the one in the other datacenter. DynamicSnitch: This Snitch determines closeness based on recent performance delivered by a node. So, a quick-responding node is perceived closer than a slower one, irrespective of their location closeness or closeness in the ring. This is done to avoid overloading a slow-performing node. Now that we have the list of nodes that have desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform read (we'll discuss local read in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have Read Repair (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example: say you have five nodes containing a row key K (that is, replication factor (RF) equals 5). Your read ConsistencyLevel is three. Then the closest of the five nodes will be asked for the data. And the second and third closest nodes will be asked to return the digest. We still have two left to be queried. If read Repair is not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute digest. The request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. So, basically, in all scenarios, you will have a maximum one wrong response. But with correct read and write consistency levels, we can guarantee an up-to-date response all the time. Let's see what goes within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid-fast since there exists only one copy of data. This is the fastest retrieval. If the data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. Figure 2.7: A simplified representation of the read mechanism. The bottom image shows processing on the read node. Numbers in circles shows the order of the event. BF stands for Bloom Filter Each SSTable is associated with its Bloom Filter built on the row keys in the SSTable. Bloom Filters are kept in memory, and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from Bloom Filter for row keys, there exists one Bloom Filter for each row in the SSTable. This secondary Bloom Filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older. And use the index file to locate the offset for each column value for that row key and the Bloom filter associated with the row (built on the column name). On Bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Components of Cassandra We have gone through how read and write takes place in highly distributed Cassandra clusters. It's time to look into individual components of it a little deeper. Messaging service Messaging service is the mechanism that manages internode socket communication in a ring. Communications, for example, gossip, read, read digest, write, and so on, processed via a messaging service, can be assumed as a gateway messaging server running at each node. To communicate, each node creates two socket connections per node. This implies that if you have 101 nodes, there will be 200 open sockets on each node to handle communication with other nodes. The messages contain a verb handler within them that basically tells the receiving node a couple of things: how to deserialize the payload message and what handler to execute for this particular message. The execution is done by the verb handlers (sort of an event handler). The singleton that orchestrates the messaging service mechanism is org.apache.cassandra.net.MessagingService. Gossip Cassandra uses the gossip protocol for internode communication. As the name suggests, the protocol spreads information in the same way an office rumor does. It can also be compared to a virus spread. There is no central broadcaster, but the information (virus) gets transferred to the whole population. It's a way for nodes to build the global map of the system with a small number of local interactions. Cassandra uses gossip to find out the state and location of other nodes in the ring (cluster). The gossip process runs every second and exchanges information with at the most three other nodes in the cluster. Nodes exchange information about themselves and other nodes that they come to know about via some other gossip session. This causes all the nodes to eventually know about all the other nodes. Like everything else in Cassandra, gossip messages have a version number associated with it. So, whenever two nodes gossip, the older information about a node gets overwritten with a newer one. Cassandra uses an Anti-entropy version of gossip protocol that utilizes Merkle trees (discussed later) to repair unread data. Implementation-wise the gossip task is handled by the org.apache.cassandra.gms.Gossiper class. Gossiper maintains a list of live and dead endpoints (the unreachable endpoints). At every one-second interval, this module starts a gossip round with a randomly chosen node. A full round of gossip consists of three messages. A node X sends a syn message to a node Y to initiate gossip. Y, on receipt of this syn message, sends an ack message back to X. To reply to this ack message, X sends an ack2 message to Y completing a full message round. Figure 2.8: Two nodes gossiping Failure detection Failure detection is one of the fundamental features of any robust and distributed system. A good failure detection mechanism implementation makes a fault-tolerant system such as Cassandra. The failure detector that Cassandra uses is a variation of The ϕ accrual failure detector (2004) by Xavier Défago et al. (The phi accrual detector research paper is available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3350.) The idea behind a FailureDetector is to detect a communication failure and take appropriate actions based on the state of the remote node. Unlike traditional failure detectors, phi accrual failure detector does not emit a Boolean alive or dead (true or false, trust or suspect) value. Instead, it gives a continuous value to the application and the application is left to decide the level of severity and act accordingly. This continuous suspect value is called phi (ϕ). Partitioner Cassandra is a distributed database management system. This means it takes a single logical database and distributes it over one or more machines in the database cluster. So, when you insert some data in Cassandra, it assigns each row to a row key; and based on that row key, Cassandra assigns that row to one of the nodes that's responsible for managing it. Let's try to understand this. Cassandra inherits the data model from Google's BigTable (BigTable research paper can be found at http://research.google.com/archive/bigtable.html.). This means we can roughly assume that the data is stored in some sort of a table that has an unlimited number of columns (not unlimited, Cassandra limits the maximum number of columns to be two billion) with rows binded with a unique key, namely, row key. Now, your terabytes of data on one machine will be restrictive from multiple points of views. One is disk space, another being limited parallel processing, and if not duplicated, a source of single point of failure. What Cassandra does is, it defines some rules to slice data across rows and assigns which node in the cluster is responsible for holding which slice. This task is done by a partitioner. There are several types of partitioners to choose from. In short, Cassandra (as of Version 1.2) offers three partitioners as follows: RandomPartitioner: It uses MD5 hashing to distribute data across the cluster. Cassandra 1.1.x and precursors have this as the default partitioner. Murmur3Partitioner: It uses Murmur hash to distribute the data. It performs better than RandomPartitioner. It is the default partitioner from Cassandra Version 1.2 onwards. ByteOrderPartitioner: Keeps keys distributed across the cluster by key bytes. This is an ordered distribution, so the rows are stored in lexical order. This distribution is commonly discouraged because it may cause a hotspot. Replication Cassandra runs on commodity hardware, and works reliably in network partitions. However, this comes with a cost: replication. To avoid data inaccessibility in case a node goes down or becomes unavailable, one must replicate data to more than one node. Replication brings features such as fault tolerance and no single point of failure to the system. Cassandra provides more than one strategy to replicate the data, and one can configure the replication factor while creating keyspace. Log Structured Merge tree Cassandra (also, HBase) is heavily influenced by Log Structured Merge (LSM). It uses an LSM tree-like mechanism to store data on a disk. The writes are sequential (in append fashion) and the data storage is contiguous. This makes writes in Cassandra superfast, because there is no seek involved. Contrast this with an RBDMS system that is based on the B+ Tree (http://en.wikipedia.org/wiki/B%2B_tree) implementation. LSM tree advocates the following mechanism to store data: note down the arriving modification into a log file (CommitLog), push the modification/new data into memory (MemTable) for faster lookup, and when the system has gathered enough updates in memory or after a certain threshold time, flush this data to a disk in a structured store file (SSTable). The logs corresponding to the updates that are flushed can now be discarded. Figure 2.12: Log Structured Merge (LSM) Trees CommitLog One of the promises that Cassandra makes to the end users is durability. In conventional terms (or in ACID terminology), durability guarantees that a successful transaction (write, update) will survive permanently. This means once Cassandra says write successful that means the data is persisted and will survive system failures. It is done the same way as in any DBMS that guarantees durability: by writing the replayable information to a file before responding to a successful write. This log is called the CommitLog in the Cassandra realm. This is what happens. Any write to a node gets tracked by org.apache.cassandra.db.commitlog.CommitLog, which writes the data with certain metadata into the CommitLog file in such a manner that replaying this will recreate the data. The purpose of this exercise is to ensure there is no data loss. If due to some reason the data could not make it into MemTable or SSTable, the system can replay the CommitLog to recreate the data. MemTable MemTable is an in-memory representation of column family. It can be thought of as a cached data. MemTable is sorted by key. Data in MemTable is sorted by row key. Unlike CommitLog, which is append-only, MemTable does not contain duplicates. A new write with a key that already exists in the MemTable overwrites the older record. This being in memory is both fast and efficient. The following is an example: Write 1: {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} In CommitLog (new entry, append): {k1: [{c1, v1},{c2, v2}, {c3, v3}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} Write 2: {k2: [{c4, v4}]} In CommitLog (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} Write 3: {k1: [{c1, v5}, {c6, v6}]} In CommitLog (old entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} {k1: [{c1, v5}, {c6, v6}]} In MemTable (old entry, update): {k1: [{c1, v5}, {c2, v2}, {c3, v3}, {c6, v6}]} {k2: [{c4, v4}]} Cassandra Version 1.1.1 uses SnapTree (https://github.com/nbronson/snaptree) for MemTable representation, which claims it to be "... a drop-in replacement for ConcurrentSkipListMap, with the additional guarantee that clone() is atomic and iteration has snapshot isolation." See also copy-on-write and compare-and-swap (http://en.wikipedia.org/wiki/Copy-on-write, http://en.wikipedia.org/wiki/Compare-and-swap). Any write gets written first to CommitLog and then to MemTable. SSTable SSTable is a disk representation of the data. MemTables gets flushed to disk to immutable SSTables. All the writes are sequential, which makes this process fast. So, the faster the disk speed, the quicker the flush operation. The SSTables eventually get merged in the compaction process and the data gets organized properly into one file. This extra work in compaction pays off during reads. SSTables have three components: Bloom filter, index files, and datafiles. Bloom filter Bloom filter is a litmus test for the availability of certain data in storage (collection). But unlike a litmus test, a Bloom filter may result in false positives: that is, it says that a data exists in the collection associated with the Bloom filter, when it actually does not. A Bloom filter never results in a false negative. That is, it never states that a data is not there while it is. The reason to use Bloom filter, even with its false-positive defect, is because it is superfast and its implementation is really simple. Cassandra uses Bloom filters to determine whether an SSTable has the data for a particular row key. Bloom filters are unused for range scans, but they are good candidates for index scans. This saves a lot of disk I/O that might take in a full SSTable scan, which is a slow process. That's why it is used in Cassandra, to avoid reading many, many SSTables, which can become a bottleneck. Index files Index files are companion files of SSTables. The same as Bloom filter, there exists one index file per SSTable. It contains all the row keys in the SSTable and its offset at which the row starts in the datafile. At startup, Cassandra reads every 128th key (configurable) into the memory (sampled index). When the index is looked for a row key (after Bloom filter hinted that the row key might be in this SSTable), Cassandra performs a binary search on the sampled index in memory. Followed by a positive result from the binary search, Cassandra will have to read a block in the index file from the disk starting from the nearest value lower than the value that we are looking for. Datafiles Datafiles are the actual data. They contain row keys, metadata, and columns (partial or full). Reading data from datafiles is just one disk seek followed by a sequential read, as offset to a row key is already obtained from the associated index file. Compaction As mentioned earlier, a read require may require Cassandra to read across multiple SSTables to get a result. This is wasteful, costs multiple (disk) seeks, may require a conflict resolution, and if there are too many, SSTables were created. To handle this problem, Cassandra has a process in place, namely, compaction. Compaction merges multiple SSTable files into one. Off the shelf, Cassandra offers two types of compaction mechanism: Size Tiered Compaction Strategy and Level Compaction Strategy. The compaction process starts when the number of SSTables on disk reaches a certain threshold (N: configurable). Although the merge process is a little I/O intensive, it benefits in the long term with a lower number of disk seeks during reads. Apart from this, there are a few other benefits of compaction as follows: Removal of expired tombstones (Cassandra v0.8+) Merging row fragments Rebuilds primary and secondary indexes Tombstones Cassandra is a complex system with its data distributed among CommitLogs, MemTables, and SSTables on a node. The same data is then replicated over replica nodes. So, like everything else in Cassandra, deletion is going to be eventful. Deletion, to an extent, follows an update pattern except Cassandra tags the deleted data with a special value, and marks it as a tombstone. This marker helps future queries, compaction, and conflict resolution. Let's step further down and see what happens when a column from a column family is deleted. A client connected to a node (coordinator node, but it may not be the one holding the data that we are going to mutate), issues a delete command for a column C, in a column family CF. If the consistency level is satisfied, the delete command gets processed. When a node, containing the row key, receives a delete request, it updates or inserts the column in MemTable with a special value, namely, tombstone. The tombstone basically has the same column name as the previous one; the value is set to UNIX epoch. The timestamp is set to what the client has passed. When a MemTable is flushed to SSTable, all tombstones go into it as any regular column. On the read side, when the data is read locally on the node and it happens to have multiple versions of it in different SSTables, they are compared and the latest value is taken as the result of reconciliation. If a tombstone turns out to be a result of reconciliation, it is made a part of the result that this node returns. So, at this level, if a query has a deleted column, this exists in the result. But the tombstones will eventually be filtered out of the result before returning it back to the client. So, a client can never see a value that is a tombstone. Hinted handoff When we last talked about durability, we observed Cassandra provides CommitLogs to provide write durability. This is good. But what if the node, where the writes are going to be, is itself dead? No communication will keep anything new to be written to the node. Cassandra, inspired by Dynamo, has a feature called hinted handoff. In short, it's the same as taking a quick note locally that X cannot be contacted. Here is the mutation (operation that requires modification of the data such as insert, delete, and update) M that will be required to be replayed when it comes back. The coordinator node (the node which the client is connected to) on receipt of a mutation/write request, forwards it to appropriate replicas that are alive. If this fulfills the expected consistency level, write is assumed successful. The write requests to a node that does not respond to a write request or is known to be dead (via gossip) and is stored locally in the system.hints table. This hint contains the mutation. When a node comes to know via gossip that a node is recovered, it replays all the hints it has in store for that node. Also, every 10 minutes, it keeps checking any pending hinted handoffs to be written. Why worry about hinted handoff when you have written to satisfy consistency level? Wouldn't it eventually get repaired? Yes, that's right. Also, hinted handoff may not be the most reliable way to repair a missed write. What if the node that has hinted handoff dies? This is a reason why we do not count on hinted handoff as a mechanism to provide consistency (except for the case of the consistency level, ANY) guarantee; it's a single point of failure. The purpose of hinted handoff is, one, to make restored nodes quickly consistent with the other live ones; and two, to provide extreme write availability when consistency is not required. Read repair and Anti-entropy Cassandra promises eventual consistency and read repair is the process which does that part. Read repair, as the name suggests, is the process of fixing inconsistencies among the replicas at the time of read. What does that mean? Let's say we have three replica nodes A, B, and C that contain a data X. During an update, X is updated to X1 in replicas A and B. But it is failed in replica C for some reason. On a read request for data X, the coordinator node asks for a full read from the nearest node (based on the configured Snitch) and digest of data X from other nodes to satisfy consistency level. The coordinator node compares these values (something like digest(full_X) == digest_from_node_C). If it turns out that the digests are the same as the digests of full read, the system is consistent and the value is returned to the client. On the other hand, if there is a mismatch, full data is retrieved and reconciliation is done and the client is sent the reconciled value. After this, in background, all the replicas are updated with the reconciled value to have a consistent view of data on each node. See Figure 2.1 as shown: Figure 2.16: Image showing read repair dynamics. 1. Client queries for data x, from a node C (coordinator). 2. C gets data from replicas R1, R2, and R3; reconciles. 3. Sends reconciled data to client. 4. If there is a mismatch across replicas, repair is invoked. Merkle tree Merkle tree (A digital signature Based On A Conventional Encryption Function by Merkle, R. (1988), available at http://www.cse.msstate.edu/~ramkumar/merkle2.pdf.) is a hash tree where leaves of the tree hashes hold actual data in a column family and non-leaf nodes hold hashes of their children. The unique advantage of Merkle tree is a whole subtree can be validated just by looking at the value of the parent node. So, if nodes on two replica servers have the same hash values, then the underlying data is consistent and there is no need to synchronize. If one node passes the whole Merkle tree of a column family to another node, it can determine all the inconsistencies. Figure 2.17: Merkle tree to determine mismatch in hash values at parent nodes due to the difference in underlying data Summary By now, you are familiar with all the nuts and bolts of Cassandra. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve clarity. It is not required to understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] Apache Felix Gogo [Article] Migration from Apache to Lighttpd [Article]

0
0
3608

article-image-image-classification-and-feature-extraction-images

Packt

25 Oct 2013

3 min read

Image classification and feature extraction from images

Packt

25 Oct 2013

3 min read

(For more resources related to this topic, see here.) Classifying images Automated Remote Sensing ( ARS ) is rarely ever done in the visible spectrum. The most commonly available wavelengths outside of the visible spectrum are infrared and near-infrared. The following scene is a thermal image (band 10) from a fairly recent Landsat 8 flyover of the US Gulf Coast from New Orleans, Louisiana to Mobile, Alabama. Major natural features in the image are labeled so you can orient yourself: Because every pixel in that image has a reflectance value, it is information. Python can "see" those values and pick out features the same way we intuitively do by grouping related pixel values. We can colorize pixels based on their relation to each other to simplify the image and view related features. This technique is called classification. Classifying can range from fairly simple groupings based only on some value distribution algorithm derived from the histogram to complex methods involving training data sets and even computer learning and artificial intelligence. The simplest forms are called unsupervised classifications, whereas methods involving some sort of training data to guide the computer are called supervised. It should be noted that classification techniques are used across many fields, from medical doctors trying to spot cancerous cells in a patient's body scan, to casinos using facial-recognition software on security videos to automatically spot known con-artists at blackjack tables. To introduce remote sensing classification we'll just use the histogram to group pixels with similar colors and intensities and see what we get. First you'll need to download the Landsat 8 scene here: http://geospatialpython.googlecode.com/files/thermal.zip Instead of our histogram() function from previous examples, we'll use the version included with NumPy that allows you to easily specify a number of bins and returns two arrays with the frequency as well as the ranges of the bin values. We'll use the second array with the ranges as our class definitions for the image. The lut or look-up table is an arbitrary color palette used to assign colors to classes. You can use any colors you want. import gdalnumeric # Input file name (thermal image) src = "thermal.tif" # Output file name tgt = "classified.jpg" # Load the image into numpy using gdal srcArr = gdalnumeric.LoadFile(src) # Split the histogram into 20 bins as our classes classes = gdalnumeric.numpy.histogram(srcArr, bins=20)[1] # Color look-up table (LUT) - must be len(classes)+1. # Specified as R,G,B tuples lut = [[255,0,0],[191,48,48],[166,0,0],[255,64,64], [255,115,115],[255,116,0],[191,113,48],[255,178,115], [0,153,153],[29,115,115],[0,99,99],[166,75,0], [0,204,0],[51,204,204],[255,150,64],[92,204,204],[38,153,38], [0,133,0],[57,230,57],[103,230,103],[184,138,0]] # Starting value for classification start = 1 # Set up the RGB color JPEG output image rgb = gdalnumeric.numpy.zeros((3, srcArr.shape[0], srcArr.shape[1],), gdalnumeric.numpy.float32) # Process all classes and assign colors for i in range(len(classes)): mask = gdalnumeric.numpy.logical_and(start <= srcArr, srcArr <= classes[i]) for j in range(len(lut[i])): rgb[j] = gdalnumeric.numpy.choose(mask, (rgb[j], lut[i][j])) start = classes[i]+1 # Save the image gdalnumeric.SaveArray(rgb.astype(gdalnumeric.numpy.uint8), tgt, format="JPEG") The following image is our classification output, which we just saved as a JPEG. We didn't specify the prototype argument when saving as an image, so it has no georeferencing information. This result isn't bad for a very simple unsupervised classification. The islands and coastal flats show up as different shades of green. The clouds were isolated as shades of orange and dark blues. We did have some confusion inland where the land features were colored the same as the Gulf of Mexico. We could further refine this process by defining the class ranges manually instead of just using the histogram.

0
0
13101

Packt

25 Oct 2013

9 min read

General Considerations

Packt

25 Oct 2013

9 min read

0
0
11391

article-image-advanced-system-management

Packt

25 Oct 2013

11 min read

Advanced System Management

Packt

25 Oct 2013

11 min read

(For more resources related to this topic, see here.) Beyond backups Of course, backups are not the only issue with managing multiple, remote systems. In particular, managing such multiple configurations using a centralized application is often desirable. Configuration management One of the issues frequently faced by administrators is that of having multiple, remote systems all with similar software for the most part, but with minor differences in what is installed or running. Debian provides several packages that can help manage such an environment in a unified manner. Two of the more popular packages, both available in Debian, are FAI and Puppet. While we don't have the space to go into details, both applications are described briefly here. Fully Automated Installation Fully Automated Installation (FAI) focuses on managing Linux installations, and is developed using Debian, although it works with many different distributions, not just Debian. FAI uses a class concept for categorizing similar systems, and provides a good deal of flexibility and customization via hooks. FAI provides for unattended, automatic installation as well as tools for monitoring and updating groups of systems. FAI is frequently used for creating and maintaining clusters. More information is available at http://fai-project.org/. Puppet Probably the best known application for distributed management is Puppet, developed by Puppet Labs. Unlike FAI, only the Open Source edition is free, the Enterprise edition, which has many additional features, is not. Puppet does include support for environments other than Linux. The desired configuration is described in a custom, high-level definition language, and distributed to systems with installed clients. Unlike FAI, Puppet does not provide its own bare metal remote installation method, but does use existing methods (such as kickstart) to provide this function. A number of companies that make heavy use of distributed and clustered systems use Puppet to manage their environments. More information is available at http://puppetlabs.com/. Other packages There are other packages that can be used to manage a distributed environment, such as Chef and BCFG2. While simpler than Puppet or FAI, they support similar functions and have been used in some distributed and clustered environments. The use of FAI, Puppet, and others in cluster management warrants a brief look at clustering next, and what packages in Debian support clustering. Clusters A cluster is a group of systems that work together in such a way that the whole functions as a single unit. Such clusters can be loosely coupled or tightly coupled. A loosely coupled environment, each system is complete in itself, and can handle all of the tasks any of the other systems can handle. The environment provides mechanisms for redundancy, load sharing, and fail-over between systems, and is often called a High Availability (HA) cluster. In a tightly coupled environment, the systems involved are highly dependent on one another, often sharing memory and disk storage, and all work on the same task together. The environment provides mechanisms for data sharing, avoiding storage conflicts, keeping the systems in synchronization, and splitting up tasks appropriately. This design is often used in super-computing environments. Clustering is an advanced technique that involves more than just installing and configuring software. It also involves hardware integration, and systems and network design, and implementation. Along with the URLs mentioned below, a good text on the subject is Building Clustered Linux Systems, by Robert W. Lucke, Prentice Hall. Here we will only touch the very basics, along with what tools Debian provides. Let's take a brief look at each environment, and some of the tools used to create them. High Availability clusters Two primary functions are required to implement a high availability cluster: A way to handle load balancing and individual host fail-over. A way to synchronize storage so that all servers provide the same view of the data they serve. Debian includes meta packages that bring together software from the Linux High Availability project, including cluster-agents and resource-agents, two of the higher-level meta packages. These packages install various agents that are useful in coordinating and managing load balancing and fail-over. In some cases, a master server is designated to distribute the processing load among other servers. Data synchronization is handled by using shared storage and any of the filesystems that provide for multiple accesses and shared files, such as NFS or AFS. High Availability clusters generally use standard software, along with software that is readily available to manage the dynamics of such environments. Beowulf clusters In addition to the considerations for High Availability clusters, more tightly coupled environments such as Beowulf clusters also require an infrastructure to manage and distribute computing tasks. There are several web pages devoted to creating a Beowulf cluster using Debian as well as packages that aid in creating such a cluster. One such page is https://wiki.debian.org/StartaBeowulf, a Debian Wiki page on Beowulf basics. The manual for FAI also has a section on creating a Beowulf cluster. Books are available as well. Debian provides several packages that are helpful in building such a cluster, such as the OpenMPI libraries for message passing, and various utilities that run commands on multiple systems, such as those in the kadif package. There are even projects that have released scripts and live CDs that allow you to set up a cluster quickly (one such project is the PelicanHPC project, developed for Debian Lenny, hosted at http://www.pelicanhpc.org/. This type of cluster is not something that you can set up and go. Beowulf and other tightly coupled clusters are intended for highly parallel computing, and the programs that do the actual computing must be designed specifically for such an environment. That said, some packages for specific parallel computations do exist in Debian, such as nwchem, which provides several applications for computational chemistry that take advantage of parallelism. Common tools Some common components of clusters have already been mentioned, such as the OpenMPI libraries. Aside from the meta-packages already mentioned, the redhat-cluster suite of tools is available in Debian, as well as many useful libraries, scheduling tools, and failover tools such as booth. All of these can be found using apt-cache or Synaptic by searching for "cluster". Webmin Many administrators will never have to administer a cluster, and many won't be responsible for a large number of systems requiring central backup solutions. However, even administering a single system using command line tools and text editors can be a chore. Even clusters sometimes require administrative tasks on individual systems. Fortunately, there is an application that can ease many administrative tasks, is easy to use, and can handle many aspects of Linux administration. It is called Webmin. Up until Debian Sarge, Webmin was a part of Debian distributions. However, the Debian developer in charge of packaging it had difficulty keeping up with the frequent releases, and it was eventually dropped from Debian. However, the upstream Webmin developers maintain current packages that install cleanly. Some users have reported issues because Webmin does not always handle configuration files exactly as Debian intends, but it most certainly attempts to handle them in a compatible manner, and while some users have experienced problems with upgrades, many administrators are quite happy with Webmin. As long as you are willing to deal with conflicts during upgrades, or restrict use of modules that have major configuration impacts, you will find Webmin quite useful. Installing Webmin Webmin may be installed by adding the following lines to your apt sources file: deb http://download.webmin.com/download/repository sarge contrib deb http://webmin.mirror.somersettechsolutions.co.uk/repository sarge contrib Usually, this is added to a separate webmin.list file in /etc/apt/sources.list.d. The use of 'sarge' for the release name in the configuration is not a mistake. Since Webmin was dropped after the Sarge release (Debian 3.1), the developers update the repository as it is and haven't bothered changing it to keep up with the Debian code names. However, the versions available in the repository are compatible with any Debian release since 3.1. After updating your cache file, Webmin can be installed and maintained using apt-get, aptitude, or Synaptic. Also, if you request a Webmin upgrade from within Webmin itself on a Debian system, it will use the proper Debian package to upgrade. Using Webmin Webmin runs in the background, and provides an HTTP or HTTPS server on localhost port 10,000. You can use any web browser to connect to http://localhost:10000/ to access Webmin. Upon first installation, only the root user or those in a group allowed to use sudo to access the root account, may log in but Webmin users can be managed separately or in conjunction with local users. Webmin provides extensive and easy to understand menus and icons for various configuration tasks. Webmin is also highly modular and extensible, and an extensive list of standard modules is included with the base package. It is not possible to cover Webmin as fully here as it deserves, but a short list of some of its capabilities includes: Configuration of Webmin itself (the server, users, modules, and security) Local system user and password management Filesystem management Bootup and service management CRON job management Software updates Basic filesystem backups Authentication and security configuration APACHE, DNS, SSH, and FTP (if you're using ProFTP) configuration User mail management Qmail or sendmail configuration Network and Firewall configuration and management Bandwidth monitoring Printer management There are even modules that apply to clusters. Also, Webmin can search and allow access to other Webmin servers on the local network or you can define remote servers manually. This allows a central Webmin server, installed on a particular system, to be the gateway to all of the other servers in your environment, essentially providing a single point of access to manage all Webmin enabled servers. Webmin and Debian Webmin understands the configuration file layout of many distributions. The main problem is when a particular module does not handle certain types of configuration in the way the Debian developers prefer, which can make package upgrades somewhat difficult. This can be handled in a couple of ways. Most modules provide a means to edit configuration files directly, so if you have read the Debian documentation you can modify the configuration appropriately to use Debian specific configuration techniques. Or, you may choose to allow Webmin to modify files as it sees fit, and handle any conflicts manually when you upgrade the software involved. Finally, you can avoid those modules involved with specific software that are more likely to cause problems. One such module is Apache, which doesn't use links from sites-enabled to sites-available. Rather, it configures directly in the sites-enabled directory. Some administrators create the configuration in Webmin, and then move and link the files. Others prefer to manually configure Apache outside of Webmin. Webmin modules are constantly changing, and some actually recognize the Debian file layouts well, so it is not possible to give a comprehensive list of modules to avoid at this time. Best practice when using Webmin is to read the documentation and check the configuration files for specific software prior to using Webmin. Then, after configuring with Webmin, check the files again to determine whether changes may be required to work within the particular package's Debian configuration framework. Based upon this, you can decide whether to continue to configure using Webmin or switch back to manual configuration of that particular software. Webmin security Security is always a concern when remote access to a system is involved. Webmin handles this by requiring authentication and providing for detailed access restrictions that provide a layer of control beyond the firewall. Webmin users can be defined separately, or certain local users can be designated. Access to the various modules in Webmin can be restricted to certain users or groups of users, and detailed logs of Webmin actions are kept. Usermin In addition to Webmin, there is a server called Usermin which may be installed from the same repository as Webmin. It allows individual users to perform a number of functions more easily, such as changing their password, accessing their files, read and manage their email, and managing some aspects of their user profile. It is also modular and has the same security features as Webmin. Summary Several powerful and flexible central backup solutions exist that help manage backups for multiple remote servers and sites. Debian provides packages that assist in building High Availability and Beowulf style multiprocessing clusters as well. And, whether you are involved in managing clusters or not, or even a single system, Webmin can ease an administrator's tasks. Resources for Article: Further resources on this subject: Customizing a Linux kernel [Article] Microsoft SharePoint 2010 Administration: Farm Governance [Article] Testing Workflows for Microsoft Dynamics AX 2009 Administration [Article]

0
0
5659

Getting into the Store

Getting Started with Code::Blocks

Low-Level Index Control

Availability Management

Gesture

Authenticating Your Application with Devise

Social-Engineer Toolkit

The solvers – these great unknown

Managing Test Structure with Robot Framework

Scratching the Surface of Zend Framework 2

Trending Topics

Designing Sample Applications and Creating Reports

About Cassandra

Image classification and feature extraction from images

General Considerations

Advanced System Management

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access