Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-code-analysis-and-debugging-tools-microsoft-dynamics-nav-2009
Packt
17 Sep 2010
10 min read
Save for later

Code Analysis and Debugging Tools in Microsoft Dynamics NAV 2009

Packt
17 Sep 2010
10 min read
(For more resources on Microsoft Dynamics, see here.) The NAV tools and techniques that you use to determine what code to modify and to help you debug modifications are essentially the same. The goal in the first case is to focus on your modifications, so that you have the minimum effect on the standard code. This results in multiple benefits. Smaller pieces of well focused code are easier to debug, easier to document, easier to maintain, and easier to upgrade. As of NAV's relatively tight structure and unique combination of features, it is not unusual to spend significantly more time in determining the right way to make a modification than it actually takes to code the modification. Obviously this depends on the type of modification being made. Unfortunately, the lack of documentation regarding the internals of NAV also contributes to an extended analysis time required to design modifications. The following sections review some of the tools and techniques you can use to analyze and test. Developer's Toolkit To paraphrase the introduction in the NAV Developer's Toolkit documentation, the Toolkit is designed to help you analyze the source code. This makes it easier to design and develop application customizations and to perform updates. The Developer's Toolkit is not part of the standard product distribution, but is available to all Microsoft Partners for NAV for download from the Partner website. While it takes a few minutes to set up the Developer's Toolkit for the database on which you will be working, the investment is worthwhile. Follow the instructions in the Developer's Toolkit manual for creating and loading your Toolkit database. The Help files in the Developer's Toolkit are also useful. As of late 2009, the current NAV Developer's Toolkit is V3.00. V3.00 does not deal with the new features of NAV 2009 associated with the Role Tailored Client or three tier functionality. The NAV Developer's Toolkit has two major categories of tools—the Compare and Merge Tools, and the Source Analyzer: The Compare and Merge Tools are useful anytime you want to compare a production database's objects to an unmodified set of objects to identify what has been changed. This might be in the process of upgrading the database to a new version or simply to better understand the contents of a database when you are about to embark on a new modification adventure. The Source Analyzer tools are the more general purpose set of tools. Once you have loaded the source information for all your objects into the Developer's Tools database, you will be able to quickly generate a variety of useful code analyses. The starting point for your code analyses will be the Object Administrator view as shown in the following screenshot: When you get to this point, it's worthwhile experimenting with various menu options for each of the object types to get comfortable with the environment and how the tools work. Not only are there several tool options, but multiple viewing options. Some will be more useful than others depending on the specifics of the modification task you are addressing as well as your working habits. Relations to Tables With rare exceptions, table relations are defined between tables. The Toolkit allows you to select an object and request analysis of the defined relations between elements in that object and various tables. As a test of how the Relations to Tables analysis works, we will expand our Table entry in the Object Administrator to show all the tables. Then we will choose the Location table, right-click, and choose the option to view its Relations to other Tables with the result shown in the following screenshot: If we want to see more detail, we can right-click on the Location table name in the right window, choose the Expand All option, and see the results as shown in the following screenshot: This shows us the Relations to Tables, with the relating (from) field and the related (to) field both showing in each line. Relations from Objects If you are checking to see what objects have a relationship pointing back to a particular table (the inverse of what we just looked at), you can find that out in essentially the same fashion. Right-click on the table of interest and choose the Relations from Objects option. If you wanted to see both sets of relationships in the same display, you can right-click on the table name in the right window and choose the Relation to Tables option. At that point your display will show both sets of relationships as shown in the following screenshot for the table Reservation Entry: Source Access On any of these screens you could select one of the relationships and drill down further into the detail of the underlying C/AL code. There is a search tool, the Source Finder. When you highlight one of the identified relationships and access the Code Viewer, the Toolkit will show you the object code where the relationship is defined. Where Used The Developer's Toolkit contains other tools that are also quite valuable to you as a developer. The idea of Where Used is fairly simple: list all the places where an element is used within the total library of source information. There are two different types of Where Used. The Toolkit's first type of Where Used is powerful because it can search for uses of whole tables or key sequences or individual fields. Many developers also use other tools (primarily developer's text editors) to accomplish some of this. However, the Developer's Toolkit is specifically designed for use with C/AL and C/SIDE. The second type of "Where Used" is Where Used With. This version of the Toolkit Where Used tool allows you to focus the search. Selecting the Where Used With Options brings up the screen in the following screenshot. As you can see, the degree of control you have over the search is extensive. Screenshots of the other three tabs of the Where Used With Options form follow: Trying it out To really appreciate the capabilities and flexibilities of the Developer's Toolkit, you must work with it to address a real-life task. For example, what if your firm was in a market where the merger of firms was a frequent occurrence? In order to manage this, the manager of accounting might decide that the system needs to be able to merge the data for two customers, including accounting and sales history under a single customer number. If you do that, you must first find all the instances of the Customer No. referenced in keys of other tables. The tool to do this in the Developer's Toolkit is the Source Finder. Calling up the Source Finder, first you Reset all fields by clearing them. Then enter the name of the field you are looking for, in this case that is Customer No., as shown in the following screenshot: Now specify you are only looking for information contained in Tables, as shown in the following screenshot: Next, specify that the search should only be in Keys, as shown in the following screenshot: Your initial results will look like those in the following screenshot: This data can be further constrained through the use of Filters (for example to find only Key 1 entries) and can be sorted by clicking on a column head. Of course, as mentioned earlier, it will help you to experiment along the way. Don't make the mistake of thinking the Developer's Toolkit is the only tool you need to use. At the same time, don't make the mistake of ignoring this tool just because it won't do everything. Working in exported text code As mentioned a little earlier, some developers export objects into text files, then use a text editor to manipulate them. Let us take a look at an object that has been exported into text and imported into a text editor. We will use one of the tables that are part of our ICAN development, the Donor Type table, 50001 as shown in the following screenshot: The general structure of all exported objects is similar, with differences that you would expect for the different objects. For example, Table objects have no Sections, but Report objects do. You can also see here that this particular table contains no C/AL-coded logic, as those statements would be quoted in the text listing. You can see by looking at this table object text screenshot that you could easily search for instances of the string Code throughout the text export of the entire system, but it would be more difficult to look for references to the Donor Type form/page, Form50002. And, while you can find the instances of Code with your text editor, it would be quite difficult to differentiate those instances that relate to the Donor Type table from those in any other table. This includes those that have nothing to do with our ICAN system enhancement, as well as those simply defined in an object as Global Variables. However, the Developer's Toolkit can make that differentiation. If you were determined to use a text editor to find all instances of "Donor Type".Code, you could do the following: Rename the field in question to something unique. C/SIDE will rename all the references to this field. Then export all the sources to text followed by using your text editor (or even Microsoft Word) to find the unique name. You must either remember to return the field in the database to the original name or you must be working in a temporary "work copy" of the database, which you will shortly discard. Otherwise, you will have quite a mess. One task that needs to be done occasionally is to renumber an object or to change a reference inside an object that refers to a no longer existing element. The C/SIDE editor may not let you do that easily, or in some cases, not at all. In such a case, the best answer is to export the object into text, make the change there and then import it back in as modified. Be careful though. When you import a text object, C/SIDE does not check to see if you are overwriting another instance of that object number. C/SIDE makes that check when you import a fob (that is a compiled object) and warns you. If you must do renumbering, you should check the NAV forums on the Internet for the renumbering tools that are available there. Theoretically, you could write all of your C/AL code with a text editor and then import the result. Given the difficulty of such a task and the usefulness of the tools embedded in C/SIDE, such an approach would be foolish. However, there are occasions when it is very helpful to simply view an object "flattened out" in text format. In a report where you may have overlapping logic in multiple data items and in several control triggers as well, the only way to see all the logic at once is in text format. You can use any text editor you like, Notepad or Word or one of the visual programming editors; the exported object is just text. You need to cope with the fact that when you export a large number of objects in one pass, they all end up in the same text file. That makes the exported file relatively difficult to use. The solution is to split that file into individual text files, named logically, one for each NAV object. There are several freeware tools to do just that, available from the NAV forums on the Internet. Two excellent NAV forums are www.mibuso.com and www.dynamicsuser.net.
Read more
  • 0
  • 0
  • 5020

Packt
03 Sep 2013
19 min read
Save for later

Oracle ADF Essentials – Adding Business Logic

Packt
03 Sep 2013
19 min read
(For more resources related to this topic, see here.) Adding logic to business components by default, a business component does not have an explicit Java class. When you want to add Java logic, however, you generate the relevant Java class from the Java tab of the business component. On the Java tab, you also decide which of your methods are to be made available to other objects by choosing to implement a Client Interface . Methods that implement a client interface show up in the Data Control palette and can be called from outside the object. Logic in entity objects Remember that entity objects are closest to your database tables –– most often, you will have one entity object for every table in the database. This makes the entity object a good place to put data logic that must be always executed. If you place, for example, validation logic in an entity object, it will be applied no matter which view object attempts to change data. In the database or in an entity object? Much of the business logic you can place in an entity object can also be placed in the database using database triggers. If other systems are accessing your database tables, business logic should go into the database as much as possible. Overriding accessors To use Java in entity objects, you open an entity object and select the Java tab. When you click on the pencil icon, the Select Java Options dialog opens as shown in the following screenshot: In this dialog, you can select to generate Accessors (the setXxx() and getXxx() methods for all the attributes) as well as Data Manipulation Methods (the doDML() method; there is more on this later). When you click on OK , the entity object class is generated for you. You can open it by clicking on the hyperlink or you can find it in the Application Navigator panel as a new node under the entity object. If you look inside this file, you will find: Your class should start with an import section that contains a statement that imports your EntityImpl class. If you have set up your framework extension classes correctly this could be import com.adfessentials.adf.framework.EntityImpl. You will have to click on the plus sign in the left margin to expand the import section. The Structure panel in the bottom-left shows an overview of the class including all the methods it contains. You will see a lot of setter and getter methods like getFirstName() and setFirstName() as shown in the following screenshot: There is a doDML() method described later. If you were to decide, for example, that last name should always be stored in upper case, you could change the setLastName() method to: public void setLastName(String value) { setAttributeInternal(LASTNAME, value.toUpperCase()); } Working with database triggers If you decide to keep some of your business logic in database triggers, your triggers might change the values that get passed from the entity object. Because the entity object caches values to save database work, you need to make sure that the entity object stays in sync with the database even if a trigger changes a value. You do this by using the Refresh on Update property. To find this property, select the Attributes subtab on the left and then select the attribute that might get changed. At the bottom of the screen, you see various settings for the attribute with the Refresh settings in the top-right of the Details tab as shown in the following screenshot: Check the Refresh on Update property checkbox if a database trigger might change the attribute value. This makes the ADF framework requery the database after an update has been issued. Refresh on Insert doesn't work if you are using MySQL and your primary key is generated with AUTO_INCREMENT or set by a trigger. ADF doesn't know the primary key and therefore cannot find the newly inserted row after inserting it. It does work if you are running against an Oracle database, because Oracle SQL syntax has a special RETURNING construct that allows the entity object to get the newly created primary key back. Overriding doDML() Next, after the setters and getters, the doDML() method is the one that most often gets overridden. This method is called whenever an entity object wants to execute a Data Manipulation Language (DML ) statement like INSERT, UPDATE, or DELETE. This offers you a way to add additional processing; for example, checking that the account balance is zero before allowing a customer to be deleted. In this case, you would add logic to check the account balance, and if the deletion is allowed, call super.doDML() to invoke normal processing. Another example would be to implement logical delete (records only change state and are not actually deleted from the table). In this case, you would override doDML() as follows: @override protected void doDML(int operation, TransactionEvent e) { if (operation == DML_DELETE) { operation = DML_UPDATE; } super.doDML(operation, e); } As it is probably obvious from the code, this simply replaces a DELETE operation with an UPDATE before it calls the doDML() method of its superclass (your framework extension EntityImpl, which passes the task on to the Oracle-supplied EntityImpl class). Of course, you also need to change the state of the entity object row, for example, in the remove() method. You can find fully-functional examples of this approach on various blogs, for example at http://myadfnotebook.blogspot.dk/2012/02/updating-flag-when-deleting-entity-in.html. You also have the option of completely replacing normal doDML() method processing by simply not calling super.doDML(). This could be the case if you want all your data modifications to go via a database procedure –– for example, to insert an actor, you would have to call insertActor with first name and last name. In this case, you would write something like: @override protected void doDML(int operation, TransactionEvent e) { CallableStatement cstmt = null; if (operation == DML_INSERT) { String insStmt = "{call insertActor (?,?)}"; cstmt = getDBTransaction().createCallableStatement(insStmt, 0); try { cstmt.setString(1, getFirstName()); cstmt.setString(2, getLastName()); cstmt.execute(); } catch (Exception ex) { … } finally { … } } } If the operation is insert, the above code uses the current transaction (via the getDBTransaction() method) to create a CallableStatement with the string insertActor(?,?). Next, it binds the two parameters (indicated by the question marks in the statement string) to the values for first name and last name (by calling the getter methods for these two attributes). Finally, the code block finishes with a normal catch clause to handle SQL errors and a finally clause to close open objects. Again, fully working examples are available in the documentation and on the Internet in various blog posts. Normally, you would implement this kind of override in the framework extension EntityImpl class, with additional logic to allow the framework extension class to recognize which specific entity object the operation applies to and which database procedure to call. Data validation With the techniques you have just seen, you can implement every kind of business logic your requirements call for. One requirement, however, is so common that it has been built right into the ADF framework: data validation . Declarative validation The simplest kind of validation is where you compare one individual attribute to a limit, a range, or a number of fixed values. For this kind of validation, no code is necessary at all. You simply select the Business Rules subtab in the entity object, select an attribute, and click on the green plus sign to add a validation rule. The Add Validation Rule dialog appears as shown in the following screenshot: You have a number of options for Rule Type –– depending on your choice here, the Rule Definition tab changes to allow you to define the parameters for the rule. On the Failure Handling tab, you can define whether the validation is an error (that must be corrected) or a warning (that the user can override), and you define a message text as shown in the following screenshot: You can even define variable message tokens by using curly brackets { } in your message text. If you do so, a token will automatically be added to the Token Message Expressions section of the dialog, where you can assign it any value using Expression Language. Click on the Help button in the dialog for more information on this. If your application might ever conceivably be needed in a different language, use the looking glass icon to define a resource string stored in a separate resource bundle. This allows your application to have multiple resource bundles, one for each different user interface language. There is also a Validation Execution tab that allows you to specify under which condition your rule should be applied. This can be useful if your logic is complex and resource intensive. If you do not enter anything here, your rule is always executed. Regular expression validation One of the especially powerful declarative validations is the Regular Expression validation. A regular expression is a very compact notation that can define the format of a string –– this is very useful for checking e-mail addresses, phone numbers, and so on. To use this, set Rule Type to Regular Expression as shown in the following screenshot: JDeveloper offers you a few predefined regular expressions, for example, the validation for e-mails as shown in the preceding screenshot. Even though you can find lots of predefined regular expressions on the Internet, someone from your team should understand the basics of regular expression syntax so you can create the exact expression you need. Groovy scripts You can also set Rule Type to Script to get a free-format box where you can write a Groovy expression. Groovy is a scripting language for the Java platform that works well together with Java –– see http://groovy.codehaus.org/ for more information on Groovy. Oracle has published a white paper on Groovy in ADF (http://www.oracle.com/technetwork/developer-tools/jdev/introduction-to-groovy-128837.pdf), and there is also information on Groovy in the JDeveloper help. Method validation If none of these methods for data validation fit your need, you can of course always revert to writing code. To do this, set Rule Type to Method and provide an error message. If you leave the Create a Select Method checkbox checked when you click on OK , JDeveloper will automatically create a method with the right signature and add it to the Java class for the entity object. The autogenerated validation method for Length (in the Film entity object) would look as follows: /** * Validation method for Length. */ public boolean validateLength (Integer length) { return true; } It is your task to fill in the logic and return either true (if validation is OK) or false (if the data value does not meet the requirements). If validation fails, ADF will automatically display the message you defined for this validation rule. Logic in view objects View objects represent the dataset you need for a specific part of the application — typically a specific screen or part of a screen. You can create Java objects for either an entire view object (an XxxImpl.java class, where Xxx is the name of your view object) or for a specific row (an XxxRowImpl.java class). A view object class contains methods to work with the entire data-set that the view object represents –– for example, methods to apply view criteria or re-execute the underlying database query. The view row class contains methods to work with an individual record of data –– mainly methods to set and get attribute values for one specific record. Overriding accessors Like for entity objects, you can override the accessors (setters and getters) for view objects. To do this, you use the Java subtab in the view object and click on the pencil icon next to Java Classes to generate Java. You can select to generate a view row class including accessors to ask JDeveloper to create a view row implementation class as shown in the following screenshot: This will create an XxxRowImpl class (for example, RentalVORowImpl) with setter and getter methods for all attributes. The code will look something like the following code snippet: … public class RentalVORowImpl extends ViewRowImpl { … /** * This is the default constructor (do not remove). */ public RentalVORowImpl() { } … /** * Gets the attribute value for title using the alias name * Title. * @return the title */ public String getTitle() { return (String) getAttributeInternal(TITLE); } /** * Sets <code>value</code> as attribute value for title using * the alias name Title. * @param value value to set the title */ public void setTitle(String value) { setAttributeInternal(TITLE, value); } … } You can change all of these to manipulate data before it is delivered to the entity object or to return a processed version of an attribute value. To use such attributes, you can write code in the implementation class to determine which value to return. You can also use Groovy expressions to determine values for transient attributes. This is done on the Value subtab for the attribute by setting Value Type to Expression and filling in the Value field with a Groovy expression. See the Oracle white paper on Groovy in ADF (http://www.oracle.com/technetwork/developer-tools/jdev/introduction-to-groovy-128837.pdf) or the JDeveloper help. Change view criteria Another example of coding in a view object is to dynamically change which view criteria are applied to the view object.It is possible to define many view criteria on a view object –– when you add a view object instance to an application module, you decide which of the available view criteria to apply to that specific view object instance. However, you can also programmatically change which view criteria are applied to a view object. This can be useful if you want to have buttons to control which subset of data to display –– in the example application, you could imagine a button to "show only overdue rentals" that would apply an extra view criterion to a rental view object. Because the view criteria apply to the whole dataset, view criteria methods go into the view object, not the view row object. You generate a Java class for the view object from the Java Options dialog in the same way as you generate Java for the view row object. In the Java Options dialog, select the option to generate the view object class as shown in the following screenshot: A simple example of programmatically applying a view criteria would be a method to apply an already defined view criterion called called OverdueCriterion to a view object. This would look like this in the view object class: public void showOnlyOverdue() { ViewCriteria vc = getViewCriteria("OverdueCriterion"); applyViewCriteria(vc); executeQuery(); } View criteria often have bind variables –– for example, you could have a view criteria called OverdueByDaysCriterion that uses a bind variable OverdueDayLimit. When you generate Java for the view object, the default option of Include bind variable accessors (shown in the preceding screenshot) will create a setOverdueDayLimit() method if you have an OverdueDayLimit bind variable. A method in the view object to which we apply this criterion might look like the following code snippet: public void showOnlyOverdueByDays(int days) { ViewCriteria vc = getViewCriteria("OverdueByDaysCriterion"); setOverdueDayLimit(days); applyViewCriteria(vc); executeQuery(); } If you want to call these methods from the user interface, you must select create a client interface for them (on the Java subtab in the view object). This will make your method available in the Data Control palette, ready to be dragged onto a page and dropped as a button. When you change the view criteria and execute the query, only the content of the view object changes –– the screen does not automatically repaint itself. In order to ensure that the screen refreshes, you need to set the PartialTriggers property of the data table to point to the ID of the button that changes the view criteria. For more on partial page rendering, see the Oracle Fusion Middleware Web User Interface Developer's Guide for Oracle Application Development Framework (http://docs.oracle.com/cd/E37975_01/web.111240/e16181/af_ppr.htm). Logic in application modules You've now seen how to add logic to both entity objects and view objects. However, you can also add custom logic to application modules. An application module is the place where logic that does not belong to a specific view object goes –– for example, calls to stored procedures that involve data from multiple view objects. To generate a Java class for an application module, you navigate to the Java subtab in the application module and select the pencil icon next to the Java Classes heading. Typically, you create Java only for the application module class and not for the application module definition. You can also add your own logic here that gets called from the user interface or you can override the existing methods in the application module. A typical method to override is prepareSession(), which gets called before the application module establishes a connection to the database –– if you need to, for example, call stored procedures or do other kinds of initialization before accessing the database, an application module method is a good place to do so. Remember that you need to define your own methods as client methods on the Java tab of the application module for the method to be available to be called from elsewhere in the application. Because the application module handles the transaction, it also contains methods, such as beforeCommit(), beforeRollback(), afterCommit(), afterRollback(), and so on. The doDML() method on any entity object that is part of the transaction is executed before any of the application modules' methods. Adding logic to the user interface Logic in the user interface is implemented in the form of managed beans. These are Java classes that are registered with the task flow and automatically instantiated by the ADF framework.ADF operates with various memory scopes –– you have to decide on a scope when you define a managed bean. Adding a bean method to a button The simplest way to add logic to the user interface is to drop a button (af:commandButton) onto a page or page fragment and then double-click on it. This brings up the Bind Action Property dialog as shown in the following screenshot: If you leave Method Binding selected and click on New , the Create Managed Bean dialog appears as shown in the following screenshot: In this dialog, you can give your bean a name, provide a class name (typically the same as the bean name), and select a scope. The backingBean scope is a good scope for logic that is only used for one action when the user clicks on the button and which does not need to store any state for later. Leaving the Generate Class If It Does Not Exist checkbox checked asks JDeveloper to create the class for you. When you click on OK , JDeveloper will automatically suggest a method for you in the Method dropdown (based on the ID of the button you double-clicked on). In the Method field, provide a more useful name and click on OK to add the new class and open it in the editor. You will see a method with your chosen name, as shown in the following code snippet: Public String rentDvd() { // Add event code here... return null; } Obviously, you place your code inside this method. If you accidentally left the default method name and ended up with something like cb5_action(), you can right-click on the method name and navigate to Refactor | Rename to give it a more descriptive name. Note that JDeveloper automatically sets the Action property for your button matching the scope, bean name, and method name. This might be something like #{backingBeanScope.RentalBean.rentDvd}. Adding a bean to a task flow Your beans should always be part of a task flow. If you're not adding logic to a button, or you just want more control over the process, you can also create a backing bean class first and then add it to the task flow. A bean class is a regular Java class created by navigating to File | New | Java Class . When you have created the class, you open the task flow where you want to use it and select the Overview tab. On the Managed Beans subtab, you can use the green plus to add your bean. Simply give it a name, point to the class you created, and select a memory scope. Accessing UI components from beans In a managed bean, you often want to refer to various user interface elements. This is done by mapping each element to a property in the bean. For example, if you have an af:inputText component that you want to refer to in a bean, you create a private variable of type RichInputText in the bean (with setter and getter methods) and set the Binding property (under the Advanced heading) to point to that bean variable using Expression Language. When creating a page or page fragment, you have the option (on the Managed Bean tab) to automatically have JDeveloper create corresponding attributes for you. The Managed Bean tab is shown in the following screenshot: Leave it on the default setting of Do Not Automatically Expose UI Components in a Managed Bean . If you select one of the options to automatically expose UI elements, your bean will acquire a lot of attributes that you don't need, which will make your code unnecessarily complex and slow. However, while learning ADF, you might want to try this out to see how the bean attributes and the Binding property work together. If you do activate this setting, it applies to every page and fragment you create until you explicitly deselect this option. Summary In this article, you have seen some examples of how to add Java code to your application to implement the specific business logic your application needs. There are many, many more places and ways to add logic –– as you work with ADF, you will continually come across new business requirements that force you to figure out how to add code to your application in new ways. Fortunately, there are other books, websites, online tutorials and training that you can use to add to your ADF skill set –– refer to http://www.adfessentials.com for a starting point. Resources for Article : Further resources on this subject: Oracle Tools and Products [Article] Managing Oracle Business Intelligence [Article] Oracle Integration and Consolidation Products [Article]
Read more
  • 0
  • 0
  • 5019

article-image-nginx-http-server
Packt
18 Apr 2013
28 min read
Save for later

The NGINX HTTP Server

Packt
18 Apr 2013
28 min read
(For more resources related to this topic, see here.) NGINX's architecture NGINX consists of a single master process and multiple worker processes. Each of these is single-threaded and designed to handle thousands of connections simultaneously. The worker process is where most of the action takes place, as this is the component that handles client requests. NGINX makes use of the operating system's event mechanism to respond quickly to these requests. The NGINX master process is responsible for reading the configuration, handling sockets, spawning workers, opening log files, and compiling embedded Perl scripts. The master process is the one that responds to administrative requests via signals. The NGINX worker process runs in a tight event loop to handle incoming connections. Each NGINX module is built into the worker, so that any request processing, filtering, handling of proxy connections, and much more is done within the worker process. Due to this worker model, the operating system can handle each process separately and schedule the processes to run optimally on each processor core. If there are any processes that would block a worker, such as disk I/O, more workers than cores can be configured to handle the load. There are also a small number of helper processes that the NGINX master process spawns to handle dedicated tasks. Among these are the cache loader and cache manager processes. The cache loader is responsible for preparing the metadata for worker processes to use the cache. The cache manager process is responsible for checking cache items and expiring invalid ones. NGINX is built in a modular fashion. The master process provides the foundation upon which each module may perform its function. Each protocol and handler is implemented as its own module. The individual modules are chained together into a pipeline to handle connections and process requests. After a request is handled, it is then passed on to a series of filters, in which the response is processed. One of these filters is responsible for processing subrequests, one of NGINX's most powerful features. Subrequests are how NGINX can return the results of a request that differs from the URI that the client sent. Depending on the configuration, they may be multiply nested and call other subrequests. Filters can collect the responses from multiple subrequests and combine them into one response to the client. The response is then finalized and sent to the client. Along the way, multiple modules come into play. See http://www.aosabook.org/en/nginx.html for a detailed explanation of NGINX internals. We will be exploring the http module and a few helper modules in the remainder of this article. The HTTP core module The http module is NGINX's central module, which handles all interactions with clients over HTTP. We will have a look at the directives in the rest of this section, again divided by type. The server The server directive starts a new context. We have already seen examples of its usage throughout the book so far. One aspect that has not yet been examined in-depth is the concept of a default server. A default server in NGINX means that it is the first server defined in a particular configuration with the same listen IP address and port as another server. A default server may also be denoted by the default_server parameter to the listen directive. The default server is useful to define a set of common directives that will then be reused for subsequent servers listening on the same IP address and port: server { listen 127.0.0.1:80; server_name default.example.com; server_name_in_redirect on; } server { listen 127.0.0.1:80; server_name www.example.com; } In this example, the www.example.com server will have the server_name_in_redirect directive set to on as well as the default.example.com server. Note that this would also work if both servers had no listen directive, since they would still both match the same IP address and port number (that of the default value for listen, which is *:80). Inheritance, though, is not guaranteed. There are only a few directives that are inherited, and which ones are changes over time. A better use for the default server is to handle any request that comes in on that IP address and port, and does not have a Host header. If you do not want the default server to handle requests without a Host header, it is possible to define an empty server_name directive. This server will then match those requests. server { server_name ""; } The following table summarizes the directives relating to server: Table: HTTP server directives Directive Explanation port_in_redirect Determines whether or not the port will be specified in a redirect issued by NGINX. server Creates a new configuration context, defining a virtual host. The listen directive specifies the IP address(es) and port(s); the server_name directive lists the Host header values that this context matches. server_name Configures the names that a virtual host may respond to. server_name_in_redirect Activates using the first value of the server_name directive in any redirect issued by NGINX within this context. server_tokens Disables sending the NGINX version string in error messages and the Server response header (default value is on). Logging NGINX has a very flexible logging model . Each level of configuration may have an access log. In addition, more than one access log may be specified per level, each with a different log_format. The log_format directive allows you to specify exactly what will be logged, and needs to be defined within the http section. The path to the log file itself may contain variables, so that you can build a dynamic configuration. The following example describes how this can be put into practice: http { log_format vhost '$host $remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent"'; log_format downloads '$time_iso8601 $host $remote_addr ' '"$request" $status $body_bytes_sent $request_ time'; open_log_file_cache max=1000 inactive=60s; access_log logs/access.log; server { server_name ~^(www.)?(.+)$; access_log logs/combined.log vhost; access_log logs/$2/access.log; location /downloads { access_log logs/downloads.log downloads; } } } The following table describes the directives used in the preceding code: Table: HTTP logging directives Directive Explanation access_log Describes where and how access logs are to be written. The first parameter is a path to the file where the logs are to be stored. Variables may be used in constructing the path. The special value off disables the access log. An optional second parameter indicates log_format that will be used to write the logs. If no second parameter is configured, the predefined combined format is used. An optional third parameter indicates the size of the buffer if write buffering should be used to record the logs. If write buffering is used, this size cannot exceed the size of the atomic disk write for that filesystem. If this third parameter is gzip, then the buffered logs will be compressed on-the-fly, provided that the nginx binary was built with the zlib library. A final flush parameter indicates the maximum length of time buffered log data may remain in memory before being flushed to disk. log_format Specifies which fields should appear in the log file and what format they should take. See the next table for a description of the log-specific variables. log_not_found Disables reporting of 404 errors in the error log (default value is on). log_subrequest Enables logging of subrequests in the access log (default value is off ). open_log_file_cache Stores a cache of open file descriptors used in access_logs with a variable in the path. The parameters used are: max: The maximum number of file descriptors present in the cache inactive: NGINX will wait this amount of time for something to be written to this log before its file descriptor is closed min_uses: The file descriptor has to be used this amount of times within the inactive period in order to remain open valid: NGINX will check this often to see if the file descriptor still matches a file with the same name off: Disables the cache In the following example, log entries will be compressed at a gzip level of 4. The buffer size is the default of 64 KB and will be flushed to disk at least every minute. access_log /var/log/nginx/access.log.gz combined gzip=4 flush=1m; Note that when specifying gzip the log_format parameter is not optional.The default combined log_format is constructed like this: log_format combined '$remote_addr - $remote_user [$time_local] ' '"$request" $status $body_bytes_sent ' '"$http_referer" "$http_user_agent"'; As you can see, line breaks may be used to improve readability. They do not affect the log_format itself. Any variables may be used in the log_format directive. The variables in the following table which are marked with an asterisk ( *) are specific to logging and may only be used in the log_format directive. The others may be used elsewhere in the configuration, as well. Table: Log format variables Variable Name Value $body_bytes_sent The number of bytes sent to the client, excluding the response header. $bytes_sent The number of bytes sent to the client. $connection A serial number, used to identify unique connections. $connection_requests The number of requests made through a particular connection. $msec The time in seconds, with millisecond resolution. $pipe * Indicates if the request was pipelined (p) or not (.). $request_length * The length of the request, including the HTTP method, URI, HTTP protocol, header, and request body. $request_time The request processing time, with millisecond resolution, from the first byte received from the client to the last byte sent to the client. $status The response status. $time_iso8601 * Local time in ISO8601 format. $time_local * Local time in common log format (%d/%b/%Y:%H:%M:%S %z). In this section, we have focused solely on access_log and how that can be configured. You can also configure NGINX to log errors. Finding files In order for NGINX to respond to a request, it passes it to a content handler, determined by the configuration of the location directive. The unconditional content handlers are tried first: perl, proxy_pass, flv, mp4, and so on. If none of these is a match, the request is passed to one of the following, in order: random index, index, autoindex, gzip_static, static. Requests with a trailing slash are handled by one of the index handlers. If gzip is not activated, then the static module handles the request. How these modules find the appropriate file or directory on the filesystem is determined by a combination of certain directives. The root directive is best defined in a default server directive, or at least outside of a specific location directive, so that it will be valid for the whole server: server { root /home/customer/html; location / { index index.html index.htm; } location /downloads { autoindex on; } } In the preceding example any files to be served are found under the root /home/customer/html. If the client entered just the domain name, NGINX will try to serve index.html. If that file does not exist, then NGINX will serve index.htm. When a user enters the /downloads URI in their browser, they will be presented with a directory listing in HTML format. This makes it easy for users to access sites hosting software that they would like to download. NGINX will automatically rewrite the URI of a directory so that the trailing slash is present, and then issue an HTTP redirect. NGINX appends the URI to the root to find the file to deliver to the client. If this file does not exist, the client receives a 404 Not Found error message. If you don't want the error message to be returned to the client, one alternative is to try to deliver a file from different filesystem locations, falling back to a generic page, if none of those options are available. The try_files directive can be used as follows: location / { try_files $uri $uri/ backups/$uri /generic-not-found.html; } As a security precaution, NGINX can check the path to a file it's about to deliver, and if part of the path to the file contains a symbolic link, it returns an error message to the client: server { root /home/customer/html; disable_symlinks if_not_owner from=$document_root; } In the preceding example, NGINX will return a "Permission Denied" error if a symlink is found after /home/customer/html, and that symlink and the file it points to do not both belong to the same user ID. The following table summarizes these directives: Table: HTTP file-path directives Directive Explanation disable_symlinks Determines if NGINX should perform a symbolic link check on the path to a file before delivering it to the client. The following parameters are recognized: off : Disables checking for symlinks (default) on: If any part of a path is a symlink, access is denied if_not_owner: If any part of a path contains a symlink in which the link and the referent have different owners, access to the file is denied from=part: When specified, the path up to part is not checked for symlinks, everything afterward is according to either the on or if_not_owner parameter root Sets the path to the document root. Files are found by appending the URI to the value of this directive. try_files Tests the existence of files given as parameters. If none of the previous files are found, the last entry is used as a fallback, so ensure that this path or named location exists, or is set to return a status code indicated by  =<status code>. Name resolution If logical names instead of IP addresses are used in an upstream or *_pass directive, NGINX will by default use the operating system's resolver to get the IP address, which is what it really needs to connect to that server. This will happen only once, the first time upstream is requested, and won't work at all if a variable is used in the *_pass directive. It is possible, though, to configure a separate resolver for NGINX to use. By doing this, you can override the TTL returned by DNS, as well as use variables in the *_pass directives. server { resolver 192.168.100.2 valid=300s; } Table: Name resolution directives Directive Explanation resolver   Configures one or more name servers to be used to resolve upstream server names into IP addresses. An optional  valid parameter overrides the TTL of the domain name record. In order to get NGINX to resolve an IP address anew, place the logical name into a variable. When NGINX resolves that variable, it implicitly makes a DNS look-up to find the IP address. For this to work, a resolver directive must be configured: server { resolver 192.168.100.2; location / { set $backend upstream.example.com; proxy_pass http://$backend; } } Of course, by relying on DNS to find an upstream, you are dependent on the resolver always being available. When the resolver is not reachable, a gateway error occurs. In order to make the client wait time as short as possible, the resolver_timeout parameter should be set low. The gateway error can then be handled by an error_ page designed for that purpose. server { resolver 192.168.100.2; resolver_timeout 3s; error_page 504 /gateway-timeout.html; location / { proxy_pass http://upstream.example.com; } } Client interaction There are a number of ways in which NGINX can interact with clients. This can range from attributes of the connection itself (IP address, timeouts, keepalive, and so on) to content negotiation headers. The directives listed in the following table describe how to set various headers and response codes to get the clients to request the correct page or serve up that page from its own cache: Table: HTTP client interaction directives Directive Explanation default_type Sets the default MIME type of a response. This comes into play if the MIME type of the file cannot be matched to one of those specified by the types directive. error_page Defines a URI to be served when an error level response code is encountered. Adding an = parameter allows the response code to be changed. If the argument to this parameter is left empty, the response code will be taken from the URI, which must in this case be served by an upstream server of some sort. etag Disables automatically generating the ETag response header for static resources (default is on). if_modified_since Controls how the modification time of a response is compared to the value of the If-Modified-Since request header: off: The If-Modified-Since header is ignored exact: An exact match is made (default) before: The modification time of the response is less than or equal to the value of the If-Modified-Since header ignore_invalid_headers Disables ignoring headers with invalid names (default is on). A valid name is composed of ASCII letters, numbers, the hyphen, and possibly the underscore (controlled by the underscores_in_headers directive). merge_slashes Disables the removal of multiple slashes. The default value of on means that NGINX will compress two or more / characters into one. recursive_error_pages Enables doing more than one redirect using the error_page directive (default is off). types Sets up a map of MIME types to file name extensions. NGINX ships with a conf/mime.types file that contains most MIME type mappings. Using include to load this file should be sufficient for most purposes. underscores_in_headers Enables the use of the underscore character in client request headers. If left at the default value off , evaluation of such headers is subject to the value of the ignore_invalid_headers directive. The error_page directive is one of NGINX's most flexible. Using this directive, we may serve any page when an error condition presents. This page could be on the local machine, but could also be a dynamic page produced by an application server, and could even be a page on a completely different site. http { # a generic error page to handle any server-level errors error_page 500 501 502 503 504 share/examples/nginx/50x.html; server { server_name www.example.com; root /home/customer/html; # for any files not found, the page located at # /home/customer/html/404.html will be delivered error_page 404 /404.html; location / { # any server-level errors for this host will be directed # to a custom application handler error_page 500 501 502 503 504 = @error_handler; } location /microsite { # for any non-existent files under the /microsite URI, # the client will be shown a foreign page error_page 404 http://microsite.example.com/404.html; } # the named location containing the custom error handler location @error_handler { # we set the default type here to ensure the browser # displays the error page correctly default_type text/html; proxy_pass http://127.0.0.1:8080; } } } Using limits to prevent abuse We build and host websites because we want users to visit them. We want our websites to always be available for legitimate access. This means that we may have to take measures to limit access to abusive users. We may define "abusive" to mean anything from one request per second to a number of connections from the same IP address. Abuse can also take the form of a DDOS (distributed denial-of-service) attack, where bots running on multiple machines around the world all try to access the site as many times as possible at the same time. In this section, we will explore methods to counter each type of abuse to ensure that our websites are available. First, let's take a look at the different configuration directives that will help us achieve our goal: Table: HTTP limits directives Directive Explanation limit_conn Specifies a shared memory zone (configured with limit_conn_zone) and the maximum number of connections that are allowed per key value. limit_conn_log_level When NGINX limits a connection due to the limit_conn directive, this directive specifies at which log level that limitation is reported. limit_conn_zone Specifies the key to be limited in limit_conn as the first parameter. The second parameter, zone, indicates the name of the shared memory zone used to store the key and current number of connections per key and the size of that zone (name:size). limit_rate Limits the rate (in bytes per second) at which clients can download content. The rate limit works on a connection level, meaning that a single client could increase their throughput by opening multiple connections. limit_rate_after Starts the limit_rate after this number of bytes have been transferred. limit_req Sets a limit with bursting capability on the number of requests for a specific key in a shared memory store (configured with limit_req_zone). The burst can be specified with the second parameter. If there shouldn't be a delay in between requests up to the burst, a third parameter nodelay needs to be configured. limit_req_log_level When NGINX limits the number of requests due to the limit_req directive, this directive specifies at which log level that limitation is reported. A delay is logged at a level one less than the one indicated here. limit_req_zone Specifies the key to be limited in limit_req as the first parameter. The second parameter, zone, indicates the name of the shared memory zone used to store the key and current number of requests per key and the size of that zone ( name:size). The third parameter, rate, configures the number of requests per second (r/s) or per minute (r/m) before the limit is imposed. max_ranges Sets the maximum number of ranges allowed in a byte-range request. Specifying 0 disables byte-range support. Here we limit access to 10 connections per unique IP address. This should be enough for normal browsing, as modern browsers open two to three connections per host. Keep in mind, though, that any users behind a proxy will all appear to come from the same address. So observe the logs for error code 503 (Service Unavailable), meaning that this limit has come into effect: http { limit_conn_zone $binary_remote_addr zone=connections:10m; limit_conn_log_level notice; server { limit_conn connections 10; } } Limiting access based on a rate looks almost the same, but works a bit differently. When limiting how many pages per unit of time a user may request, NGINX will insert a delay after the first page request, up to a burst. This may or may not be what you want, so NGINX offers the possibility to remove this delay with the nodelay parameter: http { limit_req_zone $binary_remote_addr zone=requests:10m rate=1r/s; limit_req_log_level warn; server { limit_req zone=requests burst=10 nodelay; } } Using $binary_remote_addr We use the $binary_remote_addr variable in the preceding example to know exactly how much space storing an IP address will take. This variable takes 32 bytes on 32-bit platforms and 64 bytes on 64-bit platforms. So the 10m zone we configured previously is capable of holding up to 320,000 states on 32-bit platforms or 160,000 states on 64-bit platforms. We can also limit the bandwidth per client. This way we can ensure that a few clients don't take up all the available bandwidth. One caveat, though: the limit_rate directive works on a connection basis. A single client that is allowed to open multiple connections will still be able to get around this limit: location /downloads { limit_rate 500k; } Alternatively, we can allow a kind of bursting to freely download smaller files, but make sure that larger ones are limited: location /downloads { limit_rate_after 1m; limit_rate 500k; } Combining these different rate limitations enables us to create a configuration that is very flexible as to how and where clients are limited: http { limit_conn_zone $binary_remote_addr zone=ips:10m; limit_conn_zone $server_name zone=servers:10m; limit_req_zone $binary_remote_addr zone=requests:10m rate=1r/s; limit_conn_log_level notice; limit_req_log_level warn; reset_timedout_connection on; server { # these limits apply to the whole virtual server limit_conn ips 10; # only 1000 simultaneous connections to the same server_name limit_conn servers 1000; location /search { # here we want only the /search URL to be rate-limited limit_req zone=requests burst=3 nodelay; } location /downloads { # using limit_conn to ensure that each client is # bandwidth-limited # with no getting around it limit_conn connections 1; limit_rate_after 1m; limit_rate 500k; } } } Restricting access In the previous section, we explored ways to limit abusive access to websites running under NGINX. Now we will take a look at ways to restrict access to a whole website or certain parts of it. Access restriction can take two forms here: restricting to a certain set of IP addresses, or restricting to a certain set of users. These two methods can also be combined to satisfy requirements that some users can access the website either from a certain set of IP addresses or if they are able to authenticate with a valid username and password. The following directives will help us achieve these goals: Table: HTTP access module directives Directive Explanation allow Allows access from this IP address, network, or all. auth_basic Enables authentication using HTTP Basic Authentication. The parameter string is used as the realm name. If the special value off is used, this indicates that the auth_basic value of the parent configuration level is negated. auth_basic_user_file Indicates the location of a file of username:password:comment tuples used to authenticate users. The password field needs to be encrypted with the crypt algorithm. The comment field is optional. deny Denies access from this IP address, network, or all. satisfy Allows access if all or any of the preceding directives grant access. The default value all indicates that a user must come from a specific network address and enter the correct password. To restrict access to clients coming from a certain set of IP addresses, the allow and deny directives can be used as follows: location /stats { allow 127.0.0.1; deny all; } This configuration will allow access to the /stats URI from the localhost only. To restrict access to authenticated users, the auth_basic and auth_basic_user_file directives are used as follows: server { server_name restricted.example.com; auth_basic "restricted"; auth_basic_user_file conf/htpasswd; } Any user wanting to access restricted.example.com would need to provide credentials matching those in the htpasswd file located in the conf directory of NGINX's root. The entries in the htpasswd file can be generated using any available tool that uses the standard UNIX crypt() function. For example, the following Ruby script will generate a file of the appropriate format: #!/usr/bin/env ruby # setup the command-line options require 'optparse' OptionParser.new do |o| o.on('-f FILE') { |file| $file = file } o.on('-u', "--username USER") { |u| $user = u } o.on('-p', "--password PASS") { |p| $pass = p } o.on('-c', "--comment COMM (optional)") { |c| $comm = c } o.on('-h') { puts o; exit } o.parse! if $user.nil? or $pass.nil? puts o; exit end end # initialize an array of ASCII characters to be used for the salt ascii = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a + [ ".", "/" ] $lines = [] begin # read in the current http auth file File.open($file) do |f| f.lines.each { |l| $lines << l } end rescue Errno::ENOENT # if the file doesn't exist (first use), initialize the array $lines = ["#{$user}:#{$pass}n"] end # remove the user from the current list, since this is the one we're editing $lines.map! do |line| unless line =~ /#{$user}:/ line end end # generate a crypt()ed password pass = $pass.crypt(ascii[rand(64)] + ascii[rand(64)]) # if there's a comment, insert it if $comm $lines << "#{$user}:#{pass}:#{$comm}n" else $lines << "#{$user}:#{pass}n" end # write out the new file, creating it if necessary File.open($file, File::RDWR|File::CREAT) do |f| $lines.each { |l| f << l} end Save this file as http_auth_basic.rb and give it a filename (-f), a user (-u), and a password (-p), and it will generate entries appropriate to use in NGINX's auth_ basic_user_file directive: $ ./http_auth_basic.rb -f htpasswd -u testuser -p 123456 To handle scenarios where a username and password should only be entered if not coming from a certain set of IP addresses, NGINX has the satisfy directive. The any parameter is used here for this either/or scenario: server { server_name intranet.example.com; location / { auth_basic "intranet: please login"; auth_basic_user_file conf/htpasswd-intranet; allow 192.168.40.0/24; allow 192.168.50.0/24; deny all; satisfy any; } If, instead, the requirements are for a configuration in which the user must come from a certain IP address and provide authentication, the all parameter is the default. So, we omit the satisfy directive itself and include only allow, deny, auth_basic, and auth_basic_user_file: server { server_name stage.example.com; location / { auth_basic "staging server"; auth_basic_user_file conf/htpasswd-stage; allow 192.168.40.0/24; allow 192.168.50.0/24; deny all; } Streaming media files NGINX is capable of serving certain video media types. The flv and mp4 modules, included in the base distribution, can perform what is called pseudo-streaming. This means that NGINX will seek to a certain location in the video file, as indicated by the start request parameter. In order to use the pseudo-streaming capabilities, the corresponding module needs to be included at compile time: --with-http_flv_module for Flash Video (FLV) files and/or --with-http_mp4_module for H.264/AAC files. The following directives will then become available for configuration: Table: HTTP streaming directives Directive Explanation flv Activates the flv  module for this location. mp4 Activates the mp4  module for this location. mp4_buffer_size Sets the initial buffer size for delivering MP4 files. mp4_max_buffer_size Sets the maximum size of the buffer used to process MP4 metadata. Activating FLV pseudo-streaming for a location is as simple as just including the flv keyword: location /videos { flv; } There are more options for MP4 pseudo-streaming, as the H.264 format includes metadata that needs to be parsed. Seeking is available once the "moov atom" has been parsed by the player. So to optimize performance, ensure that the metadata is at the beginning of the file. If an error message such as the following shows up in the logs, the mp4_max_buffer_size needs to be increased: mp4 moov atom is too large mp4_max_buffer_size can be increased as follows: location /videos { mp4; mp4_buffer_size 1m; mp4_max_buffer_size 20m; } Predefined variables NGINX makes constructing configurations based on the values of variables easy. Not only can you instantiate your own variables by using the set or map directives, but there are also predefined variables used within NGINX. They are optimized for quick evaluation and the values are cached for the lifetime of a request. You can use any of them as a key in an if statement, or pass them on to a proxy. A number of them may prove useful if you define your own log file format. If you try to redefine any of them, though, you will get an error message as follows: <timestamp> [emerg] <master pid>#0: the duplicate "<variable_name>" variable in <path-to-configuration-file>:<line-number> They are also not made for macro expansion in the configuration—they are mostly used at run time. Summary In this article, we have explored a number of directives used to make NGINX serve files over HTTP. Not only does the http module provide this functionality, but there are also a number of helper modules that are essential to the normal operation of NGINX. These helper modules are enabled by default. Combining the directives of these various modules enables us to build a configuration that meets our needs. We explored how NGINX finds files based on the URI requested. We examined how different directives control how the HTTP server interacts with the client, and how the error_page directive can be used to serve a number of needs. Limiting access based on bandwidth usage, request rate, and number of connections is all possible. We saw, too, how we can restrict access based on either IP address or through requiring authentication. We explored how to use NGINX's logging capabilities to capture just the information we want. Pseudo-streaming was examined briefly, as well. NGINX provides us with a number of variables that we can use to construct our configurations. Resources for Article : Further resources on this subject: Nginx HTTP Server FAQs [Article] Nginx Web Services: Configuration and Implementation [Article] Using Nginx as a Reverse Proxy [Article]
Read more
  • 0
  • 0
  • 5018

article-image-human-readable-rules-drools-jboss-rules-50part-2
Packt
16 Oct 2009
5 min read
Save for later

Human-readable Rules with Drools JBoss Rules 5.0(Part 2)

Packt
16 Oct 2009
5 min read
Drools Agenda Before we talk about how to manage rule execution order, we have to understand Drools Agenda. When an object is inserted into the knowledge session, Drools tries to match this object with all of the possible rules. If a rule has all of its conditions met, its consequence can be executed. We say that a rule is activated. Drools records this event by placing this rule onto its agenda (it is a collection of activated rules). As you may imagine, many rules can be activated, and also deactivated, depending on what objects are in the rule session. After the fireAllRules method call, Drools picks one rule from the agenda and executes its consequence. It may or may not cause further activations or deactivations. This continues until the Drools Agenda is empty. The purpose of the agenda is to manage the execution order of rules. Methods for managing rule execution order The following are the methods for managing the rule execution order (from the user's perspective). They can be viewed as alternatives to ruleflow. All of them are defined as rule attributes. salience: This is the most basic one. Every rule has a salience value. By default it is set to 0. Rules with higher salience value will fire first. The problem with this approach is that it is hard to maintain. If we want to add new rule with some priority, we may have to shift the priorities of existing rules. It is often hard to figure out why a rule has certain salience, so we have to comment every salience value. It creates an invisible dependency on other rules. activation-group: This used to be called xor-group. When two or more rules with the same activation group are on the agenda, Drools will fire just one of them. agenda-group: Every rule has an agenda group. By default it is MAIN. However, it can be overridden. This allows us to partition Drools Agenda into multiple groups that can be executed separately. The figure above shows partitioned Agenda with activated rules. The matched rules are coming from left and going into Agenda. One rule is chosen from the Agenda at a time and then executed/fired. At runtime, we can programmatically set the active Agenda group (through the getAgenda().getAgendaGroup(String agendaGroup).setFocus() method of KnowledgeRuntime), or declaratively, by setting the rule attribute auto-focus to true. When a rule is activated and has this attribute set to true, the active agenda group is automatically changed to rule's agenda group. Drools maintains a stack of agenda groups. Whenever the focus is set to a different agenda group, Drools adds this group onto this stack. When there are no rules to fire in the current agenda group, Drools pops from the stack and sets the agenda group to the next one. Agenda groups are similar to ruleflow groups with the exception that ruleflow groups are not stacked. Note that only one instance of each of these attributes is allowed per rule (for example, a rule can only be in one ruleflow-group ; however, it can also define salience within that group). Ruleflow As we've already said, ruleflow can externalize the execution order from the rule definitions. Rules just define a ruleflow-group attribute, which is similar to agenda-group. It is then used to define the execution order. A simple ruleflow (in the example.rf file) is shown in the following screenshot: The preceding screenshot shows a ruleflow opened with the Drools Eclipse plugin. On the lefthand side are the components that can be used when building a ruleflow. On the righthand side is the ruleflow itself. It has a Start node which goes to ruleflow group called Group 1. After it finishes execution, an Action is executed, then the flow continues to another ruleflow group called Group 2, and finally it finishes at an End node. Ruleflow definitions are stored in a file with the .rf extension. This file has an XML format and defines the structure and layout for presentational purposes. Another useful rule attribute for managing which rules can be activated is lock-on-active. It is a special form of the no-loop attribute. It can be used in combination with ruleflow-group or agenda-group. If it is set to true, and an agenda/ruleflow group becomes active/focused, it discards any further activations for the rule until a different group becomes active. Please note that activations that are already on the agenda will be fired. A ruleflow consists of various nodes. Each node has a name, type, and other specific attributes. You can see and change these attributes by opening the standard Properties view in Eclipse while editing the ruleflow file. The basic node types are as follows: Start End Action RuleFlowGroup Split Join They are discussed in the following sections. Start It is the initial node. The flow begins here. Each ruleflow needs one start node. This node has no incoming connection—just one outgoing connection. End It is a terminal node. When execution reaches this node, the whole ruleflow is terminated (all of the active nodes are canceled). This node has one incoming connection and no outgoing connections. Action Used to execute some arbitrary block of code. It is similar to the rule consequence—it can reference global variables and can specify dialect. RuleFlowGroup This node will activate a ruleflow-group, as specified by its RuleFlowGroup attribute. It should match the value in ruleflow-group rule attribute.  
Read more
  • 0
  • 0
  • 5015

article-image-managing-network-layout
Packt
22 Mar 2013
15 min read
Save for later

Managing Network Layout

Packt
22 Mar 2013
15 min read
(For more resources related to this topic, see here.) There are two main approaches to working with network structure in Nagios Core: Host parent definitions allow an administrator to define a hierarchy of connectivity to monitored hosts from the "point of view" of the Nagios Core server. An example might be a server with the monitored address in another subnet linked to the Nagios Core server by a router. If the router enters a DOWN state, it triggers Nagios Core's host reachability logic to automatically determine which hosts become inaccessible, and flags these as UNREACHABLE rather than DOWN, allowing refined notification behavior. Host and service dependencies allow the formalization of relationships between hosts or services, usually for the purposes of suppressing unnecessary notifications. An example might be a service that tests a login to a mail service, that itself requires a database service to work properly. If Nagios Core finds that the database service and the login service are both down, a service dependency allows the suppressing of the notification about the login service; the administrator would therefore only be notified about the database service being down, which is more likely to be the actual problem. There is some overlap of functionality here, but the general pattern is that host parent definitions describe the structure of your network from the vantage point of your monitoring server, and host and service dependencies describe the way it functions, independent of the monitoring server. We will define both parent definitions and dependencies in this article, with the primary goal of filtering and improving the notifications that Nagios Core sends in response to failed checks, which can assist greatly in diagnosing problems. We'll also look at another, more subtle benefit of establishing host parent definitions in making the network map of the Nagios Core web interface useful, and once a basic hierarchy is set up, we'll show how to customize the map's appearance (including defining icons for hosts), to make it generally useful as a network weather map. Creating a network host hierarchy In this recipe, we'll learn how to establish a parent-child relationship for two hosts in a very simple network, in order to take advantage of Nagios Core's reachability logic. Changing this configuration is very simple; it involves adding only one directive, and optionally changing some notification options. Getting ready You will need to be running a Nagios Core 3.0 or newer server, and have at least two hosts, one of which is only reachable via the other. The host that allows communications with the other is the parent host. You should be reasonably confident that a loss of connectivity to the parent host necessarily implies that the child host becomes unreachable from the monitoring server. Access to the web interface of Nagios Core would also be useful, as making this change will change the appearance of the network map, discussed in the Using the network map recipe in this article. Our example will use a Nagios Core monitoring server, olympus.naginet, monitoring three hosts: calpe.naginet, a router janus.naginet, another router corsica.naginet, a web server The hosts are connected as shown in the following diagram: Note that the Nagios Core server olympus.naginet is only able to communicate with the corsica.naginet web server if the router calpe.naginet is working correctly. If calpe. naginet were to enter a DOWN state, we would see corsica.naginet enter a DOWN state too: This is a little misleading, as we don't actually know whether corsica.naginet is down. It might be, but with the router in between the hosts not working correctly, Nagios Core has no way of knowing. A more informative and accurate status for the host would be UNREACHABLE; this is what the configuration we're about to add will arrange. How to do it... We can configure a parent-child relationship for our two hosts as follows Change to the objects configuration directory for Nagios Core. The default path is /usr/local/nagios/etc/objects. If you've put the definition for your host in a different file, then move to its directory instead. # cd /usr/local/nagios/etc/objects Edit the file containing the definition for the child host. In our example, the child host is corsica.naginet, the web server. The host definition might look something similar to the following code snippet: define host { use linux-server host_name corsica.naginet alias corsica address 10.128.0.71 } Add a new parents directive to the host's definition, and give it the same value as the host_name directive of the host on which it is dependent for connectivity. In our example, that host is calpe.naginet. define host { use linux-server host_name corsica.naginet alias corsica address 10.128.0.71 parents calpe.naginet } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg# /etc/init.d/nagios restart With this done, if the parent host enters a DOWN state and the child host can't be contacted, then the child host will enter an UNREACHABLE state rather than also being flagged as DOWN: The child host's contacts will also receive UNREACHABLE notification s instead of DOWN notifications for the child host, provided the u flag is included in notification_options for the host, and host_notification_options for the contacts. How it works... This is a simple application of Nagios Core's reachability logic. When the check to calpe. naginet fails for the first time, Nagios Core notes that it is a parent host for one child host, corsica.naginet. If during checks for the child host it finds it cannot communicate with it, it flags an UNREACHABLE state instead of the DOWN state, firing a different notification event. The primary advantages to this are twofold: The DOWN notification is only sent for the nearest problem parent host. All other hosts beyond that host fire UNREACHABLE notifications. This means that Nagios Core's reachability logic automatically determines the point of failure from its perspective, which can be very handy in diagnosing which host is actually experiencing a problem. If the host is a parent to a large number of other hosts, the configuration can be arranged not to send urgent notifications for UNREACHABLE hosts. There may not be much point sending a hundred pager or e-mail messages to an administrator when a very central router goes down; they know there are problems with the downstream hosts, so all we would be doing is distracting them with useless information. With a little planning and some knowledge of the network, all we need to do is add a few parents directives to host definitions to build a simple network structure, and Nagios Core will behave much more intelligently as a result. This is one of the easiest ways to refine the notification behavior of Nagios Core; it can't be recommended enough! There's more... Note that a child host can itself be a parent to other hosts in turn, allowing a nesting network structure. Perhaps in another situation, we might find that the corsica.naginet server is two routers away from the monitoring server: In this case, not only is corsica.naginet the child host of calpe.naginet, but calpe. naginet is itself the child host of janus.naginet. We could specify this relationship in exactly the same way: define host { use linux-router host_name calpe.naginet alias calpe address 10.128.0.129 parents janus.naginet } It's also possible to set multiple parents for a host, if there are two possible paths to the same machine: define host { use linux-server host_name corsica.naginet alias corsica address 10.128.0.71 parents calpe.naginet,janus.naginet } With this configuration, corsica.naginet would only be deemed UNREACHABLE if both of its parent hosts were down. This kind of configuration is useful to account for redundant paths in a network; use cases could include spanning tree technologies, or dynamic routing failover After you've set up a good basic structure for your network using the parents directive, definitely check out the Using the network map recipe in this article to get some automatic visual feedback about your network's structure as generated from your new configuration. See also The Using the network map and Establishing a host dependency recipes in this article Using the network map In this recipe, we'll examine our network hierarchy in the network map (or status map) in the Nagios Core web interface. The network map takes the form of a generated graphic showing the hierarchy of hosts and their current states. You can learn how to establish such a hierarchy in the recipe Creating a network host hierarchy in this article. The network map allows filtering to show specific hosts, and clicking on hosts to navigate through larger networks. Getting ready You will need to be running a Nagios Core 3.0 or newer server, and have access to its web interface. You will also need permission to view the states of hosts, preferably all hosts. You can arrange this by adding your username in the authorized_for_all_hosts directive, normally in /usr/local/nagios/etc/cgi.cfg; for example, for the user tom, we might configure the directive to read as follows: authorized_for_all_hosts=nagiosadmin,tom By default, the nagiosadmin user should have all the necessary permissions to view the complete map. The network map is not particularly useful without at least a few hosts configured and arranged in a hierarchy, so if you have not set any parents directives for your hosts, then you may wish to read the Creating a network host hierarchy recipe in this article first, and arrange your monitored hosts as it explains. How to do it... We can inspect the network map for our newly configured host hierarchy like so: Log in to the Nagios Core web interface. Click on the Map item in the menu on the left: You should be presented with a generated graphic showing all the hosts in your network that your user has permissions to view: Hover over any host with the mouse to see a panel breaking down the host's current state: By default, the network map is centered around the Nagios Process icon. Try clicking on one of your hosts to recenter the map; in this example, it's recentered on calpe.naginet: How it works... The network map is automatically generated from your host configuration. By default, it arranges the hosts in sectors, radiating outward from the central Nagios Process icon, using lines to show dependencies, and adjusting background colors to green for UP states, and red for DOWN or UNREACHABLE states. This map is generated via the GD2 library, written by Thomas Boutell. It takes the form of a linked image map. This means that you can simply right-click the image to save it while the network is in a particular state for later reference, and also that individual nodes can be clicked to recenter the map around the nominated host. This is particularly useful for networks with a large number of hosts and very many levels of parent/child host relationships. There's more... Note that the form in the panel in the top-right allows customizing the appearance of the map directly: Layout Method: This allows you to select the algorithm used to arrange and draw the hosts. It's worth trying each of these to see which you prefer for your particular network layout. Scaling factor: Change the value here to reduce or increase the size of the map image; values between 0.0 and 1.0 will reduce the image's size, and values above 1.0 will increase it. Drawing Layers: If your hosts are organized into hostgroups, you can filter the map to only display hosts belonging to particular groups. Layer mode: If you selected any host groups in the Drawing Layers option, this allows you to select whether you want to include hosts in those groups in the map, or exclude them from it. Suppress popups: If you find the yellow information popups that appear when hovering over hosts annoying, then you can turn them off by selecting this checkbox. After selecting or changing any one of these options, you will need to click on Update to apply them. The appearance of the status map can be configured well beyond this by changing directives in the Nagios Core configuration file, and adding some directives to your hosts; take a look at the recipes under the See also section of this recipe for some examples of how this is done. See also The Customizing appearance of the network map , Choosing icons for hosts, Specifying coordinates for a host on the network map, and Using the network map as an overlay recipes in this article Choosing icons for hosts In this recipe, we'll learn how to select graphics for hosts, to appear in various parts of the Nagios Core web interface. This is done by adding directives to a host to specify the paths to appropriate images to represent it. Adding these definitions has no effect on Nagios Core's monitoring behavior; they are mostly cosmetic changes, although it's useful to see at a glance whether a particular node is a server or a workstation, particularly on the network map. Getting ready You will need to be running a Nagios Core 3.0 or newer server, and have access to its web interface. You must also be able to edit the configuration files for the server. It's a good idea to check that you actually have the required images installed. The default set of icons is included in /usr/local/nagios/share/images/logos. Don't confuse this with its parent directory, images, which contains images used as part of the Nagios Core web interface itself. In the logos directory, you should find a number of images in various formats. In this example, we're interested in the router and rack-server icons: $ ls /usr/local/nagios/share/images/logos/{router,rack-server}.*/usr/local/nagios/share/images/logos/rack-server.gd2/usr/local/nagios/share/images/logos/rack-server.gif/usr/local/nagios/share/images/logos/router.gd2/usr/local/nagios/share/images/logos/router.gif To get the full benefit of the icons, you'll likely want to be familiar with using the network map, and have access to view it with the appropriate hosts in your own Nagios Core instance. The network map is introduced in the Using the network map recipe in this article. How to do it... We can define images to be used in displaying our host as follows: Change to the objects configuration directory for Nagios Core. The default path is /usr/local/nagios/etc/objects. If you've put the definition for your host in a different file, then move to its directory instead. # cd /usr/local/nagios/etc/objects Add three new directives to each of the hosts to which you want to apply the icons. In this example, the rack-server icon is assigned to corsica.naginet, and the router icon to both calpe.naginet and corsica.naginet: define host { use linux-server host_name corsica.naginet alias corsica address 10.128.0.71 icon_image rack-server.gif icon_image_alt Rack Server statusmap_image rack-server.gd2 } define host { use linux-router host_name janus.naginet alias janus address 10.128.0.128 icon_image router.gif icon_image_alt Router statusmap_image router.gd2 } define host { use linux-router host_name calpe.naginet alias calpe address 10.128.0.129 icon_image router.gif icon_image_alt Router statusmap_image router.gd2 } Validate the configuration and restart the Nagios Core server: # /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg# /etc/init.d/nagios restart With this done, a visit to the status map should display the appropriate hosts with icons rather than question marks: The Hosts list should also include a scaled-down version of the image: How it works... When a host list, service list, or network status map is generated, it checks for the presence of icon_image or statusmap_image values for each host object, reads the appropriate image if defined, and includes that as part of its processing. The network status map defaults to displaying only a question mark in the absence of a value for the statusmap_image directive. Note that for the statusmap_image directive, we chose the .gd2 version of the icon rather than the .gif version. This is for performance reasons; the status map is generated with the GD2 library, which deals more efficiently with its native .gd2 format. The icon_image_alt directive defines the value for the alt attribute when the image is displaying in an <img> HTML tag. Most web browsers will show the contents of this tag after briefly hovering over the icon. Nagios Core 3.0 allows you to put these directives in a separate hostextinfo object, but this object type is officially deprecated as of Nagios Core 4.0, so it's recommended to avoid it. There's more If you have a number of hosts that need to share the same image, it's a good practice to inherit from a common host template with the appropriate directives set. For our example, we might define a template as follows: define host { name router-icon icon_image router.gif icon_image_alt Router statusmap_image router.gd2 register 0 } We could then apply the image settings directly to both our routers simply by inheriting from that template, by adding it to the use directive: define host { use linux-router,router-icon host_name janus.naginet alias janus address 10.128.0.128 } define host { use linux-router,router-icon host_name calpe.naginet alias calpe address 10.128.0.129 } If you don't like the included icon set, there are many icon sets available online on the Nagios Exchange site at http://exchange.nagios.org/. If you want, you could even make your own, out of pictures of your physical hardware, saved in standard PNG or GD2 format. See also The Using the network map and Specifying coordinates for a host on the network map recipes in this article
Read more
  • 0
  • 0
  • 5014

article-image-server-side-swift-building-slack-bot-part-1
Peter Zignego
12 Oct 2016
5 min read
Save for later

Server-side Swift: Building a Slack Bot, Part 1

Peter Zignego
12 Oct 2016
5 min read
As a remote iOS developer, I love Slack. It’s my meeting room and my water cooler over the course of a work day. If you’re not familiar with Slack, it is a group communication tool popular in Silicon Valley and beyond. What makes Slack valuable beyond replacing email as the go-to communication method for buisnesses is that it is more than chat; it is a platform. Thanks to Slack’s open attitude toward developers with its API, hundreds of developers have been building what have become known as Slack bots. There are many different libraries available to help you start writing your Slack bot, covering a wide range of programming languages. I wrote a library in Apple’s new programming language (Swift) for this very purpose, called SlackKit. SlackKit wasn’t very practical initially—it only ran on iOS and OS X. On the modern web, you need to support Linux to deploy on Amazon Web Servies, Heroku, or hosted server companies such as Linode and Digital Ocean. But last June, Apple open sourced Swift, including official support for Linux (Ubuntu 14 and 15 specifically). This made it possible to deploy Swift code on Linux servers, and developers hit the ground running to build out the infrastructure needed to make Swift a viable language for server applications. Even with this huge developer effort, it is still early days for server-side Swift. Apple’s Linux Foundation port is a huge undertaking, as is the work to get libdispatch, a concurrency framework that provides much of the underpinning for Foundation. In addition to rough official tooling, writing code for server-side Swift can be a bit like hitting a moving target, with biweekly snapshot releases and multiple, ABI-incompatible versions to target. Zewo to Sixty on Linux Fortunately, there are some good options for deploying Swift code on servers right now, even with Apple’s libraries in flux. I’m going to focus in on one in particular: Zewo. Zewo is modular by design, allowing us to use the Swift Package Manager to pull in only what we need instead of a monolithic framework. It’s open source and is a great community of developers that spans the globe. If you’re interested in the world of server-side Swift, you should get involved! Oh, and of course they have a Slack. Using Zewo and a few other open source libraries, I was able to build a version of SlackKit that runs on Linux. A Swift Tutorial In this two-part post series I have detailed a step-by-step guide to writing a Slack bot in Swift and deploying it to Heroku. I’m going to be using OS X but this is also achievable on Linux using the editor of your choice. Prerequisites Install Homebrew: /usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" Install swiftenv: brew install kylef/formulae/swiftenv Configure your shell: echo ‘if which swiftenv > /dev/null; then eval “$(swiftenv init -)”; fi’ >> ~/.bash_profile Download and install the latest Zewo-compatible snapshot: swiftenv install DEVELOPMENT-SNAPSHOT-2016-05-09-a swiftenv local DEVELOPMENT-SNAPSHOT-2016-05-09-a Install and Link OpenSSL: brew install openssl brew link openssl --force Let’s Keep Score The sample application we’ll be building is a leaderboard for Slack, like PlusPlus++ by Betaworks. It works like this: add a point for every @thing++, subtract a point for every @thing--, and show a leaderboard when asked @botname leaderboard. First, we need to create the directory for our application and initialize the basic project structure. mkdir leaderbot && cd leaderbot swift build --init Next, we need to edit Package.swift to add our dependency, SlackKit: importPackageDescription let package = Package( name: "Leaderbot", targets: [], dependencies: [ .Package(url: "https://github.com/pvzig/SlackKit.git", majorVersion: 0, minor: 0), ] ) SlackKit is dependent on several Zewo libraries, but thanks to the Swift Package Manager, we don’t have to worry about importing them explicitly. Then we need to build our dependencies: swift build And our development environment (we need to pass in some linker flags so that swift build knows where to find the version of OpenSSL we installed via Homebrew and the C modules that some of our Zewo libraries depend on): swift build -Xlinker -L$(pwd)/.build/debug/ -Xswiftc -I/usr/local/include -Xlinker -L/usr/local/lib -X In Part 2, I will show all of the Swift code, how to get an API token, how to test the app and deploy it on Heroku, and finally how to launch it. Disclaimer The linux version of SlackKit should be considered an alpha release. It’s a fun tech demo to show what’s possible with Swift on the server, not something to be relied upon. Feel free to report issues you come across. About the author Peter Zignego is an iOS developer in Durham, North Carolina. He writes at bytesized.co, tweets @pvzig, and freelances at Launch Software.fto help you start writing your Slack bot, covering a wide range of programming languages. I wrote a library in Apple’s new programming language (Swift) for this very purpose, called SlackKit. SlackKit wasn’t very practical initially—it only ran on iOS and OS X. On the modern web, you need to support Linux to deploy on Amazon Web Servies, Heroku, or hosted server 
Read more
  • 0
  • 0
  • 5013
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-development-workflow-docker
Xavier Bruhiere
18 Sep 2015
8 min read
Save for later

A Development Workflow with Docker

Xavier Bruhiere
18 Sep 2015
8 min read
In this post, we're going to explore the sacred developer workflow, and how we can leverage modern technologies to craft a very opinionated and trendy setup. As such, a topic might involve a lot of personal tastes, so we will mostly focus on ideas that have the potential to increase developer happiness, productivity and software quality. The tools used in this article made my life easier, but feel free to pick what you like and swap what you don't with your own arsenal. While it is a good idea to stick with mature tools and seriously learn how to master them, you should keep an open mind and periodically monitor what's new. Software development evolves at an intense pace and smart people regularly come up with new projects that can help us to be better at what we do. To keep things concrete and challenge our hypothesizes, we're going to develop a development tool. Our small command line application will manage the creation, listing and destruction of project tickets. We will write it in node.js to enjoy a scripting language, a very large ecosystem and a nice integration with yeoman. This last reason foreshadows future features and probably a post about them. Code Setup The code has been tested under Ubuntu 14.10, io.js version 1.8.1 and npm version 2.8.3. As this post focuses on the workflow, rather than on the code, I'll keep everything as simple as possible and assume you have a basic knowledge of docker and developing with node. Now let's build the basic structure of a new node project. code/ ➜ tree . ├── package.json ├── bin │   └── iago.js ├── lib │   └── notebook.js └── test    ├── mocha.opts    └── notebook.js Some details: bin/iago.js is the command line entry point. lib/notebook.js exports the methods to interact with tickets. test/ uses mocha and chai for unit-testing. package.json provides information on the project: { "name":"iago", "version":"0.1.0", "description":"Ticker management", "bin":{ "iago":"./bin/iago.js" } } Build Automation As TDD advocates, let's start with a failing test. // test/notebook.js # Mocha - the fun, simple, flexible JavaScript test framework # Chai - Assertion Library var expect = require('chai').expect; var notebook = require('../lib/notebook'); describe('new note', function() { beforeEach(function(done) { // Reset the database, used to store tickets, after each test, to keep them independent notebook.backend.remove(); done(); }) it('should be empty', function() { expect(notebook.backend.size()).to.equal(0); }); }); In order to run it, we first need to install node, npm, mocha and chai. Ideally, we share same software versions as the rest of the team, on the same OS. Hopefully, it won't collapse with other projects we might develop on the same machine and the production environment is exactly the same. Or we could use docker and don't bother. $ docker run -it --rm # start a new container, automatically removed once done --volume $PWD:/app # make our code available from within the container --workdir /app # set default working dir in project's root iojs # use official io.js image npm install --save-dev mocha chai # install test libraries and save it in package.json This one-liner install mocha and chai locally in node_modules/. With nothing more than docker installed, we can now run tests. $ docker run -it --rm --volume $PWD:/app --workdir /app iojs node_modules/.bin/mocha Having dependencies bundled along with the project let us use the stack container as is. This approach extends to other languages remarkably : ruby has Bundle and Go has Godep. Let's make the test pass with the following implementation of our notebook. /*jslint node: true */ 'use strict'; var path = require('path'); # Flat JSON file database built on lodash API var low = require('lowdb'); # Pretty unicode tables for the CLI withNode.JS var table = require('cli-table'); /** * Storage with sane defaults * @param{string} dbPath - Flat (json) file Lowdb will use * @param{string} dbName - Lowdb database name */ functiondb(dbPath, dbName) { dbPath = dbPath || process.env.HOME + '/.iago.json'; dbName = dbName || 'notebook'; console.log('using', dbPath, 'storage'); returnlow(dbPath)(dbName); } module.exports = { backend: db(), write: function(title, content, owner, labels) { var note = { meta: { project: path.basename(process.cwd()), date: newDate(), status: 'created', owner: owner, labels: labels, }, title: title, ticket: content, }; console.log('writing new note:', title); this.backend.push(note); }, list: function() { var i = 0; var grid = newtable({head:['title', 'note', 'author', 'date']}); var dump = db().cloneDeep(); for (; i < dump.length; i++) { grid.push([ dump[i].title, dump[i].ticket, dump[i].meta.author, dump[i].meta.date ]); } console.log(grid.toString()); }, done: function(title) { var notes = db().remove({title: title}); console.log('note', notes[0].title, 'removed'); } }; Again we install dependencies and re-run tests. # Install lowdb and cli-table locally docker run -it --rm --volume $PWD:/app --workdir /app iojs npm install lowdb cli-table # Successful tests docker run -it --rm --volume $PWD:/app --workdir /app iojs node_modules/.bin/mocha To sum up, so far: The iojs container gives us a consistent node stack. When mapping the code as a volume and bundling the dependencies locally, we can run tests or execute anything. In the second part, we will try to automate the process and integrate those ideas smoothly in our workflow. Coding Environment Containers provide a consistent way to package environments and distribute them. This is ideal to setup a development machine and share it with the team / world. The following Dockerfile builds such an artifact: # Save it as provision/Dockerfile FROM ruby:latest RUN apt-get update && apt-get install -y tmux vim zsh RUN gem install tmuxinator ENV EDITOR "vim" # Inject development configuration ADD workspace.yml /root/.tmuxinator/workspace.yml ENTRYPOINT ["tmuxinator"] CMD ["start", "workspace"] Tmux is a popular terminal multiplexer and tmuxinator let us easily control how to organize and navigate terminal windows. The configuration thereafter setup a single window split in three : The main pane where we can move around and edit files The test pane where tests continuously run on file changes The repl pane with a running interpreter # Save as provision/workspace.yml name: workspace # We find the same code path as earlier root: /app windows: -workspace: layout: main-vertical panes: - zsh # Watch files and rerun tests - docker exec -it code_worker_1 node_modules/.bin/mocha --watch -repl: # In case worker container is still bootstraping - sleep 3 - docker exec -it code_worker_1 node Let's dig what's behind docker exec -it code_worker_1 node_modules/.bin/mocha --watch. Workflow Deployment This command supposes an iojs container, named code_worker_1, is running. So we have two containers to orchestrate and docker compose is a very elegant solution for that. The configuration file below describes how to run them. # This container have the necessary tech stack worker: image: iojs volumes: -.:/app working_dir: /app # Just hang around # The other container will be in charge to run interesting commands command:"while true; do echo hello world; sleep 10; done" # This one is our development environment workspace: # Build the dockerfile we described earlier build: ./provision # Make docker client available within the container volumes: -/var/run/docker.sock:/var/run/docker.sock -/usr/bin/docker:/usr/bin/docker # Make the code available within the container volumes_from: - worker stdin_open: true tty: true Yaml gives us a very declarative expression of our machines. Let's infuse some life in them. $ # Run in detach mode $ docker-compose up -d $ # ... $ docker-compose ps Name Command State ----------------------------------------------------- code_worker_1 while true; do echo hello w Up code_workspace_1 tmuxinator start workspace Up The code stack and the development environment are ready. We can reach them with docker attach code_workspace_1, and find a tmux session as configured above, with tests and repl in place. Once done, ctrl-p + ctrl-q to detach the session from the container, and docker-compose stop to stop both machines. Next time we'll develop on this project a simple docker-compose up -d will bring us back the entire stack and our favorite tools. What's Next We combined a lot of tools, but most of them uses configuration files we can tweak. Actually, this is the very basics of a really promising reflection. Indeed, we could easily consider more sophisticated development environments, with personal dotfiles and a better provisioning system. This is also true for the stack container, which could be dedicated to android code and run on a powerful 16GB RAM remote server. Containers unlock new potential for deployment, but also for development. The consistency those technologies bring on the table should encourage best practices, automation and help us write more reliable code, faster. Otherwise: Courtesy of xkcd About the author Xavier Bruhiere is the CEO of Hive Tech. He contributes to many community projects, including Occulus Rift, Myo, Docker and Leap Motion. In his spare time he enjoys playing tennis, the violin and the guitar. You can reach him at @XavierBruhiere.
Read more
  • 0
  • 0
  • 5013

article-image-debugging-rest-web-services
Packt
20 Oct 2009
13 min read
Save for later

Debugging REST Web Services

Packt
20 Oct 2009
13 min read
(For more resources on this subject, see here.) Message tracing The first symptom that you will notice when you are running into problems is that the client would not behave the way you want it to behave. As an example, there would be no output, or the wrong output. Since the outcome of running a REST client depends on the request that you send over the wire and the response that you receive over the wire, one of the first things is to capture the messages and verify that those are in the correct expected format. REST Services and clients interact using messages, usually in pairs of request and response. So if there are problems, they are caused by errors in the messages being exchanged. Sometimes the user only has control over a REST client and does not have access to the implementation details of the service. Sometimes the user will implement the REST service for others to consume the service. Sometimes the Web browser can act as a client. Sometimes a PHP application on a server can act as a REST client. Irrespective of where the client is and where the service is, you can use message capturing tools to capture messages and try to figure out the problem. Thanks to the fact that the service and client use messages to interact with each other, we can always use a message capturing tool in the middle to capture messages. It is not that we must run the message capturing tool on the same machine where the client is running or the service is running; the message capturing tool can be run on either machine, or it can be run on a third machine. The following figure illustrates how the message interaction would look with a message capturing tool in place. If the REST client is a Web browser and we want to capture the request and response involved in a message interaction, we would have to point the Web browser to message capturing tool and let the tool send the request to the service on behalf of the Web browser. Then, since it is the tool that sent the request to the service, the service would respond to the tool. The message capturing tool in turn would send the response it received from the service to the Web browser. In this scenario, the tool in the middle would gain access to both the request and response. Hence it can reveal those messages for us to have a look. When you are not seeing the client to work, here is the list of things that you might need to look for: If the client sends a message If you are able to receive a response from a service If the request message sent by the client is in the correct format, including HTTP headers If the response sent by the server is in the correct format, including the HTTP headers In order to check for the above, you would require a message-capturing tool to trace the messages. There are multiple tools that you can use to capture the messages that are sent from the client to the service and vice versa. Wireshark (http://www.wireshark.org/) is one such tool that can be used to capture any network traffic. It is an open-source tool and is available under the GNU General Public License version 2. However this tool can be a bit complicated if you are looking for a simple tool. Apache TCPMon (http://ws.apache.org/commons/tcpmon/) is another tool that is designed to trace web services messages. This is a Java based tool that can be used with web services to capture the messages. Because TCPMon is a message capturing tool, it can be used to intercept messages sent between client and service, and as explained earlier, can be run on the client machine, the server machine or on a third independent machine. The only catch is that you need Java installed in your system to run this tool. You can also find a C-based implementation of a similar tool with Apache Axis2/C (http://ws.apache.org/axis2/c). However, that tool does not have a graphical user interface. There is a set of steps that you need to follow, which are more or less the same across all of these tools, in order to prepare the tool for capturing messages. Define the target host name Define the target port number Define the listen port number Target host name is the name of the host machine on which the service is running. As an example, if we want to debug the request sent to the Yahoo spelling suggestion service, hosted at http://search.yahooapis.com/WebSearchService/V1/spellingSuggestion, the host name would be search.yahooapis.com. We can either use the name of the host or we can use the IP address of the host because the tools are capable of dealing with both formats in place of the host name. As an example, if the service is hosted on the local machine, we could either use localhost or 127.0.0.1 in place of the host name. Target port number is the port number on which the service hosting web server is listening; usually this is 80. As an example, for the Yahoo spelling suggestion service, hosted at http://search.yahooapis.com/WebSearchService/V1/spellingSuggestion, the target port number is 80. Note that, when the service URL does not mention any number, we can always use the default number. If it was running on a port other than port 80, we can find the port number followed by the host name and preceded with character ':'. As an example, if we have our web server running on port 8080 on the local machine, we would have service URL similar to http://localhost:8080/rest/04/library/book.php. Here, the host name is localhost and the target port is 8080. Listen port is the port on which the tool will be listening to capture the messages from the client before sending it to the service. For an example, say that we want to use port 9090 as our listen port to capture the messages while using the Yahoo spelling suggestion service. Under normal circumstances, we will be using a URL similar to the following with the web browser to send the request to the service. http://search.yahooapis.com/WebSearchService/V1/spellingSuggestion?appid=YahooDemo&query=apocalipto When we want to send this request through the message capturing tool and since we decided to make the tools listen port to be 9090 with the tool in the middle and assuming that the tool is running on the local machine, we would now use the following URL with the web browser in place of the original URL. http://localhost:9090/WebSearchService/V1/spellingSuggestion?appid=YahooDemo&query=apocalipto Note that we are not sending this request directly to search.yahooapis.com, but rather to the tool listening on port 9090 on local host. Once the tool receives the request, it will capture the request, forward that to the target host, receive the response and forward that response to the web browser. The following figure shows the Apache TCPMon tool. You can see localhost being used as the target host, 80 being the target port number and 9090 being the listening port number. Once you fill in these fields you can see a new tab being added in the tool showing the messages being captured. Once you click on the Add button, you will see a new pane as shown in the next figure where it will show the messages and pass the messages to and from the client and service. Before you can capture the messages, there is one more step. That is to change the client code to point to the port number 9090, since our monitoring tool is now listening on that port. Originally, we were using port 80 $url = 'http://localhost:80/rest/04/library/book.php'; or just $url = 'http://localhost/rest/04/library/book.php'; because the default port number used by a web server is port 80, and the client was directly talking to the service. However, with the tool in place, we are going to make the client talk to the tool listening on port 9090. The tool in turn will talk to the service. Note that in this sample we have all three parties, the client, the service, and the tool running on the same machine. So we will keep using localhost as our host name. Now we are going to change the service endpoint address used by the client to contain port 9090. This will make sure that the client will be talking to the tool. $url = 'http://localhost:9090/rest/04/library/book.php'; As you can see, the tool has captured the request and the response. The request appears at the top and the response at the bottom. The request is a GET request to the resource located at /rest/04/library/book.php. The response is a success response, with HTTP 200 OK code. And after the HTTP headers, the response body, which is in XML follows. As mentioned earlier, the first step in debugging is to verify if the client has sent a request and if the service responded. In the above example, we have both the request and response in place. If both were missing then we need to check what is wrong on either side. If the client request was missing, you can check for the following in the code. Are you using the correct URL in client Have you written the request to the wire in the client? Usually this is done by the function curl_exec when using Curl If the response was missing, you can check for the following. Are you connected to the network? Because your service can be hosted on a remote machine Have you written a response from the service? That is, basically, have you returned the correct string value from the service? In PHP wire, using the echo function to write the required response to the wire usually does this. If you are using a PHP framework, you may have to use the framework specific mechanisms to do this. As an example, if you are using the Zend_Rest_Server class, you have to use handle() method to make sure that the response is sent to the client. Here is a sample error scenario. As you can see, the response is 404 not found. And if you look at the request, you see that there is a typo in the request. We have missed 'k' from our resource URL, hence we have sent the request to /rest/04/library/boo.php, which does not exist, whereas the correct resource URL is /rest/04/library/book.php. Next let us look at the Yahoo search example that was discussed earlier to identify some advanced concepts. We want to capture the request sent by the web browser and the response sent by the server for the request. http://search.yahooapis.com/WebSearchService/V1/spellingSuggestion?appid=YahooDemo&ampquery=apocalipto. As discussed earlier, the target host name is search.yahooapis.com. The target port number is 80. Let's use 9091 as the listen. Let us use the web browser to send the request through the tool so that we can capture the request and response. Since the tool is listening on port 9091, we would use the following URL with the web browser. http://localhost:9091/WebSearchService/V1/spellingSuggestion?appid=YahooDemo&ampquery=apocalipto When you use the above URL with the web browser, the web browser would send the request to the tool and the tool will get the response from the service and forward that to the web browser. We can see that the web browser gets the response. However, if we have a look at the TCPMon tool's captured messages, we see that the service has sent some binary data instead of XML data even though the Web browser is displaying the response in XML format. So what went wrong? In fact, nothing is wrong. The service sent the data in binary format because the web browser requested that format. If you look closely at the request sent you will see the following. GET /WebSearchService/V1/spellingSuggestion?appid=YahooDemo&query=apocalipto HTTP/1.1Host: search.yahooapis.com:9091User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5Accept-Language: en-us,en;q=0.7,zh-cn;q=0.3Accept-Encoding: gzip,deflateAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7Keep-Alive: 300Connection: keep-alive In the request, the web browser has used the HTTP header. Accept-Encoding: gzip,deflate This tells the service that the web browser can handle data that comes in gzip compressed format. Hence the service sends the data compressed to the web browser. Obviously, it is not possible to look into the XML messages and debug them if the response is compressed. Hence we should ideally capture the messages in XML format. To do this, we can modify the request message on the TCPMon pane itself and resend the message. First remove the line Accept-Encoding: gzip,deflate Then click on the Resend button. Once we click on the Resend button, we will get the response in XML format. Errors in building XML While forming XML as request payload or response payload, we can run into errors through simple mistakes. Some would be easy to spot but some are not. Most of the XML errors could be avoided by following a simple rule of thumb-each opening XML tag should have an equivalent closing tag. That is the common mistake that can happen while building XML payloads. In the above diagram, if you look carefully in the circle, the ending tag for the book element is missing. A new starting tag for a new book is started before the first book is closed. This would cause the XML parsing on the client side to fail. In this case I am using the Library system sample and here is the PHP source code causing the problem. echo "<books>";while ($line = mysql_fetch_array($result, MYSQL_ASSOC)) { echo "<book>"; foreach ($line as $key => $col_value) { echo "<$key>$col_value</$key>"; } //echo "</book>";}echo "</books>"; Here I have intentionally commented out printing the closing tag to demonstrate the error scenario. However, while writing this code, I could have missed that as well, causing the system to be buggy. While looking for XML related errors, you can use the manual technique that we just used. Look for missing tags. If the process looks complicated and you cannot seem to find any XML errors in the response or request that you are trying to debug, you can copy the XML captured with the tool and run it through an XML validator tool. For example, you can use an online tool such as http://www.w3schools.com/XML/xml_validator.asp. You can also check if the XML file is well formed using an XML parser
Read more
  • 0
  • 0
  • 5013

article-image-vaadin-project-spring-and-handling-login-spring
Packt
22 Jul 2013
16 min read
Save for later

Vaadin project with Spring and Handling login with Spring

Packt
22 Jul 2013
16 min read
(For more resources related to this topic, see here.) Setting up a Vaadin project with Spring in Maven We will set up a new Maven project for Vaadin application that will use the Spring framework. We will use a Java annotation-driven approach for Spring configuration instead of XML configuration files. This means that we will eliminate the usage of XML to the necessary minimum (for XML fans, don't worry there will be still enough XML to edit). In this recipe, we will set up a Spring project where we define a bean that will be obtainable from the Spring application context in the Vaadin code. As the final result, we will greet a lady named Adela, so we display Hi Adela! text on the screen. The brilliant thing about this is that we get the greeting text from the bean that we define via Spring. Getting ready First, we create a new Maven project. mvn archetype:generate -DarchetypeGroupId=com.vaadin -DarchetypeArtifactId=vaadin-archetype-application -DarchetypeVersion=LATEST -Dpackaging=war -DgroupId=com.packtpub.vaadin -DartifactId=vaadin-with-spring -Dversion=1.0 More information about Maven and Vaadin can be found at https://vaadin.com/book/-/page/getting-started.maven.html. How to do it... Carry out the following steps, in order to set up a Vaadin project with Spring in Maven: First, we need to add the necessary dependencies. Just add the following Maven dependencies into the pom.xml file: dependencies into the pom.xml file: <dependency> <groupId>org.springframework</groupId> <artifactId>spring-core</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-context</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-web</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>cglib</groupId> <artifactId>cglib</artifactId> <version>2.2.2</version> </dependency> In the preceding code, we are referring to the spring.version property. Make sure we have added the Spring version inside the properties tag in the pom.xml file. <properties> … <spring.version>3.1.2.RELEASE</spring.version> </properties> At the time of writing, the latest version of Spring was 3.1.2. Check the latest version of the Spring framework at http://www.springsource.org/spring-framework. The last step in the Maven configuration file is to add the new repository into pom.xml. Maven needs to know where to download the Spring dependencies. <repositories> … <repository> <id>springsource-repo</id> <name>SpringSource Repository</name> <url>http://repo.springsource.org/release</url> </repository> </repositories> Now we need to add a few lines of XML into the src/main/webapp/WEB-INF/web.xml deployment descriptor file. At this point, we make the first step in connecting Spring with Vaadin. The location of the AppConfig class needs to match the full class name of the configuration class. <context-param> <param-name>contextClass</param-name> <param-value> org.springframework.web.context.support.Annotation ConfigWebApplicationContext </param-value> </context-param> <context-param> <param-name>contextConfigLocation</param-name> <param-value>com.packtpub.vaadin.AppConfig </param-value> </context-param> <listener> <listener-class> org.springframework.web.context.ContextLoaderListener </listener-class> </listener> Create a new class AppConfig inside the com.packtpub.vaadin package and annotate it with the @Configuration annotation. Then create a new @Bean definition as shown: package com.packtpub.vaadin; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; @Configuration public class AppConfig { @Bean(name="userService") public UserService helloWorld() { return new UserServiceImpl(); } } In order to have the recipe complete, we need to make a class that will represent a domain class. Create a new class called User. public class User { private String name; // generate getters and setters for name field } UserService is a simple interface defining a single method called getUser(). When the getUser() method is called in this recipe, we always create and return a new instance of the user (in the future, we could add parameters, for example login, and fetch user from the database). UserServiceImpl is the implementation of this interface. As mentioned, we could replace that implementation by something smarter than just returning a new instance of the same user every time the getUser() method is called. public interface UserService { public User getUser(); } public class UserServiceImpl implements UserService { @Override public User getUser() { User user = new User(); user.setName("Adela"); return user; } } Almost everything is ready now. We just make a new UI and get the application context from which we get the bean. Then, we call the service and obtain a user that we show in the browser. After we are done with the UI, we can run the application. public class AppUI extends UI { private ApplicationContext context; @Override protected void init(VaadinRequest request) { UserService service = getUserService(request); User user = service.getUser(); String name = user.getName(); Label lblUserName = new Label("Hi " + name + " !"); VerticalLayout layout = new VerticalLayout(); layout.setMargin(true); setContent(layout); layout.addComponent(lblUserName); } private UserService getUserService (VaadinRequest request) { WrappedSession session = request.getWrappedSession(); HttpSession httpSession = ((WrappedHttpSession) session).getHttpSession(); ServletContext servletContext = httpSession.getServletContext(); context = WebApplicationContextUtils.getRequired WebApplicationContext(servletContext); return (UserService) context.getBean("userService"); } } Run the following Maven commands in order to compile the widget set and run the application: mvn package mvn jetty:run How it works... In the first step, we have added dependencies to Spring. There was one additional dependency to cglib, Code Generation Library. This library is required by the @ Configuration annotation and it is used by Spring for making the proxy objects. More information about cglib can be found at http://cglib.sourceforge.net Then, we have added contextClass, contextConfigLocation and ContextLoaderListener into web.xml file. All these are needed in order to initialize the application context properly. Due to this, we are able to get the application context by calling the following code: WebApplicationContextUtils.getRequiredWebApplicationContext (servletContext); Then, we have made UserService that is actually not a real service in this case (we did so because it was not in the scope of this recipe). We will have a look at how to declare Spring services in the following recipes. In the last step, we got the application context by using the WebApplicationContextUtils class from Spring. WrappedSession session = request.getWrappedSession(); HttpSession httpSession = ((WrappedHttpSession) session).getHttpSession(); ServletContext servletContext = httpSession.getServletContext(); context = WebApplicationContextUtils.getRequired WebApplicationContext(servletContext); Then, we obtained an instance of UserService from the Spring application context. UserService service = (UserService) context.getBean("userService"); User user = service.getUser(); We can obtain a bean without knowing the bean name because it can be obtained by the bean type, like this context.getBean(UserService.class). There's more... Using the @Autowire annotation in classes that are not managed by Spring (classes that are not defined in AppConfig in our case) will not work, so no instances will be set via the @Autowire annotation. Handling login with Spring We will create a login functionality in this recipe. The user will be able to log in as admin or client. We will not use a database in this recipe. We will use a dummy service where we just hardcode two users. The first user will be "admin" and the second user will be "client". There will be also two authorities (or roles), ADMIN and CLIENT. We will use Java annotation-driven Spring configuration. Getting ready Create a new Maven project from the Vaadin archetype. mvn archetype:generate -DarchetypeGroupId=com.vaadin -DarchetypeArtifactId=vaadin-archetype-application -DarchetypeVersion=LATEST -Dpackaging=war -DgroupId=com.app -DartifactId=vaadin-spring-login -Dversion=1.0 Maven archetype generates the basic structure of the project. We will add the packages and classes, so the project will have the following directory and file structure: How to do it... Carry out the following steps, in order to create login with Spring framework: We need to add Maven dependencies in pom.xml to spring-core, spring-context, spring-web, spring-security-core, spring-security-config, and cglib (cglib is required by the @Configuration annotation from Spring). <dependency> <groupId>org.springframework</groupId> <artifactId>spring-core</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-context</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework</groupId> <artifactId>spring-web</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework.security</groupId> <artifactId>spring-security-core</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>org.springframework.security</groupId> <artifactId>spring-security-config</artifactId> <version>${spring.version}</version> </dependency> <dependency> <groupId>cglib</groupId> <artifactId>cglib</artifactId> <version>2.2.2</version> </dependency> Now we edit the web.xml file, so Spring knows we want to use the annotation-driven configuration approach. The path to the AppConfig class must match full class name (together with the package name). <context-param> <param-name>contextClass</param-name> <param-value> org.springframework.web.context.support.Annotation ConfigWebApplicationContext </param-value> </context-param> <context-param> <param-name>contextConfigLocation</param-name> <param-value>com.app.config.AppConfig</param-value> </context-param> <listener> <listener-class> org.springframework.web.context.ContextLoaderListener </listener-class> </listener> We are referring to the AppConfig class in the previous step. Let's implement that class now. AppConfig needs to be annotated by the @Configuration annotation, so Spring can accept it as the context configuration class. We also add the @ComponentScan annotation, which makes sure that Spring will scan the specified packages for Spring components. The package names inside the @ComponentScan annotation need to match our packages that we want to include for scanning. When a component (a class that is annotated with the @Component annotation) is found and there is a @Autowire annotation inside, the auto wiring will happen automatically. package com.app.config; import com.app.auth.AuthManager; import com.app.service.UserService; import com.app.ui.LoginFormListener; import com.app.ui.LoginView; import com.app.ui.UserView; import org.springframework.context.annotation.Bean; import org.springframework.context. annotation.ComponentScan; import org.springframework.context. annotation.Configuration; import org.springframework.context. annotation.Scope; @Configuration @ComponentScan(basePackages = {"com.app.ui" , "com.app.auth", "com.app.service"}) public class AppConfig { @Bean public AuthManager authManager() { AuthManager res = new AuthManager(); return res; } @Bean public UserService userService() { UserService res = new UserService(); return res; } @Bean public LoginFormListener loginFormListener() { return new LoginFormListener(); } } We are defining three beans in AppConfig. We will implement them in this step. AuthManager will take care of the login process. package com.app.auth; import com.app.service.UserService; import org.springframework.beans.factory. annotation.Autowired; import org.springframework.security.authentication. AuthenticationManager; import org.springframework.security.authentication. BadCredentialsException; import org.springframework.security.authentication. UsernamePasswordAuthenticationToken; import org.springframework.security.core.Authentication; import org.springframework.security.core. AuthenticationException; import org.springframework.security.core. GrantedAuthority; import org.springframework.security.core. userdetails.UserDetails; import org.springframework.stereotype.Component; import java.util.Collection; @Component public class AuthManager implements AuthenticationManager { @Autowired private UserService userService; public Authentication authenticate (Authentication auth) throws AuthenticationException { String username = (String) auth.getPrincipal(); String password = (String) auth.getCredentials(); UserDetails user = userService.loadUserByUsername(username); if (user != null && user.getPassword(). equals(password)) { Collection<? extends GrantedAuthority> authorities = user.getAuthorities(); return new UsernamePasswordAuthenticationToken (username, password, authorities); } throw new BadCredentialsException("Bad Credentials"); } } UserService will fetch a user based on the passed login. UserService will be used by AuthManager. package com.app.service; import org.springframework.security.core. GrantedAuthority; import org.springframework.security.core. authority.GrantedAuthorityImpl; import org.springframework.security.core. authority.SimpleGrantedAuthority; import org.springframework.security.core. userdetails.UserDetails; import org.springframework.security.core. userdetails.UserDetailsService; import org.springframework.security.core. userdetails.UsernameNotFoundException; import org.springframework.security.core. userdetails.User; import org.springframework.stereotype.Service; import java.util.ArrayList; import java.util.List; public class UserService implements UserDetailsService { @Override public UserDetails loadUserByUsername (String username) throws UsernameNotFoundException { List<GrantedAuthority> authorities = new ArrayList<GrantedAuthority>(); // fetch user from e.g. DB if ("client".equals(username)) { authorities.add (new SimpleGrantedAuthority("CLIENT")); User user = new User(username, "pass", true, true, false, false, authorities); return user; } if ("admin".equals(username)) { authorities.add (new SimpleGrantedAuthority("ADMIN")); User user = new User(username, "pass", true, true, false, false, authorities); return user; } else { return null; } } } LoginFormListener is just a listener that will initiate the login process, so it will cooperate with AuthManager. package com.app.ui; import com.app.auth.AuthManager; import com.vaadin.navigator.Navigator; import com.vaadin.ui.*; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.security.authentication. UsernamePasswordAuthenticationToken; import org.springframework.security.core.Authentication; import org.springframework.security.core. AuthenticationException; import org.springframework.security.core.context. SecurityContextHolder; import org.springframework.stereotype.Component; @Component public class LoginFormListener implements Button.ClickListener { @Autowired private AuthManager authManager; @Override public void buttonClick(Button.ClickEvent event) { try { Button source = event.getButton(); LoginForm parent = (LoginForm) source.getParent(); String username = parent.getTxtLogin().getValue(); String password = parent.getTxtPassword().getValue(); UsernamePasswordAuthenticationToken request = new UsernamePasswordAuthenticationToken (username, password); Authentication result = authManager.authenticate(request); SecurityContextHolder.getContext(). setAuthentication(result); AppUI current = (AppUI) UI.getCurrent(); Navigator navigator = current.getNavigator(); navigator.navigateTo("user"); } catch (AuthenticationException e) { Notification.show("Authentication failed: " + e.getMessage()); } } } The login form will be made as a separate Vaadin component. We will use the application context and that way we get bean from the application context by ourselves. So, we are not using auto wiring in LoginForm. package com.app.ui; import com.vaadin.ui.*; import org.springframework.context.ApplicationContext; public class LoginForm extends VerticalLayout { private TextField txtLogin = new TextField("Login: "); private PasswordField txtPassword = new PasswordField("Password: "); private Button btnLogin = new Button("Login"); public LoginForm() { addComponent(txtLogin); addComponent(txtPassword); addComponent(btnLogin); LoginFormListener loginFormListener = getLoginFormListener(); btnLogin.addClickListener(loginFormListener); } public LoginFormListener getLoginFormListener() { AppUI ui = (AppUI) UI.getCurrent(); ApplicationContext context = ui.getApplicationContext(); return context.getBean(LoginFormListener.class); } public TextField getTxtLogin() { return txtLogin; } public PasswordField getTxtPassword() { return txtPassword; } } We will use Navigator for navigating between different views in our Vaadin application. We make two views. The first is for login and the second is for showing the user detail when the user is logged into the application. Both classes will be in the com.app.ui package. LoginView will contain just the components that enable a user to log in (text fields and button). public class LoginView extends VerticalLayout implements View { public LoginView() { LoginForm loginForm = new LoginForm(); addComponent(loginForm); } @Override public void enter(ViewChangeListener.ViewChangeEvent event) { } }; UserView needs to identify whether the user is logged in or not. For this, we will use SecurityContextHolder that obtains the SecurityContext that holds the authentication data. If the user is logged in, then we display some data about him/her. If not, then we navigate him/her to the login form. public class UserView extends VerticalLayout implements View { public void enter(ViewChangeListener.ViewChangeEvent event) { removeAllComponents(); SecurityContext context = SecurityContextHolder.getContext(); Authentication authentication = context.getAuthentication(); if (authentication != null && authentication.isAuthenticated()) { String name = authentication.getName(); Label labelLogin = new Label("Username: " + name); addComponent(labelLogin); Collection<? extends GrantedAuthority> authorities = authentication.getAuthorities(); for (GrantedAuthority ga : authorities) { String authority = ga.getAuthority(); if ("ADMIN".equals(authority)) { Label lblAuthority = new Label("You are the administrator. "); addComponent(lblAuthority); } else { Label lblAuthority = new Label("Granted Authority: " + authority); addComponent(lblAuthority); } } Button logout = new Button("Logout"); LogoutListener logoutListener = new LogoutListener(); logout.addClickListener(logoutListener); addComponent(logout); } else { Navigator navigator = UI.getCurrent().getNavigator(); navigator.navigateTo("login"); } } } We have mentioned LogoutListener in the previous step. Here is how that class could look: public class LogoutListener implements Button.ClickListener { @Override public void buttonClick(Button.ClickEvent clickEvent) { SecurityContextHolder.clearContext(); UI.getCurrent().close(); Navigator navigator = UI.getCurrent().getNavigator(); navigator.navigateTo("login"); } } Everything is ready for the final AppUI class. In this class, we put in to practice all that we have created in the previous steps. We need to get the application context. That is done in the first lines of code in the init method. In order to obtain the application context, we need to get the session from the request, and from the session get the servlet context. Then, we use the Spring utility class, WebApplicationContextUtils, and we find the application context by using the previously obtained servlet context. After that, we set up the navigator. @PreserveOnRefresh public class AppUI extends UI { private ApplicationContext applicationContext; @Override protected void init(VaadinRequest request) { WrappedSession session = request.getWrappedSession(); HttpSession httpSession = ((WrappedHttpSession) session).getHttpSession(); ServletContext servletContext = httpSession.getServletContext(); applicationContext = WebApplicationContextUtils. getRequiredWebApplicationContext(servletContext); Navigator navigator = new Navigator(this, this); navigator.addView("login", LoginView.class); navigator.addView("user", UserView.class); navigator.navigateTo("login"); setNavigator(navigator); } public ApplicationContext getApplicationContext() { return applicationContext; } } Now we can run the application. The password for usernames client and admin is pass. mvn package mvn jetty:run How it works... There are two tricky parts from the development point of view while making the application: First is how to get the Spring application context in Vaadin. For this, we need to make sure that contextClass, contextConfigLocation, and ContextLoaderListener are defined in the web.xml file. Then we need to know how to get Spring application context from the VaadinRequest. We certainly need a reference to the application context in UI, so we define the applicationContext class field together with the public getter (because we need access to the application context from other classes, to get Spring beans). The second part, which is a bit tricky, is the AppConfig class. That class represents annotated Spring application configuration (which is referenced from the web.xml file). We needed to define what packages Spring should scan for components. For this, we have used the @ComponentScan annotation. The important thing to keep in mind is that the @Autowired annotation will work only for Spring managed beans that we have defined in AppConfig. When we try to add the @Autowired annotation to a simple Vaadin component, the autowired reference will remain empty because no auto wiring happens. It is up to us to decide what instances should be managed by Spring and where we use the Spring application context to retrieve the beans. Summary In this article, we saw how to add Spring into the Maven project. We also took a look at handling login with Spring Resources for Article:   Further resources on this subject: Vaadin Portlets in Liferay User Interface Development [Article] Creating a Basic Vaadin Project [Article] Getting Started with Ext GWT [Article]
Read more
  • 0
  • 0
  • 5010

article-image-implementing-azure-ad-protection-with-chatgpt
Steve Miles
15 Jun 2023
8 min read
Save for later

Implementing Azure AD Protection with ChatGPT

Steve Miles
15 Jun 2023
8 min read
IntroductionCybersecurity professionals face numerous challenges daily, from threat detection to incident response. The advent of AI-powered language models, also called Generative AI such as ChatGPT or Google's Bard, has revolutionized how experts approach their tasks. In this tutorial, we will explore how ChatGPT can assist cybersecurity professionals in performing various tasks efficiently and effectively. From analyzing logs and conducting risk assessments to developing incident response strategies, ChatGPT's capabilities can be harnessed to streamline workflows and enhance productivity. In this blog, let's dive into the practical applications and benefits of integrating Generative AI into (cyber)security operations.In this article, we will cover a tutorial on implementing Azure AD Protection with ChatGPT and also cover certain other areas of cybersecurity where GPT can be beneficial.Implementing Azure AD Identity Protection with ChatGPTAzure AD Identity Protection helps organizations safeguard their Azure Active Directory (Azure AD) identities by detecting and mitigating identity-related risks. In this section, we will explore how ChatGPT can assist in implementing Azure AD Identity Protection through code examples using Python and the Microsoft Graph API.1. Set up the EnvironmentBefore we begin, ensure that you have the following prerequisites in place:Python is installed on your machine.The requests library is installed. You can install it using the following command: pip install requests Azure AD application registered with the appropriate permissions to access Azure AD Identity Protection.2. Acquire Access TokenTo interact with the Microsoft Graph API, we must acquire an access token. Use the following Python code to obtain the access token:```python import requests # Azure AD application details tenant_id = 'YOUR_TENANT_ID' client_id = 'YOUR_CLIENT_ID' client_secret = 'YOUR_CLIENT_SECRET' # Microsoft Graph token endpoint token_url = f'https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token' # Request access token payload = {    'grant_type': 'client_credentials',    'client_id': client_id,    'client_secret': client_secret,    'scope': 'https://graph.microsoft.com/.default' } response = requests.post(token_url, data=payload) if response.status_code == 200:    access_token = response.json()['access_token'] else:    print('Error: Failed to obtain access token') ```Make sure to replace the placeholders with your Azure AD application details.3. Query Azure AD Identity Protection Data with ChatGPTNow that we have the access token, we can leverage ChatGPT to query Azure AD Identity Protection data. Use the following code example to interact with the model and retrieve identity protection insights:```python import openai openai.api_key = 'YOUR_OPENAI_API_KEY' def query_model(question):    response = openai.Completion.create(        engine='text-davinci-003',        prompt=question,        max_tokens=100,        temperature=0.5,        n=1,        stop=None,        temperature=0.5,    )    if response.choices:        return response.choices[0].text.strip()    else:        return None # Example question for querying Azure AD Identity Protection data question = "What are the recent risky sign-ins detected by Azure AD Identity Protection?" # Microsoft Graph API endpoint for risky sign-ins graph_api_url = 'https://graph.microsoft.com/v1.0/identityProtection/riskyUsers' # Send API request with the access token headers = {    'Authorization': f'Bearer {access_token}',    'Content-Type': 'application/json' } response = requests.get(graph_api_url, headers=headers) if response.status_code == 200:    risky_sign_ins = response.json()    # Process the response as needed    # ...    # Query the AI model for insights or recommendations    insights = query_model(question)    if insights:        print("Identity Protection Insights:")        print(insights)    else:        print("Error: Failed to obtain insights from the AI model") else:    print("Error: Failed to retrieve risky sign-ins data from Azure AD Identity Protection") ```Ensure you have appropriate permissions and update the `graph_api_url` with the relevant endpoint for the Azure AD Identity Protection data you want to retrieve.4. Interpret and Utilize InsightsOnce you obtain insights from the AI model, interpret, and utilize them to improve your identity protection practices. This could involve taking proactive measures to mitigate risks, investigating suspicious activities, or implementing additional security measures based on the recommendations provided.Remember to adapt the code examples based on your specific requirements and refer to the Microsoft Graph API documentation for available endpoints and data structures. https://learn.microsoft.com/en-us/graph/Other application areas1. Analyzing Log FilesOne of the most important aspects of cybersecurity is analyzing log files for suspicious activity and potential security breaches. Chat can help businesses automate this process. By importing log files into the model, ChatGPT can quickly identify patterns, anomalies, and potentially malicious activities. This analysis allows cybersecurity professionals to focus on the most important issues, saving valuable time and effort. In addition, ChatGPT's ability to create human-readable summaries of log data simplifies the interpretation and communication of findings for stakeholders.2. Conducting Risk AssessmentsConducting a comprehensive risk assessment is essential to understanding an organization's security posture. ChatGPT can help in this process by using its powerful capabilities to provide context and insights. By interacting with the model, organizations can ask specific questions about potential vulnerabilities, attacks, or best practices related to their risk assessments. ChatGPT's feedback provides knowledge of the organization's security environment and offers real value actionable insights that help businesses identify and prioritize risks and remediation tasks.3. Developing Incident Response StrategiesTime is of the essence in a cybersecurity incident. Generative AI can be an invaluable tool for developing effective incident response mechanisms. By leveraging its natural language processing capabilities, businesses can use ChatGPT to brainstorm and optimize response processes. The model can provide recommendations based on historical data, industry standards, and best practices, helping to create robust and efficient incident response systems. Generative AI can understand and generate human-like responses, making it an ideal virtual security analyst for cybersecurity professionals in high-pressure and time-sensitive situations.4. Automating Routine TasksCybersecurity professionals are often faced with increasing volume and velocity of repetitive and time-consuming tasks, such as vulnerability assessments, log analysis, and updating firewall rules. Generative AI can help automate these routine tasks, freeing experts to focus on complex real-value organizational security challenges. By integrating ChatGPT with existing automation frameworks, organizations can create chatbot-like interfaces that interact with the model to perform pre-defined actions. This approach increases productivity and reduces the risk of human error associated with manual processing.5. Enhancing Threat Intelligence AnalysisEffective threat reporting is essential for proactive cybersecurity defenses. Generative AI can enhance threat intelligence data analysis by extracting insights from a vast repository of security information. By asking about emerging threats, known vulnerabilities, or attack techniques, administrators can gain a deeper understanding of the ongoing threat landscape. ChatGPT's ability to understand complex security issues enhances the accuracy and relevance of threat intelligence reports, contributing to timely decision-making.ConclusionIn conclusion, it is easier and more efficient to implement Azure AD in conjunction with ChatGPT. As the cybersecurity landscape continues to evolve, businesses must embrace AI-powered solutions to stay ahead of malicious actors. Generative AI provides valuable support for various cybersecurity tasks, including log analysis, risk assessment, incident response planning, workflow automation, and threat intelligence analysis capabilities, enabling cybersecurity professionals to streamline their workflow, increase productivity, and make more informed decisions. While it is important to exercise proper judgment and credentials when implementing AI models, integrating Generative AI  such as ChatGPT into the cybersecurity industry offers significant opportunities for businesses to manage their tasks faster, more accurately, and more efficiently.Author BioSteve Miles (SMiles) is the CTO responsible for the tools and technologies selection for the cloud practice of a multi-billion turnover IT distributor based in the UK and Ireland. He is also a multi-cloud and hybrid technology strategist with 20+ years of telco, co-location, hosted data center, hybrid, and multi-cloud infrastructure experience. Steve is an Alibaba Cloud MVP (Most Valuable Professional), as well as being a Microsoft Azure MVP (Most Valuable Professional), and MCT (Microsoft Certified Trainer). Published freelance author for Microsoft technologies and certification guides, as well as an editorial and technical reviewer. Amongst many hybrid/cloud-based certifications, he is Alibaba Cloud Certified, with 20+ Cloud/Hybrid based Microsoft certifications with 14 of those being in Azure.His roles have included network security architect, global solutions architect, public cloud security solutions architect, and Azure practice technical lead. He currently works for a leading multi-cloud distributor based in the UK and Dublin in a cloud and hybrid technology leadership role.His first Microsoft certification was on Windows NT. He is an MCP, MCITP, MCSA, and MCSE for Windows Server and many other Microsoft products. He also holds multiple Microsoft Fundamentals, Associate, Expert, and Specialty certifications in Azure Security, Identity, Network, M365, and D365. He also holds multiple security and networking vendor certifications, as well as PRINCE2 and ITIL, and is associated with industry bodies such as the CIF, ISCA, and IISP.Author of the book: Azure Security Cookbook 
Read more
  • 0
  • 0
  • 5008
article-image-uploading-videos-and-sound-files-your-posts-using-apache-roller-40
Packt
30 Dec 2009
6 min read
Save for later

Uploading Videos and Sound Files on Your Posts Using Apache Roller 4.0

Packt
30 Dec 2009
6 min read
Using videos in your posts It's time to learn how to insert video files in your posts. You can just insert one as an HTML link, but who does that anymore? It's a good way to drive your prospective readers away! In today's world, you need to offer your spectators the easiest, quickest, and most attractive way to see what you have to offer. My job is to show you how to do that with your Roller weblog, so let's get to it! Time for action – uploading and inserting videos on your posts In this exercise, I'll show you how to upload a video file to your blog server and then insert it into a post using Apache Roller: Open your web browser and log into Roller. The New Entry page will appear. Click on the File Uploads link from the Create & Edit tab. Scroll down the File Uploads page until you locate the Manage Uploaded Files section. Type video in the New Directory field and click on the Create button, as shown in the following screenshot:  Roller will show the following success message: Scroll down the page again until you locate the video folder in the Manage Uploaded Files section, and click on it: Roller will take you to the same File Uploads page, but this time you'll be inside the video directory. Now click on the first Browse... button of the Upload files for use in weblog main section. On the File Upload dialog, go to the folder where you downloaded the support files for this article, and double-click on the FirstFrame.png image to select it. The name of this file will show up on the first textbox of the File Uploads page. Now click on the second Browse... button and double-click on the showvbox.mp4 file, so that its name appears in the second textbox of the File Uploads page. Repeat the process with the third Browse... button and the showvbox_controller.swf file. Your File Uploads page must look like the following screenshot: Click on the Upload button to upload the three files to the video directory in your blog server. Roller will return the following success page: Don't forget to write down the three URLs from the previous Roller message; you'll need them when inserting the video into an entry (post). Click on the New Entry link from the Create & Edit tab to go to the New Entry page and create a new post for your blog. Type Ubuntu Linux Virtual Machine Inside a Windows XP PC in the Title field, select Open Source in the Category field, type virtualbox windows xp linux ubuntu in the Tags field. In the Content field, and type Here's a sample video of my Ubuntu Linux Virtual Machine, running inside a Windows XP PC with VirtualBox: as shown in the following screenshot:   Click on the Toggle HTML Source button from the HTML editor and write the following code below the text you typed in step 8 (the text in bold must correspond to the URLs from step 7's screenshot): <object height="498" width="640" id="csSWF" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/ cabs/flash/swflash.cab#version=9,0,115,0"><param name="src" value="http://localhost/roller/main/ resource/video/showvbox_controller.swf" /><param name="bgcolor" value="#1a1a1a" /><param name="quality" value="best" /><param name="allowScriptAccess" value="always" /><param name="allowFullScreen" value="true" /><param name="scale" value="showall" /><param name="flashVars" value="autostart=false" /><embed height="498" width="640" name="csSWF" src="http://localhost/roller/main/resource/ video/showvbox_controller.swf" bgcolor="#1a1a1a" quality="best" allowscriptaccess="always" allowfullscreen="true" scale="showall" flashvars="autostart=false&amp; thumb=http://localhost/roller/main/resource/ video/FirstFrame.png&amp; thumbscale=45&amp;color=0x000000,0x000000" pluginspage="http://www.macromedia.com/ shockwave/download/ index.cgi?P1_Prod_Version=ShockwaveFlash"/></object> Click on the Save as Draft button, to save a draft of your post. Then scroll down the page and click on the Full Preview button, to see how your post will look in your blog before publishing it. The preview will open in a new tab in your web browser:       Click on the play button and the video will begin to playback. When finished, close the preview tab and click on the Post to Weblog button to publish your new post. You can logout from Roller now. What just happened? The previous exercise showed you how to upload a video to your blog server and insert it in a post. As you can see, videos are a little more complicated than plain images. In this case, we used the following files: FirstFrame.png: This is the thumbnail image that shows up before the video begins to playback showvbox.mp4: The video file showvbox_controller.swf: The controller that plays back theshowvbox.mp4 file      The video was produced with Camtasia Studio, an excellent screen recording software application from TechSmith (http://www.techsmith.com). If you want to practice with your own videos, you can download a 30-day free trial version of Camtasia Studio from the following URL: http://www.techsmith.com/download/camtasiatrial.asp. What would happen if you wanted to embed a video from your camera or cell phone? Well, you can embed it directly in your blog, but the best thing to do is use a software application such as Camtasia Studio to create the .swf controller and the HTML code to embed into your post automatically. Then you just need to change the .swf controller and thumbnail URLs, as you did in step 9 of the previous exercise. You can use the same embed code to insert a different video in your blog; just be sure to change the URL in bold text. You could also upload your video to YouTube instead of uploading it into Roller, as we'll see later in this article.
Read more
  • 0
  • 0
  • 5004

article-image-application-packaging-vmware-thinapp-47-essentials
Packt
15 Jan 2013
19 min read
Save for later

Application Packaging in VMware ThinApp 4.7 Essentials

Packt
15 Jan 2013
19 min read
(For more resources related to this topic, see here.) The capture and build environment You cannot write a book about a packaging format without discussing the environment used to create the packages. The environment you use to capture an installation is of great importance. ThinApp uses a method of snapshotting when capturing an application installation. This means you create a snapshot (Pre-Installation Snapshot) of the current state of the machine. After modifying the environment, you create another snapshot, the Post-Installation Snapshot. The differences between the two snapshots represent the changes made by the installer. This should be all the information you need in order to run the application. Many packaging products use snapshotting as a method of capturing changes. The alternative would be to try to hook into the installer itself. Both methods have their pros and cons. Using snapshot is much more flexible. You don't even have to run an installer. You can copy files and create registry keys manually and it will all be captured. But, your starting point will decide the outcome. If your machine already contains the Java Runtime Environment ( JRE ) and the application you are capturing requires Java, then you will not be able to capture the JRE. Since it was already there when you ran the pre-install snapshot, it will not be a part of the captured differences. This means your package would require Java installed or it will fail to run. The package will not be self-contained. The other method, monitoring the installer, will be more independent of the capturing environment but will not support all the installer formats and will not support manual tweaking during capture. Nothing is black or white. Snapshotting can be made a little more independent of the capture environment. When an installer discovers components already installed, it can register itself to the same components. ThinApp will recognize this, investigate which files are related to a component, and mark them as needed to be included in the package. But this is not a bulletproof method. So the general rule is to make sure your environment allows ThinApp to capture all required dependencies of the application. ThinApp packages are able to support multiple operating systems with one single package. This is a great feature and really helps in lowering the overall administration of maintaining an application. The possibility of running the same package on your Windows XP clients, Windows 7 machines, and your XenApp servers is unique. Most other packaging formats require you to maintain one package per environment. The easiest method to package an application is to capture it on the platform where it will run. Normally you can achieve an out of the box success rate of 60 — 80 percent. This means you have not tweaked the project in any way. The package might not be ready for production but it will run on a clean machine not having the application locally installed. If you want to support multiple operating systems you should choose the lowest platform you need to support. Most of the time this would be Windows XP. From ThinApp's point of view, Windows XP and Windows Server 2003 are of the same generation and Windows 7 and Windows 2008 R2 are of the same generation. Most installers are environment aware. They will investigate the targeting platform and if it discovers a Windows 7 operating system, it knows that some files are already present in the same or newer version than required. Installing on a Windows XP with no service pack would force those required files to be installed locally, and therefore captured by the capturing process. Having these files captured from and installation made on Windows XP rarely conflicts the running of the application on Windows 7 and helps you achieve multiple OS support. Creating a package for one single operating system is of course the easiest task. Creating a package supporting multiple operating systems, all being 32-bit systems is a little harder. How hard depends on the application. Creating a package supporting many different OS and both 32-bit and 64-bit versions is the hardest. But it is doable. It just requires a little extra packaging effort. Some applications cannot run on a 64-bit OS, but most applications offer some kind of work around. If the application contains 16-bit code, then it's impossible to make it run on a 64-bit environment. 64-bit environments cannot handle 16-bit code. Your only workaround for those scenarios is whole machine virtualization technologies. VMware Workstation, VMware View, Citrix XenDesktop, Microsoft Med-V, and many others offer you the capability to access a virtualized 32-bit operating system on your 64-bit machine. In general, you should use an environment that is as clean as possible. This will guarantee that all your packages include as many dependencies as possible, making them portable and robust. But it's not written in stone. If you are capturing an add-on to Microsoft Office, then Microsoft Office has to be locally installed in your capturing environment or the add-on installer would fail to run. You must design your capture environment to allow flexibility. Sometimes you capture on Windows XP, the next application might be captured on Windows 7 64-bit. The next day you'll capture on a machine having JRE installed, or Microsoft Office. The use of virtual machines is a must. Physical machines are supported but the hours spent on reverting to a clean state to start the capture of the next application makes it virtually useless. My capture environment is my MacBook Pro running VMware Fusion and several virtual machines such as Windows XP, Windows Vista, Windows 7, Windows 2003 Server, and of course Windows Server 2008. All VMs have several snapshots (states of the virtual machine) so I can easily jump back and forth between clean, Microsoft Office-installed and different service packs and browsers. Yes, it will require some serious disk space. I'm always low on free disk space. No matter how big the disks you buy are, your project folders and virtual machines will eat it all. I have two disks in my MacBook. One SSD disk, where I keep most of my virtual machines, and one traditional hard disk where I keep all my project folders. The best capture environments I've ever seen have been hosted on VMware vSphere and ESX. Using server hardware to run client operating systems make them fast as lightning. Snapshotting of your VMs take seconds, as well as reverting snapshots. Access to the virtual machines hosted on VMware ESX can be achieved using a console within the vSphere client or basic RDP. The only downside I can see to using an ESX environment is that you cannot do packaging offline, while traveling. The next logical question is if my capture machine should be a domain member or standalone, this depends, I always prefer to capture on standalone machines. This way I know that group policies will not mess with my capture process. No restrictions will be blocking me from doing what I need to do. But again, sometimes you can simply not capture an application without having access to a backend infrastructure. Then your capture machine must be on the corporate network and most of the time it means that it has to be a domain member. If possible, try putting the machine in a special Organizational Unit ( OU) where you limit the amount of restrictions. If at all possible, make sure you don't have antivirus installed on your capturing environment. I know that some enterprises have strict policies forcing even packaging machines to be protected by antivirus. But be careful. There is no way of telling what your antivirus may decide to do to your application's installation and the whole capture process. Most installer manuals clearly state to disable any antivirus during installation. They do that for a reason. Antivirus scanning logs and all that follows will also pollute your project folder. It will probably not break your package but I am a strong believer in delivering clean and optimized packages. So having an antivirus means you will have to spend some time cleaning up your project folders. Alternatively, you can include areas where the antivirus changes content in snapshot.ini, the Setup Capture exclusion list. Entry points and the data container An entry point is the doorway into the virtual environment for the end users. An entry point specifies what will be launched within the virtual environment. Mostly an entry point points to an executable, for example, winword.exe. But an entry point doesn't have to point to an executable. You can point an entry point to whatever kind of file you want, as long as the file type has a file association made to it. Whatever is associated to the file type will be launched within the virtual environment. If no file type association exists, you will get the standard operating system dialog box, asking you which application to open the file with. The name of the entry point must use an .exe extension. When the user double-clicks on an entry point, we are asking the operating system to execute the ThinApp runtime. Entry points are managed in Package.ini. You'll find them at the end of the Package.ini file. The data container is the file where ThinApp stores the whole virtual environment and the ThinApp runtime. There can only be one data container per project. The content in the data container is an exact copy of the representation of the virtual environment found in your project folder. The data in the data container is in read-only format. It's the packagers who change the content by rebuilding the project. An end user cannot change the content of the data container. An entry point can be a data container. Setup Capture will recommend not using an entry point as a data container if Setup Capture believes that the package will be large (200 MB-300 MB). The reason for this is that the icon of the entry point may take up to 20 minutes to be displayed. This is a feature of the operating system and there's nothing you can do about it. It's therefore better to store the data container in a separate file and keep your entry points small. Make sure the icons are displayed quickly. Setup Capture will force you to use a separate data container when the size is calculated to be larger than 1.5 GB. Windows has a size limitation for xecutable files. Windows will deny executing a .exe file larger than 2 GB. The name of the data container can be anything. You will not have to name it with the .dat extension. It doesn't have to have a file extension at all. If you're using a separate data container, you must keep the data container in the same folder as your entry points. Let's take a closer look at the data container and entry point section of Package.ini. You'll find the data container and entry points at the end of the Package.ini file. The following is an example Package.ini file from a virtualized Mozilla Firefox: [Mozilla Firefox.exe] Source=%ProgramFilesDir%Mozilla Firefoxfirefox.exe ;ChangeReadOnlyData to binPackage.ro.tvr to build with old versions(4.6.0 or earlier) of tools ReadOnlyData=Package.ro.tvr WorkingDirectory=%ProgramFilesDir%Mozilla Firefox FileTypes=.htm.html.shtml.xht.xhtml Protocols=FirefoxURL;ftp;http;https Shortcuts=%Desktop%;%Programs%Mozilla Firefox;%AppData%Microsoft Internet ExplorerQuick Launch [Mozilla Firefox (Safe Mode).exe] Disabled=1 Source=%ProgramFilesDir%Mozilla Firefoxfirefox.exe Shortcut=Mozilla Firefox.exe WorkingDirectory=%ProgramFilesDir%Mozilla Firefox CommandLine="%ProgramFilesDir%Mozilla Firefoxfirefox.exe"-safe-mode Shortcuts=%Programs%Mozilla Firefox A step-by-step explanation for the parameters is given as follows: [Mozilla Firefox.exe]   Within [] is the name of the entry point. This is the name the end user will see. Make sure to use .exe as your file extension. Source=%ProgramFilesDir%Mozilla Firefoxfirefox.exe The source parameter points to the target of the entry point, that is, what will be launched when the user clicks on the entry point. The source can either be a virtualized or physical file. The target will be launched within the virtual environment no matter where it lives. ReadOnlyData=Package.ro.tvr The ReadOnlyData indicates this entry point is in fact a data container as well. WorkingDirectory=%ProgramFilesDir%Mozilla Firefox This specifies the working directory for the executable launched. This is often a very important parameter. If you do not specify a working directory, the active working directory will be the location of your package. A lot of software depends on having their working directory set to the application's own folder in the program files directory. FileTypes=.htm.html.shtml.xht.xhtml This is used when registering the entry point. It specifies which file extensions should be associated with this entry point. The previous example registers .htm, .html, and so on to the virtualized Mozilla Firefox. Protocols=FirefoxURL;ftp;http;https This is used when registering the entry point. It specifies which protocols should be associated with this entry point. The previous example registers http, https, and so on to the virtualized Mozilla Firefox. Shortcuts=%Desktop%;%Programs%Mozilla Firefox The parameter Shortcuts is also used when registering your entry points. The Shortcuts parameter decides where shortcuts will be created. The previous example creates shortcuts to virtualized Mozilla Firefox on the Start menu in a folder called Mozilla Firefox, as well as a shortcut on the user's desktop. [Mozilla Firefox (Safe Mode).exe] Disabled=1 Disabled means this entry point will not be created during the build of your project. Source=%ProgramFilesDir%Mozilla Firefoxfirefox.exe Shortcut=Mozilla Firefox.exe Shortcut tells this ent;ry point what its data container is named. If you change the data container's name you will have to change the Shortcut parameter on all entry points using the data container. WorkingDirectory=%ProgramFilesDir%Mozilla Firefox CommandLine="%ProgramFilesDir%Mozilla Firefoxfirefox.exe"-safe-mode CommandLine will allow you to specify hardcoded parameters to the executable. It's the native parameters supported by the virtualized application that you use. Shortcuts=%Programs%Mozilla Firefox There are many more parameters related to entry points. The following are some more examples with descriptions: [Microsoft Office Enterprise 2007.dat] Source=%ProgramFilesDir%Microsoft OfficeOffice12OSA.EXE ;ChangeReadOnlyData to binPackage.ro.tvr to build with old versions(4.6.0 or earlier) of tools ReadOnlyData=Package.ro.tvr MetaDataContainerOnly=1 MetaDataContainer indicates that this is a separate data container. [Microsoft Office Excel 2007.exe] Source=%ProgramFilesDir%Microsoft OfficeOffice12EXCEL.EXE Shortcut=Microsoft Office Enterprise 2007.dat FileTypes=.csv.dqy.iqy.slk.xla.xlam.xlk.xll.xlm.xls.xlsb.xlshtml.xlsm. xlsx.xlt.xlthtml.xltm.xltx.xlw Comment=Perform calculations, analyze information, and visualize data in spreadsheets by using Microsoft Office Excel. Comment allows you to specify text to be displayed when hovering your mouse over the shortcut to the entry point. ObjectTypes=Excel.Addin;Excel.AddInMacroEnabled;Excel. Application;Excel.Application.12;Excel.Backup;Excel.Chart;Excel. Chart.8;Excel.CSV;Excel.Macrosheet;Excel.Sheet;Excel.Sheet.12;Excel. Sheet.8;Excel.SheetBinaryMacroEnabled;Excel.SheetBinaryMacroEnab led.12;Excel.SheetMacroEnabled;Excel.SheetMacroEnabled.12;Excel. SLK;Excel.Template;Excel.Template.8;Excel.TemplateMacroEnabled;Excel. Workspace;Excel.XLL This specifies the object types which will be registered to the entry point when registered. Shortcuts=%Programs%Microsoft Office StatusBarDisplayName=WordProcessor Users can change the name displayed in the ThinApp splash screen. In this example, WordProcessor will be displayed as the title. Icon=%ProgramFilesDir%Microsoft OfficeOffice12EXCEL.ico Icon allows you to specify an icon for your entry point. Most of the times ThinApp will display the correct icon without this parameter. You can point to an executable to use its built-in icons as well. You can specify a different icon set by applying 1 or 2 and so on to the icon path, for example, Icon=%ProgramFilesDir%Microsoft OfficeOffice12EXCEL.EXE,1 The most common entry points should be either cmd.exe or regedit.exe. You'll find them in all Package.ini files but they are disabled by default. Since cmd.exe and regedit.exe most likely weren't modified during Setup Capture, they are not part of the virtual environment. So the source will be the native cmd.exe and regedit.exe. These two entry points are the packagers' best friends. Using these entry points allows a packager to investigate the environment known to the virtualized application. What you can see using cmd.exe or regedit.exe is what the application sees. This is a great help when troubleshooting. If you package an add-on to a natively installed application, the typical example is packaging JRE and you want the local Internet Explorer to be able to use it. Creating an entry point within your Java package using native Internet Explorer as a source, is a perfect method of dealing with it. Now you can offer a separate shortcut to the user, allowing users to choose when to use native Java or when to use virtualized Java. ThinApp's isolation will allow virtualization of one Java version running on a machine with another version natively installed. The only problem with this approach is how you educate your users when to use which shortcut. ThinDirect, discussed later in this article, in the Virtualizing Internet Explorer 6 section, will allow you to automatically point the user to the right browser. There are many use cases for launching something natively within a virtualized environment. You may face troublesome Excel add-ons. Virtualizing them will protect against conflicts, but you must launch native Excel within the environment of the add-on for it to work. Here you could use the fact that many Excel add-ons use .xla files as the typical entry point to the add-on. Create your entry point using the .xla file as source and you will be able to launch any Excel version that is natively installed. When you use a non executable as your entry point source, remember that the name of your entry point must still be .exe. The following is an example of an entry point using a text file as source: [ReadMe.exe] Source=%Drive_C%Tempreadme.txt ReadOnlyData=Package.ro.tvr Running ReadMe.exe will launch whatever is associated to handle .txt files. The application will run within the virtualized environment of the package.   The project folder The project folder is where the packager spends most of his or her time. The capturing process is just a means to create the project folder. You could just as easily create your own project folder from scratch. I admit, to manually create a project folder representing a Microsoft Office installation would be far from easy but in theory it can be done. There is some default content in all project folders. Let's capture nothing and investigate what these are. During Setup Capture, to speed things up, disable the majority of the search locations. This way pre and post scans will take close to no time at all. Run Setup Capture. In the Ready to Prescan step, click on Advanced Scan Locations.... Exclude all but one location from the scanning, as shown in the following screenshot: Since we want to capture nothing, there is no point in scanning all locations. Normally you don't have to modify the advanced scan locations. After pressing Prescan, wait for Postscan to become available and click on it when possible, without modifying anything in your capturing environment. Accept CMD.EXE as your entry point and accept all defaults throughout the wizard. Your project folder will look like the following screenshot: The project folder of a capturing, bearing no changes, will still create folder macros and default isolation modes. Let's explore the defaults prepopulated by the Setup Capture wizard. This is the minimum project folder content that the Setup Capture will ever generate. As a packager you are expected to clean up unnecessary folders from the project folder, so your final project folder may very well contain a smaller number of folder macros. Folder macros are ThinApp's variables. %ProgramFilesDir% will be translated to C:Program Files on an English Windows installation but the same package running on a Swedish OS the %ProgramFilesDir% will point to C:Program. Folder macros are the key to ThinApp packages' portability. If we explore the filesystem part of the project folder, we'll see the default isolation modes prepopulated by Setup Capture. These are applied as defaults no matter what default filesystem isolation mode you choose during the Setup Capture wizard. This confuses some people. I'm often told that a certain package is using WriteCopy or Merged as the isolation mode. Well that's just the default used when no other isolation mode is specified. A proper project folder should have isolation modes specified on all locations of importance, basically making the default isolation mode of no importance. The prepopulated isolation modes are there to make sure most applications run out of the box ThinApped. You are expected to change these to suit your application and environment. Let's look at some examples of default isolation modes. %AppData%, the location where most applications store user settings, is by default using WriteCopy. This is to make sure that you sandbox all user settings. %SystemRoot% and %SystemSystem% have WriteCopy as their default isolation modes, allowing a virtualized application to see the operating system files without allowing it to modify C:Windows and C:WindowsSystem32. %SystemSystem%spool representing C:WindowsSystem32Spool has Merged as its default. This way print jobs will be spooled to the native location, allowing the printer to pick up the print job. %Desktop% (user's desktop folder) and %Personal% (user's document folder) have Merged by default. When ThinApp generates the project folder, it uses the following logic to decide which isolation mode to prepopulate other locations with. The same logic is used within the registry as well. Modified locations will get WriteCopy as the isolation mode New locations will get Full as their isolation mode
Read more
  • 0
  • 0
  • 5000

article-image-how-to-implement-discrete-convolution-on-a-2d-dataset
Pravin Dhandre
08 Dec 2017
7 min read
Save for later

How to implement discrete convolution on a 2D dataset

Pravin Dhandre
08 Dec 2017
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book by Rodolfo Bonnin titled Machine Learning for Developers. Surprisingly the question frequently asked by developers across the globe is, “How do I get started in Machine Learning?”. One reason could be attributed to the vastness of the subject area. This book is a systematic guide teaching you how to implement various Machine Learning techniques and their day-to-day application and development. [/box] In the tutorial given below,  we have implemented convolution in a practical example to see it applied to a real image and get intuitive ideas of its effect.We will use different kernels to detect high-detail features and execute subsampling operation to get optimized and brighter image. This is a simple intuitive implementation of discrete convolution concept by applying it to a sample image with different types of kernel. Let's import the required libraries. As we will implement the algorithms in the clearest possible way, we will just use the minimum necessary ones, such as NumPy: import matplotlib.pyplot as plt import imageio import numpy as np Using the imread method of the imageio package, let's read the image (imported as three equal channels, as it is grayscale). We then slice the first channel, convert it to a floating point, and show it using matplotlib: arr = imageio.imread("b.bmp") [:,:,0].astype(np.float) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) plt.show() Now it's time to define the kernel convolution operation. As we did previously, we will simplify the operation on a 3 x 3 kernel in order to better understand the border conditions. apply3x3kernel will apply the kernel over all the elements of the image, returning a new equivalent image. Note that we are restricting the kernels to 3 x 3 for simplicity, and so the 1 pixel border of the image won't have a new value because we are not taking padding into consideration: class ConvolutionalOperation: def apply3x3kernel(self, image, kernel): # Simple 3x3 kernel operation newimage=np.array(image) for m in range(1,image.shape[0]-2): for n in range(1,image.shape[1]-2): newelement = 0 for i in range(0, 3): for j in range(0, 3): newelement = newelement + image[m - 1 + i][n - 1+ j]*kernel[i][j] newimage[m][n] = newelement return (newimage) As we saw in the previous sections, the different kernel configurations highlight different elements and properties of the original image, building filters that in conjunction can specialize in very high-level features after many epochs of training, such as eyes, ears, and doors. Here, we will generate a dictionary of kernels with a name as the key, and the coefficients of the kernel arranged in a 3 x 3 array. The Blur filter is equivalent to calculating the average of the 3 x 3 point neighborhood, Identity simply returns the pixel value as is, Laplacian is a classic derivative filter that highlights borders, and then the two Sobel filters will mark horizontal edges in the first case, and vertical ones in the second case: kernels = {"Blur":[[1./16., 1./8., 1./16.], [1./8., 1./4., 1./8.], [1./16., 1./8., 1./16.]] ,"Identity":[[0, 0, 0], [0., 1., 0.], [0., 0., 0.]] ,"Laplacian":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]] ,"Left Sobel":[[1., 0., -1.], [2., 0., -2.], [1., 0., -1.]] ,"Upper Sobel":[[1., 2., 1.], [0., 0., 0.], [-1., -2., -1.]]} Let's generate a ConvolutionalOperation object and generate a comparative kernel graphical chart to see how they compare: conv = ConvolutionalOperation() plt.figure(figsize=(30,30)) fig, axs = plt.subplots(figsize=(30,30)) j=1 for key,value in kernels.items(): axs = fig.add_subplot(3,2,j) out = conv.apply3x3kernel(arr, value) plt.imshow(out, cmap=plt.get_cmap('binary_r')) j=j+1 plt.show() <matplotlib.figure.Figure at 0x7fd6a710a208> In the final image you can clearly see how our kernel has detected several high-detail features on the image—in the first one, you see the unchanged image because we used the unit kernel, then the Laplacian edge finder, the left border detector, the upper border detector, and then the blur operator: Having reviewed the main characteristics of the convolution operation for the continuous and discrete fields, we can conclude by saying that, basically, convolution kernels highlight or hide patterns. Depending on the trained or (in our example) manually set parameters, we can begin to discover many elements in the image, such as orientation and edges in different dimensions. We may also cover some unwanted details or outliers by blurring kernels, for example. Additionally, by piling layers of convolutions, we can even highlight higher-order composite elements, such as eyes or ears. This characteristic of convolutional neural networks is their main advantage over previous data-processing techniques: we can determine with great flexibility the primary components of a certain dataset, and represent further samples as a combination of these basic building blocks. Now it's time to look at another type of layer that is commonly used in combination with the former—the pooling layer. Subsampling operation (pooling) The subsampling operation consists of applying a kernel (of varying dimensions) and reducing the extension of the input dimensions by dividing the image into mxn blocks and taking one element representing that block, thus reducing the image resolution by some determinate factor. In the case of a 2 x 2 kernel, the image size will be reduced by half. The most well-known operations are maximum (max pool), average (avg pool), and minimum (min pool). The following image gives you an idea of how to apply a 2 x 2 maxpool kernel, applied to a one-channel 16 x 16 matrix. It just maintains the maximum value of the internal zone it covers: Now that we have seen this simple mechanism, let's ask ourselves, what's the main purpose of it? The main purpose of subsampling layers is related to the convolutional layers: to reduce the quantity and complexity of information while retaining the most important information elements. In other word, they build a compact representation of the underlying information. Now it's time to write a simple pooling operator. It's much easier and more direct to write than a convolutional operator, and in this case we will only be implementing max pooling, which chooses the brightest pixel in the 4 x 4 vicinity and projects it to the final image: class PoolingOperation: def apply2x2pooling(self, image, stride): # Simple 2x2 kernel operation newimage=np.zeros((int(image.shape[0]/2),int(image.shape[1]/2)),np.float32) for m in range(1,image.shape[0]-2,2): for n in range(1,image.shape[1]-2,2): newimage[int(m/2),int(n/2)] = np.max(image[m:m+2,n:n+2]) return (newimage) Let's apply the newly created pooling operation, and as you can see, the final image resolution is much more blocky, and the details, in general, are brighter: plt.figure(figsize=(30,30)) pool=PoolingOperation() fig, axs = plt.subplots(figsize=(20,10)) axs = fig.add_subplot(1,2,1) plt.imshow(arr, cmap=plt.get_cmap('binary_r')) out=pool.apply2x2pooling(arr,1) axs = fig.add_subplot(1,2,2) plt.imshow(out, cmap=plt.get_cmap('binary_r')) plt.show() Here you can see the differences, even though they are subtle. The final image is of lower precision, and the chosen pixels, being the maximum of the environment, produce a brighter image: This simple implementation with various kernels simplified the working mechanism of discrete convolution operation on a 2D dataset. Using various kernels and subsampling operation, the hidden patterns of dataset are unveiled and the image is made more sharpened, with maximum pixels and much brighter image thereby producing compact representation of the dataset. If you found this article interesting, do check  Machine Learning for Developers and get to know about the advancements in deep learning, adversarial networks, popular programming frameworks to prepare yourself in the ubiquitous field of machine learning.    
Read more
  • 0
  • 0
  • 4998
article-image-understanding-context-bdd
Packt
22 Oct 2014
6 min read
Save for later

Understanding the context of BDD

Packt
22 Oct 2014
6 min read
In this article by Sujoy Acharya, author of Mockito Essentials, you will learn about the BDD concepts and BDD examples. You will also learn about how BDD can help you minimize project failure risks. (For more resources related to this topic, see here.) This section of the article deals with the software development strategies, drawbacks, and conquering the shortcomings of traditional approaches. The following strategies are applied to deliver software products to customers: Top-down or waterfall approach Bottom-up approach We'll cover these two approaches in the following sections. The following key people/roles/stakeholders are involved in software development: Customers: They explore the concept and identify the high-level goal of the system, such as automating the expense claim process Analysts: They analyze the requirements, work with the customer to understand the system, and build the system requirement specifications Designers/architects: They visualize the system, design the baseline architecture, identify the components, interact and handle the nonfunctional requirements, such as scalability and availability Developers: They construct the system from the design and specification documents Testers: They design test cases and verify the implementation Operational folks: They install the software as per the customer's environment Maintenance team: They handle bugs and monitor the system's health Managers: They act as facilitators and keep track of the progress and schedule Exploring the top-down strategy In the top-down strategy, analysts analyze the requirements and hand over the use cases / functional specifications to the designers and architects for designing the system. The architects/designers design the baseline architecture, identify the system components and interactions, and then pass the design over to the developers for implementation. The testers then verify the implementation (might report bugs for fixing), and finally, the software is deployed to the customer's environment. The following diagram depicts the top-down flow from requirement engineering to maintenance: The biggest drawback of this approach is the cost of rework. For instance, if the development team finds that a requirement is not feasible, they consult the design or analysis team. Then the architects or analysts look at the issue and rework the analysis or design. This approach has a cascading effect; the cost of rework is very high. Customers rarely know what they want before they see the system in action. Building everything all at once is a quick way to cause your requirements to change. Even without the difference in cost of requirement changes, you'll have fewer changes if you write the requirements later in the process, when you have a partially working product that the customer can see and everybody has more information about how the product will work. Exploring the bottom-up strategy In the bottom-up strategy, the requirement is broken into small chunks and each chunk is designed, developed, and unit tested separately, and finally, the chunks are integrated. The individual base elements of the system are first specified in great detail. These elements are then linked together to form larger subsystems, which in turn are linked until a complete top-level system is formed. Each subsystem is developed in isolation from the other subsystems, so integration is very important in the bottom-up approach. If integration fails, the cost and effort of building the subsystems gets jeopardized. Suppose you are building a healthcare system with three subsystems, namely, patient management, receivable management, and the claims module. If the patient module cannot talk to the claims module, the system fails. The effort of building the patient management and claims management subsystems is just wasted. Agile development methodology would suggest building the functionality feature by feature across subsystems, that is, building a very basic patient management and claims management subsystem to make the functionality work initially, and then adding more to both simultaneously, to support each new feature that is required. Finding the gaps In real-life projects, the following is the percentage of feature usage: 60 percent of features are never used 30 percent of features are occasionally used 10 percent of features are frequently used However, in the top-down approach, the analyst pays attention and brainstorms to create system requirements for all the features. In the top-down approach, time is spent to build a system where 90 percent of features are either not used or occasionally used. Instead, we can identify the high-value features and start building the features instead of paying attention to the low priority features, by using the bottom-up approach. In the bottom-up approach, subsystems are built in isolation from each other, and this causes integration problems. If we prioritize the requirements and start with the highest priority feature, design the feature, build it, unit test it, integrate it, and then show a demo to the stakeholders (customers, analysts, product managers, and so on), we can easily identify the gaps and reduce the risk of rework. We can then pick the next feature and follow the steps (designing, coding, testing, and getting feedback from the customers), and finally integrate the feature with the existing system. This reduces the integration issues of the bottom-up approach. The following figure represents the approach. Each feature is analyzed, designed, coded, tested, and integrated separately. An example of a requirement could be login failure error messages appear red and in bold, while a feature could be incorrect logins are rejected. Typically, it should be a little larger and a useful standalone bit of functionality, rather than a specific single requirement for that functionality. Another problem associated with software development is communication; each stakeholder has a different vocabulary and this causes issues for common understanding. The following are the best practices to minimize software delivery risks: Focus on high-value, frequently used features. Build a common vocabulary for the stakeholders; a domain-specific language that anybody can understand. No more big-fat upfront designing. Evolve the design with the requirements, iteratively. Code to satisfy the current requirement. Don't code for a future requirement, which may or may not be delivered. Follow the YAGNI (You Aren't Going to Need It) principle. Build test the safety net for each requirement. Integrate the code with the system and rerun the regression test. Get feedback from the stakeholders and make immediate changes. BDD suggests the preceding best approaches. Summary This article covered and taught you about the BDD concepts and BDD examples. Resources for Article: Further resources on this subject: Important features of Mockito [article] Progressive Mockito [article] Getting Started with Mockito [article]
Read more
  • 0
  • 0
  • 4997

article-image-basic-concepts-machine-learning-and-logistic-regression-example-mahout
Packt
30 Mar 2015
33 min read
Save for later

Basic Concepts of Machine Learning and Logistic Regression Example in Mahout

Packt
30 Mar 2015
33 min read
In this article by Chandramani Tiwary, author of the book, Learning Apache Mahout, we will discuss some core concepts of machine learning and discuss the steps of building a logistic regression classifier in Mahout. (For more resources related to this topic, see here.) The purpose of this article is to understand the core concepts of machine learning. We will focus on understanding the steps involved in, resolving different types of problems and application areas in machine learning. In particular we will cover the following topics: Supervised learning Unsupervised learning The recommender system Model efficacy A wide range of software applications today try to replace or augment human judgment. Artificial Intelligence is a branch of computer science that has long been trying to replicate human intelligence. A subset of AI, referred to as machine learning, tries to build intelligent systems by using the data. For example, a machine learning system can learn to classify different species of flowers or group-related news items together to form categories such as news, sports, politics, and so on, and for each of these tasks, the system will learn using data. For each of the tasks, the corresponding algorithm would look at the data and try to learn from it. Supervised learning Supervised learning deals with training algorithms with labeled data, inputs for which the outcome or target variables are known, and then predicting the outcome/target with the trained model for unseen future data. For example, historical e-mail data will have individual e-mails marked as ham or spam; this data is then used for training a model that can predict future e-mails as ham or spam. Supervised learning problems can be broadly divided into two major areas, classification and regression. Classification deals with predicting categorical variables or classes; for example, whether an e-mail is ham or spam or whether a customer is going to renew a subscription or not, for example a postpaid telecom subscription. This target variable is discrete, and has a predefined set of values. Regression deals with a target variable, which is continuous. For example, when we need to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. In order to solve a given problem of supervised learning, one has to perform the following steps. Determine the objective The first major step is to define the objective of the problem. Identification of class labels, what is the acceptable prediction accuracy, how far in the future is prediction required, is insight more important or is accuracy of classification the driving factor, these are the typical objectives that need to be defined. For example, for a churn classification problem, we could define the objective as identifying customers who are most likely to churn within three months. In this case, the class label from the historical data would be whether a customer has churned or not, with insights into the reasons for the churn and a prediction of churn at least three months in advance. Decide the training data After the objective of the problem has been defined, the next step is to decide what training data should be used. The training data is directly guided by the objective of the problem to be solved. For example, in the case of an e-mail classification system, it would be historical e-mails, related metadata, and a label marking each e-mail as spam or ham. For the problem of churn analysis, different data points collected about a customer such as product usage, support case, and so on, and a target label for whether a customer has churned or is active, together form the training data. Churn Analytics is a major problem area for a lot of businesses domains such as BFSI, telecommunications, and SaaS. Churn is applicable in circumstances where there is a concept of term-bound subscription. For example, postpaid telecom customers subscribe for a monthly term and can choose to renew or cancel their subscription. A customer who cancels this subscription is called a churned customer. Create and clean the training set The next step in a machine learning project is to gather and clean the dataset. The sample dataset needs to be representative of the real-world data, though all available data should be used, if possible. For example, if we assume that 10 percent of e-mails are spam, then our sample should ideally start with 10 percent spam and 90 percent ham. Thus, a set of input rows and corresponding target labels are gathered from data sources such as warehouses, or logs, or operational database systems. If possible, it is advisable to use all the data available rather than sampling the data. Cleaning data for data quality purposes forms part of this process. For example, training data inclusion criteria should also be explored in this step. An example of this in the case of customer analytics is to decide the minimum age or type of customers to use in the training set, for example including customers aged at least six months. Feature extraction Determine and create the feature set from the training data. Features or predictor variables are representations of the training data that is used as input to a model. Feature extraction involves transforming and summarizing that data. The performance of the learned model depends strongly on its input feature set. This process is primarily called feature extraction and requires good understanding of data and is aided by domain expertise. For example, for churn analytics, we use demography information from the CRM, product adoption (phone usage in case of telecom), age of customer, and payment and subscription history as the features for the model. The number of features extracted should neither be too large nor too small; feature extraction is more art than science and, optimum feature representation can be achieved after some iterations. Typically, the dataset is constructed such that each row corresponds to one variable outcome. For example, in the churn problem, the training dataset would be constructed so that every row represents a customer. Train the models We need to try out different supervised learning algorithms. This step is called training the model and is an iterative process where you might try building different training samples and try out different combinations of features. For example, we may choose to use support vector machines or decision trees depending upon the objective of the study, the type of problem, and the available data. Machine learning algorithms can be bucketed into groups based on the ability of a user to interpret how the predictions were arrived at. If the model can be interpreted easily, then it is called a white box, for example decision tree and logistic regression, and if the model cannot be interpreted easily, they belong to the black box models, for example support vector machine (SVM). If the objective is to gain insight, a white box model such as decision tree or logistic regression can be used, and if robust prediction is the criteria, then algorithms such as neural networks or support vector machines can be used. While training a model, there are a few techniques that we should keep in mind, like bagging and boosting. Bagging Bootstrap aggregating, which is also known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets. The datasets are the same size as the original. Each dataset is built by randomly selecting an example from the original with replacement. By with replacement we mean that you can select the same example more than once. This property allows you to have values in the new dataset that are repeated, and some values from the original won't be present in the new set. Bagging helps in reducing the variance of a model and can be used to train different models using the same datasets. The final conclusion is arrived at after considering the output of each model. For example, let's assume our data is a, b, c, d, e, f, g, and h. By sampling our data five times, we can create five different samples as follows: Sample 1: a, b, c, c, e, f, g, h Sample 2: a, b, c, d, d, f, g, h Sample 3: a, b, c, c, e, f, h, h Sample 4: a, b, c, e, e, f, g, h Sample 5: a, b, b, e, e, f, g, h As we sample with replacement, we get the same examples more than once. Now we can train five different models using the five sample datasets. Now, for the prediction; as each model will provide the output, let's assume classes are yes and no, and the final outcome would be the class with maximum votes. If three models say yes and two no, then the final prediction would be class yes. Boosting Boosting is a technique similar to bagging. In boosting and bagging, you always use the same type of classifier. But in boosting, the different classifiers are trained sequentially. Each new classifier is trained based on the performance of those already trained, but gives greater weight to examples that were misclassified by the previous classifier. Boosting focuses new classifiers in the sequence on previously misclassified data. Boosting also differs from bagging in its approach of calculating the final prediction. The output is calculated from a weighted sum of all classifiers, as opposed to the method of equal weights used in bagging. The weights assigned to the classifier output in boosting are based on the performance of the classifier in the previous iteration. Validation After collecting the training set and extracting the features, you need to train the model and validate it on unseen samples. There are many approaches for creating the unseen sample called the validation set. We will be discussing a couple of them shortly. Holdout-set validation One approach to creating the validation set is to divide the feature set into train and test samples. We use the train set to train the model and test set to validate it. The actual percentage split varies from case to case but commonly it is split at 70 percent train and 30 percent test. It is also not uncommon to create three sets, train, test and validation set. Train and test set is created from data out of all considered time periods but the validation set is created from the most recent data. K-fold cross validation Another approach is to divide the data into k equal size folds or parts and then use k-1 of them for training and one for testing. The process is repeated k times so that each set is used as a validation set once and the metrics are collected over all the runs. The general standard is to use k as 10, which is called 10-fold cross-validation. Evaluation The objective of evaluation is to test the generalization of a classifier. By generalization, we mean how good the model performs on future data. Ideally, evaluation should be done on an unseen sample, separate to the validation sample or by cross-validation. There are standard metrics to evaluate a classifier against. There are a few things to consider while training a classifier that we should keep in mind. Bias-variance trade-off The first aspect to keep in mind is the trade-off between bias and variance. To understand the meaning of bias and variance, let's assume that we have several different, but equally good, training datasets for a specific supervised learning problem. We train different models using the same technique; for example, build different decision trees using the different training datasets available. Bias measures how far off in general a model's predictions are from the correct value. Bias can be measured as the average difference between a predicted output and its actual value. A learning algorithm is biased for a particular input X if, when trained on different training sets, it is incorrect when predicting the correct output for X. Variance is how greatly the predictions for a given point vary between different realizations of the model. A learning algorithm has high variance for a particular input X if it predicts different output values for X when trained on different training sets. Generally, there will be a trade-off between bias and variance. A learning algorithm with low bias must be flexible so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training dataset differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this trade-off between bias and variance. The plot on the top left is the scatter plot of the original data. The plot on the top right is a fit with high bias; the error in prediction in this case will be high. The bottom left image is a fit with high variance; the model is very flexible, and error on the training set is low but the prediction on unseen data will have a much higher degree of error as compared to the training set. The bottom right plot is an optimum fit with a good trade-off of bias and variance. The model explains the data well and will perform in a similar way for unseen data too. If the bias-variance trade-off is not optimized, it leads to problems of under-fitting and over-fitting. The plot shows a visual representation of the bias-variance trade-off. Over-fitting occurs when an estimator is too flexible and tries to fit the data too closely. High variance and low bias leads to over-fitting of data. Under-fitting occurs when a model is not flexible enough to capture the underlying trends in the observed data. Low variance and high bias leads to under-fitting of data. Function complexity and amount of training data The second aspect to consider is the amount of training data needed to properly represent the learning task. The amount of data required is proportional to the complexity of the data and learning task at hand. For example, if the features in the data have low interaction and are smaller in number, we could train a model with a small amount of data. In this case, a learning algorithm with high bias and low variance is better suited. But if the learning task at hand is complex and has a large number of features with higher degree of interaction, then a large amount of training data is required. In this case, a learning algorithm with low bias and high variance is better suited. It is difficult to actually determine the amount of data needed, but the complexity of the task provides some indications. Dimensionality of the input space A third aspect to consider is the dimensionality of the input space. By dimensionality, we mean the number of features the training set has. If the input feature set has a very high number of features, any machine learning algorithm will require a huge amount of data to build a good model. In practice, it is advisable to remove any extra dimensionality before training the model; this is likely to improve the accuracy of the learned function. Techniques like feature selection and dimensionality reduction can be used for this. Noise in data The fourth issue is noise. Noise refers to inaccuracies in data due to various issues. Noise can be present either in the predictor variables, or in the target variable. Both lead to model inaccuracies and reduce the generalization of the model. In practice, there are several approaches to alleviate noise in the data; first would be to identify and then remove the noisy training examples prior to training the supervised learning algorithm, and second would be to have an early stopping criteria to prevent over-fitting. Unsupervised learning Unsupervised learning deals with unlabeled data. The objective is to observe structure in data and find patterns. Tasks like cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems. As the tasks involved in unsupervised learning vary vastly, there is no single process outline that we can follow. We will follow the process of some of the most common unsupervised learning problems. Cluster analysis Cluster analysis is a subset of unsupervised learning that aims to create groups of similar items from a set of items. Real life examples could be clustering movies according to various attributes like genre, length, ratings, and so on. Cluster analysis helps us identify interesting groups of objects that we are interested in. It could be items we encounter in day-to-day life such as movies, songs according to taste, or interests of users in terms of their demography or purchasing patterns. Let's consider a small example so you understand what we mean by interesting groups and understand the power of clustering. We will use the Iris dataset, which is a standard dataset used for academic research and it contains five variables: sepal length, sepal width, petal length, petal width, and species with 150 observations. The first plot we see shows petal length against petal width. Each color represents a different species. The second plot is the groups identified by clustering the data. Looking at the plot, we can see that the plot of petal length against petal width clearly separates the species of the Iris flower and in the process, it clusters the group's flowers of the same species together. Cluster analysis can be used to identify interesting patterns in data. The process of clustering involves these four steps. We will discuss each of them in the section ahead. Objective Feature representation Algorithm for clustering A stopping criteria Objective What do we want to cluster? This is an important question. Let's assume we have a large customer base for some kind of an e-commerce site and we want to group them together. How do we want to group them? Do we want to group our users according to their demography, such as age, location, income, and so on or are we interested in grouping them together? A clear objective is a good start, though it is not uncommon to start without an objective and see what can be done with the available data. Feature representation As with any machine learning task, feature representation is important for cluster analysis too. Creating derived features, summarizing data, and converting categorical variables to continuous variables are some of the common tasks. The feature representation needs to represent the objective of clustering. For example, if the objective is to cluster users based upon purchasing behavior, then features should be derived from purchase transaction and user demography information. If the objective is to cluster documents, then features should be extracted from the text of the document. Feature normalization To compare the feature vectors, we need to normalize them. Normalization could be across rows or across columns. In most cases, both are normalized. Row normalization The objective of normalizing rows is to make the objects to be clustered, comparable. Let's assume we are clustering organizations based upon their e-mailing behavior. Now organizations are very large and very small, but the objective is to capture the e-mailing behavior, irrespective of size of the organization. In this scenario, we need to figure out a way to normalize rows representing each organization, so that they can be compared. In this case, dividing by user count in each respective organization could give us a good feature representation. Row normalization is mostly driven by the business domain and requires domain expertise. Column normalization The range of data across columns varies across datasets. The unit could be different or the range of columns could be different, or both. There are many ways of normalizing data. Which technique to use varies from case to case and depends upon the objective. A few of them are discussed here. Rescaling The simplest method is to rescale the range of features to make the features independent of each other. The aim is scale the range in [0, 1] or [−1, 1]: Here x is the original value and x', the rescaled valued. Standardization Feature standardization allows for the values of each feature in the data to have zero-mean and unit-variance. In general, we first calculate the mean and standard deviation for each feature and then subtract the mean in each feature. Then, we divide the mean subtracted values of each feature by its standard deviation: Xs = (X – mean(X)) / standard deviation(X). A notion of similarity and dissimilarity Once we have the objective defined, it leads to the idea of similarity and dissimilarity of object or data points. Since we need to group things together based on similarity, we need a way to measure similarity. Likewise to keep dissimilar things apart, we need a notion of dissimilarity. This idea is represented in machine learning by the idea of a distance measure. Distance measure, as the name suggests, is used to measure the distance between two objects or data points. Euclidean distance measure Euclidean distance measure is the most commonly used and intuitive distance measure: Squared Euclidean distance measure The standard Euclidean distance, when squared, places progressively greater weight on objects that are farther apart as compared to the nearer objects. The equation to calculate squared Euclidean measure is shown here: Manhattan distance measure Manhattan distance measure is defined as the sum of the absolute difference of the coordinates of two points. The distance between two points measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is |x1 - x2| + |y1 - y2|: Cosine distance measure The cosine distance measure measures the angle between two points. When this angle is small, the vectors must be pointing in the same direction, and so in some sense the points are close. The cosine of this angle is near one when the angle is small, and decreases as it gets larger. The cosine distance equation subtracts the cosine value from one in order to give a proper distance, which is 0 when close and larger otherwise. The cosine distance measure doesn't account for the length of the two vectors; all that matters is that the points are in the same direction from the origin. Also note that the cosine distance measure ranges from 0.0, if the two vectors are along the same direction, to 2.0, when the two vectors are in opposite directions: Tanimoto distance measure The Tanimoto distance measure, like the cosine distance measure, measures the angle between two points, as well as the relative distance between the points: Apart from the standard distance measure, we can also define our own distance measure. Custom distance measure can be explored when existing ones are not able to measure the similarity between items. Algorithm for clustering The type of clustering algorithm to be used is driven by the objective of the problem at hand. There are several options and the predominant ones are density-based clustering, distance-based clustering, distribution-based clustering, and hierarchical clustering. The choice of algorithm to be used depends upon the objective of the problem. A stopping criteria We need to know when to stop the clustering process. The stopping criteria could be decided in different ways: one way is when the cluster centroids don't move beyond a certain margin after multiple iterations, a second way is when the density of the clusters have stabilized, and third way could be based upon the number of iterations, for example stopping the algorithm after 100 iterations. The stopping criteria depends upon the algorithm used, the goal being to stop when we have good enough clusters. Logistic regression Logistic regression is a probabilistic classification model. It provides the probability of a particular instance belonging to a class. It is used to predict the probability of binary outcomes. Logistic regression is computationally inexpensive, is relatively easier to implement, and can be interpreted easily. Logistic regression belongs to the class of discriminative models. The other class of algorithms is generative models. Let's try to understand the differences between the two. Suppose we have some input data represented by X and a target variable Y, the learning task obviously is P(Y|X), finding the conditional probability of Y occurring given X. A generative model concerns itself with learning the joint probability of P(Y, X), whereas a discriminative model will directly learn the conditional probability of P(Y|X) from the training set. This is the actual objective of classification. A generative model first learns P(Y, X), and then gets to P(Y|X) by conditioning on X by using Bayes' theorem. In more intuitive terms, generative models first learn the distribution of the data, then they model how the data is actually generated. However, discriminative models don't try to learn the underlying data distribution; they are concerned with finding the decision boundaries for the classification. Since generative models learn the distribution, it is possible to generate synthetic samples of X, Y. This is not possible with discriminative models. Some common examples of generative and discriminative models are as follows: Generative: naïve Bayes, Latent Dirichlet allocation Discriminative: Logistic regression, SVM, Neural networks Logistic regression belongs to the family of statistical techniques called regression. For regression problems and few other optimization problems, we first define a hypothesis, then define a cost function, and optimize it using an optimization algorithm such as Gradient descent. The optimization algorithm tries to find the regression coefficient, which best fits the data. Let's assume that the target variable is Y and the predictor variable or feature is X. Any regression problem starts with defining the hypothesis function, for example, an equation of the predictor variable , defines a cost function and then tweaks the weights; in this case, and are tweaked to minimize or maximize the cost function by using an optimization algorithm. For logistic regression, the predicted target needs to fall between zero and one. We start by defining the hypothesis function for it: Here, f(z) is the sigmoid or logistic function that has a range of zero to one, x is a matrix of features, and is the vector of weights. The next step is to define the cost function, which measures the difference between predicted and actual values. The objective of the optimization algorithm here is to find . This fits the regression coefficients so that the difference between predicted and actual target values are minimized. We will discuss gradient descent as the choice for the optimization algorithm shortly. To find the local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of that function at the current point. This will give us the optimum value of vector , once we achieve the stopping criteria. The stopping criteria is when the change in the weight vectors falls below a certain threshold, although sometimes it could be set to a predefined number of iterations. Logistic regression falls into the category of white box techniques and can be interpreted. Features or variables are of two major types, categorical and continuous, defined as follows: Categorical variable: This is a variable or feature that can take on a limited, and usually fixed, number of possible values. Example, variables such as industry, zip code, and country are categorical variables. Continuous variable: This is a variable that can take on any value between its minimum value and maximum value or range. Example, variable such as age, price, and so on, are continuous variables. Mahout logistic regression command line Mahout employs a modified version of gradient descent called stochastic gradient descent. The previous optimization algorithm, gradient ascent, uses the whole dataset on each update. This was fine with 100 examples, but with billions of data points containing thousands of features, it's unnecessarily expensive in terms of computational resources. An alternative to this method is to update the weights using, only one instance at a time. This is known as stochastic gradient ascent. Stochastic gradient ascent is an example of an online learning algorithm. This is known as online learning algorithm because we can incrementally update the classifier as new data comes in, rather than all at once. The all-at-once method is known as batch processing. We will now train and test a logistic regression algorithm using Mahout. We will also discuss both command line and code examples. The first step is to get the data and explore it. Getting the data The dataset required for this article is included in the code repository that comes with this book. It is present in the learningApacheMahout/data/chapter4 directory. If you wish to download the data, the same can be downloaded from the UCI link. The UCI is a repository for many datasets for machine learning. You can check out the other datasets available for further practice via this link http://archive.ics.uci.edu/ml/datasets.html. Create a folder in your home directory with the following command: cd $HOME mkdir bank_data cd bank_data Download the data in the bank_data directory: wget http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip Unzip the file using whichever utility you like, we use unzip: unzip bank-additional.zip cd bank-additional We are interested in the file bank-additional-full.csv. Copy the file to the learningApacheMahout/data/chapter4 directory. The file is semicolon delimited and the values are enclosed by ", it also has a header line with column name. We will use sed to preprocess the data. The sed editor is a very powerful editor in Linux and the command to use it is as follows: sed -e 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' fileName > Output_fileName For inplace editing, the command is as follows: sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' Command to replace ; with , and remove " are as follows: sed -e 's/;/,/g' bank-additional-full.csv > input_bank_data.csv sed -i 's/"//g' input_bank_data.csv The dataset contains demographic and previous campaign-related data about a client and the outcome of whether or not the client did subscribed to the term deposit. We are interested in training a model, which can predict whether a client will subscribe to a term deposit, given the input data. The following table shows various input variables along with their types: Column name Description Variable type Age This represents the age of the Client Numeric Job This represents their type of the job, for example, entrepreneur, housemaid, management Categorical Marital This represents their marital status Categorical Education This represents their education level Categorical Default States whether the client has defaulted on credit Categorical Housing States whether the client has a housing loan Categorical Loan States whether the client has a personal loan Categorical contact States the contact communication type Categorical Month States the last contact month of the year Categorical day_of_week States the last contact day of the week Categorical duration States the last contact duration, in seconds Numeric campaign This represents the number of contacts Numeric Pdays This represents the number of days that passed since the last contact Numeric previous This represents the number of contacts performed before this campaign Numeric poutcome This represents the outcome of the previous marketing campaign Categorical emp.var.rate States the employment variation rate - quarterly indicator Numeric cons.price.idx States the consumer price index - monthly indicator Numeric cons.conf.idx States the consumer confidence index - monthly indicator Numeric euribor3m States the euribor three month rate - daily indicator Numeric nr.employed This represents the number of employees - quarterly indicator Numeric Model building via command line Mahout uses command line implementation of logistic regression. We will first build a model using the command line implementation. Logistic regression does not have a map to reduce implementation, but as it uses stochastic gradient descent, it is pretty fast, even for large datasets. The Mahout Java class is OnlineLogisticRegression in the org.mahout.classifier.sgd package. Splitting the dataset To split a dataset, we can use the Mahout split command. Let's look at the split command arguments as follows: mahout split ––help We need to remove the first line before running the split command, as the file contains the header file and the split command doesn't make any special allowances for header lines. It will land in any line in the split file. We first remove the header line from the input_bank_data.csv file. sed -i '1d' input_bank_data.csv mkdir input_bank cp input_bank_data.csv input_bank Logistic regression in Mahout is implemented for single-machine execution. We set the variable MAHOUT_LOCAL to instruct Mahout to execute in the local mode. export MAHOUT_LOCAL=TRUE   mahout split --input input_bank --trainingOutput train_data --testOutput test_data -xm sequential --randomSelectionPct 30 This will create different datasets, with the split based on number passed to the argument --randomSelectionPct. The split command can run in both Hadoop and the local file system. For current execution, it runs in the local mode on the local file system and splits the data into two sets, 70 percent as train in the train_data directory and 30 percent as test in test_data directory. Next, we restore the header line to the train and test files as follows: sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' train_data/input_bank_data.csv sed -i '1s/^/age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,yn/' test_data/input_bank_data.csv Train the model command line option Let's have a look at some important and commonly used parameters and their descriptions: mahout trainlogistic ––help   --help print this list --quiet be extra quiet --input "input directory from where to get the training data" --output "output directory to store the model" --target "the name of the target variable" --categories "the number of target categories to be considered" --predictors "a list of predictor variables" --types "a list of predictor variables types (numeric, word or text)" --passes "the number of times to pass over the input data" --lambda "the amount of coeffiecient decay to use" --rate     "learningRate the learning rate" --noBias "do not include a bias term" --features "the number of internal hashed features to use"   mahout trainlogistic --input train_data/input_bank_data.csv --output model --target y --predictors age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed --types n w w w w w w w w w n n n n w n n n n n --features 20 --passes 100 --rate 50 --categories 2 We pass the input filename and the output folder name, identify the target variable name using --target option, the predictors using the --predictors option, and the variable or predictor type using --types option. Numeric predictors are represented using 'n', and categorical variables are predicted using 'w'. Learning rate passed using --rate is used by gradient descent to determine the step size for each descent. We pass the maximum number of passes over data as 100 and categories as 2. The output is given below, which represents 'y', the target variable, as a sum of predictor variables multiplied by coefficient or weights. As we have not included the --noBias option, we see the intercept term in the equation: y ~ -990.322*Intercept Term + -131.624*age + -11.436*campaign + -990.322*cons.conf.idx + -14.006*cons.price.idx + -15.447*contact=cellular + -9.738*contact=telephone + 5.943*day_of_week=fri + -988.624*day_of_week=mon + 10.551*day_of_week=thu + 11.177*day_of_week=tue + -131.624*day_of_week=wed + -8.061*default=no + 12.301*default=unknown + -131.541*default=yes + 6210.316*duration + -17.755*education=basic.4y + 4.618*education=basic.6y + 8.780*education=basic.9y + -11.501*education=high.school + 0.492*education=illiterate + 17.412*education=professional.course + 6202.572*education=university.degree + -979.771*education=unknown + -189.978*emp.var.rate + -6.319*euribor3m + -21.495*housing=no + -14.435*housing=unknown + 6210.316*housing=yes + -190.295*job=admin. + 23.169*job=blue-collar + 6202.200*job=entrepreneur + 6202.200*job=housemaid + -3.208*job=management + -15.447*job=retired + 1.781*job=self-employed + 11.396*job=services + -6.637*job=student + 6202.572*job=technician + -9.976*job=unemployed + -4.575*job=unknown + -12.143*loan=no + -0.386*loan=unknown + -197.722*loan=yes + -12.308*marital=divorced + -9.185*marital=married + -1004.328*marital=single + 8.559*marital=unknown + -11.501*month=apr + 9.110*month=aug + -1180.300*month=dec + -189.978*month=jul + 14.316*month=jun + -124.764*month=mar + 6203.997*month=may + -0.884*month=nov + -9.761*month=oct + 12.301*month=sep + -990.322*nr.employed + -189.978*pdays + -14.323*poutcome=failure + 4.874*poutcome=nonexistent + -7.191*poutcome=success + 1.698*previous Interpreting the output The output of the trainlogistic command is an equation representing the sum of all predictor variables multiplied by their respective coefficient. The coefficients give the change in the log-odds of the outcome for one unit increase in the corresponding feature or predictor variable. Odds are represented as the ratio of probabilities, and they express the relative probabilities of occurrence or nonoccurrence of an event. If we take the base 10 logarithm of odds and multiply the results by 10, it gives us the log-odds. Let's take an example to understand it better. Let's assume that the probability of some event E occurring is 75 percent: P(E)=75%=75/100=3/4 The probability of E not happening is as follows: 1-P(A)=25%=25/100=1/4 The odds in favor of E occurring are P(E)/(1-P(E))=3:1 and odds against it would be 1:3. This shows that the event is three times more likely to occur than to not occur. Log-odds would be 10*log(3). For example, a unit increase in the age will decrease the log-odds of the client subscribing to a term deposit by 97.148 times, whereas a unit increase in cons.conf.idx will increase the log-odds by 1051.996. Here, the change is measured by keeping other variables at the same value. Testing the model After the model is trained, it's time to test the model's performance by using a validation set. Mahout has the runlogistic command for the same, the options are as follows: mahout runlogistic ––help We run the following command on the command line: mahout runlogistic --auc --confusion --input train_data/input_bank_data.csv --model model   AUC = 0.59 confusion: [[25189.0, 2613.0], [424.0, 606.0]] entropy: [[NaN, NaN], [-45.3, -7.1]] To get the scores for each instance, we use the --scores option as follows: mahout runlogistic --scores --input train_data/input_bank_data.csv --model model To test the model on the test data, we will pass on the test file created during the split process as follows: mahout runlogistic --auc --confusion --input test_data/input_bank_data.csv --model model   AUC = 0.60 confusion: [[10743.0, 1118.0], [192.0, 303.0]] entropy: [[NaN, NaN], [-45.2, -7.5]] Prediction Mahout doesn't have an out of the box command line for implementation of logistic regression for prediction of new samples. Note that the new samples for the prediction won't have the target label y, we need to predict that value. There is a way to work around this, though; we can use mahout runlogistic for generating a prediction by adding a dummy column as the y target variable and adding some random values. The runlogistic command expects the target variable to be present, hence the dummy columns are added. We can then get the predicted score using the --scores option. Summary In this article, we covered the basic machine learning concepts. We also saw the logistic regression example in Mahout. Resources for Article:   Further resources on this subject: Implementing the Naïve Bayes classifier in Mahout [article] Learning Random Forest Using Mahout [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 4995
Modal Close icon
Modal Close icon