Data | 0 articles | Tech News, Tutorials & Expert Insights

10 Aug 2015

17 min read

The Splunk Interface

10 Aug 2015

In this article by Vincent Bumgarner & James D. Miller, author of the book, Implementing Splunk - Second Edition, we will walk through the most common elements in the Splunk interface, and will touch upon concepts that will be covered in greater detail. You may want to dive right into the search section, but an overview of the user interface elements might save you some frustration later. We will cover the following topics: Logging in and app selection A detailed explanation of the search interface widgets A quick overview of the admin interface (For more resources related to this topic, see here.) Logging into Splunk The Splunk GUI interface (Splunk is also accessible through its command-line interface [CLI] and REST API) is web-based, which means that no client needs to be installed. Newer browsers with fast JavaScript engines, such as Chrome, Firefox, and Safari, work better with the interface. As of Splunk Version 6.2.0, no browser extensions are required. Splunk Versions 4.2 and earlier require Flash to render graphs. Flash can still be used by older browsers, or for older apps that reference Flash explicitly. The default port for a Splunk installation is 8000. The address will look like: http://mysplunkserver:8000 or http://mysplunkserver.mycompany.com:8000. The Splunk interface If you have installed Splunk on your local machine, the address can be some variant of http://localhost:8000, http://127.0.0.1:8000, http://machinename:8000, or http://machinename.local:8000. Once you determine the address, the first page you will see is the login screen. The default username is admin with the password changeme. The first time you log in, you will be prompted to change the password for the admin user. It is a good idea to change this password to prevent unwanted changes to your deployment. By default, accounts are configured and stored within Splunk. Authentication can be configured to use another system, for instance Lightweight Directory Access Protocol (LDAP). By default, Splunk authenticates locally. If LDAP is set up, the order is as follows: LDAP / Local. The home app After logging in, the default app is the Launcher app (some may refer to this as Home). This app is a launching pad for apps and tutorials. In earlier versions of Splunk, the Welcome tab provided two important shortcuts, Add data and the Launch search app. In version 6.2.0, the Home app is divided into distinct areas, or panes, that provide easy access to Explore Splunk Enterprise (Add Data, Splunk Apps, Splunk Docs, and Splunk Answers) as well as Apps (the App management page) Search & Reporting (the link to the Search app), and an area where you can set your default dashboard (choose a home dashboard). The Explore Splunk Enterprise pane shows links to: Add data: This links Add Data to the Splunk page. This interface is a great start for getting local data flowing into Splunk (making it available to Splunk users). The Preview data interface takes an enormous amount of complexity out of configuring dates and line breaking. Splunk Apps: This allows you to find and install more apps from the Splunk Apps Marketplace (http://apps.splunk.com). This marketplace is a useful resource where Splunk users and employees post Splunk apps, mostly free but some premium ones as well. Splunk Answers: This is one of your links to the wide amount of Splunk documentation available, specifically http://answers.splunk.com, where you can engage with the Splunk community on Splunkbase (https://splunkbase.splunk.com/) and learn how to get the most out of your Splunk deployment. The Apps section shows the apps that have GUI elements on your instance of Splunk. App is an overloaded term in Splunk. An app doesn't necessarily have a GUI at all; it is simply a collection of configurations wrapped into a directory structure that means something to Splunk. Search & Reporting is the link to the Splunk Search & Reporting app. Beneath the Search & Reporting link, Splunk provides an outline which, when you hover over it, displays a Find More Apps balloon tip. Clicking on the link opens the same Browse more apps page as the Splunk Apps link mentioned earlier. Choose a home dashboard provides an intuitive way to select an existing (simple XML) dashboard and set it as part of your Splunk Welcome or Home page. This sets you at a familiar starting point each time you enter Splunk. The following image displays the Choose Default Dashboard dialog: Once you select an existing dashboard from the dropdown list, it will be part of your welcome screen every time you log into Splunk – until you change it. There are no dashboards installed by default after installing Splunk, except the Search & Reporting app. Once you have created additional dashboards, they can be selected as the default. The top bar The bar across the top of the window contains information about where you are, as well as quick links to preferences, other apps, and administration. The current app is specified in the upper-left corner. The following image shows the upper-left Splunk bar when using the Search & Reporting app: Clicking on the text takes you to the default page for that app. In most apps, the text next to the logo is simply changed, but the whole block can be customized with logos and alternate text by modifying the app's CSS. The upper-right corner of the window, as seen in the previous image, contains action links that are almost always available: The name of the user who is currently logged in appears first. In this case, the user is Administrator. Clicking on the username allows you to select Edit Account (which will take you to the Your account page) or to Logout (of Splunk). Logout ends the session and forces the user to login again. The following screenshot shows what the Your account page looks like: This form presents the global preferences that a user is allowed to change. Other settings that affect users are configured through permissions on objects and settings on roles. (Note: preferences can also be configured using the CLI or by modifying specific Splunk configuration files). Full name and Email address are stored for the administrator's convenience. Time zone can be changed for the logged-in user. This is a new feature in Splunk 4.3. Setting the time zone only affects the time zone used to display the data. It is very important that the date is parsed properly when events are indexed. Default app controls the starting page after login. Most users will want to change this to search. Restart backgrounded jobs controls whether unfinished queries should run again if Splunk is restarted. Set password allows you to change your password. This is only relevant if Splunk is configured to use internal authentication. For instance, if the system is configured to use Windows Active Directory via LDAP (a very common configuration), users must change their password in Windows. Messages allows you to view any system-level error messages you may have pending. When there is a new message for you to review, a notification displays as a count next to the Messages menu. You can click the X to remove a message. The Settings link presents the user with the configuration pages for all Splunk Knowledge objects, Distributed Environment settings, System and Licensing, Data, and Users and Authentication settings. If you do not see some of these options, you do not have the permissions to view or edit them. The Activity menu lists shortcuts to Splunk Jobs, Triggered Alerts, and System Activity views. You can click Jobs (to open the search jobs manager window, where you can view and manage currently running searches), click Triggered Alerts (to view scheduled alerts that are triggered) or click System Activity (to see dashboards about user activity and the status of the system). Help lists links to video Tutorials, Splunk Answers, the Splunk Contact Support portal, and online Documentation. Find can be used to search for objects within your Splunk Enterprise instance. For example, if you type in error, it returns the saved objects that contain the term error. These saved objects include Reports, Dashboards, Alerts, and so on. You can also search for error in the Search & Reporting app by clicking Open error in search. The search & reporting app The Search & Reporting app (or just the search app) is where most actions in Splunk start. This app is a dashboard where you will begin your searching. The summary view Within the Search & Reporting app, the user is presented with the Summary view, which contains information about the data which that user searches for by default. This is an important distinction—in a mature Splunk installation, not all users will always search all data by default. But at first, if this is your first trip into Search & Reporting, you'll see the following: From the screen depicted in the previous screenshot, you can access the Splunk documentation related to What to Search and How to Search. Once you have at least some data indexed, Splunk will provide some statistics on the available data under What to Search (remember that this reflects only the indexes that this particular user searches by default; there are other events that are indexed by Splunk, including events that Splunk indexes about itself.) This is seen in the following image: In previous versions of Splunk, panels such as the All indexed data panel provided statistics for a user's indexed data. Other panels gave a breakdown of data using three important pieces of metadata—Source, Sourcetype, and Hosts. In the current version—6.2.0—you access this information by clicking on the button labeled Data Summary, which presents the following to the user: This dialog splits the information into three tabs—Hosts, Sources and Sourcetypes. A host is a captured hostname for an event. In the majority of cases, the host field is set to the name of the machine where the data originated. There are cases where this is not known, so the host can also be configured arbitrarily. A source in Splunk is a unique path or name. In a large installation, there may be thousands of machines submitting data, but all data on the same path across these machines counts as one source. When the data source is not a file, the value of the source can be arbitrary, for instance, the name of a script or network port. A source type is an arbitrary categorization of events. There may be many sources across many hosts, in the same source type. For instance, given the sources /var/log/access.2012-03-01.log and /var/log/access.2012-03-02.log on the hosts fred and wilma, you could reference all these logs with source type access or any other name that you like. Let's move on now and discuss each of the Splunk widgets (just below the app name). The first widget is the navigation bar. As a general rule, within Splunk, items with downward triangles are menus. Items without a downward triangle are links. Next we find the Search bar. This is where the magic starts. We'll go into great detail shortly. Search Okay, we've finally made it to search. This is where the real power of Splunk lies. For our first search, we will search for the word (not case specific); error. Click in the search bar, type the word error, and then either press Enter or click on the magnifying glass to the right of the bar. Upon initiating the search, we are taken to the search results page. Note that the search we just executed was across All time (by default); to change the search time, you can utilize the Splunk time picker. Actions Let's inspect the elements on this page. Below the Search bar, we have the event count, action icons, and menus. Starting from the left, we have the following: The number of events matched by the base search. Technically, this may not be the number of results pulled from disk, depending on your search. Also, if your query uses commands, this number may not match what is shown in the event listing. Job: This opens the Search job inspector window, which provides very detailed information about the query that was run. Pause: This causes the current search to stop locating events but keeps the job open. This is useful if you want to inspect the current results to determine whether you want to continue a long running search. Stop: This stops the execution of the current search but keeps the results generated so far. This is useful when you have found enough and want to inspect or share the results found so far. Share: This shares the search job. This option extends the job's lifetime to seven days and sets the read permissions to everyone. Export: This exports the results. Select this option to output to CSV, raw events, XML, or JavaScript Object Notation (JSON) and specify the number of results to export. Print: This formats the page for printing and instructs the browser to print. Smart Mode: This controls the search experience. You can set it to speed up searches by cutting down on the event data it returns and, additionally, by reducing the number of fields that Splunk will extract by default from the data (Fast mode). You can, otherwise, set it to return as much event information as possible (Verbose mode). In Smart mode (the default setting) it toggles search behavior based on the type of search you're running. Timeline Now we'll skip to the timeline below the action icons. Along with providing a quick overview of the event distribution over a period of time, the timeline is also a very useful tool for selecting sections of time. Placing the pointer over the timeline displays a pop-up for the number of events in that slice of time. Clicking on the timeline selects the events for a particular slice of time. Clicking and dragging selects a range of time. Once you have selected a period of time, clicking on Zoom to selection changes the time frame and reruns the search for that specific slice of time. Repeating this process is an effective way to drill down to specific events. Deselect shows all events for the time range selected in the time picker. Zoom out changes the window of time to a larger period around the events in the current time frame The field picker To the left of the search results, we find the field picker. This is a great tool for discovering patterns and filtering search results. Fields The field list contains two lists: Selected Fields, which have their values displayed under the search event in the search results Interesting Fields, which are other fields that Splunk has picked out for you Above the field list are two links: Hide Fields and All Fields. Hide Fields: Hides the field list area from view. All Fields: Takes you to the Selected Fields window. Search results We are almost through with all the widgets on the page. We still have a number of items to cover in the search results section though, just to be thorough. As you can see in the previous screenshot, at the top of this section, we have the number of events displayed. When viewing all results in their raw form, this number will match the number above the timeline. This value can be changed either by making a selection on the timeline or by using other search commands. Next, we have the action icons (described earlier) that affect these particular results. Under the action icons, we have four results tabs: Events list, which will show the raw events. This is the default view when running a simple search, as we have done so far. Patterns streamlines the event pattern detection. It displays a list of the most common patterns among the set of events returned by your search. Each of these patterns represents the number of events that share a similar structure. Statistics populates when you run a search with transforming commands such as stats, top, chart, and so on. The previous keyword search for error does not display any results in this tab because it does not have any transforming commands. Visualization transforms searches and also populates the Visualization tab. The results area of the Visualization tab includes a chart and the statistics table used to generate the chart. Not all searches are eligible for visualization. Under the tabs described just now, is the timeline. Options Beneath the timeline, (starting at the left) is a row of option links that include: Show Fields: shows the Selected Fields screen List: allows you to select an output option (Raw, List, or Table) for displaying the search results Format: provides the ability to set Result display options, such as Show row numbers, Wrap results, the Max lines (to display) and Drilldown as on or off. NN Per Page: is where you can indicate the number of results to show per page (10, 20, or 50). To the right are options that you can use to choose a page of results, and to change the number of events per page. In prior versions of Splunk, these options were available from the Results display options popup dialog. The events viewer Finally, we make it to the actual events. Let's examine a single event. Starting at the left, we have: Event Details: Clicking here (indicated by the right facing arrow) opens the selected event, providing specific information about the event by type, field, and value, and allows you the ability to perform specific actions on a particular event field. In addition, Splunk version 6.2.0 offers a button labeled Event Actions to access workflow actions, a few of which are always available. Build Eventtype: Event types are a way to name events that match a certain query. Extract Fields: This launches an interface for creating custom field extractions. Show Source: This pops up a window with a simulated view of the original source. The event number: Raw search results are always returned in the order most recent first. Next to appear are any workflow actions that have been configured. Workflow actions let you create new searches or links to other sites, using data from an event. Next comes the parsed date from this event, displayed in the time zone selected by the user. This is an important and often confusing distinction. In most installations, everything is in one time zone—the servers, the user, and the events. When one of these three things is not in the same time zone as the others, things can get confusing. Next, we see the raw event itself. This is what Splunk saw as an event. With no help, Splunk can do a good job finding the date and breaking lines appropriately, but as we will see later, with a little help, event parsing can be more reliable and more efficient. Below the event are the fields that were selected in the field picker. Clicking on the value adds the field value to the search. Summary As you have seen, the Splunk GUI provides a rich interface for working with search results. We have really only scratched the surface and will cover more elements. Resources for Article: Further resources on this subject: The Splunk Web Framework [Article] Loading data, creating an app, and adding dashboards and reports in Splunk [Article] Working with Apps in Splunk [Article]

0
0
4002

article-image-article-oracle-bi-publisher-11g-learning-new-xpt-format

Packt

31 Oct 2011

4 min read

Oracle BI Publisher 11g: Learning the new XPT format

Packt

31 Oct 2011

4 min read

0
0
3995

Packt

03 Sep 2012

18 min read

Overview of FIM 2010 R2

Packt

03 Sep 2012

18 min read

0
0
3966

article-image-exact-inference-using-graphical-models

Packt

25 Jun 2014

7 min read

Exact Inference Using Graphical Models

Packt

25 Jun 2014

7 min read

(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]

0
0
3946

article-image-securing-data-rest-oracle-11g

Packt

23 Oct 2012

11 min read

Securing Data at Rest in Oracle 11g

Packt

23 Oct 2012

11 min read

0
0
3919

Packt

20 Jun 2014

11 min read

What is Quantitative Finance?

Packt

20 Jun 2014

11 min read

(For more resources related to this topic, see here.) Discipline 1 – finance (financial derivatives) In general, a financial derivative is a contract between two parties who agree to exchange one or more cash flows in the future. The value of these cash flows depends on some future event, for example, that the value of some stock index or interest rate being above or below some predefined level. The activation or triggering of this future event thus depends on the behavior of a variable quantity known as the underlying. Financial derivatives receive their name because they derive their value from the behavior of another financial instrument. As such, financial derivatives do not have an intrinsic value in themselves (in contrast to bonds or stocks); their price depends entirely on the underlying. A critical feature of derivative contracts is thus that their future cash flows are probabilistic and not deterministic. The future cash flows in a derivative contract are contingent on some future event. That is why derivatives are also known as contingent claims. This feature makes these types of contracts difficult to price. The following are the most common types of financial derivatives: Futures Forwards Options Swaps Futures and forwards are financial contracts between two parties. One party agrees to buy the underlying from the other party at some predetermined date (the maturity date) for some predetermined price (the delivery price). An example could be a one-month forward contract on one ounce of silver. The underlying is the price of one ounce of silver. No exchange of cash flows occur at inception (today, t=0), but it occurs only at maturity (t=T). Here t represents the variable time. Forwards are contracts negotiated privately between two parties (in other words, Over The Counter (OTC)), while futures are negotiated at an exchange. Options are financial contracts between two parties. One party (called the holder of the option) pays a premium to the other party (called the writer of the option) in order to have the right, but not the obligation, to buy some particular asset (the underlying) for some particular price (the strike price) at some particular date in the future (the maturity date). This type of contract is called a European Call contract. Example 1 Consider a one-month call contract on the S&P 500 index. The underlying in this case will be the value of the S&P 500 index. There are cash flows both at inception (today, t=0) and at maturity (t=T). At inception, (t=0) the premium is paid, while at maturity (t=T), the holder of the option will choose between the following two possible scenarios, depending on the value of the underlying at maturity S(T): Scenario A: To exercise his/her right and buy the underlying asset for K Scenario B: To do nothing if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K The option holder will choose Scenario A if the value of the underlying at maturity is above the value of the strike, that is, S(T)>K. This will guarantee him/her a profit of S(T)-K. The option holder will choose Scenario B if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K. This will guarantee him/her to limit his/her losses to zero. Example 2 An Interest Rate Swap (IRS) is a financial contract between two parties A and B who agree to exchange cash flows at regular intervals during a given period of time (the life of a contract). Typically, the cash flows from A to B are indexed to a fixed rate of interest, while the cash flows from B to A are indexed to a floating interest rate. The set of fixed cash flows is known as the fixed leg, while the set of floating cash flows is known as the floating leg. The cash flows occur at regular intervals during the life of the contract between inception (t=0) and maturity (t=T). An example could be a fixed-for-floating IRS, who pays a rate of 5 percent on the agreed notional N every three months and receives EURIBOR3M on the agreed notional N every three months. Example 3 A futures contract on a stock index also involves a single future cash flow (the delivery price) to be paid at the maturity of the contract. However, the payoff in this case is uncertain because how much profit I will get from this operation will depend on the value of the underlying at maturity. If the price of the underlying is above the delivery price, then the payoff I get (denoted by function H) is positive (indicating a profit) and corresponds to the difference between the value of the underlying at maturity S(T) and the delivery price K. If the price of the underlying is below the delivery price, then the payoff I get is negative (indicating a loss) and corresponds to the difference between the delivery price K and the value of the underlying at maturity S(T). This characteristic can be summarized in the following payoff formula: Equation 1 Here, H(S(T)) is the payoff at maturity, which is a function of S(T). Financial derivatives are very important to the modern financial markets. According to the Bank of International Settlements (BIS) as of December 2012, the amounts outstanding for OTC derivative contracts worldwide were Foreign exchange derivatives with 67,358 billion USD, Interest Rate Derivatives with 489,703 billion USD, Equity-linked derivatives with 6,251 billion USD, Commodity derivatives with 2,587 billion USD, and Credit default swaps with 25,069 billion USD. For more information, see http://www.bis.org/statistics/dt1920a.pdf. Discipline 2 – mathematics We need mathematical models to capture both the future evolution of the underlying and the probabilistic nature of the contingent cash flows we encounter in financial derivatives. Regarding the contingent cash flows, these can be represented in terms of the payoff function H(S(T)) for the specific derivative we are considering. Because S(T) is a stochastic variable, the value of H(S(T)) ought to be computed as an expectation E[H(S(T))]. And in order to compute this expectation, we need techniques that allow us to predict or simulate the behavior of the underlying S(T) into the future, so as to be able to compute the value of ST and finally be able to compute the mean value of the payoff E[H(S(T))]. Regarding the behavior of the underlying, typically, this is formalized using Stochastic Differential Equations (SDEs), such as Geometric Brownian Motion (GBM), as follows: Equation 2 The previous equation fundamentally says that the change in a stock price (dS), can be understood as the sum of two effects—a deterministic effect (first term on the right-hand side) and a stochastic term (second term on the right-hand side). The parameter is called the drift, and the parameter is called the volatility. S is the stock price, dt is a small time interval, and dW is an increment in the Wiener process. This model is the most common model to describe the behavior of stocks, commodities, and foreign exchange. Other models exist, such as jump, local volatility, and stochastic volatility models that enhance the description of the dynamics of the underlying. Regarding the numerical methods, these correspond to ways in which the formal expression described in the mathematical model (usually in continuous time) is transformed into an approximate representation that can be used for calculation (usually in discrete time). This means that the SDE that describes the evolution of the price of some stock index into the future, such as the FTSE 100, is changed to describe the evolution at discrete intervals. An approximate representation of an SDE can be calculated using the Euler approximation as follows: Equation 3 The preceding equation needs to be solved in an iterative way for each time interval between now and the maturity of the contract. If these time intervals are days and the contract has a maturity of 30 days from now, then we compute tomorrow's price in terms of todays. Then we compute the day after tomorrow as a function of tomorrow's price and so on. In order to price the derivative, we require to compute the expected payoff E[H(ST)] at maturity and then discount it to the present. In this way, we would be able to compute what should be the fair premium associated with a European option contract with the help of the following equation: Equation 4 Discipline 3 – informatics (C++ programming) What is the role of C++ in pricing derivatives? Its role is fundamental. It allows us to implement the actual calculations that are required in order to solve the pricing problem. Using the preceding techniques to describe the dynamics of the underlying, we require to simulate many potential future scenarios describing its evolution. Say we ought to price a futures contract on the EUR/USD exchange rate with one year maturity. We have to simulate the future evolution of EUR/USD for each day for the next year (using equation 3). We can then compute the payoff at maturity (using equation 1). However, in order to compute the expected payoff (using equation 4), we need to simulate thousands of such possible evolutions via a technique known as Monte Carlo simulation. The set of steps required to complete this process is known as an algorithm. To price a derivative, we ought to construct such algorithm and then implement it in an advanced programming language such as C++. Of course C++ is not the only possible choice, other languages include Java, VBA, C#, Mathworks Matlab, and Wolfram Mathematica. However, C++ is an industry standard because it's flexible, fast, and portable. Also, through the years, several numerical libraries have been created to conduct complex numerical calculations in C++. Finally, C++ is a powerful modern object-oriented language. It is always difficult to strike a balance between clarity and efficiency. We have aimed at making computer programs that are self-contained (not too object oriented) and self-explanatory. More advanced implementations are certainly possible, particularly in the context of larger financial pricing libraries in a corporate context. In this article, all the programs are implemented with the newest standard C++11 using Code::Blocks (http://www.codeblocks.org) and MinGW (http://www.mingw.org). The Bento Box template A Bento Box is a single portion take-away meal common in Japanese cuisine. Usually, it has a rectangular form that is internally divided in compartments to accommodate the various types of portions that constitute a meal. In this article, we use the metaphor of the Bento Box to describe a visual template to facilitate, organize, and structure the solution of derivative problems. The Bento Box template is simply a form that we will fill sequentially with the different elements that we require to price derivatives in a logical structured manner. The Bento Box template when used to price a particular derivative is divided into four areas or boxes, each containing information critical for the solution of the problem. The following figure illustrates a generic template applicable to all derivatives: The Bento Box template – general case The following figure shows an example of the Bento Box template as applied to a simple European Call option: The Bento Box template – European Call option In the preceding figure, we have filled the various compartments, starting in the top-left box and proceeding clockwise. Each compartment contains the details about our specific problem, taking us in sequence from the conceptual (box 1: derivative contract) to the practical (box 4: algorithm), passing through the quantitative aspects required for the solution (box 2: mathematical model and box 3: numerical method). Summary This article gave an overview of the main elements of Quantitative Finance as applied to pricing financial derivatives. The Bento Box template technique will be used to organize our approach to solve problems in pricing financial derivatives. We will assume that we are in possession with enough information to fill box 1 (derivative contract). Resources for Article: Further resources on this subject: Application Development in Visual C++ - The Tetris Application [article] Getting Started with Code::Blocks [article] Creating and Utilizing Custom Entities [article]

0
0
3915

article-image-using-execnet-parallel-and-distributed-processing-nltk

Packt

09 Nov 2010

8 min read

Using Execnet for Parallel and Distributed Processing with NLTK

Packt

09 Nov 2010

8 min read

Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible Introduction NLTK is great for in-memory single-processor natural language processing. However, there are times when you have a lot of data to process and want to take advantage of multiple CPUs, multi-core CPUs, and even multiple computers. Or perhaps you want to store frequencies and probabilities in a persistent, shared database so multiple processes can access it simultaneously. For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions and more. Distributed tagging with execnet Execnet is a distributed execution library for python. It allows you to create gateways and channels for remote code execution. A gateway is a connection from the calling process to a remote environment. The remote environment can be a local subprocess or an SSH connection to a remote node. A channel is created from a gateway and handles communication between the channel creator and the remote code. Since many NLTK processes require 100 percent CPU utilization during computation, execnet is an ideal way to distribute that computation for maximum resource usage. You can create one gateway per CPU core, and it doesn't matter whether the cores are in your local computer or spread across remote machines. In many situations, you only need to have the trained objects and data on a single machine, and can send the objects and data to the remote nodes as needed. Getting ready You'll need to install execnet for this to work. It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.0.8. The execnet homepage, which has API documentation and examples, is at http://codespeak.net/execnet/. How to do it... We start by importing the required modules, as well as an additional module remote_tag. py that will be explained in the next section. We also need to import pickle so we can serialize the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so we must dump the tagger to a string using pickle.dumps(). We'll use the default tagger that's used by the nltk.tag.pos_tag() function, but you could load and dump any pre-trained part-of-speech tagger as long as it implements the TaggerI interface. Once we have a serialized tagger, we start execnet by making a gateway with execnet. makegateway(). The default gateway creates a Python subprocess, and we can call the remote_exec() method with the remote_tag module to create a channel. With an open channel, we send over the serialized tagger and then the first tokenized sentence of the treebank corpus. You don't have to do any special serialization of simple types such as lists and tuples, since execnet already knows how to handle serializing the built-in types. Now if we call channel.receive(), we get back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so we know the tagging worked. We end by exiting the gateway, which closes the channel and kills the subprocess. >>> import execnet, remote_tag, nltk.tag, nltk.data >>> from nltk.corpus import treebank >>> import cPickle as pickle >>> tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER)) >>> gw = execnet.makegateway() >>> channel = gw.remote_exec(remote_tag) >>> channel.send(tagger) >>> channel.send(treebank.sents()[0]) >>> tagged_sentence = channel.receive() >>> tagged_sentence == treebank.tagged_sents()[0] True >>> gw.exit() Visually, the communication process looks like this: How it works... The gateway's remote_exec() method takes a single argument that can be one of the following three types: A string of code to execute remotely. The name of a pure function that will be serialized and executed remotely. The name of a pure module whose source will be executed remotely. We use the third option with the remote_tag.py module, which is defined as follows: import cPickle as pickle if __name__ == '__channelexec__': tagger = pickle.loads(channel.receive()) for sentence in channel: channel.send(tagger.tag(sentence)) A pure module is a module that is self-contained. It can only access Python modules that are available where it executes, and does not have access to any variables or states that exist wherever the gateway is initially created. To detect that the module is being executed by execnet, you can look at the __name__ variable. If it's equal to '__channelexec__', then it is being used to create a remote channel. This is similar to doing if __name__ == '__ main__' to check if a module is being executed on the command line. The first thing we do is call channel.receive() to get the serialized tagger, which we load using pickle.loads(). You may notice that channel is not imported anywhere—that's because it is included in the global namespace of the module. Any module that execnet executes remotely has access to the channel variable in order to communicate with the channel` creator. Once we have the tagger, we iteratively tag() each tokenized sentence that we receive from the channel. This allows us to tag as many sentences as the sender wants to send, as iteration will not stop until the channel is closed. What we've essentially created is a compute node for part-of-speech tagging that dedicates 100 percent of its resources to tagging whatever sentences it receives. As long as the channel remains open, the node is available for processing. There's more... This is a simple example that opens a single gateway and channel. But execnet can do a lot more, such as opening multiple channels to increase parallel processing, as well as opening gateways to remote hosts over SSH to do distributed processing. Multiple channels We can create multiple channels, one per gateway, to make the processing more parallel. Each gateway creates a new subprocess (or remote interpreter if using an SSH gateway) and we use one channel per gateway for communication. Once we've created two channels, we can combine them using the MultiChannel class, which allows us to iterate over the channels, and make a receive queue to receive messages from each channel. After creating each channel and sending the tagger, we cycle through the channels to send an even number of sentences to each channel for tagging. Then we collect all the responses from the queue. A call to queue.get() will return a 2-tuple of (channel, message) in case you need to know which channel the message came from. If you don't want to wait forever, you can also pass a timeout keyword argument with the maximum number of seconds you want to wait, as in queue.get(timeout=4). This can be a good way to handle network errors. Once all the tagged sentences have been collected, we can exit the gateways. Here's the code: >>> import itertools >>> gw1 = execnet.makegateway() >>> gw2 = execnet.makegateway() >>> ch1 = gw1.remote_exec(remote_tag) >>> ch1.send(tagger) >>> ch2 = gw2.remote_exec(remote_tag) >>> ch2.send(tagger) >>> mch = execnet.MultiChannel([ch1, ch2]) >>> queue = mch.make_receive_queue() >>> channels = itertools.cycle(mch) >>> for sentence in treebank.sents()[:4]: ... channel = channels.next() ... channel.send(sentence) >>> tagged_sentences = [] >>> for i in range(4): ... channel, tagged_sentence = queue.get() ... tagged_sentences.append(tagged_sentence) >>> len(tagged_sentences) 4 >>> gw1.exit() >>> gw2.exit() Local versus remote gateways The default gateway spec is popen, which creates a Python subprocess on the local machine. This means execnet.makegateway() is equivalent to execnet. makegateway('popen'). If you have passwordless SSH access to a remote machine, then you can create a remote gateway using execnet.makegateway('ssh=remotehost') where remotehost should be the hostname of the machine. A SSH gateway spawns a new Python interpreter for executing the code remotely. As long as the code you're using for remote execution is pure, you only need a Python interpreter on the remote machine. Channels work exactly the same no matter what kind of gateway is used; the only difference will be communication time. This means you can mix and match local subprocesses with remote interpreters to distribute your computations across many machines in a network. There are many more details on gateways in the API documentation at http://codespeak.net/execnet/basics.html.

0
0
3905

article-image-integrating-ibm-cognos-tm1-ibm-cognos-8-bi

Packt

16 Dec 2011

4 min read

Integrating IBM Cognos TM1 with IBM Cognos 8 BI

Packt

16 Dec 2011

4 min read

(For more resources on IBM, see here.) Before proceeding with the actual steps of the recipe, we will take a note of the following integration considerations: The measured Dimension in the TM1 Cube needs to be explicitly identified. The Data Source needs to be created in IBM Cognos Connection which points to the TM1 Cube. New Data Source can also be created from IBM Cognos Framework Manager, but for the sake of simplicity we will be creating that from IBM Cognos Connection itself. The created Data Source is used in IBM Cognos Framework Manager Model to create a Metadata Package and publish to IBM Cognos Connection. Metadata Package can be used to create reports, generate queries, slice and dice, or event management using one of the designer studios available in IBM Cognos BI. We will focus on each of the above steps in this recipe, where we will be using one of the Cubes created as part of demodata TM1 Server application and we will be using the Cube as a Data Source in the IBM Cognos BI layer. Getting ready Ensure that the TM1 Admin Server service is started and demodata TM1 Server is running. We should have IBM Cognos 8 BI Server running and IBM Cognos 8 Framework Manager installed. How to do it... Open the TM1 Architect and right-click on the Sales_Plan Cube. Click on Properties. In the Measures Dimension box, click on Sales_Plan_Measures and then for Time Dimension click on Months. Note that the preceding step is compulsory if we want to use the Cube as a Data Source for the BI layer. We need to explicitly define a measures dimension and a time dimension. Click on OK and minimize the TM1 Architect, keep the server running. Now from the Start menu, open IBM Cognos Framework Manager, which is desktop-based tool used to create metadata models. Create a new project from IBM Cognos 8 Framework Manager. Enter the Project name as Demodata and provide the Location where the model file will be located. Note that each project generates a .cpf file which can be opened in the IBM Cognos Framework Manager. Provide valid user credentials so that IBM Cognos Framework Manager can link to a running IBM Cognos BI Server setup. Users and roles are defined by IBM Cognos BI admin user. Choose English as the authoring language when the Select Language list comes up. This will open the Metadata Wizard - Select Metadata Source. We use the Metadata Wizard to create a new Data Source or point to an existing Data Source. In the Metadata Wizard make sure that Data Sources is selected and click on the Next button. In the next screen, click on the New button to create a new Data Source by the name of TM1_Demodata_Sales_Plan. This will open a New data source wizard, where we need to specify the name of the Data Source. On next screen, it will ask for the Data Source Type for which we will specify TM1 from the drop-down, as we want to create a new Data Source based on the TM1 Cube Sales_Data. On the next screen specify the connection parameters. For Administration Host we can specify a name or localhost, depending on the name of the server. In our case, we have specified name of the server as ankitgar, hence we are using an actual name instead of a localhost. In the case of TM1 sitting on another server within the network, we will provide the IP address or name of the host in UNC format. Test the connection to test whether the connection to the TM1 Cube is successful. Click on Close and proceed. Click on the Finish button to complete the creation of the Data Source. The new Data Source is created on the Cognos 8 Server and now can be used by anyone with valid privileges given by the admin user. It's just a connection to the Sales_Plan TM1 Cube which now can be used to create metadata models and, hence, reports and queries perform the various functions suggested in the preceding sections. Now it will return to Metadata Wizard as shown, with the new Data Source appearing with the list of already created Data Sources. Click on the newly created Data Source and on the Next button. It will display all available Cubes on the DemoData TM1 Server, the machine name being the server name (localhost/ankitgar). Click on the Sales_Plan cube and then on Next.

0
0
3894

Packt

15 Mar 2017

24 min read

WebLogic Server

Packt

15 Mar 2017

24 min read

0
0
3892

Packt

25 May 2015

37 min read

Query complete/suggest

Packt

25 May 2015

37 min read

0
0
3892

Packt

03 Feb 2016

11 min read

Gradient Descent at Work

Packt

03 Feb 2016

11 min read

In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p): return np.array([np.random.normal() for j in range(p)]) def hypothesis(X, w): return np.dot(X,w) def loss(X, w, y): return hypothesis(X, w) - y def squared_loss(X, w, y): return loss(X, w, y)**2 def gradient(X, w, y): gradients = list() n = float(len(y)) for j in range(len(w)): gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n) return gradients def update(X, w, y, alpha=0.01): return [t - alpha*g for t, g in zip(w, gradient(X, w, y))] def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000): w = random_w(X.shape[1]) for k in range(iterations): SSL = np.sum(squared_loss(X,w,y)) new_w = update(X,w,y, alpha=alpha) new_SSL = np.sum(squared_loss(X,new_w,y)) w = new_w if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta): return w return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article: Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]

0
0
3884

Packt

20 Nov 2013

6 min read

Securing the Hadoop Ecosystem

Packt

20 Nov 2013

6 min read

(For more resources related to this topic, see here.) Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce). The following are the topics that we'll be covering in this article: Configuring authentication and authorization for the following Hadoop ecosystem components: Hive Oozie Flume HBase Sqoop Pig Best practices in configuring secured Hadoop components Configuring Kerberos for Hadoop ecosystem components The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured. Securing Hive Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop: There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows: The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster. Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster. To secure Hive in the Hadoop ecosystem, the following interactions should be secured: User interaction with Hive CLI or HiveServer User roles and privileges needs to be enforced to ensure users have access to only authorized data The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems. Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication. HiveServer2 supports Kerberos and LDAP authentication for the user authentication. When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users' data in the Hadoop cluster. On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users. To set up a secured Hive interactions, we need to do the following steps: One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml. <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/_HOST@YOUR-REALM.COM</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property> Securing Hive using Sentry In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser. Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies. More details on Sentry are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cdh/sentry.html Summary In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry. Resources for Article: Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Managing a Hadoop Cluster [Article] Making Big Data Work for Hadoop and Solr [Article]

0
0
3877

article-image-python-text-processing-nltk-2-transforming-chunks-and-trees

Packt

16 Dec 2010

10 min read

Python Text Processing with NLTK 2: Transforming Chunks and Trees

Packt

16 Dec 2010

10 min read

0
0
3870

article-image-embed-einstein-dashboards-salesforce-classic

Amey Varangaonkar

21 Mar 2018

5 min read

How to embed Einstein dashboards on Salesforce Classic

Amey Varangaonkar

21 Mar 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book highlights the key techniques and know-how to unlock critical insights from your data using Salesforce Einstein Analytics.[/box] With Einstein Analytics, users have the power to embed their dashboards on various third-party applications and even on their web applications. In this article, we will show how to embed an Einstein dashboard on Salesforce Classic. In order to start embedding the dashboard, let's create a sample dashboard by performing the following steps: Navigate to Analytics Studio | Create | Dashboard. Add three chart widgets on the dashboard. Click on the Chart button in the middle and select the Opportunity dataset. Select Measures as Sum of Amount and select BillingCountry under Group by. Click on Done. Repeat the second step for the second widget, but select Account Source under Group by and make it a donut chart. Repeat the second step for the third widget but select Stage under Group by and make it a funnel chart. Click on Save (s) and enter Embedding Opportunities in the title field, as shown in the following screenshot: Now that we have created a dashboard, let's embed this dashboard in Salesforce Classic. In order to start embedding the dashboard, exit from the Einstein Analytics platform and go to Classic mode. The user can embed the dashboard on the record detail page layout in Salesforce Classic. The user can view the dashboard, drill in, and apply a filter, just like in the Einstein Analytics window. Let's add the dashboard to the account detail page by performing the following steps: Navigate to Setup | Customize | Accounts | Page Layouts as shown in the following screenshot: Click on Edit of Account Layout and it will open a page layout editor which has two parts: a palette on the upper portion of the screen, and the page layout on the lower portion of the screen. The palette contains the user interface elements that you can add to your page layout, such as Fields, Buttons, Links, and Actions, and Related Lists, as shown in the following screenshot: Click on the Wave Analytics Assets option from the palette and you can see all the dashboards on the right-side panel. Drag and drop a section onto the page layout, name it Einstein Dashboard, and click on OK. Drag and drop the dashboard which you wish to add to the record detail page. We are going to add Embedded Opportunities. Click on Save. Go to any accounting record and you should see a new section within the dashboard: Users can easily configure the embedded dashboards by using attributes. To access the dashboard properties, go to edit page layout again, and go to the section where we added the dashboard to the layout. Hover over the dashboard and click on the Tool icon. It will open an Asset Properties window: The Asset Properties window gives the user the option to change the following features: Width (in pixels or %): This feature allows you to adjust the width of the dashboard section. Height (in pixels): This feature allows you to adjust the height of the dashboard section. Show Title: This feature allows you to display or hide the title of the dashboard. Show Sharing Icon: Using this feature, by default, the share icon is disabled. The Show Sharing Icon option gives the user a flexibility to include the share icon on the dashboard. Show Header: This feature allows you to display or hide the header. Hide on error: This feature gives you control over whether the Analytics asset appears if there is an error. Field mapping: Last but not least, field mapping is used to filter the relevant data to the record on the dashboard. To set up the dashboard to show only the data that’s relevant to the record being viewed, use field mapping. Field mapping links data fields in the dashboard to the object’s fields. We are using the Embedded Opportunity dashboard. Let's add field mapping to it. The following is the format for field mapping: { "datasets": { "datasetName":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset Fieldname"]} }] } Let's add field mapping for account by using the following format: { "datasets": { "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } If your dashboard uses multiple datasets, then you can use the following format: { "datasets": { "datasetName1":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset1 Fieldname"]} }], "datasetName2":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset2 Fieldname"]} }] } Let's add field mapping for account and opportunities: { "datasets": { "Opportunities":[{ "fields":["Account.Name"], "Filter":{"operator": "Matches", "values":["$Name"]} }], "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } Now that we have added field mapping, save the page layout and go to the actual record. Observe that the dashboard is getting filtered now per record, as shown in the following screenshot: To summarize, we saw it’s fairly easy to embed your custom dashboards in Salesforce. Similarly, you can do so on other platforms such as Lightning, Visualforce pages, and even on your websites and web applications. If you are keen to learn more, you may check out the book Learning Einstein Analytics.

0
0
3847

Packt

23 Jul 2015

18 min read

Elasticsearch – Spicing Up a Search Using Geo

Packt

23 Jul 2015

18 min read

A geo point refers to the latitude and longitude of a point on Earth. Each location on it has its own unique latitude and longitude. Elasticsearch is aware of geo-based points and allows you to perform various operations on top of it. In many contexts, it's also required to consider a geo location component to obtain various functionalities. For example, say you need to search for all the nearby restaurants that serve Chinese food or I need to find the nearest cab that is free. In some other situation, I need to find to which state a particular geo point location belongs to understand where I am currently standing. This article by Vineeth Mohan, author of the book Elasticsearch Blueprints, is modeled such that all the examples mentioned are related to real-life scenarios, of restaurant searching, for better understanding. Here, we take the example of sorting restaurants based on geographical preferences. A number of cases ranging from the simple, such as finding the nearest restaurant, to the more complex case, such as categorization of restaurants based on distance are covered in this article. What makes Elasticsearch unique and powerful is the fact that you can combine geo operation with any other normal search query to yield results clubbed with both the location data and the query data. (For more resources related to this topic, see here.) Restaurant search Let's consider creating a search portal for restaurants. The following are its requirements: To find the nearest restaurant with Chinese cuisine, which has the word ChingYang in its name. To decrease the importance of all restaurants outside city limits. To find the distance between the restaurant and current point for each of the preceding restaurant matches. To find whether the person is in a particular city's limit or not. To aggregate all restaurants within a distance of 10 km. That is, for a radius of the first 10 km, we have to compute the number of restaurants. For the next 10 km, we need to compute the number of restaurants and so on. Data modeling for restaurants Firstly, we need to see the aspects of data and model it around a JSON document for Elasticsearch to make sense of the data. A restaurant has a name, its location information, and rating. To store the location information, Elasticsearch has a provision to understand the latitude and longitude information and has features to conduct searches based on it. Hence, it would be best to use this feature. Let's see how we can do this. First, let's see what our document should look like: { "name" : "Tamarind restaurant", "location" : { "lat" : 1.10, "lon" : 1.54 } } Now, let's define the schema for the same: curl -X PUT "http://$hostname:9200/restaurants" -d '{ "index": { "number_of_shards": 1, "number_of_replicas": 1 }, "analysis":{ "analyzer":{ "flat" : { "type" : "custom", "tokenizer" : "keyword", "filter" : "lowercase" } } } }' echo curl -X PUT "http://$hostname:9200/restaurants /restaurant/_mapping" -d '{ "restaurant" : { "properties" : { "name" : { "type" : "string" }, "location" : { "type" : "geo_point", "accuracy" : "1km" } }} }' Let's now index some documents in the index. An example of this would be the Tamarind restaurant data shown in the previous section. We can index the data as follows: curl -XPOST 'http://localhost:9200/restaurants/restaurant' -d '{ "name": "Tamarind restaurant", "location": { "lat": 1.1, "lon": 1.54 } }' Likewise, we can index any number of documents. For the sake of convenience, we have indexed only a total of five restaurants for this article. The latitude and longitude should be of this format. Elasticsearch also accepts two other formats (geohash and lat_lon), but let's stick to this one. As we have mapped the field location to the type geo_point, Elasticsearch is aware of what this information means and how to act upon it. The nearest hotel problem Let's assume that we are at a particular point where the latitude is 1.234 and the longitude is 2.132. We need to find the nearest restaurants to this location. For this purpose, the function_score query is the best option. We can use the decay (Gauss) functionality of the function score query to achieve this: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "function_score": { "functions": [ { "gauss": { "location": { "scale": "1km", "origin": [ 1.231, 1.012 ] } } } ] } } }' Here, we tell Elasticsearch to give a higher score to the restaurants that are nearby the referral point we gave it. The closer it is, the higher is the importance. Maximum distance covered Now, let's move on to another example of finding restaurants that are within 10 kms from my current position. Those that are beyond 10 kms are of no interest to me. So, it almost makes up to a circle with a radius of 10 km from my current position, as shown in the following map: Our best bet here is using a geo distance filter. It can be used as follows: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "filter": { "geo_distance": { "distance": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Inside city limits Next, I need to consider only those restaurants that are inside a particular city limit; the rest are of no interest to me. As the city shown in the following map is rectangle in nature, this makes my job easier: Now, to see whether a geo point is inside a rectangle, we can use the bounding box filter. A rectangle is marked when you feed the top-left point and bottom-right point. Let's assume that the city is within the following rectangle with the top-left point as X and Y and the bottom-right point as A and B: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_bounding_box": { "location": { "top_left": { "lat": 2, "lon": 0 }, "bottom_right": { "lat": 0, "lon": 2 } } } } } } }' Distance values between the current point and each restaurant Now, consider the scenario where you need to find the distance between the user location and each restaurant. How can we achieve this requirement? We can use scripts; the current geo coordinates are passed to the script and then the query to find the distance between each restaurant is run, as in the following code. Here, the current location is given as (1, 2): curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "script_fields": { "distance": { "script": "doc['"'"'location'"'"'].arcDistanceInKm(1, 2)" } }, "fields": [ "name" ], "query": { "match": { "name": "chinese" } } }' We have used the function called arcDistanceInKm in the preceding query, which accepts the geo coordinates and then returns the distance between that point and the locations satisfied by the query. Note that the unit of distance calculated is in kilometers (km). You might have noticed a long list of quotes and double quotes before and after location in the script mentioned previously. This is the standard format and if we don't use this, it would result in returning the format error while processing. The distances are calculated from the current point to the filtered hotels and are returned in the distance field of response, as shown in the following code: { "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.7554128, "hits" : [ { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZX6QQuJvMORdWRK", "_score" : 0.7554128, "fields" : { "distance" : [ 112.92927483176413 ], "name" : [ "Great chinese restaurant" ] } }, { "_index" : "restaurants", "_type" : "restaurant", "_id" : "AU08uZaZQQuJvMORdWRM", "_score" : 0.7554128, "fields" : { "distance" : [ 137.61635969665923 ], "name" : [ "Great chinese restaurant" ] } } ] } } Note that the distances measured from the current point to the hotels are direct distances and not road distances. Restaurant out of city limits One of my friends called me and asked me to join him on his journey to the next city. As we were leaving the city, he was particular that he wants to eat at some restaurant off the city limits, but outside the next city. For this, the requirement was translated to any restaurant that is minimum 15 kms and a maximum of 100 kms from the center of the city. Hence, we have something like a donut in which we have to conduct our search, as show in the following map: The area inside the donut is a match, but the area outside is not. For this donut area calculation, we have the geo_distance_range filter to our rescue. Here, we can apply the minimum distance and maximum distance in the fields from and to to populate the results, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": { "filtered": { "query": { "match_all": {} }, "filter": { "geo_distance_range": { "from": "15km", "to": "100km", "location": { "lat": 1.232, "lon": 1.112 } } } } } }' Restaurant categorization based on distance In an e-commerce solution, to search restaurants, it's required that you increase the searchable characteristics of the application. This means that if we are able to give a snapshot of results other than the top-10 results, it would add to the searchable characteristics of the search. For example, if we are able to show how many restaurants serve Indian, Thai, or other cuisines, it would actually help the user to get a better idea of the result set. In a similar manner, if we can tell them if the restaurant is near, at a medium distance, or far away, we can really pull a chord in the restaurant search user experience, as shown in the following map: Implementing this is not hard, as we have something called the distance range aggregation. In this aggregation type, we can handcraft the range of distance we are interested in and create a bucket for each of them. We can also define the key name we need, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "aggs": { "distanceRanges": { "geo_distance": { "field": "location", "origin": "1.231, 1.012", "unit": "meters", "ranges": [ { "key": "Near by Locations", "to": 200 }, { "key": "Medium distance Locations", "from": 200, "to": 2000 }, { "key": "Far Away Locations", "from": 2000 } ] } } } }' In the preceding code, we categorized the restaurants under three distance ranges, which are the nearby hotels (less than 200 meters), medium distant hotels (within 200 meters to 2,000 meters), and the far away ones (greater than 2,000 meters). This logic was translated to the Elasticsearch query using which, we received the results as follows: { "took": 44, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 5, "max_score": 0, "hits": [ ] }, "aggregations": { "distanceRanges": { "buckets": [ { "key": "Near by Locations", "from": 0, "to": 200, "doc_count": 1 }, { "key": "Medium distance Locations", "from": 200, "to": 2000, "doc_count": 0 }, { "key": "Far Away Locations", "from": 2000, "doc_count": 4 } ] } } } In the results, we received how many restaurants are there in each distance range indicated by the doc_count field. Aggregating restaurants based on their nearness In the previous example, we saw the aggregation of restaurants based on their distance from the current point to three different categories. Now, we can consider another scenario in which we classify the restaurants on the basis of the geohash grids that they belong to. This kind of classification can be advantageous if the user would like to get a geographical picture of how the restaurants are distributed. Here is the code for a geohash-based aggregation of restaurants: curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "size": 0, "aggs": { "DifferentGrids": { "geohash_grid": { "field": "location", "precision": 6 }, "aggs": { "restaurants": { "top_hits": {} } } } } }' You can see from the preceding code that we used the geohash aggregation, which is named as DifferentGrids and the precision here, is to be set as 6. The precision field value can be varied within the range of 1 to 12, with 1 being the lowest and 12 being the highest reference of precision. Also, we used another aggregation named restaurants inside the DifferentGrids aggregation. The restaurant aggregation uses the top_hits query to fetch the aggregated details from the DifferentGrids aggregation, which otherwise, would return only the key and doc_count values. So, running the preceding code gives us the following result: { "took":5, "timed_out":false, "_shards":{ "total":1, "successful":1, "failed":0 }, "hits":{ "total":5, "max_score":0, "hits":[ ] }, "aggregations":{ "DifferentGrids":{ "buckets":[ { "key":"s009", "doc_count":2, "restaurants":{... } }, { "key":"s01n", "doc_count":1, "restaurants":{... } }, { "key":"s00x", "doc_count":1, "restaurants":{... } }, { "key":"s00p", "doc_count":1, "restaurants":{... } } ] } } } As we can see from the response, there are four buckets with the key values, which are s009, s01n, s00x, and s00p. These key values represent the different geohash grids that the restaurants belong to. From the preceding result, we can evidently say that the s009 grid contains two restaurants inside it and all the other grids contain one each. A pictorial representation of the previous aggregation would be like the one shown on the following map: Summary We found that Elasticsearch can handle geo point and various geo-specific operations. A few geospecific and geopoint operations that we covered in this article were searching for nearby restaurants (restaurants inside a circle), searching for restaurants within a range (restaurants inside a concentric circle), searching for restaurants inside a city (restaurants inside a rectangle), searching for restaurants inside a polygon, and categorization of restaurants by the proximity. Apart from these, we can use Kibana, a flexible and powerful visualization tool provided by Elasticsearch for geo-based operations. Resources for Article: Further resources on this subject: Elasticsearch Administration [article] Extending ElasticSearch with Scripting [article] Indexing the Data [article]

0
0
3814

How-To Tutorials - Data

The Splunk Interface

Oracle BI Publisher 11g: Learning the new XPT format

Overview of FIM 2010 R2

Exact Inference Using Graphical Models

Securing Data at Rest in Oracle 11g

What is Quantitative Finance?

Using Execnet for Parallel and Distributed Processing with NLTK

Integrating IBM Cognos TM1 with IBM Cognos 8 BI

WebLogic Server

Query complete/suggest

Trending Topics

Gradient Descent at Work

Securing the Hadoop Ecosystem

Python Text Processing with NLTK 2: Transforming Chunks and Trees

How to embed Einstein dashboards on Salesforce Classic

Elasticsearch – Spicing Up a Search Using Geo

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access