Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-splunk-interface
Packt
10 Aug 2015
17 min read
Save for later

The Splunk Interface

Packt
10 Aug 2015
17 min read
In this article by Vincent Bumgarner & James D. Miller, author of the book, Implementing Splunk - Second Edition, we will walk through the most common elements in the Splunk interface, and will touch upon concepts that will be covered in greater detail. You may want to dive right into the search section, but an overview of the user interface elements might save you some frustration later. We will cover the following topics: Logging in and app selection A detailed explanation of the search interface widgets A quick overview of the admin interface (For more resources related to this topic, see here.) Logging into Splunk The Splunk GUI interface (Splunk is also accessible through its command-line interface [CLI] and REST API) is web-based, which means that no client needs to be installed. Newer browsers with fast JavaScript engines, such as Chrome, Firefox, and Safari, work better with the interface. As of Splunk Version 6.2.0, no browser extensions are required. Splunk Versions 4.2 and earlier require Flash to render graphs. Flash can still be used by older browsers, or for older apps that reference Flash explicitly. The default port for a Splunk installation is 8000. The address will look like: http://mysplunkserver:8000 or http://mysplunkserver.mycompany.com:8000. The Splunk interface If you have installed Splunk on your local machine, the address can be some variant of http://localhost:8000, http://127.0.0.1:8000, http://machinename:8000, or http://machinename.local:8000. Once you determine the address, the first page you will see is the login screen. The default username is admin with the password changeme. The first time you log in, you will be prompted to change the password for the admin user. It is a good idea to change this password to prevent unwanted changes to your deployment. By default, accounts are configured and stored within Splunk. Authentication can be configured to use another system, for instance Lightweight Directory Access Protocol (LDAP). By default, Splunk authenticates locally. If LDAP is set up, the order is as follows: LDAP / Local. The home app After logging in, the default app is the Launcher app (some may refer to this as Home). This app is a launching pad for apps and tutorials. In earlier versions of Splunk, the Welcome tab provided two important shortcuts, Add data and the Launch search app. In version 6.2.0, the Home app is divided into distinct areas, or panes, that provide easy access to Explore Splunk Enterprise (Add Data, Splunk Apps, Splunk Docs, and Splunk Answers) as well as Apps (the App management page) Search & Reporting (the link to the Search app), and an area where you can set your default dashboard (choose a home dashboard).                 The Explore Splunk Enterprise pane shows links to: Add data: This links Add Data to the Splunk page. This interface is a great start for getting local data flowing into Splunk (making it available to Splunk users). The Preview data interface takes an enormous amount of complexity out of configuring dates and line breaking. Splunk Apps: This allows you to find and install more apps from the Splunk Apps Marketplace (http://apps.splunk.com). This marketplace is a useful resource where Splunk users and employees post Splunk apps, mostly free but some premium ones as well. Splunk Answers: This is one of your links to the wide amount of Splunk documentation available, specifically http://answers.splunk.com, where you can engage with the Splunk community on Splunkbase (https://splunkbase.splunk.com/) and learn how to get the most out of your Splunk deployment. The Apps section shows the apps that have GUI elements on your instance of Splunk. App is an overloaded term in Splunk. An app doesn't necessarily have a GUI at all; it is simply a collection of configurations wrapped into a directory structure that means something to Splunk. Search & Reporting is the link to the Splunk Search & Reporting app. Beneath the Search & Reporting link, Splunk provides an outline which, when you hover over it, displays a Find More Apps balloon tip. Clicking on the link opens the same Browse more apps page as the Splunk Apps link mentioned earlier. Choose a home dashboard provides an intuitive way to select an existing (simple XML) dashboard and set it as part of your Splunk Welcome or Home page. This sets you at a familiar starting point each time you enter Splunk. The following image displays the Choose Default Dashboard dialog: Once you select an existing dashboard from the dropdown list, it will be part of your welcome screen every time you log into Splunk – until you change it. There are no dashboards installed by default after installing Splunk, except the Search & Reporting app. Once you have created additional dashboards, they can be selected as the default. The top bar The bar across the top of the window contains information about where you are, as well as quick links to preferences, other apps, and administration. The current app is specified in the upper-left corner. The following image shows the upper-left Splunk bar when using the Search & Reporting app: Clicking on the text takes you to the default page for that app. In most apps, the text next to the logo is simply changed, but the whole block can be customized with logos and alternate text by modifying the app's CSS. The upper-right corner of the window, as seen in the previous image, contains action links that are almost always available: The name of the user who is currently logged in appears first. In this case, the user is Administrator. Clicking on the username allows you to select Edit Account (which will take you to the Your account page) or to Logout (of Splunk). Logout ends the session and forces the user to login again. The following screenshot shows what the Your account page looks like: This form presents the global preferences that a user is allowed to change. Other settings that affect users are configured through permissions on objects and settings on roles. (Note: preferences can also be configured using the CLI or by modifying specific Splunk configuration files). Full name and Email address are stored for the administrator's convenience. Time zone can be changed for the logged-in user. This is a new feature in Splunk 4.3. Setting the time zone only affects the time zone used to display the data. It is very important that the date is parsed properly when events are indexed. Default app controls the starting page after login. Most users will want to change this to search. Restart backgrounded jobs controls whether unfinished queries should run again if Splunk is restarted. Set password allows you to change your password. This is only relevant if Splunk is configured to use internal authentication. For instance, if the system is configured to use Windows Active Directory via LDAP (a very common configuration), users must change their password in Windows. Messages allows you to view any system-level error messages you may have pending. When there is a new message for you to review, a notification displays as a count next to the Messages menu. You can click the X to remove a message. The Settings link presents the user with the configuration pages for all Splunk Knowledge objects, Distributed Environment settings, System and Licensing, Data, and Users and Authentication settings. If you do not see some of these options, you do not have the permissions to view or edit them. The Activity menu lists shortcuts to Splunk Jobs, Triggered Alerts, and System Activity views. You can click Jobs (to open the search jobs manager window, where you can view and manage currently running searches), click Triggered Alerts (to view scheduled alerts that are triggered) or click System Activity (to see dashboards about user activity and the status of the system). Help lists links to video Tutorials, Splunk Answers, the Splunk Contact Support portal, and online Documentation. Find can be used to search for objects within your Splunk Enterprise instance. For example, if you type in error, it returns the saved objects that contain the term error. These saved objects include Reports, Dashboards, Alerts, and so on. You can also search for error in the Search & Reporting app by clicking Open error in search. The search & reporting app The Search & Reporting app (or just the search app) is where most actions in Splunk start. This app is a dashboard where you will begin your searching. The summary view Within the Search & Reporting app, the user is presented with the Summary view, which contains information about the data which that user searches for by default. This is an important distinction—in a mature Splunk installation, not all users will always search all data by default. But at first, if this is your first trip into Search & Reporting, you'll see the following: From the screen depicted in the previous screenshot, you can access the Splunk documentation related to What to Search and How to Search. Once you have at least some data indexed, Splunk will provide some statistics on the available data under What to Search (remember that this reflects only the indexes that this particular user searches by default; there are other events that are indexed by Splunk, including events that Splunk indexes about itself.) This is seen in the following image: In previous versions of Splunk, panels such as the All indexed data panel provided statistics for a user's indexed data. Other panels gave a breakdown of data using three important pieces of metadata—Source, Sourcetype, and Hosts. In the current version—6.2.0—you access this information by clicking on the button labeled Data Summary, which presents the following to the user: This dialog splits the information into three tabs—Hosts, Sources and Sourcetypes. A host is a captured hostname for an event. In the majority of cases, the host field is set to the name of the machine where the data originated. There are cases where this is not known, so the host can also be configured arbitrarily. A source in Splunk is a unique path or name. In a large installation, there may be thousands of machines submitting data, but all data on the same path across these machines counts as one source. When the data source is not a file, the value of the source can be arbitrary, for instance, the name of a script or network port. A source type is an arbitrary categorization of events. There may be many sources across many hosts, in the same source type. For instance, given the sources /var/log/access.2012-03-01.log and /var/log/access.2012-03-02.log on the hosts fred and wilma, you could reference all these logs with source type access or any other name that you like. Let's move on now and discuss each of the Splunk widgets (just below the app name). The first widget is the navigation bar. As a general rule, within Splunk, items with downward triangles are menus. Items without a downward triangle are links. Next we find the Search bar. This is where the magic starts. We'll go into great detail shortly. Search Okay, we've finally made it to search. This is where the real power of Splunk lies. For our first search, we will search for the word (not case specific); error. Click in the search bar, type the word error, and then either press Enter or click on the magnifying glass to the right of the bar. Upon initiating the search, we are taken to the search results page. Note that the search we just executed was across All time (by default); to change the search time, you can utilize the Splunk time picker. Actions Let's inspect the elements on this page. Below the Search bar, we have the event count, action icons, and menus. Starting from the left, we have the following: The number of events matched by the base search. Technically, this may not be the number of results pulled from disk, depending on your search. Also, if your query uses commands, this number may not match what is shown in the event listing. Job: This opens the Search job inspector window, which provides very detailed information about the query that was run. Pause: This causes the current search to stop locating events but keeps the job open. This is useful if you want to inspect the current results to determine whether you want to continue a long running search. Stop: This stops the execution of the current search but keeps the results generated so far. This is useful when you have found enough and want to inspect or share the results found so far. Share: This shares the search job. This option extends the job's lifetime to seven days and sets the read permissions to everyone. Export: This exports the results. Select this option to output to CSV, raw events, XML, or JavaScript Object Notation (JSON) and specify the number of results to export. Print: This formats the page for printing and instructs the browser to print. Smart Mode: This controls the search experience. You can set it to speed up searches by cutting down on the event data it returns and, additionally, by reducing the number of fields that Splunk will extract by default from the data (Fast mode). You can, otherwise, set it to return as much event information as possible (Verbose mode). In Smart mode (the default setting) it toggles search behavior based on the type of search you're running. Timeline Now we'll skip to the timeline below the action icons. Along with providing a quick overview of the event distribution over a period of time, the timeline is also a very useful tool for selecting sections of time. Placing the pointer over the timeline displays a pop-up for the number of events in that slice of time. Clicking on the timeline selects the events for a particular slice of time. Clicking and dragging selects a range of time. Once you have selected a period of time, clicking on Zoom to selection changes the time frame and reruns the search for that specific slice of time. Repeating this process is an effective way to drill down to specific events. Deselect shows all events for the time range selected in the time picker. Zoom out changes the window of time to a larger period around the events in the current time frame The field picker To the left of the search results, we find the field picker. This is a great tool for discovering patterns and filtering search results. Fields The field list contains two lists: Selected Fields, which have their values displayed under the search event in the search results Interesting Fields, which are other fields that Splunk has picked out for you Above the field list are two links: Hide Fields and All Fields. Hide Fields: Hides the field list area from view. All Fields: Takes you to the Selected Fields window. Search results We are almost through with all the widgets on the page. We still have a number of items to cover in the search results section though, just to be thorough. As you can see in the previous screenshot, at the top of this section, we have the number of events displayed. When viewing all results in their raw form, this number will match the number above the timeline. This value can be changed either by making a selection on the timeline or by using other search commands. Next, we have the action icons (described earlier) that affect these particular results. Under the action icons, we have four results tabs: Events list, which will show the raw events. This is the default view when running a simple search, as we have done so far. Patterns streamlines the event pattern detection. It displays a list of the most common patterns among the set of events returned by your search. Each of these patterns represents the number of events that share a similar structure. Statistics populates when you run a search with transforming commands such as stats, top, chart, and so on. The previous keyword search for error does not display any results in this tab because it does not have any transforming commands. Visualization transforms searches and also populates the Visualization tab. The results area of the Visualization tab includes a chart and the statistics table used to generate the chart. Not all searches are eligible for visualization. Under the tabs described just now, is the timeline. Options Beneath the timeline, (starting at the left) is a row of option links that include: Show Fields: shows the Selected Fields screen List: allows you to select an output option (Raw, List, or Table) for displaying the search results Format: provides the ability to set Result display options, such as Show row numbers, Wrap results, the Max lines (to display) and Drilldown as on or off. NN Per Page: is where you can indicate the number of results to show per page (10, 20, or 50). To the right are options that you can use to choose a page of results, and to change the number of events per page. In prior versions of Splunk, these options were available from the Results display options popup dialog. The events viewer Finally, we make it to the actual events. Let's examine a single event. Starting at the left, we have: Event Details: Clicking here (indicated by the right facing arrow) opens the selected event, providing specific information about the event by type, field, and value, and allows you the ability to perform specific actions on a particular event field. In addition, Splunk version 6.2.0 offers a button labeled Event Actions to access workflow actions, a few of which are always available. Build Eventtype: Event types are a way to name events that match a certain query. Extract Fields: This launches an interface for creating custom field extractions. Show Source: This pops up a window with a simulated view of the original source. The event number: Raw search results are always returned in the order most recent first. Next to appear are any workflow actions that have been configured. Workflow actions let you create new searches or links to other sites, using data from an event. Next comes the parsed date from this event, displayed in the time zone selected by the user. This is an important and often confusing distinction. In most installations, everything is in one time zone—the servers, the user, and the events. When one of these three things is not in the same time zone as the others, things can get confusing. Next, we see the raw event itself. This is what Splunk saw as an event. With no help, Splunk can do a good job finding the date and breaking lines appropriately, but as we will see later, with a little help, event parsing can be more reliable and more efficient. Below the event are the fields that were selected in the field picker. Clicking on the value adds the field value to the search. Summary As you have seen, the Splunk GUI provides a rich interface for working with search results. We have really only scratched the surface and will cover more elements. Resources for Article: Further resources on this subject: The Splunk Web Framework [Article] Loading data, creating an app, and adding dashboards and reports in Splunk [Article] Working with Apps in Splunk [Article]
Read more
  • 0
  • 0
  • 4002

article-image-article-oracle-bi-publisher-11g-learning-new-xpt-format
Packt
31 Oct 2011
4 min read
Save for later

Oracle BI Publisher 11g: Learning the new XPT format

Packt
31 Oct 2011
4 min read
(For more resources on topic_name, see here.) We will cover the following topics in this article: The Layout Editor presentation Designing a Layout Export options The Layout Editor First, you have to choose a predefined layout from the Create Report Interface. As you can see in the following screenshot, this interface displays a list of predefined layouts: 3180EN_05_01 image 01 You can add your own predefined layouts to this list and make them available for your later use or even for all the users. After choosing a layout from the Basic Templates or the Shared Templates group, the Layout Editor Interface is displayed. Designing a Layout In the Layout Editor Interface, as shown in the following screenshot, you have tools to perform activities such as: Insert a component: Select the desired component from the Components pane on the left or from the Insert tab of the toolbar and drag-and-drop it into the design area Set component properties: Set the component properties from the Properties pane on the left or from the component-specific tab of the toolbar (only the most commonly used components) Insert a data element: Drag the element from the Data Source pane to the design area Drop Value Here Drop Label Here Drop Series Here 3180EN_05_02 image 02(within bullets) As shown in the preceding screenshot, a precise dropping area is marked. For example, in a chart you have the following marked areas: Set page layout options: In order to set the page layout options, use the Page Layout tab and the Properties pane Save the Layout: Use the activity icons from the toolbar on the right side In the following sections, a few elements will be inserted to complete our report design. You will also see the steps that you need to follow when inserting and setting the properties of these components. Text elements In order to make changes in the settings of Text elements, follow the steps given here: Click on the Insert tab and choose the Text Item component from the toolbar, as shown in the following screenshot: 3180EN_05_03 image 03 Click on the Text tab and set a font color for your text using the Font Color icon from the toolbar, as shown in the following screenshot: 3180EN_05_04 image 04 Set the text margins using the Properties panel, as shown in the following screenshot: 3180EN_05_05 image 05 In this way, we obtain the desired report title, as you can see in the following screenshot: 3180EN_05_06 image 06 In order to insert data elements in our report's components, we will use the following Data Model: 3180EN_05_07 image 07 Charts In order to create Charts, follow the steps given here: Click on the Insert tab and choose the Chart component from the toolbar, as shown in the following screenshot. Select the newly inserted chart and go to the Chart tab on the toolbar to set the chart type (Vertical Bar in this example) and the chart style (Project in this example). Drag the LOAN_PERIOD and the PRICE fields from the Data Source (from the left pane) over the Drop Value Here area of the design view. Drop the TITLE field from the Data Source over the Drop Label Here area: 3180EN_05_08 image 08 Data tables In order to create Data tables , follow the steps given here: Click on the Insert tab and choose the Data Table component from the toolbar, as shown in the following screenshot. Drag the fields LOAN_DATE, LOAN_PERIOD, TITLE, and YEAR from the Data Source over the area marked as Drop a Data Item Here. Select the LOAN_DATE column and in the Properties pane set the Formatting Mask to yyyy-mm-dd. For each column of the table, enter a suitable value for Width, in the Properties pane. For example, the fi rst column has a width of 1.00 inch: 3180EN_05_09 image 09
Read more
  • 0
  • 0
  • 3995

article-image-overview-fim-2010-r2
Packt
03 Sep 2012
18 min read
Save for later

Overview of FIM 2010 R2

Packt
03 Sep 2012
18 min read
The following picture shows a high-level overview of the FIM family and the components relevant to an FIM 2010 R2 implementation:     Within the FIM family, there are some parts that can live by themselves and others that depend on other parts. But, in order to fully utilize the power of FIM 2010 R2, you should have all parts in place. At the center, we have FIM Service and FIM Synchronization Service (FIM Sync). The key to a successful implementation of FIM 2010 R2 is to understand how these two components work—by themselves as well as together.   The history of FIM 2010 R2 Let us go through a short summary of the versions preceding FIM 2010 R2. In 1999, Microsoft bought a company called Zoomit. They had a product called VIA —a directory synchronization product. Microsoft incorporated Zoomit VIA into Microsoft Metadirectory Services (MMS). MMS was only available as a Microsoft Consulting Services solution. In 2003, Microsoft released Microsoft Identity Integration Server (MIIS), and this was the first publicly available version of the synchronization engine today known as FIM 2010 R2 Synchronization Service. In 2005, Microsoft bought a company called Alacris. They had a product called IdNexus, which was used to manage certificates and smart cards. Microsoft renamed it Certificate Lifecycle Manager (CLM). In 2007, Microsoft took MIIS (now with Service Pack 2) and CLM and slammed them together into a new product called Identity Lifecycle Manager 2007 (ILM 2007). Despite the name, ILM 2007 was basically a directory synchronization tool with a certificate management side-kicker. Finally, in 2010, Microsoft released Forefront Identity Manager 2010 (FIM 2010). FIM 2010 was a whole new thing, but as we will see, the old parts from MIIS and CLM are still there. The most fundamental change in FIM 2010 was the addition of the FIM Service component. The most important news was that FIM Service added workflow capability to the synchronization engine. Many identity management operations that used to require a lot of coding were suddenly available without a single line of code. In FIM 2010 R2, Microsoft added the FIM Reporting component and also made significant improvements to the other components.   FIM Synchronization Service (FIM Sync) FIM Synchronization Service is the oldest member of the FIM family. Anyone who has worked with MIIS back in 2003 will feel quite at home with it. Visually, the management tools look the same. FIM Synchronization Service can actually work by itself, without any other component of FIM 2010 R2 being present. We will then basically get the same functionality as MIIS had, back in 2003. FIM Synchronization Service is the heart of FIM, which pumps the data around, causing information about identities to flow from one system to another. Let's look at the pieces that make up the FIM Synchronization Service:     As we can see, there are lots of acronyms and concepts that need a little explaining. On the right-hand side of FIM Synchronization Service, we have Metaverse (MV). Metaverse is used to collect all the information about all the identities managed by FIM. On the other side, we have Connected Data Source (CDS). Connected Data Source is the database, directory, and file, among others, that the synchronization service imports information regarding the managed identities from, and/or exports this information to. To talk to different kinds of Connected Data Sources, FIM Synchronization Service uses adapters that are called Management Agents (MA). In FIM 2010 R2, we will start to use the term Connectors, instead. But, as the user interface in FIM Synchronization Manager still uses the term Management Agent The Management Agent stores a representation of the objects in the CDS, in its Connector Space (CS). When stored in the Connector Space, we refer to the objects as holograms. If we were to look into this a little deeper, we would find that the holograms (objects) are actually stored in multiple instances so that the Management Agent can keep a track of the changes to the objects in the Connector Space. In order to synchronize information from/to different Connected Data Sources, we connect the objects in the Connector Space with the corresponding object in the Metaverse. By collecting information from all Connected Data Sources, the synchronization engine aggregates the information about the object from all the Connected Data Sources into the Metaverse object. This way, the Metaverse will only contain one representation of the object (for example, a user). To describe the data flow within the synchronization service, let's look at the previous diagram and follow a typical scenario. The scenario is this—we want information in our Human Resource (HR) system to govern how users appear in Active Directory (AD) and in our e-mail system. Import users from HR: The bottom CDS could be our HR system. We configure a Management Agent to import users from HR to the corresponding CS. Projection to Metaverse: As there is no corresponding user in the MV that we can connect to, we tell the MA to create a new object in the MV. The process of creating new objects in the MV is called Projection. To transfer information from the HR CS to the MV, we configure Inbound Synchronization Rules. Import and join users from AD: The middle CDS could be Active Directory (AD). We configure a Management Agent to import users from AD. Because there are objects in the MV, we can now tell the Management Agent to try to match the user objects from AD to the objects in the MV. Connecting existing objects in a Connector Space, to an existing object in the Metaverse, is called Joining. In order for the synchronization service to know which objects to connect, some kind of unique information must be present, to get a one-to-one mapping between the object in the CS and the object in the Metaverse. Synchronize information from HR to AD: Once the Metaverse object has a connector to both the HR CS and the AD CS, we can move information from the HR CS to the AD CS. We can, for example, use the employee status information in the HR system to modify the userAccountControl attribute of the AD account. In order to modify the AD CS object, we configure an Outbound Synchronization rule that will tell the synchronization service how to update the CS object based on the information in the MV object. Synchronizing, however, does not modify the user object in AD; it only modifies the hologram representation of the user in the AD Connector Space. Export information to AD: In order to actually change any information in a Connected Data Source, we need to tell the MA to export the changes. During export, the MA updates the objects in the CDS with the changes it has made to the hologram in the Connector Space. Provision users to the e-mail system: The top CDS could be our e-mail system. As users are not present in this system, we would like the synchronization service to create new objects in the CS for the e-mail system. The process of creating new objects in a Connector Space is called Provisioning. Projection, Joining, and Provisioning all create a connector between the Metaverse object and the Connector Space object, making it possible to synchronize identity information between different Connected Data Sources. A key concept to understand here, is that we do not configure synchronization between Connected Data Sources or between Connector Spaces. We synchronize between each Connector Space and Metaverse. Looking at the previous example, we can see that when information flows from HR to AD, we configure the following: HR MA to Import data to the HR CS Inbound synchronization from the HR CS to the MV Outbound synchronization from the MV to the AD CS AD MA to Export the data to AD   Management Agents Management Agents, or Connectors as some people call them, are the entities that enable FIM to talk to different kinds of data sources. Basically, we can say that FIM can talk to any type of data source, but it only has built-in Management Agents for some. If the data source is really old, we might even have to use the extensibility platform and write our own Management Agent or buy a Management Agent from a third-party supplier. At http://aka.ms/FIMPartnerMA, we can find a list of Management Agents supplied by Microsoft Partners. For a complete list of Management Agents built in and available from Microsoft, please look at http://aka.ms/FIMMA. With R2, a new Management Agent for Extensible Connectivity 2.0 (ECMA 2.0) is released, introducing new ways of making custom Management Agents. We will see updated versions of most third party Management Agents as soon as they are migrated to the new ECMA 2.0 platform. Microsoft will also ship new Management Agents using the new ECMA 2.0 platform. Writing our own MA is one way of solving problems communicating with odd data sources. But there might be other solutions to the problem that will require less coding.   Non-declarative vs. declarative synchronization If you are using FIM Synchronization Service the old way, like we did in MIIS or ILM 2007, it is called non-declarative synchronization. We usually call that classic synchronization and will also use that term in this article. If we use the FIM Service logic to control it all, it is called declarative synchronization. As classic synchronization usually involves writing code, and declarative does not; we will also find references calling declarative synchronization codeless. In fact, it was quite possible, in some scenarios, to have codeless synchronization— even in the old MIIS or ILM 2007—using classic synchronization. The fact also remains that there are very few FIM 2010 R2 implementations that are indeed code free. In some cases you might even mix the two. This could be due either to migration from MIIS/ILM 2007 to FIM 2010 R2 or to the decision that it is cheaper/ quicker/easier to solve a particular problem using classic synchronization.   Password synchronization This should be the last resort to achieve some kind of Single Sign On (SSO). Instead of implementing password synchronization, we try to make our customers look at other ways, such as Kerberos or Federation, to get SSO. There are, however, many cases where password synchronization is the best option to maintain passwords in different systems. Not all environments can utilize Kerberos or Federation and therefore need the FIM password synchronization feature to maintain passwords in different Connected Data Sources. The use of this feature is to have Active Directory by either installing and configuring Password Change Notification Service (PCNS) on Domain Controllers or using FIM Service as a source for the password change. FIM Synchronization Service then updates the password on the connected object in Connected Data Sources, which are configured as password synchronization targets. In order for FIM to set the password in a target system, the Management Agent used to connect to that specific CDS needs to support this. Most Management Agents available today support password management or can be configured to do so.   FIM Service Management Agent A very special Management Agent is the one connecting FIM Synchronization Service to FIM Service. Many of the rules we apply to other types of Management Agents do not apply to this one. If you have experience working with classic synchronization in MIIS or ILM 2007, you will find that this Management Agent does not work as the others.   FIM Service If FIM Synchronization Service is the heart pumping information, FIM Service is the brain (sorry FIM CM, but your brain is not as impressive) FIM Service plays many roles in FIM, and during the design phase the capabilities of FIM Service is often on focus. FIM Service allows you to enforce the Identity Management policy within your organization and also make sure you are compliant at all times. FIM Service has its own database, where it stores the information about the identities it manages.   Request pipeline In order to make any changes to objects in the FIM Service database, we need to work our way through the FIM Service request pipeline. So, let's look at the following diagram and walk through the request pipeline:     Every request is made to the web service interface, and follows the ensuing flow: The Request Processor workflow receives the request and evaluates the token (who?) and the request type (what?). Permission is checked to see if the request is allowed. Management Policy Rules are evaluated. If Authenticate workflow is required, serialize and run interactive workflow. If Authorize workflow is required, parallelize and run asynchronous workflow. Modify the object in FIM Service Database according to the request. If Action workflow is required, run follow-up workflows. As we can see, a request to FIM Service may trigger three types of workflows. With the installation of FIM 2001 R2, we will get a few workflows that will cover many basic requirements, but this is one of the situations where custom coding or thirdparty workflows might be required in order to fulfill the identity management policy within the organization. Authentication workflow (AuthN) is used when the request requires additional authentication. An example of this is when a user tries to reset his password—the AuthN workflow will ask the anonymous user to authenticate using the QA gateway. Authorization workflow (AuthZ) is used when the request requires authorization from someone else. An example of this is when a user is added to a group, but the policy states that the owner of the group needs to approve the request. Action workflow is used for many types of follow-up actions—it could be sending a notification email or modifying attributes, among many other things.   FIM Service Management Agent FIM Service Management Agent , as we discussed earlier, is responsible for synchronizing data between FIM Service and FIM Synchronization Service. We said then that this MA is a bit special, and even from the FIM Service perspective it works a little differently. A couple of examples of the special relationship between the FIM Service MA and FIM Service are as follows: Any request made by the FIM Service MA will bypass any AuthN and AuthZ workflows As a performance enhancer, the FIM Service MA is allowed to make changes directly to the FIM Service DB in FIM 2010 R2, without using the request pipeline described earlier   Management Policy Rules (MPRs) The way we control what can be done, or what should happen, is by defining Management Policy Rules (MPRs) within FIM Service. MPR is our tool to enforce the Identity Management policies within our organization. There are two types of MPRs—Request and Set Transition. A Request MPR is used to define how the request pipeline should behave on a particular request. If a request comes in and there is no Request MPR matching the request, it will fail. A Set Transition MPR is used to detect changes in objects and react upon that change. For example, if my EmployeeStatus is changed to Fired, my Active Directory (AD) account should be disabled. A Set is used within FIM Service to group objects. We define rules that govern the criteria for an object to be part of a Set. For example, we can create a Set, which contains all users with Fired as EmployeeStatus. As objects satisfy this criteria and transition in to the Set, we can define a Set Transition MPR to make things such as disabling the AD account happen. We can also define an MPR that applies to the transition out from a Set. The Sets are also used to configure permissions within FIM Service. Using Sets allows us to configure very granular permissions in scenarios where FIM Service is used for user self service.   FIM Portal FIM Portal is usually the starting point for administrators who will configure FIM Service. The configuration of FIM Service is usually done using FIM Portal, but it may also be configured using Power Shell or even your own custom interface. FIM Portal can also be used for self-service scenarios, allowing users to manage some aspect of the Identity Management process. FIM Portal is actually an ASP.NET application using Microsoft Sharepoint as a foundation, and can be modified in many ways.   Self Service Password Reset (SSPR) The Self Service Password Reset (SSPR) feature of FIM is a special case, where most components used to implement it are built-in. The default method is using what is called a QA Gate. FIM 2010 R2 also has built-in methods for using a One Time Password (OTP) that can be sent using either SMS, or e-mail services. In short, the QA Gate works in the following way: The administrator defines a number of questions. Users register for SSPR and provide answers to the questions. Users are presented with the same questions, when a password reset is needed. Giving the correct answers identifies the user and allows them to reset their password.     Once the FIM administrator has used FIM Portal to configure the password reset feature, the end user can register his answers to QA Gate. If the organization has deployed FIM Password Reset Extension to the end user's Windows client, the process of registration and reset can be made directly from the Windows client. If not, the user can register and reset his password using the password registration and reset portals.   FIM Reporting The Reporting component is brand new in FIM 2010 R2. In earlier versions of FIM, as well as the older MIIS and ILM, reporting was typically achieved by either buying third-party add-ons or developing their own solutions based on SQL Reporting Services. The purpose of Reporting is to give you a chance to view historical data. There are a few reports built in to FIM 2010 R2, but many organizations will develop their own reports that comply with their Identity Management policies. The implementation of FIM 2010 R2 will however be a little more complex, if you want the Reporting component. This is because the engine used to generate the reports is the Data Warehouse component of Microsoft System Center Service Manager (SCSM). There are a number of reasons for using the existing reporting capabilities in SCSM; the main one is that it is easy to extend.   FIM Certificate Management (FIM CM) Certificate Management is the outcast member of the FIM family. FIM CM can be, and often is, used by itself, without any other parts of FIM being present. It is also the component with the poorest integration with the other components. If we look at it, we will find that it hasn't changed much since its predecessor, Certificate Lifecycle Management (CLM), was released. FIM CM is mainly focused on managing smart cards, but it can also be used to manage and trace any type of certificate requests.     The basic concept of FIM CM is that a smart card is requested using the FIM CM portal. Information regarding all requests is stored in the FIM CM database. The Certification authority, which handles the issuing of the certificates, is configured to report the status back to the FIM CM database. FIM CM portal also contains a workflow engine, so that the FIM CM admin can configure features such as e-mail notifications as a part of the policies.   Certificate Management portal FIM Certificate Management uses a portal to interact with users and administrators. The FIM CM portal is an ASP.Net 2.0 website where, for example: Administrators can configure the policies that govern the processes around certificate management End users can manage their smart cards for purposes such as renewing and changing PIN codes Help desks can use the portal to, for example, request temporary smart cards or reset PINs:     Licensing We put this part in here, not to tell you how FIM 2010 R2 is licensed, but rather to tell you that it is complex. Since Microsoft has a habit of changing the way they license their products, we will not put any license details into writing. Depending on what parts you are using and, in some cases, how you are using them, you need to buy different licenses. FIM 2010 R2 (at the time of my writing) uses both Server licenses as well as Client Access Licenses (CALs). In almost every FIM project the licensing cost is negligible compared to the gain retrieved by implementing it. But even so, please make sure to contact your Microsoft licensing partner, or your Microsoft contact, to clear any questions you might have around licensing. If you do not have Microsoft System Center Service Manager (SCSM), it is stated (at the time of my writing) that you can install and use SCSM for FIM Reporting usage without having to buying SCSM licenses. Read more about FIM Licensing at o http://aka.ms/FIMLicense.   Summary As it can be seen, Microsoft Forefront Identity Manager 2010 R2 is not just one product, but a family of products. In this article, we have given you a short overview of the different components, and we saw how together they can mitigate the challenges that The Company has identified about their identity management.
Read more
  • 0
  • 0
  • 3966

article-image-exact-inference-using-graphical-models
Packt
25 Jun 2014
7 min read
Save for later

Exact Inference Using Graphical Models

Packt
25 Jun 2014
7 min read
(For more resources related to this topic, see here.) Complexity of inference A graphical model can be used to answer both probability queries and MAP queries. The most straightforward way to use this model is to generate the joint distribution and sum out all the variables, except the ones we are interested in. However, we need to determine and specify the joint distribution where an exponential blowup happens. In worst-case scenarios, we need to determine the exact inference in NP-hard. By the word exact, we mean specifying the probability values with a certain precision (say, five digits after the decimals). Suppose we tone down our precision requirements (for example, only up to two digits after the decimals). Now, is the (approximate) inference task any easier? Unfortunately not—even approximate inference is NP-hard, that is, getting values is far better than random guessing (50 percent or a probability of 0.5), which takes exponential time. It might seem like inference is a hopeless task, but that is only in the worst case. In general cases, we can use exact inference to solve certain classes of real-world problems (such as Bayesian networks that have a small number of discrete random variables). Of course, for larger problems, we have to resort to approximate inference. Real-world issues Since inference is a task that is NP-hard, inference engines are written in languages that are as close to bare metal as possible; usually in C or C++. Use Python implementations of inference algorithms. Complete and mature packages for these are uncommon. Use inference engines that have a Python interface, such as Stan (mc-stan.org). This choice serves a good balance between running the Python code and a fast inference implementation. Use inference engines that do not have a Python interface, which is true for majority of the inference engines out there. A fairly comprehensive list can be found at http://en.wikipedia.org/wiki/Bayesian_network#Software. The use of Python here is limited to creating a file that describes the model in a format that the inference engine can consume. In the article on inference, we will stick to the first two choices in the list. We will use native Python implementations (of inference algorithms) to peek into the interiors of the inference algorithms while running toy-sized problems, and then use an external inference engine with Python interfaces to try out a more real-world problem. The tree algorithm We will now look at another class of exact inference algorithms based on message passing. Message passing is a general mechanism, and there exist many variations of message passing algorithms. We shall look at a short snippet of the clique tree-message passing algorithm (which is sometimes called the junction tree algorithm too). Other versions of the message passing algorithm are used in approximate inference as well. We initiate the discussion by clarifying some of the terms used. A cluster graph is an arrangement of a network where groups of variables are placed in the cluster. It is similar to a factor where each cluster has a set of variables in its scope. The message passing algorithm is all about passing messages between clusters. As an analogy, consider the gossip going on at a party, where Shelly and Clair are in a conversation. If Shelly knows B, C, and D, and she is chatting with Clair who knows D, E, and F (note that the only person they know in common is D), they can share information (or pass messages) about their common friend D. In the message passing algorithm, two clusters are connected by a Separation Set (sepset), which contains variables common to both clusters. Using the preceding example, the two clusters and are connected by the sepset , which contains the only variable common to both clusters. In the next section, we shall learn about the implementation details of the junction tree algorithm. We will first understand the four stages of the algorithm and then use code snippets to learn about it from an implementation perspective. The four stages of the junction tree algorithm In this section, we will discuss the four stages of the junction tree algorithm. In the first stage, the Bayes network is converted into a secondary structure called a join tree (alternate names for this structure in the literature are junction tree, cluster tree, or a clique tree). The transformation from the Bayes network to junction tree proceeds as per the following steps: We will construct a moral graph by changing all the directed edges to undirected edges. All nodes that have V-structures that enter the said node have their parents connected with an edge. We have seen an example of this process (in the VE algorithm) called moralization, which is a possible reference to connect (apparently unmarried) parents that have a child (node). Then, we will selectively add edges to the moral graph to create a triangulated graph. A triangulated graph is an undirected graph where the maximum cycle length between the nodes is 3. From the triangulated graph, we will identify the subsets of nodes (called cliques). Starting with the cliques as clusters, we will arrange the clusters to form an undirected tree called the join tree, which satisfies the running intersection property. This property states that if a node appears in two cliques, it should also appear in all the nodes on the path that connect the two cliques. In the second stage, the potentials at each cluster are initialized. The potentials are similar to a CPD or a table. They have a list of values against each assignment to a variable in their scope. Both clusters and sepsets contain a set of potentials. The term potential is used as opposed to probabilities because in Markov networks, unlike probabilities, the values of the potentials are not obliged to sum to 1. This stage consists of message passing or belief propagation between neighboring clusters. Each message consists of a belief the cluster has about a particular variable. Each message can be passed asynchronously, but it has to wait for information from other clusters before it collates that information and passes it to the next cluster. It can be useful to think of a tree-structured cluster graph, where the message passing happens in two stages: an upward pass stage and a downward pass stage. Only after a node receives messages from the leaf nodes, will it send the message to its parent (in the "upward pass"), and only after the node receives a message from its parents will it send a message to its children (in the "downward pass"). The message passing stage completes when each cluster sepset has consistent beliefs. Recall that a cluster connected to a sepset has common variables. For example, cluster C and sepset S have and variables in its scope. Then, the potential against obtained from either the cluster or the sepset has the same value, which is why it is said that the cluster graph has consistent beliefs or that the cliques are calibrated. Once the whole cluster graph has consistent beliefs, the fourth stage is marginalization, where we can query the marginal distribution for any variable in the graph. Summary We first explored the inference problem where we studied the types of inference. We then learned that inference is NP-hard and understood that, for large networks, exact inference is infeasible. Resources for Article: Further resources on this subject: Getting Started with Spring Python [article] Python Testing: Installing the Robot Framework [article] Discovering Python's parallel programming tools [article]
Read more
  • 0
  • 0
  • 3946

article-image-securing-data-rest-oracle-11g
Packt
23 Oct 2012
11 min read
Save for later

Securing Data at Rest in Oracle 11g

Packt
23 Oct 2012
11 min read
Introduction The Oracle physical database files are primarily protected by filesystem privileges. An attacker who has read permissions on these files will be able to steal the entire database or critical information such as datafiles containing credit card numbers, social security numbers, or other types of private information. Other threats are related to data theft from storage mediums where the physical database resides. The same applies for unprotected backups or dumps that can be easily restored or imported. The data in the database is stored in proprietary format that is quite easy to decipher. There are several sites and specialized tools available to extract data from datafiles, backups, and dumps, known generically as Data Unloading ( DUL). These tools are usually the last solution when the database is corrupted and there is no backup available for restore and recovery. As you probably have already guessed, they can be used by an attacker for data extraction from stolen databases or dumps (summary descriptions and links to several DUL tools can be found at http://www.oracle-internals.com/?p=17 Blvd). The technology behind DUL utilities is based on understanding how Oracle keeps the data in datafiles behind the scenes (a very good article about Oracle datafile internals, written by Rodrigo Righetti, can be found at http://docs.google.com/Doc?id=df2mxgvb_1dgb9fv). Once you decipher the mechanism you will be able to build your tool with little effort. One of the best methods for protecting data at rest is encryption. We can enumerate the following as data encryption methods, described in this chapter for using with Oracle database: Operating system proprietary filesystem or block-based encryption Cryptographic API, especially DBMS_CRYPTO used for column encryption Transparent Data Encryption for encrypting columns, tablespaces, dumps, and RMAN backups Using block device encryption By using block device encryption the data is encrypted and decrypted at block-device level. The block device can be formatted with a filesystem. The decryption is performed once the filesystem is mounted by the operating system, transparently for users. This type of encryption protects best against media theft and can be used for datafile placement. In this recipe we will add a new disk and implement block-level encryption with Linux Unified Key Setup-on-disk-format (LUKS). Getting ready All steps will be performed with nodeorcl1 as root. How to do it... Shut down nodeorcl1, then add a new disk to the nodeorcl1 system and boot it. Our new device will be seen by the operating system as /dev/sdb . Next, create a new partition /dev/sdb1 using fdisk as follows: [root@nodeorcl1 ~]# fdisk /dev/sdb WARNING: DOS-compatible mode is deprecated. It's strongly recommended to switch off the mode (command 'c') and change display units to sectors (command 'u'). Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-5577, default 1): Using default value 1 Last cylinder, +cylinders or +size{K,M,G} (1-5577, default 5577): Using default value 5577 Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. Format and add a passphrase for encryption on /dev/sdb1 device with cryptsetup utility as follows: [root@nodeorcl1 dev]# cryptsetup luksFormat /dev/sdb1 WARNING! ======== This will overwrite data on /dev/sdb1 irrevocably. Are you sure? (Type uppercase yes): YES Enter LUKS passphrase: P5;@o[]klopY&P] Verify passphrase: P5;@o[]klopY&P] [root@nodeorcl1 dev]# The access on the encrypted device is not performed directly; all operations are performed through a device-mapper. Open the device-mapper for /dev/sdb1 as follows: [root@nodeorcl1 mapper]# cryptsetup luksOpen /dev/sdb1 storage Enter passphrase for /dev/sdb1: P5;@o[]klopY&P] [root@nodeorcl1 mapper]# [root@nodeorcl1 mapper]# ls -al /dev/mapper/storage lrwxrwxrwx. 1 root root 7 Sep 23 20:03 /dev/mapper/storage -> ../ dm-4 The formatting with a filesystem must also be performed on the device-mapper. Format the device-mapper with the ext4 filesystem as follows: [root@nodeorcl1 mapper]# mkfs.ext4 /dev/mapper/storage mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) ………………………………………………………………………………………………………… This filesystem will be automatically checked every 38 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. [root@nodeorcl1 mapper]# Next we will configure the device-mapper /dev/mapper/storage for automatic mount during boot. Create a directory called storage that will be used as the mount point: [root@nodeorcl1 storage]# mkdir /storage The mapper-device /dev/mapper/storage can be mounted as a normal device: [root@nodeorcl1 storage]# mount /dev/mapper/storage /storage To make the mount persistent across reboots add /storage as the mount point for /dev/mapper/storage. First add the mapper-device name into /etc/crypttab: [root@nodeorcl1 storage]# echo "storage /dev/sdb1" > /etc/crypttab Add the complete mapper-device path, mount point, and filesystem type in /etc/fstab as follows: /dev/mapper/storage /storage ext4 defaults 1 2 Reboot the system: [root@nodeorcl1 storage]# shutdown –r now At boot sequence, the passphrase for /storage will be requested. If no passphrase is typed then the mapper device will be not mounted. How it works... Block device encryption is implemented to work below the filesystem level. Once the device is offline, the data appears like a large blob of random data. There is no way to determine what kind of filesystem and data it contains. There's more... To dump information about the encrypted device you should execute the following command: [root@nodeorcl1 dev]# cryptsetup luksDump /dev/sdb1 LUKS header information for /dev/sdb1 Version: 1 Cipher name: aes Cipher mode: cbc-essiv:sha256 Hash spec: sha1 Payload offset: 4096 MK bits: 256 MK digest: 2c 7a 4c 96 9d db 63 1c f0 15 0b 2c f0 1a d9 9b 8c 0c 92 4b MK salt: 59 ce 2d 5b ad 8f 22 ea 51 64 c5 06 7b 94 ca 38 65 94 ce 79 ac 2e d5 56 42 13 88 ba 3e 92 44 fc MK iterations: 51750 UUID: 21d5a994-3ac3-4edc-bcdc-e8bfbf5f66f1 Key Slot 0: ENABLED Iterations: 207151 Salt: 89 97 13 91 1c f4 c8 74 e9 ff 39 bc d3 28 5e 90 bf 6b 9a c0 6d b3 a0 21 13 2b 33 43 a7 0c f1 85 Key material offset: 8 AF stripes: 4000 Key Slot 1: DISABLED Key Slot 2: DISABLED Key Slot 3: DISABLED Key Slot 4: DISABLED Key Slot 5: DISABLED Key Slot 6: DISABLED Key Slot 7: DISABLED [root@nodeorcl1 ~]# Using filesystem encryption with eCryptfs The eCryptfs filesytem is implemented as an encryption/decryption layer interposed between a mounted filesystem and the kernel. The data is encrypted and decrypted automatically at filesystem access. It can be used for backup or sensitive files placement for transportable or fixed storage mediums. In this recipe we will install and demonstrate some of eCryptfs, capabilities. Getting ready All steps will be performed on nodeorcl1. How to do it... eCryptfs is shipped and bundled with the Red Hat installation kit. The eCryptfs package is dependent on the trouser package. As root user, first install the trouser package followed by installation of the ecryptfs-util package: [root@nodeorcl1 Packages]# rpm -Uhv trousers-0.3.4-4.el6.x86_64. rpm warning: trousers-0.3.4-4.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID fd431d51: NOKEY Preparing... ###################################### ##### [100%] 1:trousers ###################################### ##### [100%] [root@nodeorcl1 Packages]# rpm -Uhv ecryptfs-utils-82-6.el6. x86_64.rpm warning: ecryptfs-utils-82-6.el6.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID fd431d51: NOKEY Preparing... ###################################### ##### [100%] 1:ecryptfs-utils ###################################### ##### [100%] Create a directory that will be mounted with the eCryptfs filesystem and set the oracle user as the owner: [root@nodeorcl1 ~]# mkdir /ecryptedfiles [root@nodeorcl1 ~]# chown -R oracle:oinstall /ecryptedfiles Mount /ecryptedfiles to itself using the eCryptfs filesystem. Use the default values for all options and use a strong passphrase as follows: [root@nodeorcl1 hashkeys]# mount -t ecryptfs /ecryptedfiles / ecryptedfiles Select key type to use for newly created files: 1) openssl 2) tspi 3) passphrase Selection: 3 Passphrase: lR%5_+KO}Pi_$2E Select cipher: 1) aes: blocksize = 16; min keysize = 16; max keysize = 32 (not loaded) 2) blowfish: blocksize = 16; min keysize = 16; max keysize = 56 (not loaded) 3) des3_ede: blocksize = 8; min keysize = 24; max keysize = 24 (not loaded) 4) cast6: blocksize = 16; min keysize = 16; max keysize = 32 (not loaded) 5) cast5: blocksize = 8; min keysize = 5; max keysize = 16 (not loaded) Selection [aes]: Select key bytes: 1) 16 2) 32 3) 24 Selection [16]: Enable plaintext passthrough (y/n) [n]: Enable filename encryption (y/n) [n]: y Filename Encryption Key (FNEK) Signature [d395309aaad4de06]: Attempting to mount with the following options: ecryptfs_unlink_sigs ecryptfs_fnek_sig=d395309aaad4de06 ecryptfs_key_bytes=16 ecryptfs_cipher=aes ecryptfs_sig=d395309aaad4de06 Mounted eCryptfs [root@nodeorcl1 hashkeys]# Switch to the oracle user and export the HR schema to /ecryptedfiles directory as follows: [oracle@nodeorcl1 ~]$ export NLS_LANG=AMERICAN_AMERICA.AL32UTF8 [oracle@nodeorcl1 ~]$ exp system file=/ecryptedfiles/hr.dmp owner=HR statistics=none Export: Release 11.2.0.3.0 - Production on Sun Sep 23 20:49:30 2012 Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved. Password: Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, OLAP, Data Mining and Real Application Testing options Export done in AL32UTF8 character set and AL16UTF16 NCHAR character set About to export specified users ... …………………………………………………………………………………………………………….. . . exporting table LOCATIONS 23 rows exported . . exporting table REGIONS 4 rows exported . …………………………………………………………………………………………………….. . exporting post-schema procedural objects and actions . exporting statistics Export terminated successfully without warnings. [oracle@nodeorcl1 ~]$ If you open the hr.dmp file with the strings command, you will be able to see the content of the dump file: [root@nodeorcl1 ecryptedfiles]# strings hr.dmp | more ……………………………………………………………………………………………………………………………………….. CREATE TABLE "COUNTRIES" ("COUNTRY_ID" CHAR(2) CONSTRAINT "COUNTRY_ID_NN" NOT NULL ENABLE, "COUNTRY_NAME" VARCHAR2(40), "REGION_ID" NUMBER, CONSTRAINT "COUNTRY_C_ID_PK" PRIMARY KEY ("COUNTRY_ID") ENABLE ) ORGANIZATION INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) TABLESPACE "EXAMPLE" NOLOGGING NOCOMPRESS PCTTHRESHOLD 50 INSERT INTO "COUNTRIES" ("COUNTRY_ID", "COUNTRY_NAME", "REGION_ ID") VALUES (:1, :2, :3) Argentina Australia Belgium Brazil Canada Next as root unmount /ecryptedfiles as follows: [root@nodeorcl1 /]# unmount /ecryptedfiles/ If we list the content of the /ecryptedfile directory now, we should see that the file name and content is encrypted: [root@nodeorcl1 /]# cd /ecryptedfiles/ [root@nodeorcl1 ecryptedfiles]# ls ECRYPTFS_FNEK_ENCRYPTED.FWbHZH0OehHS.URqPdiytgZHLV5txs- bH4KKM4Sx2qGR2by6i00KoaCBwE-- [root@nodeorcl1 ecryptedfiles]# [root@nodeorcl1 ecryptedfiles]# more ECRYPTFS_FNEK_ENCRYPTED. FWbHZH0OehHS.URqPdiytgZHLV5txs-bH4KKM4Sx2qGR2by6i00KoaCBwE-- ………………………………………………………………………………………………………………………………… 9$Eî□□KdgQNK□□v□□ S□□J□□□ h□□□ PIi'ʼn□□R□□□□□siP □b □`)3 □W □W( □□□□c!□□8□E.1'□R□7bmhIN□□--(15%) …………………………………………………………………………………………………………………………………. To make the file accessible again, mount the /ecryptedfiles filesystem by passing the same parameters and passphrase as performed in step 3. How it works... eCryptfs is mapped in the kernel Virtual File System ( VFS ), similarly with other filesystems such as ext3, ext4, and ReiserFS. All calls on a filesystem will go first through the eCryptfs mount point and then to the current filesystem found on the mount point (ext4, ext4, jfs, ReiserFS). The key used for encryption is retrieved from the user session key ring, and the kernel cryptographic API is used for encryption and decryption of file content. The communication with kernel is performed by the eCryptfs daemon. The file data content is encrypted for each file with a distinct randomly generated File Encryption Key ( FEK ); FEK is encrypted with File Encryption Key Encryption Key ( FEKEK ) resulting in an Encrypted File Encryption Key ( EFEK) that is stored in the header of file. There's more... On Oracle Solaris you can implement filesystem encryption using the ZFS built-in filesystem encryption capabilities. On IBM AIX you can use EFS.
Read more
  • 0
  • 0
  • 3919

article-image-what-quantitative-finance
Packt
20 Jun 2014
11 min read
Save for later

What is Quantitative Finance?

Packt
20 Jun 2014
11 min read
(For more resources related to this topic, see here.) Discipline 1 – finance (financial derivatives) In general, a financial derivative is a contract between two parties who agree to exchange one or more cash flows in the future. The value of these cash flows depends on some future event, for example, that the value of some stock index or interest rate being above or below some predefined level. The activation or triggering of this future event thus depends on the behavior of a variable quantity known as the underlying. Financial derivatives receive their name because they derive their value from the behavior of another financial instrument. As such, financial derivatives do not have an intrinsic value in themselves (in contrast to bonds or stocks); their price depends entirely on the underlying. A critical feature of derivative contracts is thus that their future cash flows are probabilistic and not deterministic. The future cash flows in a derivative contract are contingent on some future event. That is why derivatives are also known as contingent claims. This feature makes these types of contracts difficult to price. The following are the most common types of financial derivatives: Futures Forwards Options Swaps Futures and forwards are financial contracts between two parties. One party agrees to buy the underlying from the other party at some predetermined date (the maturity date) for some predetermined price (the delivery price). An example could be a one-month forward contract on one ounce of silver. The underlying is the price of one ounce of silver. No exchange of cash flows occur at inception (today, t=0), but it occurs only at maturity (t=T). Here t represents the variable time. Forwards are contracts negotiated privately between two parties (in other words, Over The Counter (OTC)), while futures are negotiated at an exchange. Options are financial contracts between two parties. One party (called the holder of the option) pays a premium to the other party (called the writer of the option) in order to have the right, but not the obligation, to buy some particular asset (the underlying) for some particular price (the strike price) at some particular date in the future (the maturity date). This type of contract is called a European Call contract. Example 1 Consider a one-month call contract on the S&P 500 index. The underlying in this case will be the value of the S&P 500 index. There are cash flows both at inception (today, t=0) and at maturity (t=T). At inception, (t=0) the premium is paid, while at maturity (t=T), the holder of the option will choose between the following two possible scenarios, depending on the value of the underlying at maturity S(T): Scenario A: To exercise his/her right and buy the underlying asset for K Scenario B: To do nothing if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K The option holder will choose Scenario A if the value of the underlying at maturity is above the value of the strike, that is, S(T)>K. This will guarantee him/her a profit of S(T)-K. The option holder will choose Scenario B if the value of the underlying at maturity is below the value of the strike, that is, S(T)<K. This will guarantee him/her to limit his/her losses to zero. Example 2 An Interest Rate Swap (IRS) is a financial contract between two parties A and B who agree to exchange cash flows at regular intervals during a given period of time (the life of a contract). Typically, the cash flows from A to B are indexed to a fixed rate of interest, while the cash flows from B to A are indexed to a floating interest rate. The set of fixed cash flows is known as the fixed leg, while the set of floating cash flows is known as the floating leg. The cash flows occur at regular intervals during the life of the contract between inception (t=0) and maturity (t=T). An example could be a fixed-for-floating IRS, who pays a rate of 5 percent on the agreed notional N every three months and receives EURIBOR3M on the agreed notional N every three months. Example 3 A futures contract on a stock index also involves a single future cash flow (the delivery price) to be paid at the maturity of the contract. However, the payoff in this case is uncertain because how much profit I will get from this operation will depend on the value of the underlying at maturity. If the price of the underlying is above the delivery price, then the payoff I get (denoted by function H) is positive (indicating a profit) and corresponds to the difference between the value of the underlying at maturity S(T) and the delivery price K. If the price of the underlying is below the delivery price, then the payoff I get is negative (indicating a loss) and corresponds to the difference between the delivery price K and the value of the underlying at maturity S(T). This characteristic can be summarized in the following payoff formula: Equation 1 Here, H(S(T)) is the payoff at maturity, which is a function of S(T). Financial derivatives are very important to the modern financial markets. According to the Bank of International Settlements (BIS) as of December 2012, the amounts outstanding for OTC derivative contracts worldwide were Foreign exchange derivatives with 67,358 billion USD, Interest Rate Derivatives with 489,703 billion USD, Equity-linked derivatives with 6,251 billion USD, Commodity derivatives with 2,587 billion USD, and Credit default swaps with 25,069 billion USD. For more information, see http://www.bis.org/statistics/dt1920a.pdf. Discipline 2 – mathematics We need mathematical models to capture both the future evolution of the underlying and the probabilistic nature of the contingent cash flows we encounter in financial derivatives. Regarding the contingent cash flows, these can be represented in terms of the payoff function H(S(T)) for the specific derivative we are considering. Because S(T) is a stochastic variable, the value of H(S(T)) ought to be computed as an expectation E[H(S(T))]. And in order to compute this expectation, we need techniques that allow us to predict or simulate the behavior of the underlying S(T) into the future, so as to be able to compute the value of ST and finally be able to compute the mean value of the payoff E[H(S(T))]. Regarding the behavior of the underlying, typically, this is formalized using Stochastic Differential Equations (SDEs), such as Geometric Brownian Motion (GBM), as follows: Equation 2 The previous equation fundamentally says that the change in a stock price (dS), can be understood as the sum of two effects—a deterministic effect (first term on the right-hand side) and a stochastic term (second term on the right-hand side). The parameter is called the drift, and the parameter is called the volatility. S is the stock price, dt is a small time interval, and dW is an increment in the Wiener process. This model is the most common model to describe the behavior of stocks, commodities, and foreign exchange. Other models exist, such as jump, local volatility, and stochastic volatility models that enhance the description of the dynamics of the underlying. Regarding the numerical methods, these correspond to ways in which the formal expression described in the mathematical model (usually in continuous time) is transformed into an approximate representation that can be used for calculation (usually in discrete time). This means that the SDE that describes the evolution of the price of some stock index into the future, such as the FTSE 100, is changed to describe the evolution at discrete intervals. An approximate representation of an SDE can be calculated using the Euler approximation as follows: Equation 3 The preceding equation needs to be solved in an iterative way for each time interval between now and the maturity of the contract. If these time intervals are days and the contract has a maturity of 30 days from now, then we compute tomorrow's price in terms of todays. Then we compute the day after tomorrow as a function of tomorrow's price and so on. In order to price the derivative, we require to compute the expected payoff E[H(ST)] at maturity and then discount it to the present. In this way, we would be able to compute what should be the fair premium associated with a European option contract with the help of the following equation: Equation 4 Discipline 3 – informatics (C++ programming) What is the role of C++ in pricing derivatives? Its role is fundamental. It allows us to implement the actual calculations that are required in order to solve the pricing problem. Using the preceding techniques to describe the dynamics of the underlying, we require to simulate many potential future scenarios describing its evolution. Say we ought to price a futures contract on the EUR/USD exchange rate with one year maturity. We have to simulate the future evolution of EUR/USD for each day for the next year (using equation 3). We can then compute the payoff at maturity (using equation 1). However, in order to compute the expected payoff (using equation 4), we need to simulate thousands of such possible evolutions via a technique known as Monte Carlo simulation. The set of steps required to complete this process is known as an algorithm. To price a derivative, we ought to construct such algorithm and then implement it in an advanced programming language such as C++. Of course C++ is not the only possible choice, other languages include Java, VBA, C#, Mathworks Matlab, and Wolfram Mathematica. However, C++ is an industry standard because it's flexible, fast, and portable. Also, through the years, several numerical libraries have been created to conduct complex numerical calculations in C++. Finally, C++ is a powerful modern object-oriented language. It is always difficult to strike a balance between clarity and efficiency. We have aimed at making computer programs that are self-contained (not too object oriented) and self-explanatory. More advanced implementations are certainly possible, particularly in the context of larger financial pricing libraries in a corporate context. In this article, all the programs are implemented with the newest standard C++11 using Code::Blocks (http://www.codeblocks.org) and MinGW (http://www.mingw.org). The Bento Box template A Bento Box is a single portion take-away meal common in Japanese cuisine. Usually, it has a rectangular form that is internally divided in compartments to accommodate the various types of portions that constitute a meal. In this article, we use the metaphor of the Bento Box to describe a visual template to facilitate, organize, and structure the solution of derivative problems. The Bento Box template is simply a form that we will fill sequentially with the different elements that we require to price derivatives in a logical structured manner. The Bento Box template when used to price a particular derivative is divided into four areas or boxes, each containing information critical for the solution of the problem. The following figure illustrates a generic template applicable to all derivatives: The Bento Box template – general case The following figure shows an example of the Bento Box template as applied to a simple European Call option: The Bento Box template – European Call option In the preceding figure, we have filled the various compartments, starting in the top-left box and proceeding clockwise. Each compartment contains the details about our specific problem, taking us in sequence from the conceptual (box 1: derivative contract) to the practical (box 4: algorithm), passing through the quantitative aspects required for the solution (box 2: mathematical model and box 3: numerical method). Summary This article gave an overview of the main elements of Quantitative Finance as applied to pricing financial derivatives. The Bento Box template technique will be used to organize our approach to solve problems in pricing financial derivatives. We will assume that we are in possession with enough information to fill box 1 (derivative contract). Resources for Article: Further resources on this subject: Application Development in Visual C++ - The Tetris Application [article] Getting Started with Code::Blocks [article] Creating and Utilizing Custom Entities [article]
Read more
  • 0
  • 0
  • 3915
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-using-execnet-parallel-and-distributed-processing-nltk
Packt
09 Nov 2010
8 min read
Save for later

Using Execnet for Parallel and Distributed Processing with NLTK

Packt
09 Nov 2010
8 min read
  Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible     Introduction NLTK is great for in-memory single-processor natural language processing. However, there are times when you have a lot of data to process and want to take advantage of multiple CPUs, multi-core CPUs, and even multiple computers. Or perhaps you want to store frequencies and probabilities in a persistent, shared database so multiple processes can access it simultaneously. For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions and more. Distributed tagging with execnet Execnet is a distributed execution library for python. It allows you to create gateways and channels for remote code execution. A gateway is a connection from the calling process to a remote environment. The remote environment can be a local subprocess or an SSH connection to a remote node. A channel is created from a gateway and handles communication between the channel creator and the remote code. Since many NLTK processes require 100 percent CPU utilization during computation, execnet is an ideal way to distribute that computation for maximum resource usage. You can create one gateway per CPU core, and it doesn't matter whether the cores are in your local computer or spread across remote machines. In many situations, you only need to have the trained objects and data on a single machine, and can send the objects and data to the remote nodes as needed. Getting ready You'll need to install execnet for this to work. It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.0.8. The execnet homepage, which has API documentation and examples, is at http://codespeak.net/execnet/. How to do it... We start by importing the required modules, as well as an additional module remote_tag. py that will be explained in the next section. We also need to import pickle so we can serialize the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so we must dump the tagger to a string using pickle.dumps(). We'll use the default tagger that's used by the nltk.tag.pos_tag() function, but you could load and dump any pre-trained part-of-speech tagger as long as it implements the TaggerI interface. Once we have a serialized tagger, we start execnet by making a gateway with execnet. makegateway(). The default gateway creates a Python subprocess, and we can call the remote_exec() method with the remote_tag module to create a channel. With an open channel, we send over the serialized tagger and then the first tokenized sentence of the treebank corpus. You don't have to do any special serialization of simple types such as lists and tuples, since execnet already knows how to handle serializing the built-in types. Now if we call channel.receive(), we get back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so we know the tagging worked. We end by exiting the gateway, which closes the channel and kills the subprocess. >>> import execnet, remote_tag, nltk.tag, nltk.data >>> from nltk.corpus import treebank >>> import cPickle as pickle >>> tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER)) >>> gw = execnet.makegateway() >>> channel = gw.remote_exec(remote_tag) >>> channel.send(tagger) >>> channel.send(treebank.sents()[0]) >>> tagged_sentence = channel.receive() >>> tagged_sentence == treebank.tagged_sents()[0] True >>> gw.exit() Visually, the communication process looks like this: How it works... The gateway's remote_exec() method takes a single argument that can be one of the following three types: A string of code to execute remotely. The name of a pure function that will be serialized and executed remotely. The name of a pure module whose source will be executed remotely. We use the third option with the remote_tag.py module, which is defined as follows: import cPickle as pickle if __name__ == '__channelexec__': tagger = pickle.loads(channel.receive()) for sentence in channel: channel.send(tagger.tag(sentence)) A pure module is a module that is self-contained. It can only access Python modules that are available where it executes, and does not have access to any variables or states that exist wherever the gateway is initially created. To detect that the module is being executed by execnet, you can look at the __name__ variable. If it's equal to '__channelexec__', then it is being used to create a remote channel. This is similar to doing if __name__ == '__ main__' to check if a module is being executed on the command line. The first thing we do is call channel.receive() to get the serialized tagger, which we load using pickle.loads(). You may notice that channel is not imported anywhere—that's because it is included in the global namespace of the module. Any module that execnet executes remotely has access to the channel variable in order to communicate with the channel` creator. Once we have the tagger, we iteratively tag() each tokenized sentence that we receive from the channel. This allows us to tag as many sentences as the sender wants to send, as iteration will not stop until the channel is closed. What we've essentially created is a compute node for part-of-speech tagging that dedicates 100 percent of its resources to tagging whatever sentences it receives. As long as the channel remains open, the node is available for processing. There's more... This is a simple example that opens a single gateway and channel. But execnet can do a lot more, such as opening multiple channels to increase parallel processing, as well as opening gateways to remote hosts over SSH to do distributed processing. Multiple channels We can create multiple channels, one per gateway, to make the processing more parallel. Each gateway creates a new subprocess (or remote interpreter if using an SSH gateway) and we use one channel per gateway for communication. Once we've created two channels, we can combine them using the MultiChannel class, which allows us to iterate over the channels, and make a receive queue to receive messages from each channel. After creating each channel and sending the tagger, we cycle through the channels to send an even number of sentences to each channel for tagging. Then we collect all the responses from the queue. A call to queue.get() will return a 2-tuple of (channel, message) in case you need to know which channel the message came from. If you don't want to wait forever, you can also pass a timeout keyword argument with the maximum number of seconds you want to wait, as in queue.get(timeout=4). This can be a good way to handle network errors. Once all the tagged sentences have been collected, we can exit the gateways. Here's the code: >>> import itertools >>> gw1 = execnet.makegateway() >>> gw2 = execnet.makegateway() >>> ch1 = gw1.remote_exec(remote_tag) >>> ch1.send(tagger) >>> ch2 = gw2.remote_exec(remote_tag) >>> ch2.send(tagger) >>> mch = execnet.MultiChannel([ch1, ch2]) >>> queue = mch.make_receive_queue() >>> channels = itertools.cycle(mch) >>> for sentence in treebank.sents()[:4]: ... channel = channels.next() ... channel.send(sentence) >>> tagged_sentences = [] >>> for i in range(4): ... channel, tagged_sentence = queue.get() ... tagged_sentences.append(tagged_sentence) >>> len(tagged_sentences) 4 >>> gw1.exit() >>> gw2.exit() Local versus remote gateways The default gateway spec is popen, which creates a Python subprocess on the local machine. This means execnet.makegateway() is equivalent to execnet. makegateway('popen'). If you have passwordless SSH access to a remote machine, then you can create a remote gateway using execnet.makegateway('ssh=remotehost') where remotehost should be the hostname of the machine. A SSH gateway spawns a new Python interpreter for executing the code remotely. As long as the code you're using for remote execution is pure, you only need a Python interpreter on the remote machine. Channels work exactly the same no matter what kind of gateway is used; the only difference will be communication time. This means you can mix and match local subprocesses with remote interpreters to distribute your computations across many machines in a network. There are many more details on gateways in the API documentation at http://codespeak.net/execnet/basics.html.
Read more
  • 0
  • 0
  • 3905

article-image-integrating-ibm-cognos-tm1-ibm-cognos-8-bi
Packt
16 Dec 2011
4 min read
Save for later

Integrating IBM Cognos TM1 with IBM Cognos 8 BI

Packt
16 Dec 2011
4 min read
(For more resources on IBM, see here.) Before proceeding with the actual steps of the recipe, we will take a note of the following integration considerations: The measured Dimension in the TM1 Cube needs to be explicitly identified. The Data Source needs to be created in IBM Cognos Connection which points to the TM1 Cube. New Data Source can also be created from IBM Cognos Framework Manager, but for the sake of simplicity we will be creating that from IBM Cognos Connection itself. The created Data Source is used in IBM Cognos Framework Manager Model to create a Metadata Package and publish to IBM Cognos Connection. Metadata Package can be used to create reports, generate queries, slice and dice, or event management using one of the designer studios available in IBM Cognos BI. We will focus on each of the above steps in this recipe, where we will be using one of the Cubes created as part of demodata TM1 Server application and we will be using the Cube as a Data Source in the IBM Cognos BI layer. Getting ready Ensure that the TM1 Admin Server service is started and demodata TM1 Server is running. We should have IBM Cognos 8 BI Server running and IBM Cognos 8 Framework Manager installed. How to do it... Open the TM1 Architect and right-click on the Sales_Plan Cube. Click on Properties. In the Measures Dimension box, click on Sales_Plan_Measures and then for Time Dimension click on Months. Note that the preceding step is compulsory if we want to use the Cube as a Data Source for the BI layer. We need to explicitly define a measures dimension and a time dimension. Click on OK and minimize the TM1 Architect, keep the server running. Now from the Start menu, open IBM Cognos Framework Manager, which is desktop-based tool used to create metadata models. Create a new project from IBM Cognos 8 Framework Manager. Enter the Project name as Demodata and provide the Location where the model file will be located. Note that each project generates a .cpf file which can be opened in the IBM Cognos Framework Manager. Provide valid user credentials so that IBM Cognos Framework Manager can link to a running IBM Cognos BI Server setup. Users and roles are defined by IBM Cognos BI admin user. Choose English as the authoring language when the Select Language list comes up. This will open the Metadata Wizard - Select Metadata Source. We use the Metadata Wizard to create a new Data Source or point to an existing Data Source. In the Metadata Wizard make sure that Data Sources is selected and click on the Next button. In the next screen, click on the New button to create a new Data Source by the name of TM1_Demodata_Sales_Plan. This will open a New data source wizard, where we need to specify the name of the Data Source. On next screen, it will ask for the Data Source Type for which we will specify TM1 from the drop-down, as we want to create a new Data Source based on the TM1 Cube Sales_Data. On the next screen specify the connection parameters. For Administration Host we can specify a name or localhost, depending on the name of the server. In our case, we have specified name of the server as ankitgar, hence we are using an actual name instead of a localhost. In the case of TM1 sitting on another server within the network, we will provide the IP address or name of the host in UNC format. Test the connection to test whether the connection to the TM1 Cube is successful. Click on Close and proceed. Click on the Finish button to complete the creation of the Data Source. The new Data Source is created on the Cognos 8 Server and now can be used by anyone with valid privileges given by the admin user. It's just a connection to the Sales_Plan TM1 Cube which now can be used to create metadata models and, hence, reports and queries perform the various functions suggested in the preceding sections. Now it will return to Metadata Wizard as shown, with the new Data Source appearing with the list of already created Data Sources. Click on the newly created Data Source and on the Next button. It will display all available Cubes on the DemoData TM1 Server, the machine name being the server name (localhost/ankitgar). Click on the Sales_Plan cube and then on Next.
Read more
  • 0
  • 0
  • 3894

article-image-weblogic-server
Packt
15 Mar 2017
24 min read
Save for later

WebLogic Server

Packt
15 Mar 2017
24 min read
In this article by Adrian Ward, Christian Screen and Haroun Khan, the author of the book Oracle Business Intelligence Enterprise Edition 12c - Second Edition, will talk a little more in detail about the enterprise application server that is at the core of Oracle Fusion Middleware, WebLogic. Oracle WebLogic Server is a scalable, enterprise-ready Java Platform Enterprise Edition (Java EE) application server. Its infrastructure supports the deployment of many types of distributed applications. It is also an ideal foundation for building service-oriented applications (SOA). You can already see why BEA was a perfect acquisition for Oracle years ago. Or, more to the point, a perfect core for Fusion Middleware. (For more resources related to this topic, see here.) The WebLogic Server is a robust application in itself. In Oracle BI 12c, the WebLogic server is crucial to the overall implementation, not just from installation but throughout the Oracle BI 12c lifecycle, which now takes advantage of the WebLogic Management Framework. Learning the management components of WebLogic Server that ultimately control the Oracle BI components is critical to the success of an implementation. These management areas within the WebLogic Server are referred to as the WebLogic Administration Server, WebLogic Manager Server(s), and the WebLogic Node Manager. A Few WebLogic Server Nuances Before we move on to a description for each of those areas within WebLogic, it is also important to understand that the WebLogic Server software that is used for the installation of the Oracle BI product suite carries a limited license. Although the software itself is the full enterprise version and carries full functionality, the license that ships with Oracle BI 12c is not a full enterprise license for WebLogic Server for your organization to spin off other siloed JEE deployments on other non-OBIEE servers.: Clustered from the installation:            The WebLogic Server license provided with out-of-the-box Oracle BI 12c does not allow for horizontal scale-out. An enterprise WebLogic Server license needs be obtained for this advanced functionality. Contains an Embedded Web/HTTP Server, not Oracle HTTP Server (OHS): WebLogic Server does not contain a separate HTTP server with the installation. The Oracle BI Enterprise Deployment Guide (available on oracle.com ) discusses separating the Application Tier from the Web/HTTP tier, suggesting Oracle HTTP Server. These items are simply a few nuances of the product suite in relation to Oracle BI 12c. Most software products contain a short list such as this one. However, once you understand the nuances, the easier it will be to ensure that you have a more successful implementation. It also allows your team to be as prepared in advance as possible. Be sure to consult your Oracle sales representative to assist with licensing concerns. Despite these nuances, we highly recommended that, in order to learn more about the installation features, configuration options, administration, and maintenance of WebLogic, you not only research it in relation to Oracle BI, but also in relation to its standalone form. That is to say that there is much more information at large on the topic of WebLogic Server itself than WebLogic Server as it relates to Oracle BI. Understanding this approach to self-educating or web searching should provide you with more efficient results. WebLogic Domain The highest unit of management for controlling the WebLogic Server Installation is called a domain. A domain is a logically related group of WebLogic Server resources that you manage as a unit. A domain always includes, and is centrally managed by, one Administration Server. Additional WebLogic Server instances, which are controlled by the Administration Server for the domain, are called Managed Servers. The configuration for all the servers in the domain is stored in the configuration repository, the config.xml file, which resides on the machine hosting the Administration Server. Upon installing and configuring Oracle BI 12c, the domain, bi, is established within the WebLogic Server. This domain is the recommended name for each Oracle BI 12c implementation and should not be modified. The domain path for the bi domain may appear as ORACLE_HOME/user_projects/domains/bi. This directory for the bi domain is also referred to as the DOMAIN_HOME or BI_DOMAIN folder WebLogic Administration Server The WebLogic Server is an enterprise software suite that manages a myriad of application server components, mainly focusing on Java technology. It is also comprised of many ancillary components, which enables the software to scale well, and also makes it a good choice for distributed environments and high-availability. Clearly, it is good enough to be at the core of Oracle Fusion Middleware. One of the most crucial components of WebLogic Server is WebLogic Administration Server. When installing the WebLogic Server software, the Administration Server is automatically installed with it. It is the Administration Server that not only controls all subsequent WebLogic Server instances, called Managed Servers, but it also controls such aspects as authentication-provider security (for example, LDAP) and other application-server-related configurations. WebLogic Server installs on the operating system and ultimately runs as a service on that machine. The WebLogic Server can be managed in several ways. The two main methods are via the Graphical User Interface (GUI) web application called WebLogic Administration Console, or via command line using the WebLogic Scripting Tool (WLST). You access the Administration Console from any networked machine using a web-based client (that is, a web browser) that can communicate with the Administration Server through the network and/or firewall. The WebLogic Administration Server and the WebLogic Server are basically synonymous. If the WebLogic Server is not running, the WebLogic Administration Console will be unavailable as well. WebLogic Managed Server Web applications, Enterprise Java Beans (EJB), and other resources are deployed onto one or more Managed Servers in a WebLogic Domain. A managed server is an instance of a WebLogic Server in a WebLogic Server domain. Each WebLogic Server domain has at least one instance, which acts as the Administration Server just discussed. One administration server per domain must exist, but one or more managed servers may exist in the WebLogic Server domain. In a production deployment, Oracle BI is deployed into its own managed server. The Oracle BI installer installs two WebLogic server instances, the Admin Server and a managed server, bi_server1. Oracle BI is deployed into the managed server bi_server1, and is configured by default to resolve to port 19502; the Admin Server resolves to port 19500. Historically, this has been port 9704 for the Oracle BI managed server, and port 7001 for the Admin Server. When administering the WebLogic Server via the Administration Console, the WebLogic Administration Server instance appears in the same list of servers, which also includes any managed servers. As a best practice, the WebLogic Administration Server should be used for configuration and management of the WebLogic Server only, and not contain any additionally deployed applications, EJBs, and so on. One thing to note is that the Enterprise Manager Fusion Control is actually a JEE application deployed to the Administration Server instance, which is why its web client is accessible under the same port as the Admin Server. It is not necessarily a native application deployment to the core WebLogic Server, but gets deployed and configured during the Oracle BI installation and configuration process automatically. In the deployments page within the Administration Console, you will find a deployment named em. WebLogic Node Manager The general idea behind Node Manager is that it takes on somewhat of a middle-man role. That is to say, the Node Manager provides a communication tunnel between the WebLogic Administration Server and any Managed Servers configured within the WebLogic Domain. When the WebLogic Server environment is contained on a single physical server, it may be difficult to recognize the need for a Node Manager. It is very necessary and, as part of any of your ultimate start-up and shutdown scripts for Oracle BI, the Node Manager lifecycle management will have to be a part of that process. Node Manager’s real power comes into play when Oracle BI is scaled out horizontally on one or more physical servers. Each scaled-out deployment of WebLogic Server will contain a Node Manager. If the Node Manager is not running on the server on which the Managed Server is deployed, then the core Administration Server will not be able to issue start or stop commands to that server. As such, if the Node Manager is down, communication with the overall cluster will be affected. The following figure shows how machines A, B, and C are physically separated, each containing a Node Manager. You can see that the Administration Server communicates to the Node Managers, and not the Managed Servers, directly: System tools controlled by WebLogic We briefly discussed the WebLogic Administration Console, which controls the administrative configuration of the WebLogic Server Domain. This includes the components managed within it, such as security, deployed applications, and so on. The other management tool that provides control of the deployed Oracle BI application ancillary deployments, libraries, and several other configurations, is called the Enterprise Manager Fusion Middleware Control. This seems to be a long name for single web-based tool. As such, the name is often shortened to “Fusion Control” or “Enterprise Manager.” Reference to either abbreviated title in the context of Oracle BI should ensure fellow Oracle BI teammates understand what you mean. Security It would be difficult to discuss the overall architecture of Oracle BI without at least giving some mention to how the basics of security, authentication, and authorization are applied. By default, installing Oracle WebLogic Server provides a default Lightweight Directory Access Protocol (LDAP) server, referred to as the WebLogic Server Embedded LDAP server. This is a standards-compliant LDAP system, which acts as the default authentication method for out-of-the-box Oracle BI. Integration of secondary LDAP providers, such as Oracle Internet Directory (OID) or Microsoft Active Directory (MSAD), is crucial to leveraging most organizations' identity-management systems. The combination of multiple authentication providers is possible; in fact, it is commonplace. For example, a configuration may wish to have users that exist in both the Embedded LDAP server and MSAD to authenticate and have access to Oracle BI. Potentially, users may want another set of users to be stored in a relational database repository, or have a set of relational database tables control the authorization that users have in relation to the Oracle BI system. WebLogic Server provides configuration opportunities for each of these scenarios. Oracle BI security incorporates the Fusion Middleware Security model, Oracle Platform Security Services (OPSS). This has a positive influence over managing all aspects of Oracle BI, as it provides a very granular level of authorization and a large number of authentication and authorization-integration mechanisms. OPSS also introduces to Oracle BI the concept of managing privileges by application role instead of directly by user or group. It abides by open standards to integrate with security mechanisms that are growing in popularity, such as the Security Assertion Markup Language (SAML) 2.0. Other well-known single-sign-on mechanisms such as SiteMinder and Oracle Access Manager already have pre-configured integration points within Oracle BI Fusion Control.Oracle BI 12c and Oracle BI 11g security is managed differently than the legacy Oracle BI 10g versions. Oracle BI 12c no longer has backward compatibility for the legacy version of Oracle BI 10g, and focus should be to follow the new security configuration best practices of Oracle BI 12c: An Oracle BI best practice is to manage security by Application Roles. Understanding the differences between the Identity Store, Credential Store, and Policy Store is critical for advanced security configuration and maintenance. As of Oracle BI 12c, the OPSS metadata is now stored in a relational repository, which is installed as part of the RCU-schemas installation process that takes place prior to executing the Oracle BI 12c installation on the application server. Managing by Application Roles In Oracle BI 11g, the default security model is the Oracle Fusion Middleware security model, which has a very broad scope. A universal Information Technology security-administration best practice is to set permissions or privileges to a specific point of access on a group, and not individual users. The same idea applies here, except there is another enterprise-level of user, and even group, aggregation, called an Application Role.Application Roles can contain other application roles, groups, or individual users. Access privileges to a certain object, such as a folder, web page, or column should always be assigned to an application role. Application roles for Oracle BI can be managed in the Oracle Enterprise Manager Fusion Middleware Control interface. They can also be scripted using the WLST command-line interface. Security Providers Fusion Middleware security can seem complex at first, but knowing the correct terminology and understanding how the most important components communicate with each other and the application at large is extremely important as it relates to security management. Oracle BI uses three main repositories for accessing authentication and authorization information, all of which are explained in the following sections. Identity Store Identity Store is the authentication provider, which may also provide authorization metadata. A simple mnemonic here is that this store tells Oracle BI how to “Identify” any users attempting to access the system. An example of creating an Identity Store would be to configure an LDAP system such as Oracle Internet Directory or Microsoft Active Directory to reference users within an organization. These LDAP configurations are referred to as Authentication Providers. Credential Store The credential store is ultimately for advanced Oracle configurations. You may touch upon this when establishing an enterprise Oracle BI deployment, but not much thereafter, unless integrating the Oracle BI Action Framework or something equally as complex. Ultimately, the credential store does exactly what its name implies – it stores credentials. Specifically, it is used to store credentials of other applications, which the core application (that is, Oracle BI) may access at a later time without having to re-enter said credentials. An example of this would be integrating Oracle BI with the Oracle Enterprise Management (EPM) suite. In this example, let's pretend there is an internal requirement at Company XYZ for users to access an Oracle BI dashboard. Upon viewing said dashboard, if a report with discrepancies is viewed, the user requires the ability to click on a link, which opens an Oracle EPM Financial Report containing more details about the concern. If not all users accessing the Oracle BI dashboard have credentials to access to the Oracle EPM environment directly, how could they open and view the report without being prompted for credentials? The answer would be that the credential store would be configured with the credentials of a central user having access to the Oracle EPM environment. This central user's credentials (encrypted, of course) are passed along with the dashboard viewer's request and presto, access! Policy Store The policy store is quite unique to Fusion Middleware security and leverages a security standard referred to as XACML, which ultimately provides granular access and privilege control for an enterprise application. This is one of the reasons why managing by Application Roles becomes so important. It is the individual Application Roles that are assigned policies defining access to information within Oracle BI. Stated another way, the application privileges, such as the ability to administer the Oracle BI RPD, are assigned to a particular application role, and these associations are defined in the policy store. The following figure shows how each area of security management is controlled: These three types of security providers within Oracle Fusion Middleware are integral to Oracle BI architecture. Further recommended research on this topic would be to look at Oracle Fusion Middleware Security, OPSS, and the Application Development Framework (ADF). System Requirements The first thing to recognize with infrastructure requirements prior to deploying Oracle BI 12c is that its memory and processor requirements have increased since previous versions. The Java Application server, WebLogic Server, installs with the full version of its software (though under a limited/restricted license, as already discussed). A multitude of additional Java libraries and applications are also deployed. Be prepared for a recommended minimum 8 to16 GB Read Access Memory (RAM) requirement for an Enterprise deployment, and a 6 to 8 GB RAM minimum requirement for a workstation deployment. Client tools Oracle BI 12c has a separate client tools installation that requires Microsoft Windows XP or a more recent version of the Windows Operating System (OS). The Oracle BI 12c client tools provide the majority of client-to-server management capabilities required for normal day-to-day maintenance of the Oracle BI repository and related artefacts. The client-tools installation is usually reserved for Oracle BI developers who architect and maintain the Oracle BI metadata repository, better known as the RPD, which stems from its binary file extension (.rpd). The Oracle BI 12c client-tools installation provides each workstation with the Administration tool, Job Manager, and all command-line Application Programming Interface (API) executables. In Oracle BI 12c, a 64-bit Windows OS is a requirement for installing the Oracle BI Development Client tools. It has been observed that, with some initial releases of Oracle BI 12c client tools, the ODBC DSN connectivity does not work in Windows Server 2012. Therefore, utilizing Windows Server 2012 as a development environment will be ineffective if attempting to open the Administration Tool and connecting to the OBIEE Server in online mode. Multi-User Development Environment One of the key features when developing with Oracle BI is the ability for multiple metadata developers to develop simultaneously. Although the use of the term “simultaneously” can vary among the technical communities, the use of concurrent development within the Oracle BI suite requires Oracle BI's Multi-User Development Environment (MUD) configuration, or some other process developed by third-party Oracle partners. The MUD configuration itself is fairly straightforward and ultimately relies on the Oracle BI administrator’s ability to divide metadata modeling responsibilities into projects. Projects – which are usually defined and delineated by logical fact table definitions – can be assigned to one or more metadata developers. In previous versions of Oracle BI, a metadata developer could install the entire Oracle BI product suite on an up-to-date laptop or commodity desktop workstation and successfully develop, test, and deploy an Oracle BI metadata model. The system requirements of Oracle BI 12c require a significant amount of processor and RAM capacity in order to perform development efforts on a standard-issue workstation or laptop. If an organization currently leverages the Oracle BI multi-user development environment, or plans to with the current release, this raises a couple of questions: How do we get our developers the best environment suitable for developing our metadata? Do we need to procure new hardware? Microsoft Windows is a requirement for Oracle BI client tools. However, the Oracle BI client tool does not include the server component of the Oracle BI environment. It only allows for connecting from the developer's workstation to the Oracle BI server instance. In a multi-user development environment, this poses a serious problem as only one metadata repository (RPD) can exist on any one Oracle BI server instance at any given time. If two developers are working from their respective workstations at the same time and wish to see their latest modifications published in a rapid application development (RAD) cycle, this type of iterative effort fails, as one developer's published changes will overwrite the other’s in real-time. To resolve the issue there are two recommended solutions. The first is an obvious localized solution. This solution merely upgrades the Oracle BI developers’ workstations or laptops to comply with the minimum requirements for installing the full Oracle BI environment on said machines. This upgrade should be both memory- (RAM) and processor- (MHz) centric. 16GB+ RAM and a dual-core processor are recommended. A 64-bit operating system kernel is required. Without an upgraded workstation from which to work, Oracle BI metadata developers will sit at a disadvantage for general iterative metadata development, and will especially be disenfranchised if interfacing within a multi-user development environment. The second solution is one that takes advantage of virtual machines’ (VM) capacity within the organization. Virtual machines have become a staple within most Information Technology departments, as they are versatile and allow for speedy proposition of server environments. For this scenario, it is recommended to create a virtual-machine template of an Oracle BI environment from which to duplicate and “stand up” individual virtual machine images for each metadata developer on the Oracle BI development team. This effectively provides each metadata developer with their own Oracle BI development environment server, which contains the fully deployed Oracle BI server environment. Each developer then has the ability to develop and test iteratively by connecting to their assigned virtual server, without fear that their efforts will conflict with another developer's. The following figure illustrates how an Oracle BI MUD environment can leverage either upgraded developer-workstation hardware or VM images, to facilitate development: This article does not cover the installation, configuration, or best practices for developing in a MUD environment. However, the Oracle BI development team deserves a lot of credit for documenting these processes in unprecedented detail. The Oracle BI 11g MUD documentation provides a case study, which conveys best practices for managing a complex Oracle BI development lifecycle. When ready to deploy a MUD environment, it is highly recommended to peruse this documentation first. The information in this section merely seeks to convey best practice in facilitating a developer’s workstation when using MUD. Certifications matrix Oracle BI 12c largely complies with the overall Fusion Middleware infrastructure. This common foundation allows for a centralized model to communicate with operating systems, web servers, and other ancillary components that are compliant. Oracle does a good job of updating a certification matrix for each Fusion Middleware application suite per respective product release. The certification matrix for Oracle BI 12c can be found on the Oracle website at the following locations: http://www.oracle.com/technetwork/middleware/fusion-middleware/documentation/fmw-1221certmatrix-2739738.xlsx and http://www.oracle.com/technetwork/middleware/ias/downloads/fusion-certification-100350.html. The certification matrix document is usually provided in Microsoft Excel format and should be referenced before any project or deployment of Oracle BI begins. This will ensure that infrastructure components such as the selected operating system, web server, web browsers, LDAP server, and so on, will actually work when integrated with the product suite. Scaling out Oracle BI 12c There are several reasons why an organization may wish to expand their Oracle BI footprint. This can range anywhere from requiring a highly available environment to achieving high levels of concurrent usage over time. The number of total end users, the number of total concurrent end users, the volume of queries, the size of the underlying data warehouse, and cross-network latency are even more factors to consider. Scaling out an environment has the potential to solve performance issues and stabilize the environment. When scoping out the infrastructure for an Oracle BI deployment, there are several crucial decisions to be made. These decisions can be greatly assisted by preparing properly, using Oracle's recommended guides for clustering and deploying Oracle BI on an enterprise scale. Pre-Configuration Run-Down Configuring the Oracle BI product suite, specifically when involving scaling out or setting up high-availability (HA), takes preparation. Proactively taking steps to understand what it takes to correctly establish or pre-configure the infrastructure required to support any level of fault tolerance and high-availability is critical. Even if the decision to scale-out from the initial Oracle BI deployment hasn't been made, if the potential exists, proper planning is recommended. Shared Storage We would be remiss not to highlight one of the most important concepts of scaling out Oracle BI, specifically for High-Availability: shared storage. The idea of shared storage is that, in a fault-tolerance environment, there are binary files and other configuration metadata that needs to be shared across the nodes. If these common elements were not shared, then, if one node were to fail, there is a potential loss of data. Most importantly is that, in a highly available Oracle BI environment, there can be only one WebLogic Administration Server running for that environment at any one time. A HA configuration makes one Administration Server active while the other is passive. If the appropriate pre-configuration steps for shared storage (as well as other items in the high-availability guide) are not properly completed, one should not expect accurate results from their environment. OBIEE 12c requires you to modify the Singleton Data Directory (SDD) for your Oracle BI configuration found at ORACLE_HOME/user_projects/domains/bi/data, so that the files within that path are moved to a shared storage location that would be mounted to the scaled-out servers on which a HA configuration would be implemented. To change this, one would need to modify the ORACLE_HOME/user_projects/domains/bi/config/fmwconfig/bienv/core/bi-environment.xml file to set the path of the bi:singleton-data-directory element to the full path of the shared mounted file location that contains a copy of the bidata folder, which will be referenced by one ore more scaled-out HA Oracle 12c servers. For example, change the XML file element: <bi:singleton-data-directory>/oraclehome/user_projects/domains/bi/bidata/</bi:singleton-data-directory> To reflect a shared NAS or SAN mount whose folder names and structure are inline with the IT team’s standard naming conventions, where the /bidata folder is the folder from the main Oracle BI 12c instance that gets copied to the shared directory: <bi:singleton-data-directory>/mount02/obiee_shared_settings/bidata/</bi:singleton-data-directory> Clustering A major benefit of Oracle BI's ability to leverage WebLogic Server as the Java application server tier is that, per the default installation, Oracle BI gets established in a clustered architecture. There is no additional configuration necessary to set this architecture in motion. Clearly, installing Oracle BI on a single server only provides a single server with which to interface; however, upon doing so, Oracle BI is installed into a single-node clustered-application-server environment. Additional clustered nodes of Oracle BI can then be configured to establish and expand the server, either horizontally or vertically. Vertical vs Horizontal In respect to the enterprise architecture and infrastructure of the Oracle BI environment, a clustered environment can be expanded in one of two ways: horizontally (scale-out) and vertically (scale-up). A horizontal expansion is the typical expansion type when clustering. It is represented by installing and configuring the application on a separate physical server, with reference to the main server application. A vertical expansion is usually represented by expanding the application on the same physical server under which the main server application resides. A horizontally expanded system can then, additionally, be vertically expanded. There are benefits to both scaling options. The decision to scale the system one way or the other is usually predicated on the cost of additional physical servers, server limitations, peripherals such as memory or processors, or an increase in usage activity by end users. Some considerations that may be used to assess which approach is the best for your specific implementation might be as follows: Load-balancing capabilities and need for an Active-Active versus Active-Passive architecture Need for failover or high availability Costs for processor and memory enhancements versus the cost of new servers Anticipated increase in concurrent user queries Realized decrease in performance due to increase in user activity Oracle BI Server (System Component) Cluster Controller When discussing scaling out the Oracle BI Server cluster, it is a common mistake to confuse the WebLogic Server application clustering with the Oracle BI Server Cluster Controller. Currently, Oracle BI can only have a single metadata repository (RPD) reference associated with an Oracle BI Server deployment instance at any single point in time. Because of this, the Oracle BI Server engine leverages a failover concept, to ensure some level of high-availability exists when the environment is scaled. In an Oracle BI scaled-out clustered environment, a secondary node, which has an instance of Oracle BI installed, will contain a secondary Oracle BI Server engine. From the main Oracle BI Managed Server containing the primary Oracle BI Server instance, the secondary Oracle BI Server instance gets established as the failover server engine using the Oracle BI Server Cluster Controller. This configuration takes place in the Enterprise Manager Fusion Control console. Based on this configuration, the scaled-out Oracle BI Server engine acts in an active-passive mode. That is to say that, when the main Oracle BI server engine instance fails, the secondary, or passive, Oracle BI Server engine then becomes active to route requests and field queries. Summary This article proves very beneficial as an introductory document for the beginner about what WebLogic Server is. Resources for Article: Further resources on this subject: Oracle 12c SQL and PL/SQL New Features [article] Schema Validation with Oracle JDeveloper - XDK 11g [article] Creating external tables in your Oracle 10g/11g Database [article]
Read more
  • 0
  • 0
  • 3892

article-image-query-completesuggest
Packt
25 May 2015
37 min read
Save for later

Query complete/suggest

Packt
25 May 2015
37 min read
This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server - Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: (For more resources related to this topic, see here.) Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. This article by the authors David Smiley, Eric Pugh, Kranti Parisa, and Matt Mitchel of the book, Apache Solr Enterprise Search Server, Third Edition, covers one of the most effective features of a search user interface—automatic/instant-search or completion of query input in a search input box. It is typically displayed as a drop-down menu that appears automatically after typing. There are several ways this can work: Instant-search: Here, the menu is populated with search results. Each row is a document, just like the regular search results are, and as such, choosing one takes the user directly to the information instead of a search results page. At your discretion, you might opt to consider the last word partially typed. Examples of this are the URL bar in web browsers and various person search services. This is particularly effective for quick lookup scenarios against identifying information such as a name/title/identifier. It's less effective for broader searches. It's commonly implemented either with edge n-grams or with the Suggester component. Query log completion: If your application has sufficient query volume, then you should perform the query completion against previously executed queries that returned results. The pop-up menu is then populated with queries that others have typed. This is what Google does. It takes a bit of work to set this up. To get the query string and other information, you could write a custom search component, or parse Solr's log files, or hook into the logging system and parse it there. The query strings could be appended to a plain query log file, or inserted into a database, or added directly to a Solr index. Putting the data into a database before it winds up in a Solr index affords more flexibility on how to ultimately index it in Solr. Finally, at this point, you could index the field with an EdgeNGramTokenizer and perform searches against it, or use a KeywordTokenizer and then use one of the approaches listed for query term completion below. We recommend reading this excellent article by Jay Hill on doing this with EdgeNGrams at http://lucidworks.com/blog/auto-suggest-from-popular-queries-using-edgengrams/. Monitor your user's queries! Even if you don't plan to do query log completion, you should capture useful information about each request for ancillary usage analysis, especially to monitor which searches return no results. Capture the request parameters, the response time, the result count, and add a timestamp. Query term completion: The last word of the user's query is searched within the index as a prefix, and other indexed words starting with that prefix are provided. This type is an alternative to query log completion and it's easy to implement. There are several implementation approaches: facet the word using facet.prefix, use Solr's Suggester feature, or use the Terms component. You should consider these choices in that order. Facet/Field value completion: This is similar to query term completion, but it is done on data that you would facet or filter on. The pop-up menu of choices will ideally give suggestions across multiple fields with a label telling you which field each suggestion is for, and the value will be the exact field value, not the subset of it that the user typed. This is particularly useful when there are many possible filter choices. We've seen it used at Mint.com and elsewhere to great effect, but it is under-utilized in our opinion. If you don't have many fields to search, then the Suggester component could be used with one dictionary per field. Otherwise, build a search index dedicated to this information that contains one document per field and value pair, and use an edge n-gram approach to search it. There are other interesting query completion concepts we've seen on sites too, and some of these can be combined effectively. First, we'll cover a basic approach to instant-search using edge n-grams. Next, we'll describe three approaches to implementing query term completion—it's a popular type of query completion, and these approaches highlight different technologies within Solr. Lastly, we'll cover an approach to implement field-value suggestions for one field at a time, using the Suggester search component. Instant-search via edge n-grams As mentioned in the beginning of this section, instant-search is a technique in which a partial query is used to suggest a set of relevant documents, not terms. It's great for quickly finding documents by name or title, skipping the search results page. Here, we'll briefly describe how you might implement this approach using edge n-grams, which you can think of as a set of token prefixes. This is much faster than the equivalent wildcard query because the prefixes are all indexed. The edge n-gram technique is arguably more flexible than other suggest approaches: it's possible to do custom sorting or boosting, to use the highlighter easily to highlight the query, to offer infix suggestions (it isn't limited to matching titles left-to-right), and it's possible to filter the suggestions with a filter query, such as the current navigation filter state in the UI. It should be noted, though, that this technique is more complicated and increases indexing time and index size. It's also not quite as fast as the Suggester component. One of the key components to this approach is the EdgeNGramFilterFactory component, which creates a series of tokens for each input token for all possible prefix lengths. The field type definition should apply this filter to the index analyzer only, not the query analyzer. Enhancements to the field type could include adding filters such as LowerCaseFilterFactory, TrimFilterFactory, ASCIIFoldingFilterFactory, or even a PatternReplaceFilterFactory for normalizing repetitive spaces. Furthermore, you should set omitTermFreqAndPositions=true and omitNorms=true in the field type since these index features consume a lot of space and won't be needed. The Solr Admin Analysis tool can really help with the design of the perfect field type configuration. Don't hesitate to use this tool! A minimalist query for this approach is to simply query the n-grams field directly; since the field already contains prefixes, this just works. It's even better to have only the last word in the query search this field while the other words search a field indexed normally for keyword search. Here's an example: assuming a_name_wordedge is an n-grams based field and the user's search text box contains simple mi: http://localhost:8983/solr/mbartists/select?defType=edismax&qf=a_name&q.op=AND&q=simple a_name_wordedge:mi. The search client here inserted a_name_wordedge: before the last word. The combination of field type definition flexibility (custom filters and so on), and the ability to use features such as DisMax, custom boosting/sorting, and even highlighting, really make this approach worth exploring. Query term completion via facet.prefix Most people don't realize that faceting can be used to implement query term completion, but it can. This approach has the unique and valuable benefit of returning completions filtered by filter queries (such as faceted navigation state) and by query words prior to the last one being completed. This means the completion suggestions should yield matching results, which is not the case for the other techniques. However, there are limits to its scalability in terms of memory use and inappropriateness for real-time search applications. Faceting on a tokenized field is going to use an entry in the field value cache (based on UnInvertedField) to hold all words in memory. It will use a hefty chunk of memory for many words, and it's going to take a non-trivial amount of time to build this cache on every commit during the auto-warming phase. For a data point, consider MusicBrainz's largest field: t_name (track name). It has nearly 700K words in it. It consumes nearly 100 MB of memory and it took 33 seconds to initialize on my machine. The mandatory initialization per commit makes this approach unsuitable for real-time-search applications. Measure this for yourself. Perform a trivial query to trigger its initialization and measure how long it takes. Then search Solr's statistics page for fieldValueCache. The size is given in bytes next to memSize. This statistic is also logged quite clearly. For this example, we have a search box searching track names and it contains the following: michael ja All of the words here except the last one become the main query for the term suggest. For our example, this is just michael. If there isn't anything, then we'd want to ensure that the request handler used would search for all documents. The faceted field is a_spell, and we want to sort by frequency. We also want there to be at least one occurrence, and we don't want more than five suggestions. We don't need the actual search results, either. This leaves the facet.prefix faceting parameter to make this work. This parameter filters the facet values to those starting with this value. Remember that facet values are the final result of text analysis, and therefore are probably lowercased for fields you might want to do term completion on. You'll need to pre-process the prefix value similarly, or else nothing will be found. We're going to set this to ja, the last word that the user has partially typed. Here is the URL for such a search http://localhost:8983/solr/mbartists/select?q=michael&df=a_spell&wt=json&omitHeader=true&indent=on&facet=on&rows=0&facet.limit=5&facet.mincount=1&facet.field=a_spell&facet.prefix=ja. When setting this up for real, we recommend creating a request handler just for term completion with many of these parameters defined there, so that they can be configured separately from your application. In this example, we're going to use Solr's JSON response format. Here is the result: { "response":{"numFound":1919,"start":0,"docs":[]}, "facet_counts":{    "facet_queries":{},    "facet_fields":{      "a_spell":[        "jackson",17,        "james",15,        "jason",4,        "jay",4,        "jacobs",2]},    "facet_dates":{},    "facet_ranges":{}}} This is exactly the information needed to populate a pop-up menu of choices that the user can conveniently choose from. However, there are some issues to be aware of with this feature: You may want to retain the case information of what the user is typing so that it can then be re-applied to the Solr results. Remember that facet.prefix will probably need to be lowercased, depending on text analysis. If stemming text analysis is performed on the field at the time of indexing, then the user might get completion choices that are clearly wrong. Most stemmers, namely Porter-based ones, stem off the suffix to an invalid word. Consider using a minimal stemmer, if any. For stemming and other text analysis reasons, you might want to create a separate field with suitable text analysis just for this feature. In our example here, we used a_spell on purpose because spelling suggestions and term completion have the same text analysis requirements. If you would like to perform term completion of multiple fields, then you'll be disappointed that you can't do so directly. The easiest way is to combine several fields at index time. Alternatively, a query searching multiple fields with faceting configured for multiple fields can be performed. It would be up to you to merge the faceting results based on ordered counts. Query term completion via the Suggester A high-speed approach to implement term completion, called the Suggester, was introduced in Version 3 of Solr. Until Solr 4.7, the Suggester was an extension of the spellcheck component. It can still be used that way, but it now has its own search component, which is how you should use it. Similar to spellcheck, it's not necessarily as up to date as your index and it needs to be built. However, the Suggester only takes a couple of seconds or so for this usually, and you are not forced to do this per commit, unlike with faceting. The Suggester is generally very fast—a handful of milliseconds per search at most for common setups. The performance characteristics are largely determined by a configuration choice (shown later) called lookupImpl, in which we recommend WFSTLookupFactory for query term completion (but not for other suggestion types). Additionally, the Suggester uniquely includes a method of loading its dictionary from a file that optionally includes a sorting weight. We're going to use it for MusicBrainz's artist name completion. The following is in our solrconfig.xml: <requestHandler name="/a_term_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_term_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aTermSuggester</str> </arr> </requestHandler>    <searchComponent name="aTermSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_term_suggest</str>    <str name="lookupImpl">WFSTLookupFactory</str>    <str name="field">a_spell</str>    <!-- <float name="threshold">0.005</float> -->    <str name="buildOnOptimize">true</str> </lst> </searchComponent> The first part of this is a request handler definition just for using the Suggester. The second part of this is an instantiation of the SuggestComponent search component. The dictionary here is loaded from the a_spell field in the main index, but if a file is desired, then you can provide the sourceLocation parameter. The document frequency threshold for suggestions is commented here because MusicBrainz has unique names that we don't want filtered out. However, in common scenarios, this threshold is advised. The Suggester needs to be built, which is the process of building the dictionary from its source into an optimized memory structure. If you set storeDir, it will also save it such that the next time Solr starts, it will load automatically and be ready. If you try to get suggestions before it's built, there will be no results. The Suggester only takes a couple of seconds or so to build and so we recommend building it automatically on startup via a firstSearcher warming query in solrconfig.xml. If you are using Solr 5.0, then this is simplified by adding a buildOnStartup Boolean to the Suggester's configuration. To be kept up to date, it needs to be rebuilt from time to time. If commits are infrequent, you should use the buildOnCommit setting. We've chosen the buildOnOptimize setting as the dataset is optimized after it's completely indexed; and then, it's never modified. Realistically, you may need to schedule a URL fetch to trigger the build, as well as incorporate it into any bulk data loading scripts you develop. Now, let's issue a request to the Suggester. Here's a completion for the incomplete query string sma http://localhost:8983/solr/mbartists/a_term_suggest?q=sma&wt=json. And here is the output, indented: { "responseHeader":{    "status":0,    "QTime":1}, "suggest":{"a_term_suggest":{    "sma":{      "numFound":5,      "suggestions":[{        "term":"sma",        "weight":3,        "payload":""},      {        "term":"small",        "weight":110,        "payload":""},      {        "term":"smart",        "weight":50,        "payload":""},      {        "term":"smash",        "weight":36,        "payload":""},      {        "term":"smalley",        "weight":9,        "payload":""}]}}}} If the input is found, it's listed first; then suggestions are presented in weighted order. In the case of an index-based source, the weights are, by default, the document frequency of the value. For more information about the Suggester, see the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/Suggester. You'll find information on lookupImpl alternatives and other details. However, some secrets of the Suggester are still undocumented, buried in the code. Look at the factories for more configuration options. Query term completion via the Terms component The Terms component is used to expose raw indexed term information, including term frequency, for an indexed field. It has a lot of options for paging into this voluminous data and filtering out terms by term frequency. The Terms component has the benefit of using no Java heap memory, and consequently, there is no initialization penalty. It's always up to date with the indexed data, like faceting but unlike the Suggester. The performance is typically good, but for high query load on large indexes, it will suffer compared to the other approaches. An interesting feature unique to this approach is a regular expression term match option. This can be used for case-insensitive matching, but it probably doesn't scale to many terms. For more information about this component, visit the Solr Reference Guide at https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. Field-value completion via the Suggester In this example, we'll show you how to suggest complete field values. This might be used for instant-search navigation by a document name or title, or it might be used to filter results by a field. It's particularly useful for fields that you facet on, but it will take some work to integrate into the search user experience. This can even be used to complete multiple fields at once by specifying suggest.dictionary multiple times. To complete values across many fields at once, you should consider an alternative approach than what is described here. For example, use a dedicated suggestion index of each name-value pair and use an edge n-gram technique or shingling. We'll use the Suggester once again, but using a slightly different configuration. Using AnalyzingLookupFactory as the lookupImpl, this Suggester will be able to specify a field type for query analysis and another as the source for suggestions. Any tokenizer or filter can be used in the analysis chain (lowercase, stop words, and so on). We're going to reuse the existing textSpell field type for this example. It will take care of lowercasing the tokens and throwing out stop words. For the suggestion source field, we want to return complete field values, so a string field will be used; we can use the existing a_name_sort field for this, which is close enough. Here's the required configuration for the suggest component: <searchComponent name="aNameSuggester" class="solr.SuggestComponent"> <lst name="suggester">    <str name="name">a_name_suggest</str>    <str name="lookupImpl">AnalyzingLookupFactory</str>    <str name="field">a_name_sort</str>    <str name="buildOnOptimize">true</str>    <str name="storeDir">a_name_suggest</str>    <str name="suggestAnalyzerFieldType">textSpell</str> </lst> </searchComponent> And here is the request handler and component: <requestHandler name="/a_name_suggest" class="solr.SearchHandler" startup="lazy"> <lst name="defaults">    <str name="suggest">true</str>    <str name="suggest.dictionary">a_name_suggest</str>    <str name="suggest.count">5</str> </lst> <arr name="components">    <str>aNameSuggester</str> </arr> </requestHandler> We've set up the Suggester to build the index of suggestions after an optimize command. On a modestly powered laptop, the build time was about 5 seconds. Once the build is complete, the /a_name_suggest handler will return field values for any matching query. Here's an example that will make use of this Suggester: http://localhost:8983/solr/mbartists/a_name_suggest?wt=json&omitHeader=true&q=The smashing,pum. Here's the response from that query: { "spellcheck":{    "suggestions":[      "The smashing,pum",{        "numFound":1,        "startOffset":0,        "endOffset":16,        "suggestion":["Smashing Pumpkins, The"]},      "collation","(Smashing Pumpkins, The)"]}} As you can see, the Suggester is able to deal with the mixed case. Ignore The (a stop word) and also the , (comma) we inserted, as this is how our analysis is configured. Impressive! It's worth pointing out that there's a lot more that can be done here, depending on your needs, of course. It's entirely possible to add synonyms, additional stop words, and different tokenizers to the analysis chain. There are other interesting lookupImpl choices. FuzzyLookupFactory can suggest completions that are similarly typed to the input query; for example, words that are similar in spelling, or just typos. AnalyzingInfixLookupFactory is a Suggester that can provide completions from matching prefixes anywhere in the field value, not just the beginning. Other ones are BlendedInfixLookupFactory and FreeTextLookupFactory. See the Solr Reference Guide for further information. Summary In this article we learned about the query complete/suggest feature. We saw the different ways by which we can implement this feature. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] Tuning Solr JVM and Container [article] Apache Solr PHP Integration [article]
Read more
  • 0
  • 0
  • 3892
article-image-gradient-descent-work
Packt
03 Feb 2016
11 min read
Save for later

Gradient Descent at Work

Packt
03 Feb 2016
11 min read
In this article by Alberto Boschetti and Luca Massaron authors of book Regression Analysis with Python, we will learn about gradient descent, its feature scaling and a simple implementation. (For more resources related to this topic, see here.) As an alternative from the usual classical optimization algorithms, the gradient descent technique is able to minimize the cost function of a linear regression analysis using much less computations. In terms of complexity, gradient descent ranks in the order O(n*p), thus making learning regression coefficients feasible even in the occurrence of a large n (that stands for the number of observations) and large p (number of variables). The method works by leveraging a simple heuristic that gradually converges to the optimal solution starting from a random one. Explaining it using simple words, it resembles walking blind in the mountains. If you want to descend to the lowest valley, even if you don't know and can't see the path, you can proceed approximately by going downhill for a while, then stopping, then directing downhill again, and so on, always directing at each stage where the surface descends until you arrive at a point when you cannot descend anymore. Hopefully, at that point, you will have reached your destination. In such a situation, your only risk is to pass by an intermediate valley (where there is a wood or a lake for instance) and mistake it for your desired arrival because the land stops descending there. In an optimization process, such a situation is defined as a local minimum (whereas your target is a global minimum, instead of the best minimum possible) and it is a possible outcome of your journey downhill depending on the function you are working on minimizing. The good news, in any case, is that the error function of the linear model family is a bowl-shaped one (technically, our cost function is a concave one) and it is unlikely that you can get stuck anywhere if you properly descend. The necessary steps to work out a gradient-descent-based solution are hereby described. Given our cost function for a set of coefficients (the vector w): We first start by choosing a random initialization for w by choosing some random numbers (taken from a standardized normal curve, for instance, having zero mean and unit variance). Then, we start reiterating an update of the values of w (opportunely using the gradient descent computations) until the marginal improvement from the previous J(w) is small enough to let us figure out that we have finally reached an optimum minimum. We can opportunely update our coefficients, separately one by one, by subtracting from each of them a portion alpha (α, the learning rate) of the partial derivative of the cost function: Here, in our formula, wj is to be intended as a single coefficient (we are iterating over them). After resolving the partial derivative, the final resolution form is: Simplifying everything, our gradient for the coefficient of x is just the average of our predicted values multiplied by their respective x value. We have to notice that by introducing more parameters to be estimated during the optimization procedure, we are actually introducing more dimensions to our line of fit (turning it into a hyperplane, a multidimensional surface) and such dimensions have certain communalities and differences to be taken into account. Alpha, called the learning rate, is very important in the process, because if it is too large, it may cause the optimization to detour and fail. You have to think of each gradient as a jump or as a run in a direction. If you fully take it, you may happen to pass over the optimum minimum and end up in another rising slope. Too many consecutive long steps may even force you to climb up the cost slope, worsening your initial position (given by a cost function that is its summed square, the loss of an overall score of fitness). Using a small alpha, the gradient descent won't jump beyond the solution, but it may take much longer to reach the desired minimum. How to choose the right alpha is a matter of trial and error. Anyway, starting from an alpha, such as 0.01, is never a bad choice based on our experience in many optimization problems. Naturally, the gradient, given the same alpha, will in any case produce shorter steps as you approach the solution. Visualizing the steps in a graph can really give you a hint about whether the gradient descent is working out a solution or not. Though quite conceptually simple (it is based on an intuition that we surely applied ourselves to move step by step where we can optimizing our result), gradient descent is very effective and indeed scalable when working with real data. Such interesting characteristics elevated it to be the core optimization algorithm in machine learning, not being limited to just the linear model's family, but also, for instance, extended to neural networks for the process of back propagation that updates all the weights of the neural net in order to minimize the training errors. Surprisingly, the gradient descent is also at the core of another complex machine learning algorithm, the gradient boosting tree ensembles, where we have an iterative process minimizing the errors using a simpler learning algorithm (a so-called weak learner because it is limited by an high bias) for progressing toward the optimization. Scikit-learn linear_regression and other linear models present in the linear methods module are actually powered by gradient descent, making Scikit-learn our favorite choice while working on data science projects with large and big data. Feature scaling While using the classical statistical approach, not the machine learning one, working with multiple features requires attention while estimating the coefficients because of their similarities that can cause a variance inflection of the estimates. Moreover, multicollinearity between variables also bears other drawbacks because it can render very difficult, if not impossible to achieve, matrix inversions, the matrix operation at the core of the normal equation coefficient estimation (and such a problem is due to the mathematical limitation of the algorithm). Gradient descent, instead, is not affected at all by reciprocal correlation, allowing the estimation of reliable coefficients even in the presence of perfect collinearity. Anyway, though being quite resistant to the problems that affect other approaches, gradient descent's simplicity renders it vulnerable to other common problems, such as the different scale present in each feature. In fact, some features in your data may be represented by the measurements in units, some others in decimals, and others in thousands, depending on what aspect of reality each feature represents. For instance, in the dataset we decide to take as an example, the Boston houses dataset (http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html), a feature is the average number of rooms (a float ranging from about 5 to over 8), others are the percentage of certain pollutants in the air (float between 0 and 1), and so on, mixing very different measurements. When it is the case that the features have a different scale, though the algorithm will be processing each of them separately, the optimization will be dominated by the variables with the more extensive scale. Working in a space of dissimilar dimensions will require more iterations before convergence to a solution (and sometimes, there could be no convergence at all). The remedy is very easy; it is just necessary to put all the features on the same scale. Such an operation is called feature scaling. Feature scaling can be achieved through standardization or normalization. Normalization rescales all the values in the interval between zero and one (usually, but different ranges are also possible), whereas standardization operates removing the mean and dividing by the standard deviation to obtain a unit variance. In our case, standardization is preferable both because it easily permits retuning the obtained standardized coefficients into their original scale and because, centering all the features at the zero mean, it makes the error surface more tractable by many machine learning algorithms, in a much more effective way than just rescaling the maximum and minimum of a variable. An important reminder while applying feature scaling is that changing the scale of the features implies that you will have to use rescaled features also for predictions. A simple implementation Let's try the algorithm first using the standardization based on the Scikit-learn preprocessing module: import numpy as np import random from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression   boston = load_boston() standardization = StandardScaler() y = boston.target X = np.column_stack((standardization.fit_transform(boston.data), np.ones(len(y)))) In the preceding code, we just standardized the variables using the StandardScaler class from Scikit-learn. This class can fit a data matrix, record its column means and standard deviations, and operate a transformation on itself as well as on any other similar matrixes, standardizing the column data. By means of this method, after fitting, we keep a track of the means and standard deviations that have been used because they will come handy if afterwards we will have to recalculate the coefficients using the original scale. Now, we just record a few functions for the following computations: def random_w(p):     return np.array([np.random.normal() for j in range(p)])   def hypothesis(X, w):     return np.dot(X,w)   def loss(X, w, y):     return hypothesis(X, w) - y   def squared_loss(X, w, y):     return loss(X, w, y)**2   def gradient(X, w, y):     gradients = list()     n = float(len(y))     for j in range(len(w)):         gradients.append(np.sum(loss(X, w, y) * X[:,j]) / n)     return gradients   def update(X, w, y, alpha=0.01):     return [t - alpha*g for t, g in zip(w, gradient(X, w, y))]   def optimize(X, y, alpha=0.01, eta = 10**-12, iterations = 1000):     w = random_w(X.shape[1])     for k in range(iterations):         SSL = np.sum(squared_loss(X,w,y))         new_w = update(X,w,y, alpha=alpha)         new_SSL = np.sum(squared_loss(X,new_w,y))         w = new_w         if k>=5 and (new_SSL - SSL <= eta and new_SSL - SSL >= -eta):             return w     return w We can now calculate our regression coefficients: w = optimize(X, y, alpha = 0.02, eta = 10**-12, iterations = 20000) print ("Our standardized coefficients: " +   ', '.join(map(lambda x: "%0.4f" % x, w))) Our standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A simple comparison with Scikit-learn's solution can prove if our code worked fine: sk=LinearRegression().fit(X[:,:-1],y) w_sk = list(sk.coef_) + [sk.intercept_] print ("Scikit-learn's standardized coefficients: " + ', '.join(map(lambda x: "%0.4f" % x, w_sk))) Scikit-learn's standardized coefficients: -0.9204, 1.0810, 0.1430, 0.6822, -2.0601, 2.6706, 0.0211, -3.1044, 2.6588, -2.0759, -2.0622, 0.8566, -3.7487, 22.5328 A noticeable particular to mention is our choice of alpha. After some tests, the value of 0.02 has been chosen for its good performance on this very specific problem. Alpha is the learning rate and, during optimization, it can be fixed or changed according to a line search method, modifying its value in order to minimize the cost function at each single step of the optimization process. In our example, we opted for a fixed learning rate and we had to look for its best value by trying a few optimization values and deciding on which minimized the cost in the minor number of iterations. Summary In this article we learned about gradient descent, its feature scaling and a simple implementation using an algorithm based on Scikit-learn preprocessing module. Resources for Article:   Further resources on this subject: Optimization Techniques [article] Saving Time and Memory [article] Making Your Data Everything It Can Be [article]
Read more
  • 0
  • 0
  • 3884

article-image-securing-hadoop-ecosystem
Packt
20 Nov 2013
6 min read
Save for later

Securing the Hadoop Ecosystem

Packt
20 Nov 2013
6 min read
(For more resources related to this topic, see here.) Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce). The following are the topics that we'll be covering in this article: Configuring authentication and authorization for the following Hadoop ecosystem components: Hive Oozie Flume HBase Sqoop Pig Best practices in configuring secured Hadoop components Configuring Kerberos for Hadoop ecosystem components The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured. Securing Hive Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop: There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows: The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster. Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster. To secure Hive in the Hadoop ecosystem, the following interactions should be secured: User interaction with Hive CLI or HiveServer User roles and privileges needs to be enforced to ensure users have access to only authorized data The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems. Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication. HiveServer2 supports Kerberos and LDAP authentication for the user authentication. When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users' data in the Hadoop cluster. On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users. To set up a secured Hive interactions, we need to do the following steps: One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml. <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/_HOST@YOUR-REALM.COM</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property> Securing Hive using Sentry In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser. Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies. More details on Sentry are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cdh/sentry.html Summary In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry. Resources for Article: Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Managing a Hadoop Cluster [Article] Making Big Data Work for Hadoop and Solr [Article]
Read more
  • 0
  • 0
  • 3877

article-image-python-text-processing-nltk-2-transforming-chunks-and-trees
Packt
16 Dec 2010
10 min read
Save for later

Python Text Processing with NLTK 2: Transforming Chunks and Trees

Packt
16 Dec 2010
10 min read
  Python Text Processing with NLTK 2.0 Cookbook Use Python's NLTK suite of libraries to maximize your Natural Language Processing capabilities. Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond Learn how machines and crawlers interpret and process natural languages Easily work with huge amounts of data and learn how to handle distributed processing Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible Introduction This article will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees. The functions detailed in these recipes modify data, as opposed to learning from it. That means it's not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when. Whenever the term chunk is used in this article, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What's important in this article is what you can do with a chunk, not where it came from. Filtering insignificant words Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase "the movie was terrible", the most significant words are "movie" and "terrible", while "the" and "was" are almost useless. You could get the same meaning if you took them out, such as "movie terrible" or "terrible movie". Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words, and keep the significant ones, by looking at their part-of-speech tags. Getting ready First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:   Word Tag a DT all PDT an DT and CC or CC that WDT the DT Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. How to do it... In transforms.py there is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC. def filter_insignificant(chunk, tag_suffixes=['DT', 'CC']): good = [] for word, tag in chunk: ok = True for suffix in tag_suffixes: if tag.endswith(suffix): ok = False break if ok: good.append((word, tag)) return good Now we can use it on the part-of-speech tagged version of "the terrible movie". >>> from transforms import filter_insignificant >>> filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie', 'NN')]) [('terrible', 'JJ'), ('movie', 'NN')] As you can see, the word "the" is eliminated from the chunk. How it works... filter_insignificant() iterates over the tagged words in the chunk. For each tag, it checks if that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. However if the tag is ok, then the tagged word is appended to a new good chunk that is returned. There's more... The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as "you", "your", "their", and "theirs" are no good but DT and CC words are ok. The tag suffixes would then be PRP and PRP$. Following is an example of this function: >>> filter_insignificant([('your', 'PRP$'), ('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')], tag_suffixes=['PRP', 'PRP$']) [('book', 'NN'), ('is', 'VBZ'), ('great', 'JJ')] Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing, querying, and text classification. Correcting verb forms It's fairly common to find incorrect verb forms in real-world language. For example, the correct form of "is our children learning?" is "are our children learning?". The verb "is" should only be used with singular nouns, while "are" is for plural nouns, such as "children". We can correct these mistakes by creating verb correction mappings that are used depending on whether there's a plural or singular noun in the chunk. Getting ready We first need to define the verb correction mappings in transforms.py. We'll create two mappings, one for plural to singular, and another for singular to plural. plural_verb_forms = { ('is', 'VBZ'): ('are', 'VBP'), ('was', 'VBD'): ('were', 'VBD') } singular_verb_forms = { ('are', 'VBP'): ('is', 'VBZ'), ('were', 'VBD'): ('was', 'VBD') } Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping, is to are, was to were, and vice versa. How to do it... In transforms.py there is a function called correct_verbs(). Pass it a chunk with incorrect verb forms, and you'll get a corrected chunk back. It uses a helper function first_chunk_index() to search the chunk for the position of the first tagged word where pred returns True. def first_chunk_index(chunk, pred, start=0, step=1): l = len(chunk) end = l if step > 0 else -1 for i in range(start, end, step): if pred(chunk[i]): return i return None def correct_verbs(chunk): vbidx = first_chunk_index(chunk, lambda (word, tag): tag. startswith('VB')) # if no verb found, do nothing if vbidx is None: return chunk verb, vbtag = chunk[vbidx] nnpred = lambda (word, tag): tag.startswith('NN') # find nearest noun to the right of verb nnidx = first_chunk_index(chunk, nnpred, start=vbidx+1) # if no noun found to right, look to the left if nnidx is None: nnidx = first_chunk_index(chunk, nnpred, start=vbidx-1, step=-1) # if no noun found, do nothing if nnidx is None: return chunk noun, nntag = chunk[nnidx] # get correct verb form and insert into chunk if nntag.endswith('S'): chunk[vbidx] = plural_verb_forms.get((verb, vbtag), (verb, vbtag)) else: chunk[vbidx] = singular_verb_forms.get((verb, vbtag), (verb, vbtag)) return chunk When we call it on a part-of-speech tagged "is our children learning" chunk, we get back the correct form, "are our children learning". >>> from transforms import correct_verbs >>> correct_verbs([('is', 'VBZ'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')]) [('are', 'VBP'), ('our', 'PRP$'), ('children', 'NNS'), ('learning', 'VBG')] We can also try this with a singular noun and an incorrect plural verb. >>> correct_verbs([('our', 'PRP$'), ('child', 'NN'), ('were', 'VBD'), ('learning', 'VBG')]) [('our', 'PRP$'), ('child', 'NN'), ('was', 'VBD'), ('learning', 'VBG')] In this case, "were" becomes "was" because "child" is a singular noun. How it works... The correct_verbs() function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then we look on either side of the verb to find the nearest noun, starting on the right, and only looking to the left if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we lookup the correct verb form depending on whether or not the noun is plural. Plural nouns are tagged with NNS, while singular nouns are tagged with NN. This means we can check the plurality of a noun by seeing if its tag ends with S. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form. To make searching through the chunk easier, we define a function called first_chunk_ index(). It takes a chunk, a lambda predicate, the starting index, and a step increment. The predicate function is called with each tagged word until it returns True. If it never returns True, then None is returned. The starting index defaults to zero and the step increment to one. As you'll see in upcoming recipes, we can search backwards by overriding start and setting step to -1. This small utility function will be a key part of subsequent transform functions. Swapping verb phrases Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, "the book was great" can be transformed into "the great book". How to do it... In transforms.py there is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around. def swap_verb_phrase(chunk): # find location of verb vbpred = lambda (word, tag): tag != 'VBG' and tag.startswith('VB') and len(tag) > 2 vbidx = first_chunk_index(chunk, vbpred) if vbidx is None: return chunk return chunk[vbidx+1:] + chunk[:vbidx] Now we can see how it works on the part-of-speech tagged phrase "the book was great". >>> from transforms import swap_verb_phrase >>> swap_verb_phrase([('the', 'DT'), ('book', 'NN'), ('was', 'VBD'), ('great', 'JJ')]) [('great', 'JJ'), ('the', 'DT'), ('book', 'NN')] The result is "great the book". This phrase clearly isn't grammatically correct, so read on to learn how to fix it. How it works... Using first_chunk_index() from the previous recipe, we start by finding the first matching verb that is not a gerund (a word that ends in "ing") tagged with VBG. Once we've found the verb, we return the chunk with the right side before the left, and remove the verb. The reason we don't want to pivot around a gerund is that gerunds are commonly used to describe nouns, and pivoting around one would remove that description. Here's an example where you can see how not pivoting around a gerund is a good thing: >>> swap_verb_phrase([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')]) [('fantastic', 'JJ'), ('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN')] If we had pivoted around the gerund, the result would be "book is fantastic this", and we'd lose the gerund "gripping". There's more... Filtering insignificant words makes the final result more readable. By filtering either before or after swap_verb_phrase(), we get "fantastic gripping book" instead of "fantastic this gripping book". >>> from transforms import swap_verb_phrase, filter_insignificant >>> swap_verb_phrase(filter_insignificant([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')])) [('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')] >>> filter_insignificant(swap_verb_phrase([('this', 'DT'), ('gripping', 'VBG'), ('book', 'NN'), ('is', 'VBZ'), ('fantastic', 'JJ')])) [('fantastic', 'JJ'), ('gripping', 'VBG'), ('book', 'NN')] Either way, we get a shorter grammatical chunk with no loss of meaning.
Read more
  • 0
  • 0
  • 3870
article-image-embed-einstein-dashboards-salesforce-classic
Amey Varangaonkar
21 Mar 2018
5 min read
Save for later

How to embed Einstein dashboards on Salesforce Classic

Amey Varangaonkar
21 Mar 2018
5 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book highlights the key techniques and know-how to unlock critical insights from your data using Salesforce Einstein Analytics.[/box] With Einstein Analytics, users have the power to embed their dashboards on various third-party applications and even on their web applications. In this article, we will show how to embed an Einstein dashboard on Salesforce Classic. In order to start embedding the dashboard, let's create a sample dashboard by performing the following steps: Navigate to Analytics Studio | Create | Dashboard. Add three chart widgets on the dashboard. Click on the Chart button in the middle and select the Opportunity dataset. Select Measures as Sum of Amount and select BillingCountry under Group by. Click on Done. Repeat the second step for the second widget, but select Account Source under Group by and make it a donut chart. Repeat the second step for the third widget but select Stage under Group by and make it a funnel chart. Click on Save (s) and enter Embedding Opportunities in the title field, as shown in the following screenshot: Now that we have created a dashboard, let's embed this dashboard in Salesforce Classic. In order to start embedding the dashboard, exit from the Einstein Analytics platform and go to Classic mode. The user can embed the dashboard on the record detail page layout in Salesforce Classic. The user can view the dashboard, drill in, and apply a filter, just like in the Einstein Analytics window. Let's add the dashboard to the account detail page by performing the following steps: Navigate to Setup | Customize | Accounts | Page Layouts as shown in the following screenshot: Click on Edit of Account Layout and it will open a page layout editor which has two parts: a palette on the upper portion of the screen, and the page layout on the lower portion of the screen. The palette contains the user interface elements that you can add to your page layout, such as Fields, Buttons, Links, and Actions, and Related Lists, as shown in the following screenshot: Click on the Wave Analytics Assets option from the palette and you can see all the dashboards on the right-side panel. Drag and drop a section onto the page layout, name it Einstein Dashboard, and click on OK. Drag and drop the dashboard which you wish to add to the record detail page. We are going to add Embedded Opportunities. Click on Save. Go to any accounting record and you should see a new section within the dashboard: Users can easily configure the embedded dashboards by using attributes. To access the dashboard properties, go to edit page layout again, and go to the section where we added the dashboard to the layout. Hover over the dashboard and click on the Tool icon. It will open an Asset Properties window: The Asset Properties window gives the user the option to change the following features: Width (in pixels or %): This feature allows you to adjust the width of the dashboard section. Height (in pixels): This feature allows you to adjust the height of the dashboard section. Show Title: This feature allows you to display or hide the title of the dashboard. Show Sharing Icon: Using this feature, by default, the share icon is disabled. The Show Sharing Icon option gives the user a flexibility to include the share icon on the dashboard. Show Header: This feature allows you to display or hide the header. Hide on error: This feature gives you control over whether the Analytics asset appears if there is an error. Field mapping: Last but not least, field mapping is used to filter the relevant data to the record on the dashboard. To set up the dashboard to show only the data that’s relevant to the record being viewed, use field mapping. Field mapping links data fields in the dashboard to the object’s fields. We are using the Embedded Opportunity dashboard. Let's add field mapping to it. The following is the format for field mapping: { "datasets": { "datasetName":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset Fieldname"]} }] } Let's add field mapping for account by using the following format: { "datasets": { "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } If your dashboard uses multiple datasets, then you can use the following format: { "datasets": { "datasetName1":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset1 Fieldname"]} }], "datasetName2":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset2 Fieldname"]} }] } Let's add field mapping for account and opportunities: { "datasets": { "Opportunities":[{ "fields":["Account.Name"], "Filter":{"operator": "Matches", "values":["$Name"]} }], "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } Now that we have added field mapping, save the page layout and go to the actual record. Observe that the dashboard is getting filtered now per record, as shown in the following screenshot: To summarize, we saw it’s fairly easy to embed your custom dashboards in Salesforce. Similarly, you can do so on other platforms such as Lightning, Visualforce pages, and even on your websites and web applications. If you are keen to learn more, you may check out the book Learning Einstein Analytics.    
Read more
  • 0
  • 0
  • 3847

Packt
23 Jul 2015
18 min read
Save for later

Elasticsearch – Spicing Up a Search Using Geo

Packt
23 Jul 2015
18 min read
A geo point refers to the latitude and longitude of a point on Earth. Each location on it has its own unique latitude and longitude. Elasticsearch is aware of geo-based points and allows you to perform various operations on top of it. In many contexts, it's also required to consider a geo location component to obtain various functionalities. For example, say you need to search for all the nearby restaurants that serve Chinese food or I need to find the nearest cab that is free. In some other situation, I need to find to which state a particular geo point location belongs to understand where I am currently standing. This article by Vineeth Mohan, author of the book Elasticsearch Blueprints, is modeled such that all the examples mentioned are related to real-life scenarios, of restaurant searching, for better understanding. Here, we take the example of sorting restaurants based on geographical preferences. A number of cases ranging from the simple, such as finding the nearest restaurant, to the more complex case, such as categorization of restaurants based on distance are covered in this article. What makes Elasticsearch unique and powerful is the fact that you can combine geo operation with any other normal search query to yield results clubbed with both the location data and the query data. (For more resources related to this topic, see here.) Restaurant search Let's consider creating a search portal for restaurants. The following are its requirements: To find the nearest restaurant with Chinese cuisine, which has the word ChingYang in its name. To decrease the importance of all restaurants outside city limits. To find the distance between the restaurant and current point for each of the preceding restaurant matches. To find whether the person is in a particular city's limit or not. To aggregate all restaurants within a distance of 10 km. That is, for a radius of the first 10 km, we have to compute the number of restaurants. For the next 10 km, we need to compute the number of restaurants and so on. Data modeling for restaurants Firstly, we need to see the aspects of data and model it around a JSON document for Elasticsearch to make sense of the data. A restaurant has a name, its location information, and rating. To store the location information, Elasticsearch has a provision to understand the latitude and longitude information and has features to conduct searches based on it. Hence, it would be best to use this feature. Let's see how we can do this. First, let's see what our document should look like: { "name" : "Tamarind restaurant", "location" : {      "lat" : 1.10,      "lon" : 1.54 } } Now, let's define the schema for the same: curl -X PUT "http://$hostname:9200/restaurants" -d '{    "index": {        "number_of_shards": 1,        "number_of_replicas": 1  },    "analysis":{            "analyzer":{                    "flat" : {                "type" : "custom",                "tokenizer" : "keyword",                "filter" : "lowercase"            }        }    } }'   echo curl -X PUT "http://$hostname:9200/restaurants /restaurant/_mapping" -d '{    "restaurant" : {    "properties" : {        "name" : { "type" : "string" },        "location" : { "type" : "geo_point", "accuracy" : "1km" }    }}   }' Let's now index some documents in the index. An example of this would be the Tamarind restaurant data shown in the previous section. We can index the data as follows: curl -XPOST 'http://localhost:9200/restaurants/restaurant' -d '{    "name": "Tamarind restaurant",    "location": {        "lat": 1.1,        "lon": 1.54    } }' Likewise, we can index any number of documents. For the sake of convenience, we have indexed only a total of five restaurants for this article. The latitude and longitude should be of this format. Elasticsearch also accepts two other formats (geohash and lat_lon), but let's stick to this one. As we have mapped the field location to the type geo_point, Elasticsearch is aware of what this information means and how to act upon it. The nearest hotel problem Let's assume that we are at a particular point where the latitude is 1.234 and the longitude is 2.132. We need to find the nearest restaurants to this location. For this purpose, the function_score query is the best option. We can use the decay (Gauss) functionality of the function score query to achieve this: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": {    "function_score": {      "functions": [        {          "gauss": {            "location": {              "scale": "1km",               "origin": [                1.231,                1.012              ]            }          }        }      ]    } } }' Here, we tell Elasticsearch to give a higher score to the restaurants that are nearby the referral point we gave it. The closer it is, the higher is the importance. Maximum distance covered Now, let's move on to another example of finding restaurants that are within 10 kms from my current position. Those that are beyond 10 kms are of no interest to me. So, it almost makes up to a circle with a radius of 10 km from my current position, as shown in the following map: Our best bet here is using a geo distance filter. It can be used as follows: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": {    "filtered": {      "filter": {        "geo_distance": {          "distance": "100km",          "location": {            "lat": 1.232,            "lon": 1.112          }        }      }    } } }' Inside city limits Next, I need to consider only those restaurants that are inside a particular city limit; the rest are of no interest to me. As the city shown in the following map is rectangle in nature, this makes my job easier: Now, to see whether a geo point is inside a rectangle, we can use the bounding box filter. A rectangle is marked when you feed the top-left point and bottom-right point. Let's assume that the city is within the following rectangle with the top-left point as X and Y and the bottom-right point as A and B: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": {    "filtered": {      "query": {        "match_all": {}      },      "filter": {        "geo_bounding_box": {          "location": {            "top_left": {              "lat": 2,              "lon": 0            },            "bottom_right": {              "lat": 0,              "lon": 2            }          }        }      }    } } }' Distance values between the current point and each restaurant Now, consider the scenario where you need to find the distance between the user location and each restaurant. How can we achieve this requirement? We can use scripts; the current geo coordinates are passed to the script and then the query to find the distance between each restaurant is run, as in the following code. Here, the current location is given as (1, 2): curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "script_fields": {    "distance": {      "script": "doc['"'"'location'"'"'].arcDistanceInKm(1, 2)"    } }, "fields": [    "name" ], "query": {    "match": {      "name": "chinese"    } } }' We have used the function called arcDistanceInKm in the preceding query, which accepts the geo coordinates and then returns the distance between that point and the locations satisfied by the query. Note that the unit of distance calculated is in kilometers (km). You might have noticed a long list of quotes and double quotes before and after location in the script mentioned previously. This is the standard format and if we don't use this, it would result in returning the format error while processing. The distances are calculated from the current point to the filtered hotels and are returned in the distance field of response, as shown in the following code: { "took" : 3, "timed_out" : false, "_shards" : {    "total" : 1,    "successful" : 1,    "failed" : 0 }, "hits" : {    "total" : 2,    "max_score" : 0.7554128,    "hits" : [ {      "_index" : "restaurants",      "_type" : "restaurant",      "_id" : "AU08uZX6QQuJvMORdWRK",      "_score" : 0.7554128,      "fields" : {        "distance" : [ 112.92927483176413 ],        "name" : [ "Great chinese restaurant" ]      }    }, {      "_index" : "restaurants",      "_type" : "restaurant",      "_id" : "AU08uZaZQQuJvMORdWRM",      "_score" : 0.7554128,      "fields" : {        "distance" : [ 137.61635969665923 ],        "name" : [ "Great chinese restaurant" ]      }    } ] } } Note that the distances measured from the current point to the hotels are direct distances and not road distances. Restaurant out of city limits One of my friends called me and asked me to join him on his journey to the next city. As we were leaving the city, he was particular that he wants to eat at some restaurant off the city limits, but outside the next city. For this, the requirement was translated to any restaurant that is minimum 15 kms and a maximum of 100 kms from the center of the city. Hence, we have something like a donut in which we have to conduct our search, as show in the following map: The area inside the donut is a match, but the area outside is not. For this donut area calculation, we have the geo_distance_range filter to our rescue. Here, we can apply the minimum distance and maximum distance in the fields from and to to populate the results, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "query": {    "filtered": {      "query": {        "match_all": {}      },      "filter": {        "geo_distance_range": {          "from": "15km",          "to": "100km",          "location": {            "lat": 1.232,            "lon": 1.112          }        }      }    } } }' Restaurant categorization based on distance In an e-commerce solution, to search restaurants, it's required that you increase the searchable characteristics of the application. This means that if we are able to give a snapshot of results other than the top-10 results, it would add to the searchable characteristics of the search. For example, if we are able to show how many restaurants serve Indian, Thai, or other cuisines, it would actually help the user to get a better idea of the result set. In a similar manner, if we can tell them if the restaurant is near, at a medium distance, or far away, we can really pull a chord in the restaurant search user experience, as shown in the following map: Implementing this is not hard, as we have something called the distance range aggregation. In this aggregation type, we can handcraft the range of distance we are interested in and create a bucket for each of them. We can also define the key name we need, as shown in the following code: curl -XPOST 'http://localhost:9200/restaurants/_search' -d '{ "aggs": {    "distanceRanges": {      "geo_distance": {        "field": "location",        "origin": "1.231, 1.012",        "unit": "meters",        "ranges": [          {            "key": "Near by Locations",            "to": 200          },          {            "key": "Medium distance Locations",            "from": 200,            "to": 2000          },          {            "key": "Far Away Locations",            "from": 2000          }        ]      }    } } }' In the preceding code, we categorized the restaurants under three distance ranges, which are the nearby hotels (less than 200 meters), medium distant hotels (within 200 meters to 2,000 meters), and the far away ones (greater than 2,000 meters). This logic was translated to the Elasticsearch query using which, we received the results as follows: { "took": 44, "timed_out": false, "_shards": {    "total": 1,    "successful": 1,    "failed": 0 }, "hits": {    "total": 5,    "max_score": 0,    "hits": [         ] }, "aggregations": {    "distanceRanges": {      "buckets": [        {          "key": "Near by Locations",          "from": 0,          "to": 200,          "doc_count": 1        },        {          "key": "Medium distance Locations",          "from": 200,          "to": 2000,        "doc_count": 0        },        {          "key": "Far Away Locations",          "from": 2000,          "doc_count": 4        }      ]    } } } In the results, we received how many restaurants are there in each distance range indicated by the doc_count field. Aggregating restaurants based on their nearness In the previous example, we saw the aggregation of restaurants based on their distance from the current point to three different categories. Now, we can consider another scenario in which we classify the restaurants on the basis of the geohash grids that they belong to. This kind of classification can be advantageous if the user would like to get a geographical picture of how the restaurants are distributed. Here is the code for a geohash-based aggregation of restaurants: curl -XPOST 'http://localhost:9200/restaurants/_search?pretty' -d '{ "size": 0, "aggs": {    "DifferentGrids": {      "geohash_grid": {        "field": "location",        "precision": 6      },      "aggs": {        "restaurants": {          "top_hits": {}        }      }    } } }' You can see from the preceding code that we used the geohash aggregation, which is named as DifferentGrids and the precision here, is to be set as 6. The precision field value can be varied within the range of 1 to 12, with 1 being the lowest and 12 being the highest reference of precision. Also, we used another aggregation named restaurants inside the DifferentGrids aggregation. The restaurant aggregation uses the top_hits query to fetch the aggregated details from the DifferentGrids aggregation, which otherwise, would return only the key and doc_count values. So, running the preceding code gives us the following result: {    "took":5,    "timed_out":false,    "_shards":{      "total":1,      "successful":1,      "failed":0    },    "hits":{      "total":5,      "max_score":0,      "hits":[        ]    },    "aggregations":{      "DifferentGrids":{          "buckets":[            {                "key":"s009",               "doc_count":2,                "restaurants":{... }            },            {                "key":"s01n",                "doc_count":1,                "restaurants":{... }            },            {                "key":"s00x",                "doc_count":1,                "restaurants":{... }            },            {                "key":"s00p",                "doc_count":1,                "restaurants":{... }            }          ]      }    } } As we can see from the response, there are four buckets with the key values, which are s009, s01n, s00x, and s00p. These key values represent the different geohash grids that the restaurants belong to. From the preceding result, we can evidently say that the s009 grid contains two restaurants inside it and all the other grids contain one each. A pictorial representation of the previous aggregation would be like the one shown on the following map: Summary We found that Elasticsearch can handle geo point and various geo-specific operations. A few geospecific and geopoint operations that we covered in this article were searching for nearby restaurants (restaurants inside a circle), searching for restaurants within a range (restaurants inside a concentric circle), searching for restaurants inside a city (restaurants inside a rectangle), searching for restaurants inside a polygon, and categorization of restaurants by the proximity. Apart from these, we can use Kibana, a flexible and powerful visualization tool provided by Elasticsearch for geo-based operations. Resources for Article: Further resources on this subject: Elasticsearch Administration [article] Extending ElasticSearch with Scripting [article] Indexing the Data [article]
Read more
  • 0
  • 0
  • 3814
Modal Close icon
Modal Close icon