Data | Tech News, Tutorials & Expert Insights

08 Jan 2016

14 min read

Advanced Shiny Functions

08 Jan 2016

In this article by Chris Beeley, author of the book, Web Application Development with R using Shiny - Second Edition, we are going to extend our toolkit by learning about advanced Shiny functions. These allow you to take control of the fine details of your application, including the interface, reactivity, data, and graphics. We will cover the following topics: Learn how to show and hide parts of the interface Change the interface reactively Finely control reactivity, so functions and outputs run at the appropriate time Use URLs and reactive Shiny functions to populate and alter the selections within an interface Upload and download data to and from a Shiny application Produce beautiful tables with the DataTables jQuery library (For more resources related to this topic, see here.) Summary of the application We're going to add a lot of new functionality to the application, and it won't be possible to explain every piece of code before we encounter it. Several of the new functions depend on at least one other function, which means that you will see some of the functions for the first time, whereas a different function is being introduced. It's important, therefore, that you concentrate on whichever function is being explained and wait until later in the article to understand the whole piece of code. In order to help you understand what the code does as you go along it is worth quickly reviewing the actual functionality of the application now. In terms of the functionality, which has been added to the application, it is now possible to select not only the network domain from which browser hits originate but also the country of origin. The draw map function now features a button in the UI, which prevents the application from updating the map each time new data is selected, the map is redrawn only when the button is pushed. This is to prevent minor updates to the data from wasting processor time before the user has finished making their final data selection. A Download report button has been added, which sends some of the output as a static file to a new webpage for the user to print or download. An animated graph of trend has been added; this will be explained in detail in the relevant section. Finally, a table of data has been added, which summarizes mean values of each of the selectable data summaries across the different countries of origin. Downloading data from RGoogleAnalytics The code is given and briefly summarized to give you a feeling for how to use it in the following section. Note that my username and password have been replaced with XXXX; you can get your own user details from the Google Analytics website. Also, note that this code is not included on the GitHub because it requires the username and password to be present in order for it to work: library(RGoogleAnalytics) ### Generate the oauth_token object oauth_token <- Auth(client.id = "xxxx", client.secret = "xxxx") # Save the token object for future sessions save(oauth_token, file = "oauth_token") Once you have your client.id and client.secret from the Google Analytics website, the preceding code will direct you to a browser to authenticate the application and save the authorization within oauth_token. This can be loaded in future sessions to save from reauthenticating each time as follows: # Load the token object and validate for new run load("oauth_token") ValidateToken(oauth_token) The preceding code will load the token in subsequent sessions. The validateToken() function is necessary each time because the authorization will expire after a time this function will renew the authentication: ## list of metrics and dimensions query.list <- Init(start.date = "2013-01-01", end.date = as.character(Sys.Date()), dimensions = "ga:country,ga:latitude,ga:longitude, ga:networkDomain,ga:date", metrics = "ga:users,ga:newUsers,ga:sessions, ga:bounceRate,ga:sessionDuration", max.results = 10000, table.id = "ga:71364313") gadf = GetReportData(QueryBuilder(query.list), token = oauth_token, paginate_query = FALSE) Finally, the metrics and dimensions of interest (for more on metrics and dimensions, see the documentation of the Google Analytics API online) are placed within a list and downloaded with the GetReportData() function as follows: ...[data tidying functions]... save(gadf, file = "gadf.Rdata") The data tidying that is carried out at the end is omitted here for brevity, as you can see at the end the data is saved as gadf.Rdata ready to load within the application. Animation Animation is surprisingly easy. The sliderInput() function, which gives an HTML widget that allows the selection of a number along a line, has an optional animation function that will increment a variable by a set amount every time a specified unit of time elapses. This allows you to very easily produce a graphic that animates. In the following example, we are going to look at the monthly graph and plot a linear trend line through the first 20% of the data (0–20% of the data). Then, we are going to increment the percentage value that selects the portion of the data by 5% and plot a linear through that portion of data (5–25% of the data). Then, increment again from 10% to 30% and plot another line and so on. There is a static image in the following screenshot: The slider input is set up as follows, with an ID, label, minimum value, maximum value, initial value, step between values, and the animation options, giving the delay in milliseconds and whether the animation should loop: sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) Having set this up, the animated graph code is pretty simple, looking very much like the monthly graph data except with the linear smooth based on a subset of the data instead of the whole dataset. The graph is set up as before and then a subset of the data is produced on which the linear smooth can be based: groupByDate <- group_by(passData(), YearMonth, networkDomain) %>% summarise(meanSession = mean(sessionDuration, na.rm = TRUE), users = sum(users), newUsers = sum(newUsers), sessions = sum(sessions)) groupByDate$Date <- as.Date(paste0(groupByDate$YearMonth, "01"), format = "%Y%m%d") smoothData <- groupByDate[groupByDate$Date %in% quantile(groupByDate$Date, input$animation / 100, type = 1): quantile(groupByDate$Date, (input$animation + 20) / 100, type = 1), ] We won't get too distracted by this code, but essentially, it tests to see which of the whole date range falls in a range defined by percentage quantiles based on the sliderInput() values. See ?quantile for more information. Finally, the linear smooth is drawn with an extra data argument to tell ggplot2 to base the line only on the smaller smoothData object and not the whole range: ggplot(groupByDate, aes_string(x = "Date", y = input$outputRequired, group = "networkDomain", colour = "networkDomain") ) + geom_line() + geom_smooth(data = smoothData, method = "lm", colour = "black" ) Not bad for a few lines of code. We have both ggplot2 and Shiny to thank for how easy this is. Streamline the UI by hiding elements This is a simple function that you are certainly going to need if you build even a moderately complex application. Those of you who have been doing extra credit exercises and/or experimenting with your own applications will probably have already wished for this or, indeed, have already found it. conditionalPanel() allows you to show/hide UI elements based on other selections within the UI. The function takes a condition (in JavaScript, but the form and syntax will be familiar from many languages) and a UI element and displays the UI only when the condition is true. This has actually used a couple of times in the advanced GA application, and indeed in all the applications, I've ever written of even moderate complexity. We're going to show the option to smooth the trend graph only when the trend graph tab is displayed, and we're going to show the controls for the animated graph only when the animated graph tab is displayed. Naming tabPanel elements In order to allow testing for which tab is currently selected, we're going to have to first give the tabs of the tabbed output names. This is done as follows (with the new code in bold): tabsetPanel(id = "theTabs", # give tabsetPanel a name tabPanel("Summary", textOutput("textDisplay"), value = "summary"), tabPanel("Trend", plotOutput("trend"), value = "trend"), tabPanel("Animated", plotOutput("animated"), value = "animated"), tabPanel("Map", plotOutput("ggplotMap"), value = "map"), tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") As you can see, the whole panel is given an ID (theTabs) and then each tabPanel is also given a name (summary, trend, animated, map, and table). They are referred to in the server.R file very simply as input$theTabs. Finally, we can make our changes to ui.R to remove parts of the UI based on tab selection: conditionalPanel( condition = "input.theTabs == 'trend'", checkboxInput("smooth", label = "Add smoother?", # add smoother value = FALSE) ), conditionalPanel( condition = "input.theTabs == 'animated'", sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) ) As you can see, the condition appears very R/Shiny-like, except with the . operator familiar to JavaScript users in place of $. This is a very simple but powerful way of making sure that your UI is not cluttered with an irrelevant material. Beautiful tables with DataTable The latest version of Shiny has added support to draw tables using the wonderful DataTables jQuery library. This will enable your users to search and sort through large tables very easily. To see DataTable in action, visit the homepage at http://datatables.net/. The version in this application summarizes the values of different variables across the different countries from which browser hits originate and looks as follows: The package can be installed using install.packages("DT") and needs to be loaded in the preamble to the server.R file with library(DT). Once this is done using the package is quite straightforward. There are two functions: one in server.R (renderDataTable) and other in ui.R (dataTableOutput). They are used as following: ### server. R output$countryTable <- DT::renderDataTable ({ groupCountry <- group_by(passData(), country) groupByCountry <- summarise(groupCountry, meanSession = mean(sessionDuration), users = log(sum(users)), sessions = log(sum(sessions)) ) datatable(groupByCountry) }) ### ui.R tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") Anything that returns a dataframe or a matrix can be used within renderDataTable(). Note that as of Shiny V. 0.12, the Shiny functions renderDataTable() and dataTableOutput() functions are deprecated: you should use the DT equivalents of the same name, as in the preceding code adding DT:: before each function name specifies that the function should be drawn from that package. Reactive user interfaces Another trick you will definitely want up your sleeve at some point is a reactive user interface. This enables you to change your UI (for example, the number or content of radio buttons) based on reactive functions. For example, consider an application that I wrote related to survey responses across a broad range of health services in different areas. The services are related to each other in quite a complex hierarchy, and over time, different areas and services respond (or cease to exist, or merge, or change their name), which means that for each time period the user might be interested in, there would be a totally different set of areas and services. The only sensible solution to this problem is to have the user tell you which area and date range they are interested in and then give them back the correct list of services that have survey responses within that area and date range. The example we're going to look at is a little simpler than this, just to keep from getting bogged down in too much detail, but the principle is exactly the same, and you should not find this idea too difficult to adapt to your own UI. We are going to allow users to constrain their data by the country of origin of the browser hit. Although we could design the UI by simply taking all the countries that exist in the entire dataset and placing them all in a combo box to be selected, it is a lot cleaner to only allow the user to select from the countries that are actually present within the particular date range they have selected. This has the added advantage of preventing the user from selecting any countries of origin, which do not have any browser hits within the currently selected dataset. In order to do this, we are going to create a reactive user interface, that is, one that changes based on data values that come about from user input. Reactive user interface example – server.R When you are making a reactive user interface, the big difference is that instead of writing your UI definition in your ui.R file, you place it in server.R and wrap it in renderUI(). Then, point to it from your ui.R file. Let's have a look at the relevant bit of the server.R file: output$reactCountries <- renderUI({ countryList = unique(as.character(passData()$country)) selectInput("theCountries", "Choose country", countryList) }) The first line takes the reactive dataset that contains only the data between the dates selected by the user and gives all the unique values of countries within it. The second line is a widget type we have not used yet, which generates a combo box. The usual id and label arguments are given, followed by the values that the combo box can take. This is taken from the variable defined in the first line. Reactive user interface example – ui.R The ui.R file merely needs to point to the reactive definition, as shown in the following line of code (just add it in to the list of widgets within sidebarPanel()): uiOutput("reactCountries") You can now point to the value of the widget in the usual way as input$subDomains. Note that you do not use the name as defined in the call to renderUI(), that is, reactCountries, but rather the name as defined within it, that is, theCountries. Progress bars It is quite common within Shiny applications and in analytics generally to have computations or data fetches that take a long time. However, even using all these tools, it will sometimes be necessary for the user to wait some time before their output is returned. In cases like this, it is a good practice to do two things: first, to inform that the server is processing the request and has not simply crashed or otherwise failed, and second to give the user some idea of how much time has elapsed since they requested the output and how much time they have remaining to wait. This is achieved very simply in Shiny using the withProgress() function. This function defaults to measuring progress on a scale from 0 to 1 and produces a loading bar at the top of the application with the information from the message and detail arguments of the loading function. You can see in the following code, the withProgress function is used to wrap a function (in this case, the function that draws the map), with message and detail arguments describing what is happened and an initial value of 0 (value = 0, that is, no progress yet): withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... } ) As the code is stepped through, the value of progress can steadily be increased from 0 to 1 (for example, in a for() loop) using the following code: incProgress(1/3) The third time this is called, the value of progress will be 1, which indicates that the function has completed (although other values of progress can be selected where necessary, see ?withProgess()). To summarize, the finished code looks as follows: withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... incProgress(1/3) .. . function code... incProgress(1/3) ... function code... incProgress(1/3) } ) It's very simple. Again, have a look at the application to see it in action. Summary In this article, you have now seen most of the functionality within Shiny. It's a relatively small but powerful toolbox with which you can build a vast array of useful and intuitive applications with comparatively little effort. In this respect, ggplot2 is rather a good companion for Shiny because it too offers you a fairly limited selection of functions with which knowledgeable users can very quickly build many different graphical outputs. Resources for Article: Further resources on this subject: Introducing R, RStudio, and Shiny[article] Introducing Bayesian Inference[article] R ─ Classification and Regression Trees[article]

0
0
4752

article-image-big-data-analysis-r-and-hadoop

Packt

26 Mar 2015

37 min read

Big Data Analysis (R and Hadoop)

Packt

26 Mar 2015

37 min read

0
0
4746

article-image-creating-your-first-collection-simple

Packt

26 Jun 2013

7 min read

Creating your first collection (Simple)

Packt

26 Jun 2013

7 min read

(For more resources related to this topic, see here.) Getting ready Assuming that you have walked through the tutorial, you should be nearly ready with the setup. Still, it does not hurt to go through the checklist: Be familiar that you know how to start your operating system's shell (cmd.exe on Windows, Terminal/iTerm on Mac, and sh/bash/tch/zsh on Unix). Ensure that running the java –version command on the shell's prompt returns at least Version 1.6. You may need to upgrade if you have an older version. Ensure that you know where you unpacked the Solr distribution and the full path to the example directory within that. You needed that directory for the tutorial, but that's also where we are going to start our own Solr instance. That allows us to easily run an embedded Jetty web server and to also find all the additional JAR files that Solr needs to operate properly. Now, create a directory where we will store our indexes and experiments. It can be anywhere on your drive. As Solr can run on any operating system where Java can run, we will use SOLRINDEXING as a name whenever we refer to that directory. Make sure to use absolute path names when substituting with your real path for the directory. How to do it... As our first example, we will create an index that stores and allows for the searching of simplified e-mail information. For now, we will just look at the addr_from and addr_to e-mail addresses and the subject line. You will see that it takes only two simple configuration files to get the basic Solr index working. Under the SOLR-INDEXING directory, create a collection1 directory and inside that create a conf directory. In the conf directory, create two files: schema.xml and solrconfig.xml. The schema.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><schema version="1.5"><fields><field name="id" type="string" indexed="true" stored="true"required="true"/><field name="addr_from" type="string" indexed="true"stored="true" required="true"/><field name="addr_to" type="string" indexed="true"stored="true" required="true"/><field name="subject" type="string" indexed="true"stored="true" required="true"/></fields><uniqueKey>id</uniqueKey><types><fieldType name="string" class="solr.StrField" /></types></schema> The solrconfig.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><config><luceneMatchVersion>LUCENE_43</luceneMatchVersion><requestDispatcher handleSelect="false"><httpCaching never304="true" /></requestDispatcher><requestHandler name="/select" class="solr.SearchHandler" /><requestHandler name="/update" class="solr.UpdateRequestHandler" /><requestHandler name="/admin" class="solr.admin.AdminHandlers" /><requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /></config> That is it. Now, let's start our just-created Solr instance. Open a new shell (we'll need the current one later). On that shell's command prompt, change the directory to the example directory of the Solr distribution and run the following command: java -Dsolr.solr.home=SOLR-INDEXING -jar start.jar Notice that solr.solr.home is not a typo; you do need the solr part twice. And, as always, if you have spaces in your paths (now or later), you may need to escape them in platform-specific ways, such as with backslashes on Unix/Linux or by quoting the whole value. In the window of your shell, you should see a long list of messages that you can safely ignore (at least for now). You can verify that everything is working fine by checking for the following three elements: The long list of messages should finish with a message like Started SocketConnector@0.0.0.0:8983. This means that Solr is now running on port 8983 successfully. You should now have a directory called data, right next to the directory called conf that we created earlier. If you open the web browser and go to the http:// localhost:8983/ solr/, you should see a web-based admin interface that makes testing and troubleshooting your Solr instance much easier. We will be using this interface later, so do spend a couple of minutes clicking around now. Now, let's load some actual content into our collection: Copy post.jar from the Solr distribution's example/exampledocs directory to our root SOLR-INDEXING directory. Create a file called input1.csv in the collection1 directory, next to the conf and data directories with the following three-line content: id,addr_from,addr_to,subjectemail1,fulan@acme.example.com,kari@acme.example.com,"Kari,we need more Junior Java engineers"email2,kari@acme.example.com,maija@acme.example.com,"Updating vacancy description" Run the import command from the command line in the SOLR-INDEXING directory (one long command; do not split it across lines): java -Dauto -Durl=http://localhost:8983/solr/collection1/update -jar post.jar collection1/input1.csv You should see the following in one of the message lines: "1 files indexed". If you now open a web browser and go to http:// localhost:8983/solr/ collection1/select?q=*%3A*&wt=ruby&indent=true, you should see Solr output with all the three documents displayed on the screen in a somewhat readable format. How it works... We have created two files to get our example working. Let's review what they mean and how they fit together: The schema.xml file in the collection's conf directory defines the actual shape of data that you want to store and index. The fields define a structure of a record. Each field has a type, which is also defined in the same file. The field defines whether it is stored, indexed, required, multivalued, or a small number of other, more advanced properties. On the other hand, the field type defines what is actually done to the field when it is indexed and when it is searched. We will explore all of these later. The solrconfig.xml file also in the collection's conf directory defines and tunes the components that make up Solr's runtime environment. At the very least, it needs to define which URLs can be called to add records to a collection (here, /update), which to query a collection (here, /select), and which to do various administrative tasks (here, /admin and /analysis/field). Once Solr started, it created a single collection with the default name of collection1, assigned an update handler to it at the /solr/collection1/update URL and search handler at the /solr/collection1/select URL (as per solrconfig.xml). At that point, Solr was ready for the data to be imported into the four required fields (as per schema.xml). We then proceeded to populate the index from a CSV file (one of many update formats available) and then verified that the records are all present in an indented Ruby format (again, one of many result formats available). Summary This article helped you create a basic Solr collection and populate it with a simple dataset in CSV format. Resources for Article : Further resources on this subject: Integrating Solr: Ruby on Rails Integration [Article] Indexing Data in Solr 1.4 Enterprise Search Server: Part2 [Article] Text Search, your Database or Solr [Article]

0
0
4693

article-image-setting-synchronous-replication

Packt

10 Aug 2015

17 min read

Setting Up Synchronous Replication

Packt

10 Aug 2015

17 min read

0
0
4683

Packt

15 Jun 2015

3 min read

Neural Network in Azure ML

Packt

15 Jun 2015

3 min read

In this article written by Sumit Mund, author of the book Microsoft Azure Machine Learning, we will learn about neural network, which is a kind of machine learning algorithm inspired by the computational models of a human brain. It builds a network of computation units, neurons, or nodes. In a typical network, there are three layers of nodes. First, the input layer, followed by the middle layer or hidden layer, and in the end, the output layer. Neural network algorithms can be used for both classification and regression problems. (For more resources related to this topic, see here.) The number of nodes in a layer depends on the problem and how you construct the network to get the best result. Usually, the number of nodes in an input layer is equal to the number of features in the dataset. For a regression problem, the number of nodes in the output layer is one while for a classification problem, it is equal to the number of class or label. Each node in a layer gets connected to all the nodes in the next layer. Each edge that connects between nodes is assigned a weight. So, a neural network can well be imagined as a weighted directed acyclic graph. In a typical neural network, as shown in the preceding figure, the middle layer or hidden layer contains the number nodes, which are chosen to make the computation right. While there is no formula or agreed convention for this, it is often optimized after trying out different options. Azure Machine Learning supports neural network for regression, two-class classification, and multiclass classification. It provides a separate module for each kind of problem and lets the users tune it with different parameters, such as the number of hidden nodes, number of iterations to train the model, and so on. A special kind of neural network algorithms where there are more than one hidden layers is known as deep networks or deep learning algorithms. Azure Machine Learning allows us to choose the number of hidden nodes as a property value of the neural network module. These kind of neural networks are getting increasingly popular these days because of their remarkable results and because they allow us to model complex and nonlinear scenarios. There are many kinds of deep networks, but recently, a special kind of deep network known as the convolutional neural network got very popular because of its significant performance in image recognition or classification problems. Azure Machine Learning supports the convolutional neural network. For simple networks with three layers, this can be done through a UI just by choosing parameters. However, to build a deep network like a convolutional deep network, it’s not easy to do so through a UI. So, Azure Machine Learning supports a new kind of language called Net#, which allows you to script different kinds of neural network inside ML Studio by defining different node, the connections (edges), and kind of connections. While deep networks are complex to build and train, Net# makes things relatively easy and simple. Though complex, neural networks are very powerful and Azure Machine Learning makes it fun to work with these be it three-layered shallow networks or multilayer deep networks. Resources for Article: Further resources on this subject: Security in Microsoft Azure [article] High Availability, Protection, and Recovery using Microsoft Azure [article] Managing Microsoft Cloud [article]

0
0
4680

article-image-oracle-goldengate-considerations-designing-solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate: Considerations for Designing a Solution

Packt

24 Feb 2011

8 min read

Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators At a high level, the design must include the following generic requirements: Hardware Software Network Storage Performance All the above must be factored into the overall system architecture. So let's take a look at some of the options and the key design issues. Replication methods So you have a fast reliable network between your source and target sites. You also have a schema design that is scalable and logically split. You now need to choose the replication architecture; One to One, One to Many, active-active, active-passive, and so on. This consideration may already be answered for you by the sheer fact of what the system has to achieve. Let's take a look at some configuration options. Active-active Let's assume a multi-national computer hardware company has an office in London and New York. Data entry clerks are employed at both sites inputting orders into an Order Management System. There is also a procurement department that updates the system inventory with volumes of stock and new products related to a US or European market. European countries are managed by London, and the US States are managed by New York. A requirement exists where the underlying database systems must be kept in synchronisation. Should one of the systems fail, London users can connect to New York and vice-versa allowing business to continue and orders to be taken. Oracle GoldenGate's active-active architecture provides the best solution to this requirement, ensuring that the database systems on both sides of the pond are kept synchronised in case of failure. Another feature the active-active configuration has to offer is the ability to load balance operations. Rather than have effectively a DR site in both locations, the European users could be allowed access to New York and London systems and viceversa. Should a site fail, then the DR solution could be quickly implemented. Active-passive The active-passive bi-directional configuration replicates data from an active primary database to a full replica database. Sticking with the earlier example, the business would need to decide which site is the primary where all users connect. For example, in the event of a failure in London, the application could be configured to failover to New York. Depending on the failure scenario, another option is to start up the passive configuration, effectively turning the active-passive configuration into active-active. Cascading The Cascading GoldenGate topology offers a number of "drop-off" points that are intermediate targets being populated from a single source. The question here is "what data do I drop at which site?" Once this question has been answered by the business, it is then a case of configuring filters in Replicat parameter files allowing just the selected data to be replicated. All of the data is passed on to the next target where it is filtered and applied again. This type of configuration lends itself to a head office system updating its satellite office systems in a round robin fashion. In this case, only the relevant data is replicated at each target site. Another design, is the Hub and Spoke solution, where all target sites are updated simultaneously. This is a typical head office topology, but additional configuration and resources would be required at the source site to ship the data in a timely manner. The CPU, network, and file storage requirements must be sufficient to accommodate and send the data to multiple targets. Physical Standby A Physical Standby database is a robust Oracle DR solution managed by the Oracle Data Guard product. The Physical Standby database is essentially a mirror copy of its Primary, which lends itself perfectly for failover scenarios. However , it is not easy to replicate data from the Physical Standby database, because it does not generate any of its own redo. That said, it is possible to configure GoldenGate to read the archived standby logs in Archive Log Only (ALO) mode. Despite being potentially slower, it may be prudent to feed a downstream system on the DR site using this mechanism, rather than having two data streams configured from the Primary database. This reduces network bandwidth utilization, as shown in the following diagram: Reducing network traffic is particularly important when there is considerable distance between the primary and the DR site. Networking The network should not be taken for granted. It is a fundamental component in data replication and must be considered in the design process. Not only must it be fast, it must be reliable. In the following paragraphs, we look at ways to make our network resilient to faults and subsequent outages, in an effort to maintain zero downtime. Surviving network outages Probably one of your biggest fears in a replication environment is network failure. Should the network fail, the source trail will fill as the transactions continue on the source database, ultimately filling the filesystem to 100% utilization, causing the Extract process to abend. Depending on the length of the outage, data in the database's redologs may be overwritten causing you the additional task of configuring GoldenGate to extract data from the database's archived logs. This is not ideal as you already have the backlog of data in the trail files to ship to the target site once the network is restored. Therefore, ensure there is sufficient disk space available to accommodate data for the longest network outage during the busiest period. Disks are relatively cheap nowadays. Providing ample space for your trail files will help to reduce the recovery time from the network outage. Redundant networks One of the key components in your GoldenGate implementation is the network. Without the ability to transfer data from the source to the target, it is rendered useless. So, you not only need a fast network but one that will always be available. This is where redundant networks come into play, offering speed and reliability. NIC teaming One method of achieving redundancy is Network Interface Card (NIC) teaming or bonding. Here two or more Ethernet ports can be "coupled" to form a bonded network supporting one IP address. The main goal of NIC teaming is to use two or more Ethernet ports connected to two or more different access network switches thus avoiding a single point of failure. The following diagram illustrates the redundant features of NIC teaming: Linux (OEL/RHEL 4 and above) supports NIC teaming with no additional software requirements. It is purely a matter of network configuration stored in text files in the /etc/sysconfig/network-scripts directory. The following steps show how to configure a server for NIC teaming: First, you need to log on as root user and create a bond0 config file using the vi text editor. # vi /etc/sysconfig/network-scripts/ifcfg-bond0 Append the following lines to it, replacing the IP address with your actual IP address, then save file and exit to shell prompt: DEVICE=bond0 IPADDR=192.168.1.20 NETWORK=192.168.1.0 NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes Choose the Ethernet ports you wish to bond, and then open both configurations in turn using the vi text editor, replacing ethn with the respective port number. # vi /etc/sysconfig/network-scripts/ifcfg-eth2 # vi /etc/sysconfig/network-scripts/ifcfg-eth4 Modify the configuration as follows: DEVICE=ethn USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none Save the files and exit to shell prompt. To make sure the bonding module is loaded when the bonding interface (bond0) is brought up, you need to modify the kernel modules configuration file: # vi /etc/modprobe.conf Append the following two lines to the file: alias bond0 bonding options bond0 mode=balance-alb miimon=100 Finally, load the bonding module and restart the network services: # modprobe bonding # service network restart You now have a bonded network that will load balance when both physical networks are available, providing additional bandwidth and enhanced performance. Should one network fail, the available bandwidth will be halved, but the network will still be available. Non-functional requirements (NFRs) Irrespective of the functional requirements, the design must also include the nonfunctional requirements (NFR) in order to achieve the overall goal of delivering a robust, high performance, and stable system. Latency One of the main NFRs is performance. How long does it take to replicate a transaction from the source database to the target? This is known as end-to-end latency that typically has a threshold that must not be breeched in order to satisfy the specified NFR. GoldenGate refers to latency as lag, which can be measured at different intervals in the replication process. These are: Source to Extract: The time taken for a record to be processed by the Extract compared to the commit timestamp on the database Replicat to Target: The time taken for the last record to be processed by the Replicat compared to the record creation time in the trail file A well designed system may encounter spikes in latency but it should never be continuous or growing. Trying to tune GoldenGate when the design is poor is a difficult situation to be in. For the system to perform well you may need to revisit the design.

0
0
4669

article-image-lambdaarchitecture-pattern

Packt

19 Jun 2017

8 min read

LambdaArchitecture Pattern

Packt

19 Jun 2017

8 min read

In this article by Tomcy John and Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of big data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of data lake. (For more resources related to this topic, see here.) The concept of a data lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed, and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same enterprise. These created information silos across arious applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While data mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have data lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this article, we will look at some of the critical aspects of a data lake and understand why does it matter for an enterprise. If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms. Lambda architecture and data lake Lambda architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations. The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (batch and speed - the curved parts of the lambda symbol) which then combines and served through the serving layer (the line merging from the curved part). Figure 01 : Lambda Symbol The main layers constituting the Lambda layer are shown below: Figyure 02 : Components of Lambda Architecure In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results. A practical realization of such a data lake can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of ingestion layer, batch layer.and speed layer springs into action: Figure 03: Layers in Data Lake Data Acquisition Layer:In an organization, data exists in various forms which can be classified as structured data, semi-structured data, or as unstructured data.One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below. Figure 04: Data Acquisition Layer Messaging Layer: The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as in the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics). Figure 05: Message Queue Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion. Data Ingestion Layer: A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda architecture. The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. Figure 06: Data Ingestion Layer Batch Processing: The batch processing layer of lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of lambda architecture. While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A map and reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on divide and rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. Figure 07: Batch Processing Speed (Near Real Time) Data Processing: This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below. Figure 08: Speed (Near Real Time) Processing Data Storage Layer: The data storage layer is very eminent in the lambda architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below. Figure 09: Non-Indexed and Indexed Data Storage Examples Lambda in action Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations. Figure 10: Lambda in action Summary In this article we have briefly discussed a practical approach towards implementing a data lake for enterprises by leveraging Lambda architecture pattern. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] System Architecture and Design of Ansible [article] Microservices and Service Oriented Architecture [article]

0
0
4658

Packt

16 Feb 2016

1 min read

R vs Pandas

Packt

16 Feb 2016

1 min read

This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data

0
0
4657

Packt

14 Sep 2015

10 min read

PostgreSQL in Action

Packt

14 Sep 2015

10 min read

In this article by Salahadin Juba, Achim Vannahme, and Andrey Volkov, authors of the book Learning PostgreSQL, we will discuss PostgreSQL (pronounced Post-Gres-Q-L) or Postgres is an open source, object-relational database management system. It emphasizes extensibility, creativity, and compatibility. It competes with major relational database vendors, such as Oracle, MySQL, SQL servers, and others. It is used by different sectors, including government agencies and the public and private sectors. It is cross-platform and runs on most modern operating systems, including Windows, Mac, and Linux flavors. It conforms to SQL standards and it is ACID complaint. (For more resources related to this topic, see here.) An overview of PostgreSQL PostgreSQL has many rich features. It provides enterprise-level services, including performance and scalability. It has a very supportive community and very good documentation. The history of PostgreSQL The name PostgreSQL comes from post-Ingres database. the history of PostgreSQL can be summarized as follows: Academia: University of California at Berkeley (UC Berkeley) 1977-1985, Ingres project: Michael Stonebraker created RDBMS according to the formal relational model 1986-1994, postgres: Michael Stonebraker created postgres in order to support complex data types and the object-relational model. 1995, Postgres95: Andrew Yu and Jolly Chen changed postgres to postgres query language (P) with an extended subset of SQL. Industry 1996, PostgreSQL: Several developers dedicated a lot of labor and time to stabilize Postgres95. The first open source version was released on January 29, 1997. With the introduction of new features, and enhancements, and at the start of open source projects, the Postgres95 name was changed to PostgreSQL. PostgreSQL began at version 6, with a very strong starting point by taking advantage of several years of research and development. Being an open source with a very good reputation, PostgreSQL has attracted hundreds of developers. Currently, PostgreSQL has innumerable extensions and a very active community. Advantages of PostgreSQL PostgreSQL provides many features that attract developers, administrators, architects, and companies. Business advantages of PostgreSQL PostgreSQL is free, open source software (OSS); it has been released under the PostgreSQL license, which is similar to the BSD and MIT licenses. The PostgreSQL license is highly permissive, and PostgreSQL is not a subject to monopoly and acquisition. This gives the company the following advantages. There is no associated licensing cost to PostgreSQL. The number of deployments of PostgreSQL is unlimited. A more profitable business model. PostgreSQL is SQL standards compliant. Thus finding professional developers is not very difficult. PostgreSQL is easy to learn and porting code from one database vendor to PostgreSQL is cost efficient. Also, PostgreSQL administrative tasks are easy to automate. Thus, the staffing cost is significantly reduced. PostgreSQL is cross-platform, and it has drivers for all modern programming languages; so, there is no need to change the company policy about the software stack in order to use PostgreSQL. PostgreSQL is scalable and it has a high performance. PostgreSQL is very reliable; it rarely crashes. Also, PostgreSQL is ACID compliant, which means that it can tolerate some hardware failure. In addition to that, it can be configured and installed as a cluster to ensure high availability (HA). User advantages of PostgreSQL PostgreSQL is very attractive for developers, administrators, and architects; it has rich features that enable developers to perform tasks in an agile way. The following are some attractive features for the developer: There is a new release almost each year; until now, starting from Postgres95, there have been 23 major releases. Very good documentation and an active community enable developers to find and solve problems quickly. The PostgreSQL manual is over than 2,500 pages in length. A rich extension repository enables developers to focus on the business logic. Also, it enables developers to meet requirement changes easily. The source code is available free of charge, it can be customized and extended without a huge effort. Rich clients and administrative tools enable developers to perform routine tasks, such as describing database objects, exporting and importing data, and dumping and restoring databases, very quickly. Database administration tasks do not requires a lot of time and can be automated. PostgreSQL can be integrated easily with other database management systems, giving software architecture good flexibility in putting software designs. Applications of PostgreSQL PostgreSQL can be used for a variety of applications. The main PostgreSQL application domains can be classified into two categories: Online transactional processing (OLTP): OLTP is characterized by a large number of CRUD operations, very fast processing of operations, and maintaining data integrity in a multiaccess environment. The performance is measured in the number of transactions per second. Online analytical processing (OLAP): OLAP is characterized by a small number of requests, complex queries that involve data aggregation, and a huge amount of data from different sources, with different formats and data mining and historical data analysis. OLTP is used to model business operations, such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database size is relatively small compared to an OLAP database. OLTP normally follows the relational model concepts, such as normalization when designing the database, while OLAP is less relational and the schema is often star shaped. Unlike OLTP, the main operation of OLAP is data retrieval. OLAP data is often generated by a process called Extract, Transform and Load (ETL). ETL is used to load data into the OLAP database from different data sources and different formats. PostgreSQL can be used out of the box for OLTP applications. For OLAP, there are many extensions and tools to support it, such as the PostgreSQL COPY command and Foreign Data Wrappers (FDW). Success stories PostgreSQL is used in many application domains, including communication, media, geographical, and e-commerce applications. Many companies provide consultation as well as commercial services, such as migrating proprietary RDBMS to PostgreSQL in order to cut off licensing costs. These companies often influence and enhance PostgreSQL by developing and submitting new features. The following are a few companies that use PostgreSQL: Skype uses PostgreSQL to store user chats and activities. Skype has also affected PostgreSQL by developing many tools called Skytools. Instagram is a social networking service that enables its user to share pictures and photos. Instagram has more than 100 million active users. The American Chemical Society (ACS): More than one terabyte of data for their journal archive is stored using PostgreSQL. In addition to the preceding list of companies, PostgreSQL is used by HP, VMware, and Heroku. PostgreSQL is used by many scientific communities and organizations, such as NASA, due to its extensibility and rich data types. Forks There are more than 20 PostgreSQL forks; PostgreSQL extensible APIs makes postgres a great candidate to fork. Over years, many groups have forked PostgreSQL and contributed their findings to PostgreSQL. The following is a list of popular PostgreSQL forks: HadoopDB is a hybrid between the PostgreSQL, RDBMS, and MapReduce technologies to target analytical workload. Greenplum is a proprietary DBMS that was built on the foundation of PostgreSQL. It utilizes the shared-nothing and massively parallel processing (MPP) architectures. It is used as a data warehouse and for analytical workloads. The EnterpriseDB advanced server is a proprietary DBMS that provides Oracle capabilities to cap Oracle fees. Postgres-XC (eXtensible Cluster) is a multi-master PostgreSQL cluster based on the shared-nothing architecture. It emphasis write-scalability and provides the same APIs to applications that PostgreSQL provides. Vertica is a column-oriented database system, which was started by Michael Stonebraker in 2005 and acquisitioned by HP in 2011. Vertica reused the SQL parser, semantic analyzer, and standard SQL rewrites from the PostgreSQL implementation. Netzza is a popular data warehouse appliance solution that was started as a PostgreSQL fork. Amazon Redshift is a popular data warehouse management system based on PostgreSQL 8.0.2. It is mainly designed for OLAP applications. The PostgreSQL architecture PostgreSQL uses the client/server model; the client and server programs could be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or Linux sockets. PostgreSQL can handle multiple connections from a client. A common PostgreSQL program consists of the following operating system processes: Client process or program (frontend): The database frontend application performs a database action. The frontend could be a web server that wants to display a web page or a command-line tool to perform maintenance tasks. PostgreSQL provides frontend tools, such as psql, createdb, dropdb, and createuser. Server process (backend): The server process manages database files, accepts connections from client applications, and performs actions on behalf of the client; the server process name is postgres. PostgreSQL forks a new process for each new connection; thus, the client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime determined by accepting and terminating a client connection. The abstract architecture of PostgreSQL The aforementioned abstract, conceptual PostgreSQL architecture can give an overview of PostgreSQL's capabilities and interactions with the client as well as the operating system. The PostgreSQL server can be divided roughly into four subsystems as follows: Process manager: The process manager manages client connections, such as the forking and terminating processes. Query processor: When a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop determines the query type. A Utility query is passed to the utilities subsystem. The Select, insert, update, and delete queries are rewritten by the rewriter, and then an execution plan is generated by the planner; finally, the query is executed, and the result is returned to the client. Utilities: The utilities subsystem provides the means to maintain the database, such as claiming storage, updating statistics, exporting and importing data with a certain format, and logging. Storage manager: The storage manager handles the memory cache, disk buffers, and storage allocation. Almost all PostgreSQL components can be configured, including the logger, planner, statistical analyzer, and storage manager. PostgreSQL configuration is governed by the application nature, such as OLAP and OLTP. The following diagram shows the PostgreSQL abstract, conceptual architecture: PostgreSQL's abstract, conceptual architecture The PostgreSQL community PostgreSQL has a very cooperative, active, and organized community. In the last 8 years, the PostgreSQL community published eight major releases. Announcements are brought to developers via the PostgreSQL weekly newsletter. There are dozens of mailing lists organized into categories, such as users, developers, and associations. Examples of user mailing lists are pgsql-general, psql-doc, and psql-bugs. pgsql-general is a very important mailing list for beginners. All non-bug-related questions about PostgreSQL installation, tuning, basic administration, PostgreSQL features, and general discussions are submitted to this list. The PostgreSQL community runs a blog aggregation service called Planet PostgreSQL—https://planet.postgresql.org/. Several PostgreSQL developers and companies use this service to share their experience and knowledge. Summary PostgreSQL is an open source, object-oriented relational database system. It supports many advanced features and complies with the ANSI-SQL:2008 standard. It has won industry recognition and user appreciation. The PostgreSQL slogan "The world's most advanced open source database" reflects the sophistication of the PostgreSQL features. PostgreSQL is a result of many years of research and collaboration between academia and industry. Companies in their infancy often favor PostgreSQL due to licensing costs. PostgreSQL can aid profitable business models. PostgreSQL is also favoured by many developers because of its capabilities and advantages. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]

0
0
4653

article-image-solving-nlp-problem-keras-part-1

Sasank Chilamkurthy

12 Oct 2016

5 min read

Solving an NLP Problem with Keras, Part 1

Sasank Chilamkurthy

12 Oct 2016

5 min read

In a previous two-part post series on Keras, I introduced Convolutional Neural Networks(CNNs) and the Keras deep learning framework. We used them to solve a Computer Vision (CV) problem involving traffic sign recognition. Now, in this two-part post series, we will solve a Natural Language Processing (NLP) problem with Keras. Let’s begin. The Problem and the Dataset The problem we are going to tackle is Natural Language Understanding. The aim is to extract the meaning of speech utterances. This is still an unsolved problem. Therefore, we can break this problem into a solvable practical problem of understanding the speaker in a limited context. In particular, we want to identify the intent of a speaker asking for information about flights. The dataset we are going to use is Airline Travel Information System (ATIS). This dataset was collected by DARPA in the early 90s. ATIS consists of spoken queries on flight related information. An example utterance is I want to go from Boston to Atlanta on Monday. Understanding this is then reduced to identifying arguments like Destination and Departure Day. This task is called slot-filling. Here is an example sentence and its labels. You will observe that labels are encoded in an Inside Outside Beginning (IOB) representation. Let’s look at the dataset: |Words | Show | flights | from | Boston | to | New | York| today| |Labels| O | O | O |B-dept | O|B-arr|I-arr|B-date| The ATIS official split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128, including the O label (NULL). Unseen words in the test set are encoded by the <UNK> token, and each digit is replaced with string DIGIT;that is,20 is converted to DIGITDIGIT. Our approach to the problem is to use: Word embeddings Recurrent neural networks I'll talk about these briefly in the following sections. Word Embeddings Word embeddings map words to a vector in a high-dimensional space. These word embeddings can actually learn the semantic and syntactic information of words. For instance, they can understand that similar words are close to each other in this space and dissimilar words are far apart. This can be learned either using large amounts of text like Wikipedia, or specifically for a given problem. We will take the second approach for this problem. As an illustation, I have shown here the nearest neighbors in the word embedding space for some of the words. This embedding space was learned by the model that we’ll define later in the post: sunday delta california boston august time car wednesday continental colorado nashville september schedule rental saturday united florida toronto july times limousine friday american ohio chicago june schedules rentals monday eastern georgia phoenix december dinnertime cars tuesday northwest pennsylvania cleveland november ord taxi thursday us north atlanta april f28 train wednesdays nationair tennessee milwaukee october limo limo saturdays lufthansa minnesota columbus january departure ap sundays midwest michigan minneapolis may sfo later Recurrent Neural Networks Convolutional layers can be a great way to pool local information, but they do not really capture the sequentiality of data. Recurrent Neural Networks (RNNs) help us tackle sequential information like natural language. If we are going to predict properties of the current word, we better remember the words before it too. An RNN has such an internal state/memory that stores the summary of the sequence it has seen so far. This allows us to use RNNs to solve complicated word tagging problems such as Part Of Speech (POS) tagging or slot filling, as in our case. The following diagram illustrates the internals of RNN: Source: Nature RNN Let's briefly go through the diagram: Is the input to the RNN. x_1,x_2,...,x_(t-1),x_t,x_(t+1)... Is the hidden state of the RNN at the step. st This is computed based on the state at the step. t-1 As st=f(Uxt+Ws(t-1)) Here f is a nonlinearity such astanh or ReLU. ot Is the output at the step. t Computed as:ot=f(Vst)U,V,W Are the learnable parameters of RNN. For our problem, we will pass a word embeddings’ sequence as the input to the RNN. Putting it all together Now that we've setup the problem and have an understanding of the basic blocks, let's code it up. Since we are using the IOB representation for labels, it's not simpleto calculate the scores of our model. We therefore use the conlleval perl script to compute the F1 Scores. I've adapted the code from here for the data preprocessing and score calculation. The complete code is available at GitHub: $ git clone https://github.com/chsasank/ATIS.keras.git $ cd ATIS.keras I recommend using jupyter notebook to run and experiment with the snippets from the tutorial. $ jupyter notebook Conclusion In part 2, we will load the data using data.load.atisfull(). We will also define the Keras model, and then we will train the model. To measure the accuracy of the model, we’ll use model.predict_on_batch() and metrics.accuracy.conlleval(). And finally, we will improve our model to achieve better results. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.

0
0
4638

article-image-nhibernate-3-creating-sample-application

Packt

06 Sep 2011

8 min read

NHibernate 3: Creating a Sample Application

Packt

06 Sep 2011

8 min read

0
0
4616

article-image-integrating-imagery-creating-and-styling-features-openlayers-3

Packt

17 Mar 2016

21 min read

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Packt

17 Mar 2016

21 min read

0
0
4607

Packt

21 Jul 2014

9 min read

Sharding in Action

Packt

21 Jul 2014

9 min read

0
0
4599

Packt

01 Mar 2016

32 min read

Breaking the Bank

Packt

01 Mar 2016

32 min read

0
0
4591

article-image-debugging-scheduler-oracle-11g-databases

Packt

08 Oct 2009

5 min read

Debugging the Scheduler in Oracle 11g Databases

Packt

08 Oct 2009

5 min read

Unix—all releases Something that has not been made very clear in the Oracle Scheduler documentation is that redirection cannot be used in jobs (<, >, >>, |, &&, ||). Therefore, many developers have tried to use it. So, let's keep in mind that we cannot use redirection, not in 11g as well as older releases of the database. The scripts must be executable, so don't forget to set the execution bits. This might seem like knocking down an open door, but it's easily forgotten. The user (who is the process owner of the external job and is nobody:nobody by default) should be able to execute the $ORACLE_HOME/bin/extjob file. In Unix, this means that the user should have execution permissions on all the parent directories of this file. This is not something specific to Oracle; it's just the way a Unix file system works. Really! Check it out. Since 10gR1, Oracle does not give execution privileges to others. A simple test for this is to try starting SQL*Plus as a user who is neither the Oracle installation user, nor a member of the DBA group—but a regular user. If you get all kinds of errors, then it implies that the permissions are not correct, assuming that the environment variables (ORACLE_HOME and PATH) are set up correctly. The $ORACLE_HOME/install/changePerm.sh script can fix the permissions within ORACLE_HOME (for 10g). In Oracle 11g, this again changed and is no longer needed. The Scheduler interprets the return code of external jobs and records it in the *_scheduler_job_run_details view. This interpretation can be very misleading, especially when using your own exit codes. For example, when you code your script to check the number of arguments, and code an exit 1 when you find the incorrect number of arguments, the error number is translated to ORA-27301: OS failure message:No such file or directory by Oracle using the code in errno.h. In 11g, the Scheduler also records the return code in the error# column. This lets us recognize the error code better and find where it is raised in the script that ran, when the error codes are unique within the script. When Oracle started with Scheduler, there were some quick changes. Here are the most important changes listed that could cause us problems when the definitions of the mentioned files are not exactly as listed: 10.2.0.1: $ORACLE_HOME/bin/extjob should be owned by the user who runs the jobs (process owner) and have 6550 permissions (setuid process owner). In a regular notation, that is what ls –l shows, and the privileges should be -r-sr-s---. 10.2.0.2: $ORACLE_HOME/rdbms/admin/externaljob.ora should be owned by root. This file is owned by the Oracle user (the user who installed Oracle) and the Oracle install group with 644 permissions or -rw-r—-r--, as shown by ls –l. This file controls which operating system user is going to be the process owner, or which user is going to run the job. The default contents of this file are as shown in the following screenshot: $ORACLE_HOME/bin/extjob must be the setuid root (permissions 4750 or -rwsr-x---) and executable for the Oracle install group, where the setuid root means that the root should be the owner of the file. This also means that while executing this binary, we temporarily get root privileges on the system. $ ORACLE_HOME/bin/extjobo should have normal 755 or -rwxr-xr-x permissions,and be owned by the normal Oracle software owner and group. If this file is missing, just copy it from $ORACLE_HOME/bin/extjob. On AIX, this is the first release that has external job support. 11g release: I n 11g, the same files as in 10.2.0.2 exist with the same permissions. But $ORACLE_HOME/bin/jssu is owned by root and the Oracle install group with the setuid root (permissions 4750 or -rwsr-x---). It is undoubtedly best to stop using the old 10g external jobs and migrate to the 11g external jobs with credentials as soon as possible. The security of the remote external jobs is better because of the use of credentials instead of falling back to the contents of a single file in $ORACLE_HOME/, and the flexibility is much better. In 11g, the process owner of the remote external jobs is controlled by the credential and not by a file. Windows usage On Windows, this is a little easier with regard to file system security. The OracleJobscheduler service must exist in a running state, and the user who runs this service should have the Logon as batch job privilege. A .batfile cannot be run directly, but should be called as an argument of cmd.exe, for example: --/BEGINDBMS_SCHEDULER.create_job(job_name => 'env_windows',job_type => 'EXECUTABLE',number_of_arguments => 2,job_action => 'C:windowssystem32cmd.exe',auto_drop => FALSE,enabled => FALSE);DBMS_SCHEDULER.set_job_argument_value('env_windows',1,'/c');DBMS_SCHEDULER.set_job_argument_value('env_windows',2,'d:temptest.bat');end;/ This job named env_windows calls cmd.exe, which eventually runs the script named test.bat that we created in d:temp. When the script we want to call needs arguments, they should be listed from argument number 3 onwards.

0
0
4590

How-To Tutorials - Data

Advanced Shiny Functions

Big Data Analysis (R and Hadoop)

Creating your first collection (Simple)

Setting Up Synchronous Replication

Neural Network in Azure ML

Oracle GoldenGate: Considerations for Designing a Solution

LambdaArchitecture Pattern

R vs Pandas

PostgreSQL in Action

Solving an NLP Problem with Keras, Part 1

Trending Topics

NHibernate 3: Creating a Sample Application

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Sharding in Action

Breaking the Bank

Debugging the Scheduler in Oracle 11g Databases

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access