Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1229 Articles
article-image-advanced-shiny-functions
Packt
08 Jan 2016
14 min read
Save for later

Advanced Shiny Functions

Packt
08 Jan 2016
14 min read
In this article by Chris Beeley, author of the book, Web Application Development with R using Shiny - Second Edition, we are going to extend our toolkit by learning about advanced Shiny functions. These allow you to take control of the fine details of your application, including the interface, reactivity, data, and graphics. We will cover the following topics: Learn how to show and hide parts of the interface Change the interface reactively Finely control reactivity, so functions and outputs run at the appropriate time Use URLs and reactive Shiny functions to populate and alter the selections within an interface Upload and download data to and from a Shiny application Produce beautiful tables with the DataTables jQuery library (For more resources related to this topic, see here.) Summary of the application We're going to add a lot of new functionality to the application, and it won't be possible to explain every piece of code before we encounter it. Several of the new functions depend on at least one other function, which means that you will see some of the functions for the first time, whereas a different function is being introduced. It's important, therefore, that you concentrate on whichever function is being explained and wait until later in the article to understand the whole piece of code. In order to help you understand what the code does as you go along it is worth quickly reviewing the actual functionality of the application now. In terms of the functionality, which has been added to the application, it is now possible to select not only the network domain from which browser hits originate but also the country of origin. The draw map function now features a button in the UI, which prevents the application from updating the map each time new data is selected, the map is redrawn only when the button is pushed. This is to prevent minor updates to the data from wasting processor time before the user has finished making their final data selection. A Download report button has been added, which sends some of the output as a static file to a new webpage for the user to print or download. An animated graph of trend has been added; this will be explained in detail in the relevant section. Finally, a table of data has been added, which summarizes mean values of each of the selectable data summaries across the different countries of origin. Downloading data from RGoogleAnalytics The code is given and briefly summarized to give you a feeling for how to use it in the following section. Note that my username and password have been replaced with XXXX; you can get your own user details from the Google Analytics website. Also, note that this code is not included on the GitHub because it requires the username and password to be present in order for it to work: library(RGoogleAnalytics) ### Generate the oauth_token object oauth_token <- Auth(client.id = "xxxx", client.secret = "xxxx") # Save the token object for future sessions save(oauth_token, file = "oauth_token") Once you have your client.id and client.secret from the Google Analytics website, the preceding code will direct you to a browser to authenticate the application and save the authorization within oauth_token. This can be loaded in future sessions to save from reauthenticating each time as follows: # Load the token object and validate for new run load("oauth_token") ValidateToken(oauth_token) The preceding code will load the token in subsequent sessions. The validateToken() function is necessary each time because the authorization will expire after a time this function will renew the authentication: ## list of metrics and dimensions query.list <- Init(start.date = "2013-01-01", end.date = as.character(Sys.Date()), dimensions = "ga:country,ga:latitude,ga:longitude, ga:networkDomain,ga:date", metrics = "ga:users,ga:newUsers,ga:sessions, ga:bounceRate,ga:sessionDuration", max.results = 10000, table.id = "ga:71364313") gadf = GetReportData(QueryBuilder(query.list), token = oauth_token, paginate_query = FALSE) Finally, the metrics and dimensions of interest (for more on metrics and dimensions, see the documentation of the Google Analytics API online) are placed within a list and downloaded with the GetReportData() function as follows: ...[data tidying functions]... save(gadf, file = "gadf.Rdata") The data tidying that is carried out at the end is omitted here for brevity, as you can see at the end the data is saved as gadf.Rdata ready to load within the application. Animation Animation is surprisingly easy. The sliderInput() function, which gives an HTML widget that allows the selection of a number along a line, has an optional animation function that will increment a variable by a set amount every time a specified unit of time elapses. This allows you to very easily produce a graphic that animates. In the following example, we are going to look at the monthly graph and plot a linear trend line through the first 20% of the data (0–20% of the data). Then, we are going to increment the percentage value that selects the portion of the data by 5% and plot a linear through that portion of data (5–25% of the data). Then, increment again from 10% to 30% and plot another line and so on. There is a static image in the following screenshot: The slider input is set up as follows, with an ID, label, minimum value, maximum value, initial value, step between values, and the animation options, giving the delay in milliseconds and whether the animation should loop: sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) Having set this up, the animated graph code is pretty simple, looking very much like the monthly graph data except with the linear smooth based on a subset of the data instead of the whole dataset. The graph is set up as before and then a subset of the data is produced on which the linear smooth can be based: groupByDate <- group_by(passData(), YearMonth, networkDomain) %>% summarise(meanSession = mean(sessionDuration, na.rm = TRUE), users = sum(users), newUsers = sum(newUsers), sessions = sum(sessions)) groupByDate$Date <- as.Date(paste0(groupByDate$YearMonth, "01"), format = "%Y%m%d") smoothData <- groupByDate[groupByDate$Date %in% quantile(groupByDate$Date, input$animation / 100, type = 1): quantile(groupByDate$Date, (input$animation + 20) / 100, type = 1), ] We won't get too distracted by this code, but essentially, it tests to see which of the whole date range falls in a range defined by percentage quantiles based on the sliderInput() values. See ?quantile for more information. Finally, the linear smooth is drawn with an extra data argument to tell ggplot2 to base the line only on the smaller smoothData object and not the whole range: ggplot(groupByDate, aes_string(x = "Date", y = input$outputRequired, group = "networkDomain", colour = "networkDomain") ) + geom_line() + geom_smooth(data = smoothData, method = "lm", colour = "black" ) Not bad for a few lines of code. We have both ggplot2 and Shiny to thank for how easy this is. Streamline the UI by hiding elements This is a simple function that you are certainly going to need if you build even a moderately complex application. Those of you who have been doing extra credit exercises and/or experimenting with your own applications will probably have already wished for this or, indeed, have already found it. conditionalPanel() allows you to show/hide UI elements based on other selections within the UI. The function takes a condition (in JavaScript, but the form and syntax will be familiar from many languages) and a UI element and displays the UI only when the condition is true. This has actually used a couple of times in the advanced GA application, and indeed in all the applications, I've ever written of even moderate complexity. We're going to show the option to smooth the trend graph only when the trend graph tab is displayed, and we're going to show the controls for the animated graph only when the animated graph tab is displayed. Naming tabPanel elements In order to allow testing for which tab is currently selected, we're going to have to first give the tabs of the tabbed output names. This is done as follows (with the new code in bold): tabsetPanel(id = "theTabs", # give tabsetPanel a name tabPanel("Summary", textOutput("textDisplay"), value = "summary"), tabPanel("Trend", plotOutput("trend"), value = "trend"), tabPanel("Animated", plotOutput("animated"), value = "animated"), tabPanel("Map", plotOutput("ggplotMap"), value = "map"), tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") As you can see, the whole panel is given an ID (theTabs) and then each tabPanel is also given a name (summary, trend, animated, map, and table). They are referred to in the server.R file very simply as input$theTabs. Finally, we can make our changes to ui.R to remove parts of the UI based on tab selection: conditionalPanel( condition = "input.theTabs == 'trend'", checkboxInput("smooth", label = "Add smoother?", # add smoother value = FALSE) ), conditionalPanel( condition = "input.theTabs == 'animated'", sliderInput("animation", "Trend over time", min = 0, max = 80, value = 0, step = 5, animate = animationOptions(interval = 1000, loop = TRUE) ) ) As you can see, the condition appears very R/Shiny-like, except with the . operator familiar to JavaScript users in place of $. This is a very simple but powerful way of making sure that your UI is not cluttered with an irrelevant material. Beautiful tables with DataTable The latest version of Shiny has added support to draw tables using the wonderful DataTables jQuery library. This will enable your users to search and sort through large tables very easily. To see DataTable in action, visit the homepage at http://datatables.net/. The version in this application summarizes the values of different variables across the different countries from which browser hits originate and looks as follows: The package can be installed using install.packages("DT") and needs to be loaded in the preamble to the server.R file with library(DT). Once this is done using the package is quite straightforward. There are two functions: one in server.R (renderDataTable) and other in ui.R (dataTableOutput). They are used as following: ### server. R output$countryTable <- DT::renderDataTable ({ groupCountry <- group_by(passData(), country) groupByCountry <- summarise(groupCountry, meanSession = mean(sessionDuration), users = log(sum(users)), sessions = log(sum(sessions)) ) datatable(groupByCountry) }) ### ui.R tabPanel("Table", DT::dataTableOutput("countryTable"), value = "table") Anything that returns a dataframe or a matrix can be used within renderDataTable(). Note that as of Shiny V. 0.12, the Shiny functions renderDataTable() and dataTableOutput() functions are deprecated: you should use the DT equivalents of the same name, as in the preceding code adding DT:: before each function name specifies that the function should be drawn from that package. Reactive user interfaces Another trick you will definitely want up your sleeve at some point is a reactive user interface. This enables you to change your UI (for example, the number or content of radio buttons) based on reactive functions. For example, consider an application that I wrote related to survey responses across a broad range of health services in different areas. The services are related to each other in quite a complex hierarchy, and over time, different areas and services respond (or cease to exist, or merge, or change their name), which means that for each time period the user might be interested in, there would be a totally different set of areas and services. The only sensible solution to this problem is to have the user tell you which area and date range they are interested in and then give them back the correct list of services that have survey responses within that area and date range. The example we're going to look at is a little simpler than this, just to keep from getting bogged down in too much detail, but the principle is exactly the same, and you should not find this idea too difficult to adapt to your own UI. We are going to allow users to constrain their data by the country of origin of the browser hit. Although we could design the UI by simply taking all the countries that exist in the entire dataset and placing them all in a combo box to be selected, it is a lot cleaner to only allow the user to select from the countries that are actually present within the particular date range they have selected. This has the added advantage of preventing the user from selecting any countries of origin, which do not have any browser hits within the currently selected dataset. In order to do this, we are going to create a reactive user interface, that is, one that changes based on data values that come about from user input. Reactive user interface example – server.R When you are making a reactive user interface, the big difference is that instead of writing your UI definition in your ui.R file, you place it in server.R and wrap it in renderUI(). Then, point to it from your ui.R file. Let's have a look at the relevant bit of the server.R file: output$reactCountries <- renderUI({ countryList = unique(as.character(passData()$country)) selectInput("theCountries", "Choose country", countryList) }) The first line takes the reactive dataset that contains only the data between the dates selected by the user and gives all the unique values of countries within it. The second line is a widget type we have not used yet, which generates a combo box. The usual id and label arguments are given, followed by the values that the combo box can take. This is taken from the variable defined in the first line. Reactive user interface example – ui.R The ui.R file merely needs to point to the reactive definition, as shown in the following line of code (just add it in to the list of widgets within sidebarPanel()): uiOutput("reactCountries") You can now point to the value of the widget in the usual way as input$subDomains. Note that you do not use the name as defined in the call to renderUI(), that is, reactCountries, but rather the name as defined within it, that is, theCountries. Progress bars It is quite common within Shiny applications and in analytics generally to have computations or data fetches that take a long time. However, even using all these tools, it will sometimes be necessary for the user to wait some time before their output is returned. In cases like this, it is a good practice to do two things: first, to inform that the server is processing the request and has not simply crashed or otherwise failed, and second to give the user some idea of how much time has elapsed since they requested the output and how much time they have remaining to wait. This is achieved very simply in Shiny using the withProgress() function. This function defaults to measuring progress on a scale from 0 to 1 and produces a loading bar at the top of the application with the information from the message and detail arguments of the loading function. You can see in the following code, the withProgress function is used to wrap a function (in this case, the function that draws the map), with message and detail arguments describing what is happened and an initial value of 0 (value = 0, that is, no progress yet): withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... } ) As the code is stepped through, the value of progress can steadily be increased from 0 to 1 (for example, in a for() loop) using the following code: incProgress(1/3) The third time this is called, the value of progress will be 1, which indicates that the function has completed (although other values of progress can be selected where necessary, see ?withProgess()). To summarize, the finished code looks as follows: withProgress(message = 'Please wait', detail = 'Drawing map...', value = 0, { ... function code... incProgress(1/3) .. . function code... incProgress(1/3) ... function code... incProgress(1/3) } ) It's very simple. Again, have a look at the application to see it in action. Summary In this article, you have now seen most of the functionality within Shiny. It's a relatively small but powerful toolbox with which you can build a vast array of useful and intuitive applications with comparatively little effort. In this respect, ggplot2 is rather a good companion for Shiny because it too offers you a fairly limited selection of functions with which knowledgeable users can very quickly build many different graphical outputs. Resources for Article: Further resources on this subject: Introducing R, RStudio, and Shiny[article] Introducing Bayesian Inference[article] R ─ Classification and Regression Trees[article]
Read more
  • 0
  • 0
  • 4752

article-image-big-data-analysis-r-and-hadoop
Packt
26 Mar 2015
37 min read
Save for later

Big Data Analysis (R and Hadoop)

Packt
26 Mar 2015
37 min read
This article is written by Yu-Wei, Chiu (David Chiu), the author of Machine Learning with R Cookbook. In this article, we will cover the following topics: Preparing the RHadoop environment Installing rmr2 Installing rhdfs Operating HDFS with rhdfs Implementing a word count problem with RHadoop Comparing the performance between an R MapReduce program and a standard R program Testing and debugging the rmr2 program Installing plyrmr Manipulating data with plyrmr Conducting machine learning with RHadoop Configuring RHadoop clusters on Amazon EMR (For more resources related to this topic, see here.) RHadoop is a collection of R packages that enables users to process and analyze big data with Hadoop. Before understanding how to set up RHadoop and put it in to practice, we have to know why we need to use machine learning to big-data scale. The emergence of Cloud technology has made real-time interaction between customers and businesses much more frequent; therefore, the focus of machine learning has now shifted to the development of accurate predictions for various customers. For example, businesses can provide real-time personal recommendations or online advertisements based on personal behavior via the use of a real-time prediction model. However, if the data (for example, behaviors of all online users) is too large to fit in the memory of a single machine, you have no choice but to use a supercomputer or some other scalable solution. The most popular scalable big-data solution is Hadoop, which is an open source framework able to store and perform parallel computations across clusters. As a result, you can use RHadoop, which allows R to leverage the scalability of Hadoop, helping to process and analyze big data. In RHadoop, there are five main packages, which are: rmr: This is an interface between R and Hadoop MapReduce, which calls the Hadoop streaming MapReduce API to perform MapReduce jobs across Hadoop clusters. To develop an R MapReduce program, you only need to focus on the design of the map and reduce functions, and the remaining scalability issues will be taken care of by Hadoop itself. rhdfs: This is an interface between R and HDFS, which calls the HDFS API to access the data stored in HDFS. The use of rhdfs is very similar to the use of the Hadoop shell, which allows users to manipulate HDFS easily from the R console. rhbase: This is an interface between R and HBase, which accesses Hbase and is distributed in clusters through a Thrift server. You can use rhbase to read/write data and manipulate tables stored within HBase. plyrmr: This is a higher-level abstraction of MapReduce, which allows users to perform common data manipulation in a plyr-like syntax. This package greatly lowers the learning curve of big-data manipulation. ravro: This allows users to read avro files in R, or write avro files. It allows R to exchange data with HDFS. In this article, we will start by preparing the Hadoop environment, so that you can install RHadoop. We then cover the installation of three main packages: rmr, rhdfs, and plyrmr. Next, we will introduce how to use rmr to perform MapReduce from R, operate an HDFS file through rhdfs, and perform a common data operation using plyrmr. Further, we will explore how to perform machine learning using RHadoop. Lastly, we will introduce how to deploy multiple RHadoop clusters on Amazon EC2. Preparing the RHadoop environment As RHadoop requires an R and Hadoop integrated environment, we must first prepare an environment with both R and Hadoop installed. Instead of building a new Hadoop system, we can use the Cloudera QuickStart VM (the VM is free), which contains a single node Apache Hadoop Cluster and R. In this recipe, we will demonstrate how to download the Cloudera QuickStart VM. Getting ready To use the Cloudera QuickStart VM, it is suggested that you should prepare a 64-bit guest OS with either VMWare or VirtualBox, or the KVM installed. If you choose to use VMWare, you should prepare a player compatible with WorkStation 8.x or higher: Player 4.x or higher, ESXi 5.x or higher, or Fusion 4.x or higher. Note, 4 GB of RAM is required to start VM, with an available disk space of at least 3 GB. How to do it... Perform the following steps to set up a Hadoop environment using the Cloudera QuickStart VM: Visit the Cloudera QuickStart VM download site (you may need to update the link as Cloudera upgrades its VMs , the current version of CDH is 5.3) at http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-2-x.html. A screenshot of the Cloudera QuickStart VM download site Depending on the virtual machine platform installed on your OS, choose the appropriate link (you may need to update the link as Cloudera upgrades its VMs) to download the VM file: To download VMWare: You can visit https://downloads.cloudera.com/demo_vm/vmware/cloudera-quickstart-vm-5.2.0-0-vmware.7z To download KVM: You can visit https://downloads.cloudera.com/demo_vm/kvm/cloudera-quickstart-vm-5.2.0-0-kvm.7z To download VirtualBox: You can visit https://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.2.0-0-virtualbox.7z Next, you can start the QuickStart VM using the virtual machine platform installed on your OS. You should see the desktop of Centos 6.2 in a few minutes. The screenshot of Cloudera QuickStart VM. You can then open a terminal and type hadoop, which will display a list of functions that can operate a Hadoop cluster. The terminal screenshot after typing hadoop Open a terminal and type R. Access an R session and check whether version 3.1.1 is already installed in the Cloudera QuickStart VM. If you cannot find R installed in the VM, please use the following command to install R: $ yum install R R-core R-core-devel R-devel How it works... Instead of building a Hadoop system on your own, you can use the Hadoop VM application provided by Cloudera (the VM is free). The QuickStart VM runs on CentOS 6.2 with a single node Apache Hadoop cluster, Hadoop Ecosystem module, and R installed. This helps you to save time, instead of requiring you to learn how to install and use Hadoop. The QuickStart VM requires you to have a computer with a 64-bit guest OS, at least 4 GB of RAM, 3 GB of disk space, and either VMWare, VirtualBox, or KVM installed. As a result, you may not be able to use this version of VM on some computers. As an alternative, you could consider using Amazon's Elastic MapReduce instead. We will illustrate how to prepare a RHadoop environment in EMR in the last recipe of this article. Setting up the Cloudera QuickStart VM is simple. Download the VM from the download site and then open the built image with either VMWare, VirtualBox, or KVM. Once you can see the desktop of CentOS, you can then access the terminal and type hadoop to see whether Hadoop is working; then, type R to see whether R works in the QuickStart VM. See also Besides using the Cloudera QuickStart VM, you may consider using a Sandbox VM provided by Hontonworks or MapR. You can find Hontonworks Sandbox at http://hortonworks.com/products/hortonworks-sandbox/#install and mapR Sandbox at https://www.mapr.com/products/mapr-sandbox-hadoop/download. Installing rmr2 The rmr2 package allows you to perform big data processing and analysis via MapReduce on a Hadoop cluster. To perform MapReduce on a Hadoop cluster, you have to install R and rmr2 on every task node. In this recipe, we will illustrate how to install rmr2 on a single node of a Hadoop cluster. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet, so that you can proceed with downloading and installing the rmr2 package. How to do it... Perform the following steps to install rmr2 on the QuickStart VM: First, open the terminal within the Cloudera QuickStart VM. Use the permission of the root to enter an R session: $ sudo R You can then install dependent packages before installing rmr2: > install.packages(c("codetools", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava", "caTools")) Quit the R session: > q() Next, you can download rmr-3.3.0 to the QuickStart VM. You may need to update the link if Revolution Analytics upgrades the version of rmr2: $ wget --no-check-certificate https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/3.3.0/build/rmr2_3.3.0.tar.gz You can then install rmr-3.3.0 to the QuickStart VM: $ sudo R CMD INSTALL rmr2_3.3.0.tar.gz Lastly, you can enter an R session and use the library function to test whether the library has been successfully installed: $ R > library(rmr2) How it works... In order to perform MapReduce on a Hadoop cluster, you have to install R and RHadoop on every task node. Here, we illustrate how to install rmr2 on a single node of a Hadoop cluster. First, open the terminal of the Cloudera QuickStart VM. Before installing rmr2, we first access an R session with root privileges and install dependent R packages. Next, after all the dependent packages are installed, quit the R session and use the wget command in the Linux shell to download rmr-3.3.0 from GitHub to the local filesystem. You can then begin the installation of rmr2. Lastly, you can access an R session and use the library function to validate whether the package has been installed. See also To see more information and read updates about RHadoop, you can refer to the RHadoop wiki page hosted on GitHub: https://github.com/RevolutionAnalytics/RHadoop/wiki Installing rhdfs The rhdfs package is the interface between R and HDFS, which allows users to access HDFS from an R console. Similar to rmr2, one should install rhdfs on every task node, so that one can access HDFS resources through R. In this recipe, we will introduce how to install rhdfs on the Cloudera QuickStart VM. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet, so that you can proceed with downloading and installing the rhdfs package. How to do it... Perform the following steps to install rhdfs: First, you can download rhdfs 1.0.8 from GitHub. You may need to update the link if Revolution Analytics upgrades the version of rhdfs: $wget --no-check-certificate https://raw.github.com/ RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz Next, you can install rhdfs under the command-line mode: $ sudo HADOOP_CMD=/usr/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz You can then set up JAVA_HOME. The configuration of JAVA_HOME depends on the installed Java version within the VM: $ sudo JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera R CMD javareconf Last, you can set up the system environment and initialize rhdfs. You may need to update the environment setup if you use a different version of QuickStart VM: $ R > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar") > library(rhdfs) > hdfs.init() How it works... The package, rhdfs, provides functions so that users can manage HDFS using R. Similar to rmr2, you should install rhdfs on every task node, so that one can access HDFS through the R console. To install rhdfs, you should first download rhdfs from GitHub. You can then install rhdfs in R by specifying where the HADOOP_CMD is located. You must configure R with Java support through the command, javareconf. Next, you can access R and configure where HADOOP_CMD and HADOOP_STREAMING are located. Lastly, you can initialize rhdfs via the rhdfs.init function, which allows you to begin operating HDFS through rhdfs. See also To find where HADOOP_CMD is located, you can use the which hadoop command in the Linux shell. In most Hadoop systems, HADOOP_CMD is located at /usr/bin/hadoop. As for the location of HADOOP_STREAMING, the streaming JAR file is often located in /usr/lib/hadoop-mapreduce/. However, if you cannot find the directory, /usr/lib/Hadoop-mapreduce, in your Linux system, you can search the streaming JAR by using the locate command. For example: $ sudo updatedb $ locate streaming | grep jar | more Operating HDFS with rhdfs The rhdfs package is an interface between Hadoop and R, which can call an HDFS API in the backend to operate HDFS. As a result, you can easily operate HDFS from the R console through the use of the rhdfs package. In the following recipe, we will demonstrate how to use the rhdfs function to manipulate HDFS. Getting ready To proceed with this recipe, you need to have completed the previous recipe by installing rhdfs into R, and validate that you can initial HDFS via the hdfs.init function. How to do it... Perform the following steps to operate files stored on HDFS: Initialize the rhdfs package: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar") > library(rhdfs) > hdfs.init () You can then manipulate files stored on HDFS, as follows:     hdfs.put: Copy a file from the local filesystem to HDFS: > hdfs.put('word.txt', './')     hdfs.ls: Read the list of directory from HDFS: > hdfs.ls('./')     hdfs.copy: Copy a file from one HDFS directory to another: > hdfs.copy('word.txt', 'wordcnt.txt')     hdfs.move : Move a file from one HDFS directory to another: > hdfs.move('wordcnt.txt', './data/wordcnt.txt')     hdfs.delete: Delete an HDFS directory from R: > hdfs.delete('./data/')     hdfs.rm: Delete an HDFS directory from R: > hdfs.rm('./data/')     hdfs.get: Download a file from HDFS to a local filesystem: > hdfs.get(word.txt', '/home/cloudera/word.txt')     hdfs.rename: Rename a file stored on HDFS: hdfs.rename('./test/q1.txt','./test/test.txt')     hdfs.chmod: Change the permissions of a file or directory: > hdfs.chmod('test', permissions= '777')     hdfs.file.info: Read the meta information of the HDFS file: > hdfs.file.info('./') Also, you can write stream to the HDFS file: > f = hdfs.file("iris.txt","w") > data(iris) > hdfs.write(iris,f) > hdfs.close(f) Lastly, you can read stream from the HDFS file: > f = hdfs.file("iris.txt", "r") > dfserialized = hdfs.read(f) > df = unserialize(dfserialized) > df > hdfs.close(f) How it works... In this recipe, we demonstrate how to manipulate HDFS using the rhdfs package. Normally, you can use the Hadoop shell to manipulate HDFS, but if you would like to access HDFS from R, you can use the rhdfs package. Before you start using rhdfs, you have to initialize rhdfs with hdfs.init(). After initialization, you can operate HDFS through the functions provided in the rhdfs package. Besides manipulating HDFS files, you can exchange streams to HDFS through hdfs.read and hdfs.write. We, therefore, demonstrate how to write a data frame in R to an HDFS file, iris.txt, using hdfs.write. Lastly, you can recover the written file back to the data frame using the hdfs.read function and the unserialize function. See also To initialize rhdfs, you have to set HADOOP_CMD and HADOOP_STREAMING in the system environment. Instead of setting the configuration each time you're using rhdfs, you can put the configurations in the .rprofile file. Therefore, every time you start an R session, the configuration will be automatically loaded. Implementing a word count problem with RHadoop To demonstrate how MapReduce works, we illustrate the example of a word count, which counts the number of occurrences of each word in a given input set. In this recipe, we will demonstrate how to use rmr2 to implement a word count problem. Getting ready In this recipe, we will need an input file as our word count program input. You can download the example input from https://github.com/ywchiu/ml_R_cookbook/tree/master/CH12. How to do it... Perform the following steps to implement the word count program: First, you need to configure the system environment, and then load rmr2 and rhdfs into an R session. You may need to update the use of the JAR file if you use a different version of QuickStart VM: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.5.0-cdh5.2.0.jar ") > library(rmr2) > library(rhdfs) > hdfs.init() You can then create a directory on HDFS and put the input file into the newly created directory: > hdfs.mkdir("/user/cloudera/wordcount/data") > hdfs.put("wc_input.txt", "/user/cloudera/wordcount/data") Next, you can create a map function: > map = function(.,lines) { keyval( +   unlist( +     strsplit( +       x = lines, +       split = " +")), +   1)} Create a reduce function: > reduce = function(word, counts) +   keyval(word, sum(counts)) + } Call the MapReduce program to count the words within a document: > hdfs.root = 'wordcount' > hdfs.data = file.path(hdfs.root, 'data') > hdfs.out = file.path(hdfs.root, 'out') > wordcount = function (input, output=NULL) { + mapreduce(input=input, output=output, input.format="text", map=map, + reduce=reduce) + } > out = wordcount(hdfs.data, hdfs.out) Lastly, you can retrieve the top 10 occurring words within the document: > results = from.dfs(out) > results$key[order(results$val, decreasing = TRUE)][1:10] How it works... In this recipe, we demonstrate how to implement a word count using the rmr2 package. First, we need to configure the system environment and load rhdfs and rmr2 into R. Then, we specify the input of our word count program from the local filesystem into the HDFS directory, /user/cloudera/wordcount/data, via the hdfs.put function. Next, we begin implementing the MapReduce program. Normally, we can divide the MapReduce program into the map and reduce functions. In the map function, we first use the strsplit function to split each line into words. Then, as the strsplit function returns a list of words, we can use the unlist function to character vectors. Lastly, we can return key-value pairs with each word as a key and the value as one. As the reduce function receives the key-value pair generated from the map function, the reduce function sums the count and returns the number of occurrences of each word (or key). After we have implemented the map and reduce functions, we can submit our job via the mapreduce function. Normally, the mapreduce function requires four inputs, which are the HDFS input path, the HDFS output path, the map function, and the reduce function. In this case, we specify the input as wordcount/data, output as wordcount/out, mapfunction as map, reduce function as reduce, and wrap the mapreduce call in function, wordcount. Lastly, we call the function, wordcount and store the output path in the variable, out. We can use the from.dfs function to load the HDFS data into the results variable, which contains the mapping of words and number of occurrences. We can then generate the top 10 occurring words from the results variable. See also In this recipe, we demonstrate how to write an R MapReduce program to solve a word count problem. However, if you are interested in how to write a native Java MapReduce program, you can refer to http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html. Comparing the performance between an R MapReduce program and a standard R program Those not familiar with how Hadoop works may often see Hadoop as a remedy for big data processing. Some might believe that Hadoop can return the processed results for any size of data within a few milliseconds. In this recipe, we will compare the performance between an R MapReduce program and a standard R program to demonstrate that Hadoop does not perform as quickly as some may believe. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into the R environment. How to do it... Perform the following steps to compare the performance of a standard R program and an R MapReduce program: First, you can implement a standard R program to have all numbers squared: > a.time = proc.time() > small.ints2=1:100000 > result.normal = sapply(small.ints2, function(x) x^2) > proc.time() - a.time To compare the performance, you can implement an R MapReduce program to have all numbers squared: > b.time = proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v)       cbind(v,v^2)) > proc.time() - b.time How it works... In this recipe, we implement two programs to square all the numbers. In the first program, we use a standard R function, sapply, to square the sequence from 1 to 100,000. To record the program execution time, we first record the processing time before the execution in a.time, and then subtract a.time from the current processing time after the execution. Normally, the execution takes no more than 10 seconds. In the second program, we use the rmr2 package to implement a program in the R MapReduce version. In this program, we also record the execution time. Normally, this program takes a few minutes to complete a task. The performance comparison shows that a standard R program outperforms the MapReduce program when processing small amounts of data. This is because a Hadoop system often requires time to spawn daemons, job coordination between daemons, and fetching data from data nodes. Therefore, a MapReduce program often takes a few minutes to a couple of hours to finish the execution. As a result, if you can fit your data in the memory, you should write a standard R program to solve the problem. Otherwise, if the data is too large to fit in the memory, you can implement a MapReduce solution. See also In order to check whether a job will run smoothly and efficiently in Hadoop, you can run a MapReduce benchmark, MRBench, to evaluate the performance of the job: $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 50 Testing and debugging the rmr2 program Since running a MapReduce program will require a considerable amount of time, varying from a few minutes to several hours, testing and debugging become very important. In this recipe, we will illustrate some techniques you can use to troubleshoot an R MapReduce program. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into an R environment. How to do it... Perform the following steps to test and debug an R MapReduce program: First, you can configure the backend as local in rmr.options: > rmr.options(backend = 'local') Again, you can execute the number squared MapReduce program mentioned in the previous recipe: > b.time = proc.time() > small.ints= to.dfs(1:100000) > result = mapreduce(input = small.ints, map = function(k,v)       cbind(v,v^2)) > proc.time() - b.time In addition to this, if you want to print the structure information of any variable in the MapReduce program, you can use the rmr.str function: > out = mapreduce(to.dfs(1), map = function(k, v) rmr.str(v)) Dotted pair list of 14 $ : language mapreduce(to.dfs(1), map = function(k, v) rmr.str(v)) $ : language mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, in.folder = if (is.list(input)) {     lapply(input, to.dfs.path) ...< $ : language c.keyval(do.call(c, lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language do.call(c, lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language lapply(in.folder, function(fname) {     kv = get.data(fname) ... $ : language FUN("/tmp/Rtmp813BFJ/file25af6e85cfde"[[1L]], ...) $ : language unname(tapply(1:lkv, ceiling((1:lkv)/(lkv/(object.size(kv)/10^6))), function(r) {     kvr = slice.keyval(kv, r) ... $ : language tapply(1:lkv, ceiling((1:lkv)/(lkv/(object.size(kv)/10^6))), function(r) {     kvr = slice.keyval(kv, r) ... $ : language lapply(X = split(X, group), FUN = FUN, ...) $ : language FUN(X[[1L]], ...) $ : language as.keyval(map(keys(kvr), values(kvr))) $ : language is.keyval(x) $ : language map(keys(kvr), values(kvr)) $ :length 2 rmr.str(v) ..- attr(*, "srcref")=Class 'srcref' atomic [1:8] 1 34 1 58 34 58 1 1 .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x3f984f0> v num 1 How it works... In this recipe, we introduced some debugging and testing techniques you can use while implementing the MapReduce program. First, we introduced the technique to test a MapReduce program in a local mode. If you would like to run the MapReduce program in a pseudo distributed or fully distributed mode, it would take you a few minutes to several hours to complete the task, which would involve a lot of wastage of time while troubleshooting your MapReduce program. Therefore, you can set the backend to the local mode in rmr.options so that the program will be executed in the local mode, which takes lesser time to execute. Another debugging technique is to list the content of the variable within the map or reduce function. In an R program, you can use the str function to display the compact structure of a single variable. In rmr2, the package also provides a function named rmr.str, which allows you to print out the content of a single variable onto the console. In this example, we use rmr.str to print the content of variables within a MapReduce program. See also For those who are interested in the option settings for the rmr2 package, you can refer to the help document of rmr.options: > help(rmr.options) Installing plyrmr The plyrmr package provides common operations (as found in plyr or reshape2) for users to easily perform data manipulation through the MapReduce framework. In this recipe, we will introduce how to install plyrmr on the Hadoop system. Getting ready Ensure that you have completed the previous recipe by starting the Cloudera QuickStart VM and connecting the VM to the Internet. Also, you need to have the rmr2 package installed beforehand. How to do it... Perform the following steps to install plyrmr on the Hadoop system: First, you should install libxml2-devel and curl-devel in the Linux shell: $ yum install libxml2-devel $ sudo yum install curl-devel You can then access R and install the dependent packages: $ sudo R > Install.packages(c(" Rcurl", "httr"), dependencies = TRUE > Install.packages("devtools", dependencies = TRUE) > library(devtools) > install_github("pryr", "hadley") > install.packages(c(" R.methodsS3", "hydroPSO"), dependencies = TRUE > q() Next, you can download plyrmr 0.5.0 and install it on Hadoop VM. You may need to update the link if Revolution Analytics upgrades the version of plyrmr: $ wget -no-check-certificate https://raw.github.com/RevolutionAnalytics/plyrmr/master/build/plyrmr_0.5.0.tar.gz $ sudo R CMD INSTALL plyrmr_0.5.0.tar.gz Lastly, validate the installation: $ R > library(plyrmr) How it works... Besides writing an R MapReduce program using the rmr2 package, you can use the plyrmr to manipulate data. The plyrmr package is similar to hive and pig in the Hadoop ecosystem, which is the abstraction of the MapReduce program. Therefore, we can implement an R MapReduce program in plyr style instead of implementing the map f and reduce functions. To install plyrmr, first install the package of libxml2-devel and curl-devel, using the yum install command. Then, access R and install the dependent packages. Lastly, download the file from GitHub and install plyrmr in R. See also To read more information about plyrmr, you can use the help function to refer to the following document: > help(package=plyrmr) Manipulating data with plyrmr While writing a MapReduce program with rmr2 is much easier than writing a native Java version, it is still hard for nondevelopers to write a MapReduce program. Therefore, you can use plyrmr, a high-level abstraction of the MapReduce program, so that you can use plyr-like operations to manipulate big data. In this recipe, we will introduce some operations you can use to manipulate data. Getting ready In this recipe, you should have completed the previous recipes by installing plyrmr and rmr2 in R. How to do it... Perform the following steps to manipulate data with plyrmr: First, you need to load both plyrmr and rmr2 into R: > library(rmr2) > library(plyrmr) You can then set the execution mode to the local mode: > plyrmr.options(backend="local") Next, load the Titanic dataset into R: > data(Titanic) > titanic = data.frame(Titanic) Begin the operation by filtering the data: > where( +   Titanic, + Freq >=100) You can also use a pipe operator to filter the data: > titanic %|% where(Freq >=100) Put the Titanic data into HDFS and load the path of the data to the variable, tidata: > tidata = to.dfs(data.frame(Titanic), output = '/tmp/titanic') > tidata Next, you can generate a summation of the frequency from the Titanic data: > input(tidata) %|% transmute(sum(Freq)) You can also group the frequency by sex: > input(tidata) %|% group(Sex) %|% transmute(sum(Freq)) You can then sample 10 records out of the population: > sample(input(tidata), n=10) In addition to this, you can use plyrmr to join two datasets: > convert_tb = data.frame(Label=c("No","Yes"), Symbol=c(0,1)) ctb = to.dfs(convert_tb, output = 'convert') > as.data.frame(plyrmr::merge(input(tidata), input(ctb), by.x="Survived", by.y="Label")) > file.remove('convert') How it works... In this recipe, we introduce how to use plyrmr to manipulate data. First, we need to load the plyrmr package into R. Then, similar to rmr2, you have to set the backend option of plyrmr as the local mode. Otherwise, you will have to wait anywhere between a few minutes to several hours if plyrmr is running on Hadoop mode (the default setting). Next, we can begin the data manipulation with data filtering. You can choose to call the function nested inside the other function call in step 4. On the other hand, you can use the pipe operator, %|%, to chain multiple operations. Therefore, we can filter data similar to step 4, using pipe operators in step 5. Next, you can input the dataset into either the HDFS or local filesystem, using to.dfs in accordance with the current running mode. The function will generate the path of the dataset and save it in the variable, tidata. By knowing the path, you can access the data using the input function. Next, we illustrate how to generate a summation of the frequency from the Titanic dataset with the transmute and sum functions. Also, plyrmr allows users to sum up the frequency by gender. Additionally, in order to sample data from a population, you can also use the sample function to select 10 records out of the Titanic dataset. Lastly, we demonstrate how to join two datasets using the merge function from plyrmr. See also Here we list some functions that can be used to manipulate data with plyrmr. You may refer to the help function for further details on their usage and functionalities: Data manipulation: bind.cols: This adds new columns select: This is used to select columns where: This is used to select rows transmute: This uses all of the above plus their summaries From reshape2: melt and dcast: It converts long and wide data frames Summary: count quantile sample Extract: top.k bottom.k Conducting machine learning with RHadoop At this point, some may believe that the use of RHadoop can easily solve machine learning problems of big data via numerous existing machine learning packages. However, you cannot use most of these to solve machine learning problems as they cannot be executed in the MapReduce mode. In the following recipe, we will demonstrate how to implement a MapReduce version of linear regression and compare this version with the one using the lm function. Getting ready In this recipe, you should have completed the previous recipe by installing rmr2 into the R environment. How to do it... Perform the following steps to implement a linear regression in MapReduce: First, load the cats dataset from the MASS package: > library(MASS) > data(cats) > X = matrix(cats$Bwt) > y = matrix(cats$Hwt) You can then generate a linear regression model by calling the lm function: > model = lm(y~X) > summary(model)   Call: lm(formula = y ~ X)   Residuals:    Min    1Q Median     3Q     Max -3.5694 -0.9634 -0.0921 1.0426 5.1238   Coefficients:            Estimate Std. Error t value Pr(>|t|)   (Intercept) -0.3567     0.6923 -0.515   0.607   X             4.0341     0.2503 16.119   <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1   Residual standard error: 1.452 on 142 degrees of freedom Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441 F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16 You can now make a regression plot with the given data points and model: > plot(y~X) > abline(model, col="red") Linear regression plot of cats dataset Load rmr2 into R: > Sys.setenv(HADOOP_CMD="/usr/bin/hadoop") > Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-> streaming-2.5.0-cdh5.2.0.jar") > library(rmr2) > rmr.options(backend="local") You can then set up X and y values: > X = matrix(cats$Bwt) > X.index = to.dfs(cbind(1:nrow(X), X)) > y = as.matrix(cats$Hwt) Make a Sum function to sum up the values: > Sum = +   function(., YY) +     keyval(1, list(Reduce('+', YY))) Compute Xtx in MapReduce, Job1: > XtX = +   values( +     from.dfs( +       mapreduce( +         input = X.index, +         map = +           function(., Xi) { +             Xi = Xi[,-1] +              keyval(1, list(t(Xi) %*% Xi))}, +         reduce = Sum, +         combine = TRUE)))[[1]] You can then compute Xty in MapReduce, Job2: Xty = +   values( +     from.dfs( +       mapreduce( +         input = X.index, +         map = function(., Xi) { +           yi = y[Xi[,1],] +           Xi = Xi[,-1] +           keyval(1, list(t(Xi) %*% yi))}, +         reduce = Sum, +         combine = TRUE)))[[1]] Lastly, you can derive the coefficient from XtX and Xty: > solve(XtX, Xty)          [,1] [1,] 3.907113 How it works... In this recipe, we demonstrate how to implement linear logistic regression in a MapReduce fashion in R. Before we start the implementation, we review how traditional linear models work. We first retrieve the cats dataset from the MASS package. We then load X as the body weight (Bwt) and y as the heart weight (Hwt). Next, we begin to fit the data into a linear regression model using the lm function. We can then compute the fitted model and obtain the summary of the model. The summary shows that the coefficient is 4.0341 and the intercept is -0.3567. Furthermore, we draw a scatter plot in accordance with the given data points and then draw a regression line on the plot. As we cannot perform linear regression using the lm function in the MapReduce form, we have to rewrite the regression model in a MapReduce fashion. Here, we would like to implement a MapReduce version of linear regression in three steps, which are: calculate the Xtx value with the MapReduce, job1, calculate the Xty value with MapReduce, job2, and then derive the coefficient value: In the first step, we pass the matrix, X, as the input to the map function. The map function then calculates the cross product of the transposed matrix, X, and, X. The reduce function then performs the sum operation defined in the previous section. In the second step, the procedure of calculating Xty is similar to calculating XtX. The procedure calculates the cross product of the transposed matrix, X, and, y. The reduce function then performs the sum operation. Lastly, we use the solve function to derive the coefficient, which is 3.907113. As the results show, the coefficients computed by lm and MapReduce differ slightly. Generally speaking, the coefficient computed by the lm model is more accurate than the one calculated by MapReduce. However, if your data is too large to fit in the memory, you have no choice but to implement linear regression in the MapReduce version. See also You can access more information on machine learning algorithms at: https://github.com/RevolutionAnalytics/rmr2/tree/master/pkg/tests Configuring RHadoop clusters on Amazon EMR Until now, we have only demonstrated how to run a RHadoop program in a single Hadoop node. In order to test our RHadoop program on a multi-node cluster, the only thing you need to do is to install RHadoop on all the task nodes (nodes with either task tracker for mapreduce version 1 or node manager for map reduce version 2) of Hadoop clusters. However, the deployment and installation is time consuming. On the other hand, you can choose to deploy your RHadoop program on Amazon EMR, so that you can deploy multi-node clusters and RHadoop on every task node in only a few minutes. In the following recipe, we will demonstrate how to configure RHadoop cluster on an Amazon EMR service. Getting ready In this recipe, you must register and create an account on AWS, and you also must know how to generate a EC2 key-pair before using Amazon EMR. For those who seek more information on how to start using AWS, please refer to the tutorial provided by Amazon at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html. How to do it... Perform the following steps to configure RHadoop on Amazon EMR: First, you can access the console of the Amazon Web Service (refer to https://us-west-2.console.aws.amazon.com/console/) and find EMR in the analytics section. Then, click on EMR. Access EMR service from AWS console. You should find yourself in the cluster list of the EMR dashboard (refer to https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#cluster-list::); click on Create cluster. Cluster list of EMR Then, you should find yourself on the Create Cluster page (refer to https://us-west-2.console.aws.amazon.com/elasticmapreduce/home?region=us-west-2#create-cluster:). Next, you should specify Cluster name and Log folder S3 location in the cluster configuration. Cluster configuration in the create cluster page You can then configure the Hadoop distribution on Software Configuration. Configure the software and applications Next, you can configure the number of nodes within the Hadoop cluster. Configure the hardware within Hadoop cluster You can then specify the EC2 key-pair for the master node login. Security and access to the master node of the EMR cluster To set up RHadoop, one has to perform bootstrap actions to install RHadoop on every task node. Please write a file named bootstrapRHadoop.sh, and insert the following lines within the file: echo 'install.packages(c("codetools", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "rJava", "caTools"), repos="http://cran.us.r-project.org")' > /home/hadoop/installPackage.R sudo Rscript /home/hadoop/installPackage.R wget --no-check-certificate https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/master/build/rmr2_3.3.0.tar.gz sudo R CMD INSTALL rmr2_3.3.0.tar.gz wget --no-check-certificate https://raw.github.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz sudo HADOOP_CMD=/home/hadoop/bin/hadoop R CMD INSTALL rhdfs_1.0.8.tar.gz You should upload bootstrapRHadoop.sh to S3. You now need to add the bootstrap action with Custom action, and add s3://<location>/bootstrapRHadoop.sh within the S3 location. Set up the bootstrap action Next, you can click on Create cluster to launch the Hadoop cluster. Create the cluster Lastly, you should see the master public DNS when the cluster is ready. You can now access the terminal of the master node with your EC2-key pair: A screenshot of the created cluster How it works... In this recipe, we demonstrate how to set up RHadoop on Amazon EMR. The benefit of this is that you can quickly create a scalable, on demand Hadoop with just a few clicks within a few minutes. This helps save you time from building and deploying a Hadoop application. However, you have to pay for the number of running hours for each instance. Before using Amazon EMR, you should create an AWS account and know how to set up the EC2 key-pair and the S3. You can then start installing RHadoop on Amazon EMR. In the first step, access the EMR cluster list and click on Create cluster. You can see a list of configurations on the Create cluster page. You should then set up the cluster name and log folder in the S3 location in the cluster configuration. Next, you can set up the software configuration and choose the Hadoop distribution you would like to install. Amazon provides both its own distribution and the MapR distribution. Normally, you would skip this section unless you have concerns about the default Hadoop distribution. You can then configure the hardware by specifying the master, core, and task node. By default, there is only one master node, and two core nodes. You can add more core and task nodes if you like. You should then set up the key-pair to login to the master node. You should next make a file containing all the start scripts named bootstrapRHadoop.sh. After the file is created, you should save the file in the S3 storage. You can then specify custom action in Bootstrap Action with bootstrapRHadoop.sh as the Bootstrap script. Lastly, you can click on Create cluster and wait until the cluster is ready. Once the cluster is ready, one can see the master public DNS and can use the EC2 key-pair to access the terminal of the master node. Beware! Terminate the running instance if you do not want to continue using the EMR service. Otherwise, you will be charged per instance for every hour you use. See also Google also provides its own cloud solution, the Google compute engine. For those who would like to know more, please refer to https://cloud.google.com/compute/. Summary In this article, we started by preparing the Hadoop environment, so that you can install RHadoop. We then covered the installation of three main packages: rmr, rhdfs, and plyrmr. Next, we introduced how to use rmr to perform MapReduce from R, operate an HDFS file through rhdfs, and perform a common data operation using plyrmr. Further, we explored how to perform machine learning using RHadoop. Lastly, we introduced how to deploy multiple RHadoop clusters on Amazon EC2. Resources for Article: Further resources on this subject: Warming Up [article] Derivatives Pricing [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 4746

article-image-creating-your-first-collection-simple
Packt
26 Jun 2013
7 min read
Save for later

Creating your first collection (Simple)

Packt
26 Jun 2013
7 min read
(For more resources related to this topic, see here.) Getting ready Assuming that you have walked through the tutorial, you should be nearly ready with the setup. Still, it does not hurt to go through the checklist: Be familiar that you know how to start your operating system's shell (cmd.exe on Windows, Terminal/iTerm on Mac, and sh/bash/tch/zsh on Unix). Ensure that running the java –version command on the shell's prompt returns at least Version 1.6. You may need to upgrade if you have an older version. Ensure that you know where you unpacked the Solr distribution and the full path to the example directory within that. You needed that directory for the tutorial, but that's also where we are going to start our own Solr instance. That allows us to easily run an embedded Jetty web server and to also find all the additional JAR files that Solr needs to operate properly. Now, create a directory where we will store our indexes and experiments. It can be anywhere on your drive. As Solr can run on any operating system where Java can run, we will use SOLRINDEXING as a name whenever we refer to that directory. Make sure to use absolute path names when substituting with your real path for the directory. How to do it... As our first example, we will create an index that stores and allows for the searching of simplified e-mail information. For now, we will just look at the addr_from and addr_to e-mail addresses and the subject line. You will see that it takes only two simple configuration files to get the basic Solr index working. Under the SOLR-INDEXING directory, create a collection1 directory and inside that create a conf directory. In the conf directory, create two files: schema.xml and solrconfig.xml. The schema.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><schema version="1.5"><fields><field name="id" type="string" indexed="true" stored="true"required="true"/><field name="addr_from" type="string" indexed="true"stored="true" required="true"/><field name="addr_to" type="string" indexed="true"stored="true" required="true"/><field name="subject" type="string" indexed="true"stored="true" required="true"/></fields><uniqueKey>id</uniqueKey><types><fieldType name="string" class="solr.StrField" /></types></schema> The solrconfig.xml file should have the following content: <?xml version="1.0" encoding="UTF-8" ?><config><luceneMatchVersion>LUCENE_43</luceneMatchVersion><requestDispatcher handleSelect="false"><httpCaching never304="true" /></requestDispatcher><requestHandler name="/select" class="solr.SearchHandler" /><requestHandler name="/update" class="solr.UpdateRequestHandler" /><requestHandler name="/admin" class="solr.admin.AdminHandlers" /><requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /></config> That is it. Now, let's start our just-created Solr instance. Open a new shell (we'll need the current one later). On that shell's command prompt, change the directory to the example directory of the Solr distribution and run the following command: java -Dsolr.solr.home=SOLR-INDEXING -jar start.jar Notice that solr.solr.home is not a typo; you do need the solr part twice. And, as always, if you have spaces in your paths (now or later), you may need to escape them in platform-specific ways, such as with backslashes on Unix/Linux or by quoting the whole value. In the window of your shell, you should see a long list of messages that you can safely ignore (at least for now). You can verify that everything is working fine by checking for the following three elements: The long list of messages should finish with a message like Started SocketConnector@0.0.0.0:8983. This means that Solr is now running on port 8983 successfully. You should now have a directory called data, right next to the directory called conf that we created earlier. If you open the web browser and go to the http:// localhost:8983/ solr/, you should see a web-based admin interface that makes testing and troubleshooting your Solr instance much easier. We will be using this interface later, so do spend a couple of minutes clicking around now. Now, let's load some actual content into our collection: Copy post.jar from the Solr distribution's example/exampledocs directory to our root SOLR-INDEXING directory. Create a file called input1.csv in the collection1 directory, next to the conf and data directories with the following three-line content: id,addr_from,addr_to,subjectemail1,fulan@acme.example.com,kari@acme.example.com,"Kari,we need more Junior Java engineers"email2,kari@acme.example.com,maija@acme.example.com,"Updating vacancy description" Run the import command from the command line in the SOLR-INDEXING directory (one long command; do not split it across lines): java -Dauto -Durl=http://localhost:8983/solr/collection1/update -jar post.jar collection1/input1.csv You should see the following in one of the message lines: "1 files indexed". If you now open a web browser and go to http:// localhost:8983/solr/ collection1/select?q=*%3A*&wt=ruby&indent=true, you should see Solr output with all the three documents displayed on the screen in a somewhat readable format. How it works... We have created two files to get our example working. Let's review what they mean and how they fit together: The schema.xml file in the collection's conf directory defines the actual shape of data that you want to store and index. The fields define a structure of a record. Each field has a type, which is also defined in the same file. The field defines whether it is stored, indexed, required, multivalued, or a small number of other, more advanced properties. On the other hand, the field type defines what is actually done to the field when it is indexed and when it is searched. We will explore all of these later. The solrconfig.xml file also in the collection's conf directory defines and tunes the components that make up Solr's runtime environment. At the very least, it needs to define which URLs can be called to add records to a collection (here, /update), which to query a collection (here, /select), and which to do various administrative tasks (here, /admin and /analysis/field). Once Solr started, it created a single collection with the default name of collection1, assigned an update handler to it at the /solr/collection1/update URL and search handler at the /solr/collection1/select URL (as per solrconfig.xml). At that point, Solr was ready for the data to be imported into the four required fields (as per schema.xml). We then proceeded to populate the index from a CSV file (one of many update formats available) and then verified that the records are all present in an indented Ruby format (again, one of many result formats available). Summary This article helped you create a basic Solr collection and populate it with a simple dataset in CSV format. Resources for Article : Further resources on this subject: Integrating Solr: Ruby on Rails Integration [Article] Indexing Data in Solr 1.4 Enterprise Search Server: Part2 [Article] Text Search, your Database or Solr [Article]
Read more
  • 0
  • 0
  • 4693

article-image-setting-synchronous-replication
Packt
10 Aug 2015
17 min read
Save for later

Setting Up Synchronous Replication

Packt
10 Aug 2015
17 min read
In this article by the author, Hans-Jürgen Schönig, of the book, PostgreSQL Replication, Second Edition, we learn how to set up synchronous replication. In asynchronous replication, data is submitted and received by the slave (or slaves) after the transaction has been committed on the master. During the time between the master's commit and the point when the slave actually has fully received the data, it can still be lost. Here, you will learn about the following topics: Making sure that no single transaction can be lost Configuring PostgreSQL for synchronous replication Understanding and using application_name The performance impact of synchronous replication Optimizing replication for speed Synchronous replication can be the cornerstone of your replication setup, providing a system that ensures zero data loss. (For more resources related to this topic, see here.) Synchronous replication setup Synchronous replication has been made to protect your data at all costs. The core idea of synchronous replication is that a transaction must be on at least two servers before the master returns success to the client. Making sure that data is on at least two nodes is a key requirement to ensure no data loss in the event of a crash. Setting up synchronous replication works just like setting up asynchronous replication. Just a handful of parameters discussed here have to be changed to enjoy the blessings of synchronous replication. However, if you are about to create a setup based on synchronous replication, we recommend getting started with an asynchronous setup and gradually extending your configuration and turning it into synchronous replication. This will allow you to debug things more easily and avoid problems down the road. Understanding the downside to synchronous replication The most important thing you have to know about synchronous replication is that it is simply expensive. Synchronous replication and its downsides are two of the core reasons for which we have decided to include all this background information in this book. It is essential to understand the physical limitations of synchronous replication, otherwise you could end up in deep trouble. When setting up synchronous replication, try to keep the following things in mind: Minimize the latency Make sure you have redundant connections Synchronous replication is more expensive than asynchronous replication Always cross-check twice whether there is a real need for synchronous replication In many cases, it is perfectly fine to lose a couple of rows in the event of a crash. Synchronous replication can safely be skipped in this case. However, if there is zero tolerance, synchronous replication is a tool that should be used. Understanding the application_name parameter In order to understand a synchronous setup, a config variable called application_name is essential, and it plays an important role in a synchronous setup. In a typical application, people use the application_name parameter for debugging purposes, as it allows users to assign a name to their database connection. It can help track bugs, identify what an application is doing, and so on: test=# SHOW application_name; application_name ------------------ psql (1 row)   test=# SET application_name TO 'whatever'; SET test=# SHOW application_name; application_name ------------------ whatever (1 row) As you can see, it is possible to set the application_name parameter freely. The setting is valid for the session we are in, and will be gone as soon as we disconnect. The question now is: What does application_name have to do with synchronous replication? Well, the story goes like this: if this application_name value happens to be part of synchronous_standby_names, the slave will be a synchronous one. In addition to that, to be a synchronous standby, it has to be: connected streaming data in real-time (that is, not fetching old WAL records) Once a standby becomes synced, it remains in that position until disconnection. In the case of cascaded replication (which means that a slave is again connected to a slave), the cascaded slave is not treated synchronously anymore. Only the first server is considered to be synchronous. With all of this information in mind, we can move forward and configure our first synchronous replication. Making synchronous replication work To show you how synchronous replication works, this article will include a full, working example outlining all the relevant configuration parameters. A couple of changes have to be made to the master. The following settings will be needed in postgresql.conf on the master: wal_level = hot_standby max_wal_senders = 5   # or any number synchronous_standby_names = 'book_sample' hot_standby = on # on the slave to make it readable Then we have to adapt pg_hba.conf. After that, the server can be restarted and the master is ready for action. We recommend that you set wal_keep_segments as well to keep more transaction logs. We also recommend setting wal_keep_segments to keep more transaction logs on the master database. This makes the entire setup way more robust. It is also possible to utilize replication slots. In the next step, we can perform a base backup just as we have done before. We have to call pg_basebackup on the slave. Ideally, we already include the transaction log when doing the base backup. The --xlog-method=stream parameter allows us to fire things up quickly and without any greater risks. The --xlog-method=stream and wal_keep_segments parameters are a good combo, and in our opinion, should be used in most cases to ensure that a setup works flawlessly and safely. We have already recommended setting hot_standby on the master. The config file will be replicated anyway, so you save yourself one trip to postgresql.conf to change this setting. Of course, this is not fine art but an easy and pragmatic approach. Once the base backup has been performed, we can move ahead and write a simple recovery.conf file suitable for synchronous replication, as follows: iMac:slavehs$ cat recovery.conf primary_conninfo = 'host=localhost                    application_name=book_sample                    port=5432'   standby_mode = on The config file looks just like before. The only difference is that we have added application_name to the scenery. Note that the application_name parameter must be identical to the synchronous_standby_names setting on the master. Once we have finished writing recovery.conf, we can fire up the slave. In our example, the slave is on the same server as the master. In this case, you have to ensure that those two instances will use different TCP ports, otherwise the instance that starts second will not be able to fire up. The port can easily be changed in postgresql.conf. After these steps, the database instance can be started. The slave will check out its connection information and connect to the master. Once it has replayed all the relevant transaction logs, it will be in synchronous state. The master and the slave will hold exactly the same data from then on. Checking the replication Now that we have started the database instance, we can connect to the system and see whether things are working properly. To check for replication, we can connect to the master and take a look at pg_stat_replication. For this check, we can connect to any database inside our (master) instance, as follows: postgres=# x Expanded display is on. postgres=# SELECT * FROM pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid            | 62871 usesysid         | 10 usename         | hs application_name | book_sample client_addr     | ::1 client_hostname | client_port     | 59235 backend_start   | 2013-03-29 14:53:52.352741+01 state           | streaming sent_location   | 0/30001E8 write_location   | 0/30001E8 flush_location   | 0/30001E8 replay_location | 0/30001E8 sync_priority   | 1 sync_state       | sync This system view will show exactly one line per slave attached to your master system. The x command will make the output more readable for you. If you don't use x to transpose the output, the lines will be so long that it will be pretty hard for you to comprehend the content of this table. In expanded display mode, each column will be in one line instead. You can see that the application_name parameter has been taken from the connect string passed to the master by the slave (which is book_sample in our example). As the application_name parameter matches the master's synchronous_standby_names setting, we have convinced the system to replicate synchronously. No transaction can be lost anymore because every transaction will end up on two servers instantly. The sync_state setting will tell you precisely how data is moving from the master to the slave. You can also use a list of application names, or simply a * sign in synchronous_standby_names to indicate that the first slave has to be synchronous. Understanding performance issues At various points in this book, we have already pointed out that synchronous replication is an expensive thing to do. Remember that we have to wait for a remote server and not just the local system. The network between those two nodes is definitely not something that is going to speed things up. Writing to more than one node is always more expensive than writing to only one node. Therefore, we definitely have to keep an eye on speed, otherwise we might face some pretty nasty surprises. Consider what you have learned about the CAP theory earlier in this book. Synchronous replication is exactly where it should be, with the serious impact that the physical limitations will have on performance. The main question you really have to ask yourself is: do I really want to replicate all transactions synchronously? In many cases, you don't. To prove our point, let's imagine a typical scenario: a bank wants to store accounting-related data as well as some logging data. We definitely don't want to lose a couple of million dollars just because a database node goes down. This kind of data might be worth the effort of replicating synchronously. The logging data is quite different, however. It might be far too expensive to cope with the overhead of synchronous replication. So, we want to replicate this data in an asynchronous way to ensure maximum throughput. How can we configure a system to handle important as well as not-so-important transactions nicely? The answer lies in a variable you have already seen earlier in the book—the synchronous_commit variable. Setting synchronous_commit to on In the default PostgreSQL configuration, synchronous_commit has been set to on. In this case, commits will wait until a reply from the current synchronous standby indicates that it has received the commit record of the transaction and has flushed it to the disk. In other words, both servers must report that the data has been written safely. Unless both servers crash at the same time, your data will survive potential problems (crashing of both servers should be pretty unlikely). Setting synchronous_commit to remote_write Flushing to both disks can be highly expensive. In many cases, it is enough to know that the remote server has accepted the XLOG and passed it on to the operating system without flushing things to the disk on the slave. As we can be pretty certain that we don't lose two servers at the very same time, this is a reasonable compromise between performance and consistency with respect to data protection. Setting synchronous_commit to off The idea is to delay WAL writing to reduce disk flushes. This can be used if performance is more important than durability. In the case of replication, it means that we are not replicating in a fully synchronous way. Keep in mind that this can have a serious impact on your application. Imagine a transaction committing on the master and you wanting to query that data instantly on one of the slaves. There would still be a tiny window during which you can actually get outdated data. Setting synchronous_commit to local The local value will flush locally but not wait for the replica to respond. In other words, it will turn your transaction into an asynchronous one. Setting synchronous_commit to local can also cause a small time delay window, during which the slave can actually return slightly outdated data. This phenomenon has to be kept in mind when you decide to offload reads to the slave. In short, if you want to replicate synchronously, you have to ensure that synchronous_commit is set to either on or remote_write. Changing durability settings on the fly Changing the way data is replicated on the fly is easy and highly important to many applications, as it allows the user to control durability on the fly. Not all data has been created equal, and therefore, more important data should be written in a safer way than data that is not as important (such as log files). We have already set up a full synchronous replication infrastructure by adjusting synchronous_standby_names (master) along with the application_name (slave) parameter. The good thing about PostgreSQL is that you can change your durability requirements on the fly: test=# BEGIN; BEGIN test=# CREATE TABLE t_test (id int4); CREATE TABLE test=# SET synchronous_commit TO local; SET test=# x Expanded display is on. test=# SELECT * FROM pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid             | 62871 usesysid         | 10 usename         | hs application_name | book_sample client_addr     | ::1 client_hostname | client_port     | 59235 backend_start   | 2013-03-29 14:53:52.352741+01 state           | streaming sent_location   | 0/3026258 write_location   | 0/3026258 flush_location   | 0/3026258 replay_location | 0/3026258 sync_priority   | 1 sync_state       | sync   test=# COMMIT; COMMIT In this example, we changed the durability requirements on the fly. This will make sure that this very specific transaction will not wait for the slave to flush to the disk. Note, as you can see, sync_state has not changed. Don't be fooled by what you see here; you can completely rely on the behavior outlined in this section. PostgreSQL is perfectly able to handle each transaction separately. This is a unique feature of this wonderful open source database; it puts you in control and lets you decide which kind of durability requirements you want. Understanding the practical implications and performance We have already talked about practical implications as well as performance implications. But what good is a theoretical example? Let's do a simple benchmark and see how replication behaves. We are performing this kind of testing to show you that various levels of durability are not just a minor topic; they are the key to performance. Let's assume a simple test: in the following scenario, we have connected two equally powerful machines (3 GHz, 8 GB RAM) over a 1 Gbit network. The two machines are next to each other. To demonstrate the impact of synchronous replication, we have left shared_buffers and all other memory parameters as default, and only changed fsync to off to make sure that the effect of disk wait is reduced to practically zero. The test is simple: we use a one-column table with only one integer field and 10,000 single transactions consisting of just one INSERT statement: INSERT INTO t_test VALUES (1); We can try this with full, synchronous replication (synchronous_commit = on): real 0m6.043s user 0m0.131s sys 0m0.169s As you can see, the test has taken around 6 seconds to complete. This test can be repeated with synchronous_commit = local now (which effectively means asynchronous replication): real 0m0.909s user 0m0.101s sys 0m0.142s In this simple test, you can see that the speed has gone up by us much as six times. Of course, this is a brute-force example, which does not fully reflect reality (this was not the goal anyway). What is important to understand, however, is that synchronous versus asynchronous replication is not a matter of a couple of percentage points or so. This should stress our point even more: replicate synchronously only if it is really needed, and if you really have to use synchronous replication, make sure that you limit the number of synchronous transactions to an absolute minimum. Also, please make sure that your network is up to the job. Replicating data synchronously over network connections with high latency will kill your system performance like nothing else. Keep in mind that throwing expensive hardware at the problem will not solve the problem. Doubling the clock speed of your servers will do practically nothing for you because the real limitation will always come from network latency. The performance penalty with just one connection is definitely a lot larger than that with many connections. Remember that things can be done in parallel, and network latency does not make us more I/O or CPU bound, so we can reduce the impact of slow transactions by firing up more concurrent work. When synchronous replication is used, how can you still make sure that performance does not suffer too much? Basically, there are a couple of important suggestions that have proven to be helpful: Use longer transactions: Remember that the system must ensure on commit that the data is available on two servers. We don't care what happens in the middle of a transaction, because anybody outside our transaction cannot see the data anyway. A longer transaction will dramatically reduce network communication. Run stuff concurrently: If you have more than one transaction going on at the same time, it will be beneficial to performance. The reason for this is that the remote server will return the position inside the XLOG that is considered to be processed safely (flushed or accepted). This method ensures that many transactions can be confirmed at the same time. Redundancy and stopping replication When talking about synchronous replication, there is one phenomenon that must not be left out. Imagine we have a two-node cluster replicating synchronously. What happens if the slave dies? The answer is that the master cannot distinguish between a slow and a dead slave easily, so it will start waiting for the slave to come back. At first glance, this looks like nonsense, but if you think about it more deeply, you will figure out that synchronous replication is actually the only correct thing to do. If somebody decides to go for synchronous replication, the data in the system must be worth something, so it must not be at risk. It is better to refuse data and cry out to the end user than to risk data and silently ignore the requirements of high durability. If you decide to use synchronous replication, you must consider using at least three nodes in your cluster. Otherwise, it will be very risky, and you cannot afford to lose a single node without facing significant downtime or risking data loss. Summary Here, we outlined the basic concept of synchronous replication, and showed how data can be replicated synchronously. We also showed how durability requirements can be changed on the fly by modifying PostgreSQL runtime parameters. PostgreSQL gives users the choice of how a transaction should be replicated, and which level of durability is necessary for a certain transaction. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]
Read more
  • 0
  • 0
  • 4683

article-image-neural-network-azure-ml
Packt
15 Jun 2015
3 min read
Save for later

Neural Network in Azure ML

Packt
15 Jun 2015
3 min read
In this article written by Sumit Mund, author of the book Microsoft Azure Machine Learning, we will learn about neural network, which is a kind of machine learning algorithm inspired by the computational models of a human brain. It builds a network of computation units, neurons, or nodes. In a typical network, there are three layers of nodes. First, the input layer, followed by the middle layer or hidden layer, and in the end, the output layer. Neural network algorithms can be used for both classification and regression problems. (For more resources related to this topic, see here.) The number of nodes in a layer depends on the problem and how you construct the network to get the best result. Usually, the number of nodes in an input layer is equal to the number of features in the dataset. For a regression problem, the number of nodes in the output layer is one while for a classification problem, it is equal to the number of class or label. Each node in a layer gets connected to all the nodes in the next layer. Each edge that connects between nodes is assigned a weight. So, a neural network can well be imagined as a weighted directed acyclic graph. In a typical neural network, as shown in the preceding figure, the middle layer or hidden layer contains the number nodes, which are chosen to make the computation right. While there is no formula or agreed convention for this, it is often optimized after trying out different options. Azure Machine Learning supports neural network for regression, two-class classification, and multiclass classification. It provides a separate module for each kind of problem and lets the users tune it with different parameters, such as the number of hidden nodes, number of iterations to train the model, and so on. A special kind of neural network algorithms where there are more than one hidden layers is known as deep networks or deep learning algorithms. Azure Machine Learning allows us to choose the number of hidden nodes as a property value of the neural network module. These kind of neural networks are getting increasingly popular these days because of their remarkable results and because they allow us to model complex and nonlinear scenarios. There are many kinds of deep networks, but recently, a special kind of deep network known as the convolutional neural network got very popular because of its significant performance in image recognition or classification problems. Azure Machine Learning supports the convolutional neural network. For simple networks with three layers, this can be done through a UI just by choosing parameters. However, to build a deep network like a convolutional deep network, it’s not easy to do so through a UI. So, Azure Machine Learning supports a new kind of language called Net#, which allows you to script different kinds of neural network inside ML Studio by defining different node, the connections (edges), and kind of connections. While deep networks are complex to build and train, Net# makes things relatively easy and simple. Though complex, neural networks are very powerful and Azure Machine Learning makes it fun to work with these be it three-layered shallow networks or multilayer deep networks. Resources for Article: Further resources on this subject: Security in Microsoft Azure [article] High Availability, Protection, and Recovery using Microsoft Azure [article] Managing Microsoft Cloud [article]
Read more
  • 0
  • 0
  • 4680

article-image-oracle-goldengate-considerations-designing-solution
Packt
24 Feb 2011
8 min read
Save for later

Oracle GoldenGate: Considerations for Designing a Solution

Packt
24 Feb 2011
8 min read
  Oracle GoldenGate 11g Implementer's guide Design, install, and configure high-performance data replication solutions using Oracle GoldenGate The very first book on GoldenGate, focused on design and performance tuning in enterprise-wide environments Exhaustive coverage and analysis of all aspects of the GoldenGate software implementation, including design, installation, and advanced configuration Migrate your data replication solution from Oracle Streams to GoldenGate Design a GoldenGate solution that meets all the functional and non-functional requirements of your system Written in a simple illustrative manner, providing step-by-step guidance with discussion points Goes way beyond the manual, appealing to Solution Architects, System Administrators and Database Administrators          At a high level, the design must include the following generic requirements: Hardware Software Network Storage Performance All the above must be factored into the overall system architecture. So let's take a look at some of the options and the key design issues. Replication methods So you have a fast reliable network between your source and target sites. You also have a schema design that is scalable and logically split. You now need to choose the replication architecture; One to One, One to Many, active-active, active-passive, and so on. This consideration may already be answered for you by the sheer fact of what the system has to achieve. Let's take a look at some configuration options. Active-active Let's assume a multi-national computer hardware company has an office in London and New York. Data entry clerks are employed at both sites inputting orders into an Order Management System. There is also a procurement department that updates the system inventory with volumes of stock and new products related to a US or European market. European countries are managed by London, and the US States are managed by New York. A requirement exists where the underlying database systems must be kept in synchronisation. Should one of the systems fail, London users can connect to New York and vice-versa allowing business to continue and orders to be taken. Oracle GoldenGate's active-active architecture provides the best solution to this requirement, ensuring that the database systems on both sides of the pond are kept synchronised in case of failure. Another feature the active-active configuration has to offer is the ability to load balance operations. Rather than have effectively a DR site in both locations, the European users could be allowed access to New York and London systems and viceversa. Should a site fail, then the DR solution could be quickly implemented. Active-passive The active-passive bi-directional configuration replicates data from an active primary database to a full replica database. Sticking with the earlier example, the business would need to decide which site is the primary where all users connect. For example, in the event of a failure in London, the application could be configured to failover to New York. Depending on the failure scenario, another option is to start up the passive configuration, effectively turning the active-passive configuration into active-active. Cascading The Cascading GoldenGate topology offers a number of "drop-off" points that are intermediate targets being populated from a single source. The question here is "what data do I drop at which site?" Once this question has been answered by the business, it is then a case of configuring filters in Replicat parameter files allowing just the selected data to be replicated. All of the data is passed on to the next target where it is filtered and applied again. This type of configuration lends itself to a head office system updating its satellite office systems in a round robin fashion. In this case, only the relevant data is replicated at each target site. Another design, is the Hub and Spoke solution, where all target sites are updated simultaneously. This is a typical head office topology, but additional configuration and resources would be required at the source site to ship the data in a timely manner. The CPU, network, and file storage requirements must be sufficient to accommodate and send the data to multiple targets. Physical Standby A Physical Standby database is a robust Oracle DR solution managed by the Oracle Data Guard product. The Physical Standby database is essentially a mirror copy of its Primary, which lends itself perfectly for failover scenarios. However , it is not easy to replicate data from the Physical Standby database, because it does not generate any of its own redo. That said, it is possible to configure GoldenGate to read the archived standby logs in Archive Log Only (ALO) mode. Despite being potentially slower, it may be prudent to feed a downstream system on the DR site using this mechanism, rather than having two data streams configured from the Primary database. This reduces network bandwidth utilization, as shown in the following diagram: Reducing network traffic is particularly important when there is considerable distance between the primary and the DR site. Networking The network should not be taken for granted. It is a fundamental component in data replication and must be considered in the design process. Not only must it be fast, it must be reliable. In the following paragraphs, we look at ways to make our network resilient to faults and subsequent outages, in an effort to maintain zero downtime. Surviving network outages Probably one of your biggest fears in a replication environment is network failure. Should the network fail, the source trail will fill as the transactions continue on the source database, ultimately filling the filesystem to 100% utilization, causing the Extract process to abend. Depending on the length of the outage, data in the database's redologs may be overwritten causing you the additional task of configuring GoldenGate to extract data from the database's archived logs. This is not ideal as you already have the backlog of data in the trail files to ship to the target site once the network is restored. Therefore, ensure there is sufficient disk space available to accommodate data for the longest network outage during the busiest period. Disks are relatively cheap nowadays. Providing ample space for your trail files will help to reduce the recovery time from the network outage. Redundant networks One of the key components in your GoldenGate implementation is the network. Without the ability to transfer data from the source to the target, it is rendered useless. So, you not only need a fast network but one that will always be available. This is where redundant networks come into play, offering speed and reliability. NIC teaming One method of achieving redundancy is Network Interface Card (NIC) teaming or bonding. Here two or more Ethernet ports can be "coupled" to form a bonded network supporting one IP address. The main goal of NIC teaming is to use two or more Ethernet ports connected to two or more different access network switches thus avoiding a single point of failure. The following diagram illustrates the redundant features of NIC teaming: Linux (OEL/RHEL 4 and above) supports NIC teaming with no additional software requirements. It is purely a matter of network configuration stored in text files in the /etc/sysconfig/network-scripts directory. The following steps show how to configure a server for NIC teaming: First, you need to log on as root user and create a bond0 config file using the vi text editor. # vi /etc/sysconfig/network-scripts/ifcfg-bond0 Append the following lines to it, replacing the IP address with your actual IP address, then save file and exit to shell prompt: DEVICE=bond0 IPADDR=192.168.1.20 NETWORK=192.168.1.0 NETMASK=255.255.255.0 USERCTL=no BOOTPROTO=none ONBOOT=yes Choose the Ethernet ports you wish to bond, and then open both configurations in turn using the vi text editor, replacing ethn with the respective port number. # vi /etc/sysconfig/network-scripts/ifcfg-eth2 # vi /etc/sysconfig/network-scripts/ifcfg-eth4 Modify the configuration as follows: DEVICE=ethn USERCTL=no ONBOOT=yes MASTER=bond0 SLAVE=yes BOOTPROTO=none Save the files and exit to shell prompt. To make sure the bonding module is loaded when the bonding interface (bond0) is brought up, you need to modify the kernel modules configuration file: # vi /etc/modprobe.conf Append the following two lines to the file: alias bond0 bonding options bond0 mode=balance-alb miimon=100 Finally, load the bonding module and restart the network services: # modprobe bonding # service network restart You now have a bonded network that will load balance when both physical networks are available, providing additional bandwidth and enhanced performance. Should one network fail, the available bandwidth will be halved, but the network will still be available. Non-functional requirements (NFRs) Irrespective of the functional requirements, the design must also include the nonfunctional requirements (NFR) in order to achieve the overall goal of delivering a robust, high performance, and stable system. Latency One of the main NFRs is performance. How long does it take to replicate a transaction from the source database to the target? This is known as end-to-end latency that typically has a threshold that must not be breeched in order to satisfy the specified NFR. GoldenGate refers to latency as lag, which can be measured at different intervals in the replication process. These are: Source to Extract: The time taken for a record to be processed by the Extract compared to the commit timestamp on the database Replicat to Target: The time taken for the last record to be processed by the Replicat compared to the record creation time in the trail file A well designed system may encounter spikes in latency but it should never be continuous or growing. Trying to tune GoldenGate when the design is poor is a difficult situation to be in. For the system to perform well you may need to revisit the design.  
Read more
  • 0
  • 0
  • 4669
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-lambdaarchitecture-pattern
Packt
19 Jun 2017
8 min read
Save for later

LambdaArchitecture Pattern

Packt
19 Jun 2017
8 min read
In this article by Tomcy John and Pankaj Misra, the authors of the book, Data Lake For Enterprises, we will learn about how the data in landscape of big data solutions can be made in near real time and certain practices that can be adopted for realizing Lambda Architecture in context of data lake. (For more resources related to this topic, see here.) The concept of a data lake in an enterprise was driven by certain challenges that Enterprises were facing with the way the data was handled, processed, and stored. Initially all the individual applications in the Enterprise, via a natural evolution cycle, started maintaining huge amounts of data into themselves with almost no reuse to other applications in the same enterprise. These created information silos across arious applications. As the next step of evolution, these individual applications started exposing this data across the organization as a data mart access layer over central data warehouse. While data mart solved one part of the problem, other problems still persisted. These problems were more about data governance, data ownership, data accessibility which were required to be resolved so as to have better availability of enterprise relevant data. This is where a need was felt to have data lakes, that could not only make such data available but also could store any form of data and process it so that data is analyzed and kept ready for consumption by consumer applications. In this article, we will look at some of the critical aspects of a data lake and understand why does it matter for an enterprise. If we need to define the term Data Lake, it can be defined as a vast repository of variety of enterprise wide raw information that can be acquired, processed, analyzed and delivered. The information thus handled could be any type of information ranging from structured, semi-structured data to completely unstructured data. Data Lake is expected to be able to derive Enterprise relevant meaning and insights from this information using various analysis and machine learning algorithms. Lambda architecture and data lake Lambda architecture as a pattern provides ways and means to perform highly scalable, performant, distributed computing on large sets of data and yet provide consistent (eventually) data with required processing both in batch as well as in near real time. Lambda architecture defines ways and means to enable scale out architecture across various data load profiles in an enterprise, with low latency expectations. The architecture pattern became significant with the emergence of big data and enterprise’s focus on real-time analytics and digital transformation. The pattern named Lambda (symbol λ) is indicative of a way by which data comes from two places (batch and speed - the curved parts of the lambda symbol) which then combines and served through the serving layer (the line merging from the curved part). Figure 01 : Lambda Symbol  The main layers constituting the Lambda layer are shown below: Figyure 02 : Components of Lambda Architecure In the above high level representation, data is fed to both the batch and speed layer. The batch layer keeps producing and re-computing views at every set batch interval. The speed layer also creates the relevant real-time/speed views. The serving layer orchestrates the query by querying both the batch and speed layer, merges it and sends the result back as results. A practical realization of such a data lake can be illustrated as shown below. The figure below shows multiple technologies used for such a realization, however once the data is acquired from multiple sources and queued in messaging layer for ingestion, the Lambda architecture pattern in form of ingestion layer, batch layer.and speed layer springs into action: Figure 03: Layers in Data Lake Data Acquisition Layer:In an organization, data exists in various forms which can be classified as structured data, semi-structured data, or as unstructured data.One of the key roles expected from the acquisition layer is to be able convert the data into messages that can be further processed in a data lake, hence the acquisition layer is expected to be flexible to accommodate variety of schema specifications at the same time must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake. A typical flow can be represented as shown below. Figure 04: Data Acquisition Layer Messaging Layer: The messaging layer would form the Message Oriented Middleware (MOM) for the data lake architecture and hence would be the primary layer for decoupling the various layers with each other, but with guaranteed delivery of messages.The other aspect of a messaging layer is its ability to enqueue and dequeue messages, as in the case with most of the messaging frameworks. Most of the messaging frameworks provide enqueue and dequeue mechanisms to manage publishing and consumption of messages respectively. Every messaging frameworks provides its own set of libraries to connect to its resources(queues/topics). Figure 05: Message Queue Additionally the messaging layer also can perform the role of data stream producer which can converted the queued data into continuous streams of data which can be passed on for data ingestion. Data Ingestion Layer: A fast ingestion layer is one of the key layers in Lambda Architecture pattern. This layer needs to ensure how fast can data be delivered into working models of Lambda architecture.  The data ingestion layer is responsible for consuming the messages from the messaging layer and perform the required transformation for ingesting them into the lambda layer (batch and speed layer) such that the transformed output conforms to the expected storage or processing formats. Figure 06: Data Ingestion Layer Batch Processing: The batch processing layer of lambda architecture is expected to process the ingested data in batches so as to have optimum utilization of system resources, at the same time, long running operations may be applied to the data to ensure high quality of data output, which is also known as Modelled data. The conversion of raw data to a modelled data is the primary responsibility of this layer, wherein the modelled data is the data model which can be served by serving layers of lambda architecture. While Hadoop as a framework has multiple components that can process data as a batch, each data processing in Hadoop is a map reduce process. A map and reduce paradigm of process execution is not a new paradigm, rather it has been used in many application ever since mainframe systems came into existence. It is based on divide and rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. Figure 07: Batch Processing Speed (Near Real Time) Data Processing: This layer is expected to perform near real time processing on data received from ingestion layer. Since the processing is expected to be in near real time, such data processing will need to be quick, fast and efficient, with support and design for high concurrency scenarios and eventually consistent outcome. The real-time processing was often dependent on data like the look-up data and reference data, hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Near real time data processing pattern is not very different from the way it is done in batch mode, but the primary difference being that the data is processed as soon as it is available for processing and does not have to be batched, as shown below. Figure 08: Speed (Near Real Time) Processing Data Storage Layer: The data storage layer is very eminent in the lambda architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. The storage, in context of lambda architecture driven data lake can be classified broadly into non-indexed and indexed data storage. Typically, the batch processing is performed on non-indexed data stored as data blocks for faster batch processing, while speed (near real time processing) is performed on indexed data which can be accessed randomly and supports complex search patterns by means of inverted indexes. Both of these models are depicted below. Figure 09: Non-Indexed and Indexed Data Storage Examples Lambda in action Once all the layers in lambda architecture have performed their respective roles, the data can be exported, exposed via services and can be delivered through other protocols from the data lake. This can also include merging the high quality processed output from batch processing with indexed storage, using technologies and frameworks, so as to provide enriched data for near real time requirements as well with interesting visualizations. Figure 10: Lambda in action Summary In this article we have briefly discussed a practical approach towards implementing a data lake for enterprises by leveraging Lambda architecture pattern. Resources for Article: Further resources on this subject: The Microsoft Azure Stack Architecture [article] System Architecture and Design of Ansible [article] Microservices and Service Oriented Architecture [article]
Read more
  • 0
  • 0
  • 4658

article-image-r-vs-pandas
Packt
16 Feb 2016
1 min read
Save for later

R vs Pandas

Packt
16 Feb 2016
1 min read
This article focuses on comparing pandas with R, the statistical package on which much of pandas' functionality is modeled. It is intended as a guide for R users who wish to use pandas, and for users who wish to replicate functionality that they have seen in the R code in pandas. It focuses on some key features available to R users and shows how to achieve similar functionality in pandas by using some illustrative examples. This article assumes that you have the R statistical package installed. If not, it can be downloaded and installed from here: http://www.r-project.org/. By the end of the article, data analysis users should have a good grasp of the data analysis capabilities of R as compared to pandas, enabling them to transition to or use pandas, should they need to. The various topics addressed in this article include the following: R data types and their pandas equivalents Slicing and selection Arithmetic operations on datatype columns Aggregation and GroupBy Matching Split-apply-combine Melting and reshaping Factors and categorical data
Read more
  • 0
  • 0
  • 4657

article-image-postgresql-action
Packt
14 Sep 2015
10 min read
Save for later

PostgreSQL in Action

Packt
14 Sep 2015
10 min read
In this article by Salahadin Juba, Achim Vannahme, and Andrey Volkov, authors of the book Learning PostgreSQL, we will discuss PostgreSQL (pronounced Post-Gres-Q-L) or Postgres is an open source, object-relational database management system. It emphasizes extensibility, creativity, and compatibility. It competes with major relational database vendors, such as Oracle, MySQL, SQL servers, and others. It is used by different sectors, including government agencies and the public and private sectors. It is cross-platform and runs on most modern operating systems, including Windows, Mac, and Linux flavors. It conforms to SQL standards and it is ACID complaint. (For more resources related to this topic, see here.) An overview of PostgreSQL PostgreSQL has many rich features. It provides enterprise-level services, including performance and scalability. It has a very supportive community and very good documentation. The history of PostgreSQL The name PostgreSQL comes from post-Ingres database. the history of PostgreSQL can be summarized as follows: Academia: University of California at Berkeley (UC Berkeley) 1977-1985, Ingres project: Michael Stonebraker created RDBMS according to the formal relational model 1986-1994, postgres: Michael Stonebraker created postgres in order to support complex data types and the object-relational model. 1995, Postgres95: Andrew Yu and Jolly Chen changed postgres to postgres query language (P) with an extended subset of SQL. Industry 1996, PostgreSQL: Several developers dedicated a lot of labor and time to stabilize Postgres95. The first open source version was released on January 29, 1997. With the introduction of new features, and enhancements, and at the start of open source projects, the Postgres95 name was changed to PostgreSQL. PostgreSQL began at version 6, with a very strong starting point by taking advantage of several years of research and development. Being an open source with a very good reputation, PostgreSQL has attracted hundreds of developers. Currently, PostgreSQL has innumerable extensions and a very active community. Advantages of PostgreSQL PostgreSQL provides many features that attract developers, administrators, architects, and companies. Business advantages of PostgreSQL PostgreSQL is free, open source software (OSS); it has been released under the PostgreSQL license, which is similar to the BSD and MIT licenses. The PostgreSQL license is highly permissive, and PostgreSQL is not a subject to monopoly and acquisition. This gives the company the following advantages. There is no associated licensing cost to PostgreSQL. The number of deployments of PostgreSQL is unlimited. A more profitable business model. PostgreSQL is SQL standards compliant. Thus finding professional developers is not very difficult. PostgreSQL is easy to learn and porting code from one database vendor to PostgreSQL is cost efficient. Also, PostgreSQL administrative tasks are easy to automate. Thus, the staffing cost is significantly reduced. PostgreSQL is cross-platform, and it has drivers for all modern programming languages; so, there is no need to change the company policy about the software stack in order to use PostgreSQL. PostgreSQL is scalable and it has a high performance. PostgreSQL is very reliable; it rarely crashes. Also, PostgreSQL is ACID compliant, which means that it can tolerate some hardware failure. In addition to that, it can be configured and installed as a cluster to ensure high availability (HA). User advantages of PostgreSQL PostgreSQL is very attractive for developers, administrators, and architects; it has rich features that enable developers to perform tasks in an agile way. The following are some attractive features for the developer: There is a new release almost each year; until now, starting from Postgres95, there have been 23 major releases. Very good documentation and an active community enable developers to find and solve problems quickly. The PostgreSQL manual is over than 2,500 pages in length. A rich extension repository enables developers to focus on the business logic. Also, it enables developers to meet requirement changes easily. The source code is available free of charge, it can be customized and extended without a huge effort. Rich clients and administrative tools enable developers to perform routine tasks, such as describing database objects, exporting and importing data, and dumping and restoring databases, very quickly. Database administration tasks do not requires a lot of time and can be automated. PostgreSQL can be integrated easily with other database management systems, giving software architecture good flexibility in putting software designs. Applications of PostgreSQL PostgreSQL can be used for a variety of applications. The main PostgreSQL application domains can be classified into two categories: Online transactional processing (OLTP): OLTP is characterized by a large number of CRUD operations, very fast processing of operations, and maintaining data integrity in a multiaccess environment. The performance is measured in the number of transactions per second. Online analytical processing (OLAP): OLAP is characterized by a small number of requests, complex queries that involve data aggregation, and a huge amount of data from different sources, with different formats and data mining and historical data analysis. OLTP is used to model business operations, such as customer relationship management (CRM). OLAP applications are used for business intelligence, decision support, reporting, and planning. An OLTP database size is relatively small compared to an OLAP database. OLTP normally follows the relational model concepts, such as normalization when designing the database, while OLAP is less relational and the schema is often star shaped. Unlike OLTP, the main operation of OLAP is data retrieval. OLAP data is often generated by a process called Extract, Transform and Load (ETL). ETL is used to load data into the OLAP database from different data sources and different formats. PostgreSQL can be used out of the box for OLTP applications. For OLAP, there are many extensions and tools to support it, such as the PostgreSQL COPY command and Foreign Data Wrappers (FDW). Success stories PostgreSQL is used in many application domains, including communication, media, geographical, and e-commerce applications. Many companies provide consultation as well as commercial services, such as migrating proprietary RDBMS to PostgreSQL in order to cut off licensing costs. These companies often influence and enhance PostgreSQL by developing and submitting new features. The following are a few companies that use PostgreSQL: Skype uses PostgreSQL to store user chats and activities. Skype has also affected PostgreSQL by developing many tools called Skytools. Instagram is a social networking service that enables its user to share pictures and photos. Instagram has more than 100 million active users. The American Chemical Society (ACS): More than one terabyte of data for their journal archive is stored using PostgreSQL. In addition to the preceding list of companies, PostgreSQL is used by HP, VMware, and Heroku. PostgreSQL is used by many scientific communities and organizations, such as NASA, due to its extensibility and rich data types. Forks There are more than 20 PostgreSQL forks; PostgreSQL extensible APIs makes postgres a great candidate to fork. Over years, many groups have forked PostgreSQL and contributed their findings to PostgreSQL. The following is a list of popular PostgreSQL forks: HadoopDB is a hybrid between the PostgreSQL, RDBMS, and MapReduce technologies to target analytical workload. Greenplum is a proprietary DBMS that was built on the foundation of PostgreSQL. It utilizes the shared-nothing and massively parallel processing (MPP) architectures. It is used as a data warehouse and for analytical workloads. The EnterpriseDB advanced server is a proprietary DBMS that provides Oracle capabilities to cap Oracle fees. Postgres-XC (eXtensible Cluster) is a multi-master PostgreSQL cluster based on the shared-nothing architecture. It emphasis write-scalability and provides the same APIs to applications that PostgreSQL provides. Vertica is a column-oriented database system, which was started by Michael Stonebraker in 2005 and acquisitioned by HP in 2011. Vertica reused the SQL parser, semantic analyzer, and standard SQL rewrites from the PostgreSQL implementation. Netzza is a popular data warehouse appliance solution that was started as a PostgreSQL fork. Amazon Redshift is a popular data warehouse management system based on PostgreSQL 8.0.2. It is mainly designed for OLAP applications. The PostgreSQL architecture PostgreSQL uses the client/server model; the client and server programs could be on different hosts. The communication between the client and server is normally done via TCP/IP protocols or Linux sockets. PostgreSQL can handle multiple connections from a client. A common PostgreSQL program consists of the following operating system processes: Client process or program (frontend): The database frontend application performs a database action. The frontend could be a web server that wants to display a web page or a command-line tool to perform maintenance tasks. PostgreSQL provides frontend tools, such as psql, createdb, dropdb, and createuser. Server process (backend): The server process manages database files, accepts connections from client applications, and performs actions on behalf of the client; the server process name is postgres. PostgreSQL forks a new process for each new connection; thus, the client and server processes communicate with each other without the intervention of the server main process (postgres), and they have a certain lifetime determined by accepting and terminating a client connection. The abstract architecture of PostgreSQL The aforementioned abstract, conceptual PostgreSQL architecture can give an overview of PostgreSQL's capabilities and interactions with the client as well as the operating system. The PostgreSQL server can be divided roughly into four subsystems as follows: Process manager: The process manager manages client connections, such as the forking and terminating processes. Query processor: When a client sends a query to PostgreSQL, the query is parsed by the parser, and then the traffic cop determines the query type. A Utility query is passed to the utilities subsystem. The Select, insert, update, and delete queries are rewritten by the rewriter, and then an execution plan is generated by the planner; finally, the query is executed, and the result is returned to the client. Utilities: The utilities subsystem provides the means to maintain the database, such as claiming storage, updating statistics, exporting and importing data with a certain format, and logging. Storage manager: The storage manager handles the memory cache, disk buffers, and storage allocation. Almost all PostgreSQL components can be configured, including the logger, planner, statistical analyzer, and storage manager. PostgreSQL configuration is governed by the application nature, such as OLAP and OLTP. The following diagram shows the PostgreSQL abstract, conceptual architecture: PostgreSQL's abstract, conceptual architecture The PostgreSQL community PostgreSQL has a very cooperative, active, and organized community. In the last 8 years, the PostgreSQL community published eight major releases. Announcements are brought to developers via the PostgreSQL weekly newsletter. There are dozens of mailing lists organized into categories, such as users, developers, and associations. Examples of user mailing lists are pgsql-general, psql-doc, and psql-bugs. pgsql-general is a very important mailing list for beginners. All non-bug-related questions about PostgreSQL installation, tuning, basic administration, PostgreSQL features, and general discussions are submitted to this list. The PostgreSQL community runs a blog aggregation service called Planet PostgreSQL—https://planet.postgresql.org/. Several PostgreSQL developers and companies use this service to share their experience and knowledge. Summary PostgreSQL is an open source, object-oriented relational database system. It supports many advanced features and complies with the ANSI-SQL:2008 standard. It has won industry recognition and user appreciation. The PostgreSQL slogan "The world's most advanced open source database" reflects the sophistication of the PostgreSQL features. PostgreSQL is a result of many years of research and collaboration between academia and industry. Companies in their infancy often favor PostgreSQL due to licensing costs. PostgreSQL can aid profitable business models. PostgreSQL is also favoured by many developers because of its capabilities and advantages. Resources for Article: Further resources on this subject: Introducing PostgreSQL 9 [article] PostgreSQL – New Features [article] Installing PostgreSQL [article]
Read more
  • 0
  • 0
  • 4653

article-image-solving-nlp-problem-keras-part-1
Sasank Chilamkurthy
12 Oct 2016
5 min read
Save for later

Solving an NLP Problem with Keras, Part 1

Sasank Chilamkurthy
12 Oct 2016
5 min read
In a previous two-part post series on Keras, I introduced Convolutional Neural Networks(CNNs) and the Keras deep learning framework. We used them to solve a Computer Vision (CV) problem involving traffic sign recognition. Now, in this two-part post series, we will solve a Natural Language Processing (NLP) problem with Keras. Let’s begin. The Problem and the Dataset The problem we are going to tackle is Natural Language Understanding. The aim is to extract the meaning of speech utterances. This is still an unsolved problem. Therefore, we can break this problem into a solvable practical problem of understanding the speaker in a limited context. In particular, we want to identify the intent of a speaker asking for information about flights. The dataset we are going to use is Airline Travel Information System (ATIS). This dataset was collected by DARPA in the early 90s. ATIS consists of spoken queries on flight related information. An example utterance is I want to go from Boston to Atlanta on Monday. Understanding this is then reduced to identifying arguments like Destination and Departure Day. This task is called slot-filling. Here is an example sentence and its labels. You will observe that labels are encoded in an Inside Outside Beginning (IOB) representation. Let’s look at the dataset: |Words | Show | flights | from | Boston | to | New | York| today| |Labels| O | O | O |B-dept | O|B-arr|I-arr|B-date| The ATIS official split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128, including the O label (NULL). Unseen words in the test set are encoded by the <UNK> token, and each digit is replaced with string DIGIT;that is,20 is converted to DIGITDIGIT. Our approach to the problem is to use: Word embeddings Recurrent neural networks I'll talk about these briefly in the following sections. Word Embeddings Word embeddings map words to a vector in a high-dimensional space. These word embeddings can actually learn the semantic and syntactic information of words. For instance, they can understand that similar words are close to each other in this space and dissimilar words are far apart. This can be learned either using large amounts of text like Wikipedia, or specifically for a given problem. We will take the second approach for this problem. As an illustation, I have shown here the nearest neighbors in the word embedding space for some of the words. This embedding space was learned by the model that we’ll define later in the post: sunday delta california boston august time car wednesday continental colorado nashville september schedule rental saturday united florida toronto july times limousine friday american ohio chicago june schedules rentals monday eastern georgia phoenix december dinnertime cars tuesday northwest pennsylvania cleveland november ord taxi thursday us north atlanta april f28 train wednesdays nationair tennessee milwaukee october limo limo saturdays lufthansa minnesota columbus january departure ap sundays midwest michigan minneapolis may sfo later Recurrent Neural Networks Convolutional layers can be a great way to pool local information, but they do not really capture the sequentiality of data. Recurrent Neural Networks (RNNs) help us tackle sequential information like natural language. If we are going to predict properties of the current word, we better remember the words before it too. An RNN has such an internal state/memory that stores the summary of the sequence it has seen so far. This allows us to use RNNs to solve complicated word tagging problems such as Part Of Speech (POS) tagging or slot filling, as in our case. The following diagram illustrates the internals of RNN:  Source: Nature RNN Let's briefly go through the diagram: Is the input to the RNN.   x_1,x_2,...,x_(t-1),x_t,x_(t+1)... Is the hidden state of the RNN at the step.  st This is computed based on the state at the step. t-1 As st=f(Uxt+Ws(t-1)) Here f is a nonlinearity such astanh or ReLU. ot Is the output at the step. t Computed as:ot=f(Vst)U,V,W Are the learnable parameters of RNN. For our problem, we will pass a word embeddings’ sequence as the input to the RNN. Putting it all together Now that we've setup the problem and have an understanding of the basic blocks, let's code it up. Since we are using the IOB representation for labels, it's not simpleto calculate the scores of our model. We therefore use the conlleval perl script to compute the F1 Scores. I've adapted the code from here for the data preprocessing and score calculation. The complete code is available at GitHub: $ git clone https://github.com/chsasank/ATIS.keras.git $ cd ATIS.keras I recommend using jupyter notebook to run and experiment with the snippets from the tutorial. $ jupyter notebook Conclusion In part 2, we will load the data using data.load.atisfull(). We will also define the Keras model, and then we will train the model. To measure the accuracy of the model, we’ll use model.predict_on_batch() and metrics.accuracy.conlleval(). And finally, we will improve our model to achieve better results. About the author Sasank Chilamkurthy works at Fractal Analytics. His work involves deep learning on medical images obtained from radiology and pathology. He is mainly interested in computer vision.
Read more
  • 0
  • 0
  • 4638
article-image-nhibernate-3-creating-sample-application
Packt
06 Sep 2011
8 min read
Save for later

NHibernate 3: Creating a Sample Application

Packt
06 Sep 2011
8 min read
  (For more resources on NHibernate, see here.)   Prepare our development environment We assume that you have a computer at hand which has Windows Vista, Windows 7, Windows Server 2003 or Windows Server 2008 installed. If you are using an Apple computer, then you can install, for example, Windows 7 as a virtual machine. First, install Microsoft Visual Studio 2010 Professional, Microsoft Visual C# 2010 Express or Microsoft Visual Basic 2010 Express on your system. The Express editions of Visual Studio can be downloaded from http://www.microsoft.com/express/windows. Note that NHibernate 3.x can also be used with the 2008 editions of Microsoft Visual Studio, but not with any older versions. NHibernate 3.x is based on the .NET framework version 3.5, and thus only works with IDEs that support this or a higher version of the .NET framework. Additionally, note that if you don't want to use Visual Studio, then there are at least two other free OSS options available to you: MonoDevelop is an IDE primarily designed for C# and other .NET languages. MonoDevelop makes it easy for developers to port .NET applications created with Visual Studio to Linux and to maintain a single code base for all platforms. MonoDevelop 2.4 or higher can be downloaded from http://monodevelop.com/download. SharpDevelop is a free IDE for C#, VB.NET, and Boo projects on Microsoft's .NET platform. It is open source. SharpDevelop 3.2 or higher can be downloaded from http://sharpdevelop.net/OpenSource/SD/Default.aspx. Furthermore, note that NHibernate also works on Mono: http://www.mono-project.com. Next, we need a relational database to play with. NHibernate supports all major relational databases like Oracle, MS SQL Server, MySQL, and so on. We will use MS SQL Server as our Relational Database Management System (RDBMS). Microsoft SQL Server is the most used RDBMS in conjunction with NHibernate and, in general, with .NET projects. The SQL Server driver for NHibernate is one of the most tested drivers in NHibernate's suite of unit tests, and when specific new features come out, it is likely that they will be first supported by this driver. Install the free Microsoft SQL Server 2008 R2 Express on your system if you have not already done so during the install of Visual Studio. You can download the express edition of MS SQL Server from here http://www.microsoft.com/express/Database/. For our samples, it really doesn't matter which version you download: the 32-bit or the 64-bit version. Just take the one that matches best with the bitness of your operating system. Make sure that you install SQL Server with the default instance name of SQL Express. Make sure you also download and install the free SQL Server Management Studio Express (SSMS) from the following link: http://www.microsoft.com/download/en/details.aspx?id=22985 Now, we are ready to tackle NHibernate. We can download NHibernate 3.1.0 GA from Source Forge http://sourceforge.net/projects/nhibernate/. The download consists of a single ZIP file containing the following content, as shown in the screenshot: The binaries that are always needed when developing an NHibernate based application can be found in the Required_Bins folder. Opening this folder, we find the files as shown in the following screenshot: Note that if you are downloading version 3.1 or newer of NHibernate, you will no longer find the two DLLs, Antlr3.Runtime.dll and Remotion.Data.Linq.dll, in the ZIP file that were present in version 3.0. The reason is that they have been IL merged into the NHibernate.dll. If we want to use lazy loading with NHibernate (and we surely will), then we also have to use some additional files which can be found in the Required_For_LazyLoading folder. Lazy loading is a technique that is used to load certain parts of the data only when really needed, which is when the code accesses it. There are three different options at hand. We want to choose Castle. The corresponding folder contains these files, as shown in the following screenshot: As we are also using Fluent NHibernate, we want to download the corresponding binaries too. Go grab the binaries from the Fluent NHibernate website and copy them to the appropriate location on your system. In either case, there is no installer available or needed. We just have to copy a bunch of files to a folder we define. Please download Fluent NHibernate, which also contains the binaries for NHibernate, from here (http://fluentnhibernate.org/downloads), as shown in the following screenshot. Make sure you download the binaries for NHibernate 3.1 and not an earlier version. Save the ZIP file you just downloaded to a location where you can easily find it for later usage. The ZIP file contains the files shown in the following screenshot: The only additional files regarding the direct NHibernate download are the FluentNHibernate.* files. On the other hand, we do not have the XSD schema files (nhibernate-configuration.xsd and nhibernate-mapping.xsd) included in this package and we'll want to copy those from the NHibernate package when implementing our sample.   Defining a model After we have successfully downloaded the necessary NHibernate and Fluent NHibernate files, we are ready to start implementing our first application using NHibernate. Let's first model the problem domain we want to create the application for. The domain for which we want to build our application is a product inventory system. With the application, we want to be able to manage a list of products for a small grocery store. The products shall be grouped by category. A category consists of a name and a short description. The product on the other hand has a name, a short description, a category, a unit price, a reorder level, and a flag to determine whether it is discontinued or not. To uniquely identify each category and product, they each have an ID. If we draw a class diagram of the model just described, then it would look similar to the following screenshot: Unfortunately, the class designer used to create the preceding diagram is only available in the professional version of Visual Studio and not in the free Express editions.   Time for action – Creating the product inventory model Let's implement the model for our simple product inventory system. First, we want to define a location on our system, where we will put all our code that we create. Create a folder called NH3BeginnersGuide on your file system. Inside this new folder, create another folder called lib. This is the place where we will put all the assemblies needed to develop an application using NHibernate and Fluent NHibernate. Locate the ZIP file containing the Fluent NHibernate files that you downloaded in the first section of this article. Extract all files to the lib folder created in the preceding step. Open Visual Studio and create a new project. Choose WPF Application as the project template. Call the project Chapter2. Make sure that the solution you create will be saved in the folder NH3BeginnersGuide you created in the preceding step. When using VS 2008 Pro, you can do this when creating the new project. If, on the other hand, you use the Express edition of Visual Studio, then you choose the location when you first save your project. (Move the mouse over the image to enlarge.) Add a new class to the project and call it Category. To this class, add a virtual (auto-) property called Id, which is of the int type. Also, add two other virtual properties of the string type, called Name and Description. The code should look similar to the following code snippet: namespace Chapter2{ public class Category { public virtual int Id { get; set; } public virtual string Name { get; set; } public virtual string Description { get; set; } }} Downloading the example code You can download the example code files here. Add another class to the project and call it Product. To this class, add the properties, as shown in the following code snippet. The type of the respective property is given in parenthesis: Id (int), Name (string), Description (string), Category (Category), UnitPrice (decimal), ReorderLevel (int), and Discontinued (bool). The resulting code should look similar to the following code snippet: namespace Chapter2{ public class Product { public virtual int Id { get; set; } public virtual string Name { get; set; } public virtual string Description { get; set; } public virtual Category Category { get; set; } public virtual decimal UnitPrice { get; set; } public virtual int ReorderLevel { get; set; } public virtual bool Discontinued { get; set; } }} What just happened? We have implemented the two classes Category and Product, which define our simple domain model. Each attribute of the entity is implemented as a virtual property of the class. To limit the amount of code necessary to define the entities, we use auto properties. Note that the properties are all declared as virtual. This is needed as NHibernate uses lazy loading by default.  
Read more
  • 0
  • 0
  • 4616

article-image-integrating-imagery-creating-and-styling-features-openlayers-3
Packt
17 Mar 2016
21 min read
Save for later

Integrating Imagery with Creating and Styling Features in OpenLayers 3

Packt
17 Mar 2016
21 min read
This article by Peter J Langley, author of the book OpenLayers 3.x Cookbook, sheds some light on three of the most important and talked about features of the OpenLayers library. (For more resources related to this topic, see here.) Introduction This article shows us the basics and the important things that we need to know when we start creating our first web-mapping application with OpenLayers. As we will see in this and the following recipes, OpenLayers is a big and complex framework, but at the same time, it is also very powerful and flexible. Although we're now spoilt for choice when it comes to picking a JavaScript mapping library (as we are with most JavaScript libraries and frameworks), OpenLayers is a mature, fully-featured, and well-supported library. In contrast to other libraries, such as Leaflet (http://leafletjs.com), which focus on a smaller download size in order to provide only the most common functionality as standard, OpenLayers tries to implement all the required things that a developer could need to create a web Geographic Information System (GIS) application. One aspect of OpenLayers 3 that immediately differentiates itself from OpenLayers 2, is that it's been built with the Google Closure library (https://developers.google.com/closure). Google Closure provides an extensive range of modular cross-browser JavaScript utility methods that OpenLayers 3 selectively includes. In GIS, a real-world phenomenon is represented by the concept of a feature. It can be a place, such as a city or a village, it can be a road or a railway, it can be a region, a lake, the border of a country, or something entirely arbitrary. Features can have a set of attributes, such as population, length, and so on. These can be represented visually through the use of points, lines, polygons, and so on, using some visual style: color, radius, width, and so on. OpenLayers offers us a great degree of flexibility when styling features. We can use static styles or dynamic styles influenced by feature attributes. Styles can be created through various methods, such as from style functions (ol.style.StyleFunction), or by applying new style instances (ol.style.Style) directly to a feature or layer. Let's take a look at all of this in the following recipes. Adding WMS layers Web Map Service (WMS) is a standard developed by the Open Geospatial Consortium (OGC), which is implemented by many geospatial servers, among which we can find the free and open source projects, GeoServer (http://geoserver.org) and MapServer (http://mapserver.org). More information on WMS can be found at http://en.wikipedia.org/wiki/Web_Map_Service. As a basic summary, a WMS server is a normal HTTP web server that accepts requests with some GIS-related parameters (such as projection, bounding box, and so on) and returns map tiles forming a mosaic that covers the requested bounding box. Here's the finished recipe's outcome using a WMS layer that covers the extent of the USA: We are going to work with remote WMS servers, so it is not necessary that you have one installed yourself. Note that we are not responsible for these servers, and that they may have problems, or they may not be available any longer when you read this section. Any other WMS server can be used, but the URL and layer name must be known. How to do it… We will add two WMS layers to work with. To do this, perform the following steps: Create an HTML file and add the OpenLayers dependencies. In particular, create the HTML to hold the map and the layer panel: <div id="js-map" class="map"></div> <div class="pane"> <h1>WMS layers</h1> <p>Select the WMS layer you wish to view:</p> <select id="js-layers" class="layers"> <option value="-10527519,3160212,4">Temperature " (USA)</option> <option value="-408479,7213209,6">Bedrock (UK)</option> </select> </div> Create the map instance with the default OpenStreetMap layer: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10527519, 3160212] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.OSM() }) ] }); Add the first WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/' + 'NDFDTemps/MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); Add the second WMS layer to the map: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Finally, add the layer-switching logic: document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); How it works… The HTML and CSS divide the page into two sections: one for the map, and the other for the layer-switching panel. The top part of our custom JavaScript file creates a new map instance with a single OpenStreetMap layer; this layer will become the background for the WMS layers in order to provide some context. Let's spend the rest of our time concentrating on how the WMS layers are created. WMS layers are encapsulated within the ol.layer.Tile layer type. The source is an instance of ol.source.TileWMS, which is a subclass of ol.source.TileImage. The ol.source.TileImage class is behind many source types, such as Bing Maps, and custom OpenStreetMap layers that are based on XYZ format. When using ol.source.TileWMS, we must at least pass in the URL of the WMS server and a layers parameter. Let's breakdown the first WMS layer as follows: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/' + 'MapServer/WMSServer', params: { LAYERS: 16, FORMAT: 'image/png', TRANSPARENT: true }, attributions: [ new ol.Attribution({ html: 'Data provided by the ' + '<a href="http://noaa.gov">NOAA</a>.' }) ] }), opacity: 0.50 })); For the url property of the source, we provide the URL of the WMS server from NOAA (http://www.noaa.gov). The params property expects an object of key/value pairs. The content of this is appended to the previous URL as query string parameters, for example, http://gis.srh.noaa.gov/arcgis/services/NDFDTemps/MapServer/WMSServer?LAYERS=16. As mentioned earlier, at minimum, this object requires the LAYERS property with a value. We request for the layer by the name of 16. Along with this parameter, we also explicitly ask for the tile images to be in the .PNG format (FORMAT: 'image/png') and that the background of the tiles be transparent (TRANSPARENT: true) rather than white, which would undesirably block out the background map layer. The default values for format and transparency are already image or PNG and false, respectively. This means you don't need to pass them in as parameters, OpenLayers will do it for you. We've shown you this for learning purposes, but this isn't strictly necessary. There are also other parameters that OpenLayers fills in for you if not specified, such as service (WMS), version (1.3.0), request (GetMap), and so on. For the attributions property, we created a new attribution instance to cover our usage of the WMS service, which simply contains a string of HTML linking back to the NOAA website. Lastly, we set the opacity property of the layer to 50% (0.50), which suitably overlays the OpenStreetMap layer underneath: map.addLayer(new ol.layer.Tile({ source: new ol.source.TileWMS({ url: 'http://ogc.bgs.ac.uk/cgi-bin/' + 'BGS_Bedrock_and_Superficial_Geology/wms', params: { LAYERS: 'BGS_EN_Bedrock_and_Superficial_Geology' }, attributions: [ new ol.Attribution({ html: 'Contains <a href="http://bgs.ac.uk">' + 'British Geological Survey</a> ' + 'materials &copy; NERC 2015' }) ] }), opacity: 0.85 })); Check the WMS standard to know which parameters you can use within the params property. The use of layers is mandatory, so you always need to specify this value. This layer from the British Geological Survey (http://bgs.ac.uk) follows the same structure as the previous WMS layer. Similarly, we provided a source URL and a layers parameter for the HTTP request. The layer name is a string rather than a number this time, which is delimited by underscores. The naming convention is at the discretion of the WMS service itself. Like earlier, an attribution instance has been added to the layer, which contains a string of HTML linking back to the BGS website, covering our usage of the WMS service. The opacity property of this layer is a little less transparent than the last one, at 85% (0.85): document.getElementById('js-layers') .addEventListener('change', function() { var values = this.value.split(','); var view = map.getView(); view.setCenter([ parseFloat(values[0]), parseFloat(values[1]) ]); view.setZoom(values[2]); }); Finally, we added a change-event listener and handler to the select menu containing both the WMS layers. If you recall from the HTML, an option's value contains a comma-delimited string. For example, the Bedrock WMS layer option looks like this: <option value="-408479,7213209,6">Bedrock (UK)</option> This translates to x coordinate, y coordinate, and zoom level. With this in mind when the change event fires, we store the value of the newly-selected option in a variable named values. The split JavaScript method creates a three-item array from the string. The array now contains the xy coordinates and the zoom level, respectively. We store a reference to the view into a variable, namely view, as it's accessed more than once within the event handler. The map view is then centered to the new location with the setCenter method. We've made sure to convert the string values into float types for OpenLayers, via the parseFloat JavaScript method. The zoom level is then set via the setZoom method. Continuing with the Bedrock example, it will recenter at -408479, 7213209 with zoom level 6. Integrating with custom WMS services plays an essential role in many web-mapping applications. Learning how we did this in this recipe should give you a good idea of how to integrate with any other WMS services that you may use. There's more… It's worth mentioning that WMS services do not necessarily cover a global extent, and they will more likely cover only subset extents of the world. Case in point, the NOAA WMS layer covers only USA, and the BGS WMS layer only covers the UK. During this topic, we only looked at the request type of GetMap, but there's also a request type called GetCapabilities. Using the GetCapabilities request parameter on the same URL endpoint returns the capabilities (such as extent) that a WMS server supports. This is discussed in much more detail later in this book. If you don't specify the type of projection, the view default projection will be used. In our case, this will be EPSG:3857, which is passed up in a parameter named CRS (it's named SRS for the GetMap version requests less than 1.3.0). If you want to retrieve WMS tiles in different projections, you need to ensure that the WMS server supports that particular format. WMS servers return images no matter whether there is information in the bounding box that we are requesting or not. Taking this recipe as an example, if the viewable extent of the map is only the UK, blank images will get returned for WMS layer requests made for USA (via the NOAA tile requests). You can prevent these unnecessary HTTP requests by setting the visibility of any layers that do not cover the extent of the area being viewed to false. There are some useful methods of the ol.source.TileWMS class that are worth being aware of, such as updateParams, which can be used to set parameters for the WMS request, and getUrls, which return the URLs used for the WMS source. Creating features programmatically Loading data from an external source is not the only way to work with vector layers. Imagine a web-mapping application where users can create new features on the fly: landing zones, perimeters, areas of interest, and so on, and add them to a vector layer with some style. This scenario requires the ability to create and add the features programmatically. In this recipe, we will take a look at some of the ways to create a selection of features programmatically. How to do it… Here, we'll create some features programmatically without any file importing. The following instructions show you how this is done: Start by creating a new HTML file with the required OpenLayers dependencies. In particular, add the div element to hold the map: <div id="js-map"></div> Create an empty JavaScript file and instantiate a map with a background raster layer: var map = new ol.Map({ view: new ol.View({ zoom: 3, center: [-2719935, 3385243] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] }); Create the point and circle features: var point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Create the line and polygon features: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); Create the vector layer and add features to the layer: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); How it works… Although we've created some random features for this recipe, features in mapping applications would normally represent some phenomenon of the real world with an appropriate geometry and a style associated with it. Let's go over the programmatic feature creation and how they are added to a vector: layervar point = new ol.Feature({ geometry: new ol.geom.Point([-606604, 3228700]) }); Features are instances of ol.Feature. This constructor contains many useful methods, such as clone, setGeometry, getStyle, and others. When creating an instance of ol.Feature, we must either pass in a geometry of type ol.geom.Geometry or an object containing properties. We demonstrate both variations throughout this recipe. For the point feature, we pass in a configuration object. The only property that we supply is geometry. There are other properties available, such as style, and the use of custom properties to set the feature attributes ourselves, which come with getters and setters. The geometry is an instance of ol.geom.Point. The ol.geom class provides a variety of other feature types that we don't get to see in this recipe, such as MultiLineString and MultiPoint. The pointgeometry type simply requires an ol.Coordinate type array (xy coordinates): var circle = new ol.Feature( new ol.geom.Circle([-391357, 4774562], 9e5) ); Remember to express the coordinates in the appropriate projection, such as the one used by the view, or translate the coordinates yourself. So, for now, all features will be rendered with the default OpenLayers styling. The circle feature follows almost the same structure as the point feature. This time, however, we don't pass in a configuration object to ol.Feature, but instead, we directly instantiate an ol.geom.Geometry type of Circle. The circle geometry takes an array of coordinates and a second parameter for the radius. 9e5 or 9e+5 is exponential notation for 900,000. The circle geometry also has useful methods, such as getCenter and setRadius: var line = new ol.Feature( new ol.geom.LineString([ [-371789, 6711782], [1624133, 4539747] ]) ); The only noticeable difference with the LineString feature is that ol.geom.LineString expects an array of coordinate arrays. For more advanced line strings, use the ol.geom.MultiLineString geometry type (more information about them can be found on the OpenLayers API documentation: http://openlayers.org/en/v3.13.0/apidoc/): The LineString feature also has useful methods, such as getLength: var polygon = new ol.Feature( new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]) ); The final feature, a Polygon geometry type differs slightly from the LineString feature as it expects an ol.Coordinate type array within an array within another wrapping array. This is because the constructor (ol.geom.Polygon) expects an array of rings with each ring representing an array of coordinates. Ideally, each ring should be closed. The polygon feature also has useful methods, such as getArea and getLinearRing: map.addLayer(new ol.layer.Vector({ source: new ol.source.Vector({ features: [point, circle, line, polygon] }) })); The OGC's Simple Feature Access specification (http://www.opengeospatial.org/standards/sfa) contains an in-depth description of the standard. It also contains a UML class diagram where you can see all the geometry classes and hierarchy. Finally, we create the vector layer, with a vector source instance and then add all four features into an array and pass it to the features property. All the features we've created are subclasses of ol.geom.SimpleGeometry. This class provides useful base methods, such as getExtent and getFirstCoordinate. All features have a getType method that can be used to identify the type of feature, for example, 'Point' or 'LineString'. There's more… Sometimes, the polygon features may represent a region with a hole in it. To create the hollow part of a polygon, we use the LinearRing geometry. The outcome is best explained with the following screenshot: You can see that the polygon has a section cut out of it. To achieve this geometry, we must create the polygon in a slightly different way. Here are the steps: Create the polygon geometry: var polygon = new ol.geom.Polygon([[ [606604, 4285365], [1506726, 3933143], [1252344, 3248267], [195678, 3248267] ]]); Create and add the linear ring to the polygon geometry: polygon.appendLinearRing( new ol.geom.LinearRing([ [645740, 3766816], [1017529, 3786384], [1017529, 3532002], [626172, 3532002] ]) ); Create the completed feature: var polygonFeature = new ol.Feature(polygon); Finish off by adding the polygon feature to the vector layer: vectorLayer.getSource().addFeature(polygonFeature); We won't break this logic down any further, as it's quite self explanatory. Now, we're comfortable with geometry creation. The ol.geom.LinearRing feature can only be used in conjunction with a polygon geometry, not as a standalone feature. Styling features based on geometry type We can summarize that there are two ways to style a feature. The first is by applying the style to the layer so that every feature inherits the styling. The second is to apply the styling options directly to the feature, which we'll see with this recipe. This recipe shows you how we can choose which flavor of styling to apply to a feature depending on the geometry type. We will apply the style directly to the feature using the ol.Feature method, setStyle. When a point geometry type is detected, we will actually style the representing geometry as a star, rather than the default circle shape. Other styling will be applied when a geometry type of line string is detected and here's what the output of the recipe will look like: How to do it… To customize the feature styling based on the geometry type, follow these steps: Create the HTML file with OpenLayers dependencies, the jQuery library, and a div element that will hold the map instance. Create a custom JavaScript file and initialize a new map instance: var map = new ol.Map({ view: new ol.View({ zoom: 4, center: [-10732981, 4676723] }), target: 'js-map', layers: [ new ol.layer.Tile({ source: new ol.source.MapQuest({layer: 'osm'}) }) ] Create a new vector layer and add it to the map. Have the source loader function retrieve the GeoJSON file, format the response, then pass it through our custom modifyFeatures method (which we'll implement next) before adding the features to the vector source: var vectorLayer = new ol.layer.Vector({ source: new ol.source.Vector({ loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } }) }); map.addLayer(vectorLayer); Finish off by implementing the modifyFeatures function so that it transforms the projection of the geometry and styles the feature that are based on the geometry type: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } }); return features; } How it works… Let's briefly look over the loader function of the vector source before we take a closer examination of the logic behind the styling: loader: function() { $.ajax({ type: 'GET', url: 'features.geojson', context: this }).done(function(data) { var format = new ol.format.GeoJSON(); var features = format.readFeatures(data); this.addFeatures(modifyFeatures(features)); }); } Our external resource contains points and line strings in the format of GeoJSON. So we must create a new instance of ol.format.GeoJSON so that we can read in the data (format.readFeatures(data)) of the AJAX response to build out a collection of OpenLayers features. Before adding the group of features straight into the vector source (this refers to the vector source here), we pass the array of features through our modifyFeatures method. This method will apply all the necessary styling to each feature, then return the modified features in place, and feed the result into the addFeatures method. Let's break down the contents our modifyFeatures method: function modifyFeatures(features) { features.forEach(function(feature) { var geometry = feature.getGeometry(); geometry.transform('EPSG:4326', 'EPSG:3857'); The logic begins by looping over each feature in the array using the JavaScript array method, forEach. The first argument passed into the anonymous iterator function is the (feature) feature. Within the loop iteration, we store the feature's geometry into a variable, namely geometry, as it's accessed more than once during the loop iteration. Unbeknown to you, the projection of coordinates within the GeoJSON file are in longitude/latitude, the EPSG:4326 projection code. The map's view, however, is in the EPSG:3857 projection. To ensure they appear where intended on the map, we use the transform geometry method, which takes the source and the destination projections as arguments and converts the coordinates of the geometry in place: if (geometry.getType() === 'Point') { feature.setStyle( new ol.style.Style({ image: new ol.style.RegularShape({ fill: new ol.style.Fill({ color: [255, 0, 0, 0.6] }), stroke: new ol.style.Stroke({ width: 2, color: 'blue' }), points: 5, radius1: 25, radius2: 12.5 }) }) ); } Next up is a conditional check on whether or not the geometry is a type of Point. The geometry instance has the getType method for this kind of purpose. Inline of the setStyle method of the feature instance, we create a new style object from the ol.style.Style constructor. The only direct property that we're interested in is the image property. By default, point geometries are styled as a circle. Instead, we want to style the point as a star. We can achieve this through the use of the ol.style.RegularShape constructor. We set up a fill style with color and a stroke style with width and color. The points property specifies the number of points for the star. In the case of a polygon shape, it represents the number of sides. The radius1 and radius2 properties are specifically to design star shapes for the configuration of the inner and outer radius, respectively: if (geometry.getType() === 'LineString') { feature.setStyle( new ol.style.Style({ stroke: new ol.style.Stroke({ color: [255, 255, 255, 1], width: 3, lineDash: [8, 6] }) }) ); } The final piece of the method has a conditional check on the geometry type of LineString. If this is the case, we style this geometry type differently to the point geometry type. We provide a stroke style with a color, width,property and a custom lineDash. The lineDash array declares a line length of 8 followed by a gap length of 6. Summary In this article we looked at how to integrate WMS layers to our map from a basic HTTP web server by passing in some GIS related parameters. We also saw how to create and add features to our vector layer with some styling the idea behind this particular recipe was to enable the user to create the features programmatically without any file importing. We also saw how to style these features by applying styling option to the features individually based on their geometry type rather than styling the layer. Resources for Article: Further resources on this subject: What is OpenLayers?[article] Getting Started with OpenLayers[article] Creating Simple Maps with OpenLayers 3[article]
Read more
  • 0
  • 0
  • 4607

article-image-sharding-action
Packt
21 Jul 2014
9 min read
Save for later

Sharding in Action

Packt
21 Jul 2014
9 min read
(For more resources related to this topic, see here.) Preparing the environment Before jumping into configuring and setting up the cluster network, we have to check some parameters and prepare the environment. To enable sharding for a database or collection, we have to configure some configuration servers that hold the cluster network metadata and shards information. Other parts of the cluster network use these configuration servers to get information about other shards. In production, it's recommended to have exactly three configuration servers in different machines. The reason for establishing each shard on a different server is to improve the safety of data and nodes. If one of the machines crashes, the whole cluster won't be unavailable. For the testing and developing environment, you can host all the configuration servers on a single server.Besides, we have two more parts for our cluster network, shards and mongos, or query routers. Query routers are the interface for all clients. All read/write requests are routed to this module, and the query router or mongos instance, using configuration servers, route the request to the corresponding shard. The following diagram shows the cluster network, modules, and the relation between them: It's important that all modules and parts have network access and are able to connect to each other. If you have any firewall, you should configure it correctly and give proper access to all cluster modules. Each configuration server has an address that routes to the target machine. We have exactly three configuration servers in our example, and the following list shows the hostnames: cfg1.sharding.com cfg2.sharding.com cfg3.sharding.com In our example, because we are going to set up a demo of sharding feature, we deploy all configuration servers on a single machine with different ports. This means all configuration servers addresses point to the same server, but we use different ports to establish the configuration server. For production use, all things will be the same, except you need to host the configuration servers on separate machines. In the next section, we will implement all parts and finally connect all of them together to start the sharding server and run the cluster network. Implementing configuration servers Now it's time to start the first part of our sharding. Establishing a configuration server is as easy as running a mongod instance using the --configsvr parameter. The following scheme shows the structure of the command: mongod --configsvr --dbpath <path> --port <port> If you don't pass the dbpath or port parameters, the configuration server uses /data/configdb as the path to store data and port 27019 to execute the instance. However, you can override the default values using the preceding command. If this is the first time that you have run the configuration server, you might be faced with some issues due to the existence of dbpath. Before running the configuration server, make sure that you have created the path; otherwise, you will see an error as shown in the following screenshot: You can simply create the directory using the mkdir command as shown in the following line of command: mkdir /data/configdb Also, make sure that you are executing the instance with sufficient permission level; otherwise, you will get an error as shown in the following screenshot: The problem is that the mongod instance can't create the lock file because of the lack of permission. To address this issue, you should simply execute the command using a root or administrator permission level. After executing the command using the proper permission level, you should see a result like the following screenshot: As you can see now, we have a configuration server for the hostname cfg1.sharding.com with port 27019 and with dbpath as /data/configdb. Also, there is a web console to watch and control the configuration server running on port 28019. By pointing the web browser to the address http://localhost:28019/, you can see the console. The following screenshot shows a part of this web console: Now, we have the first configuration server up and running. With the same method, you can launch other instances, that is, using /data/configdb2 with port 27020 for the second configuration server, and /data/configdb3 with port 27021 for the third configuration server. Configuring mongos instance After configuring the configuration servers, we should bind them to the core module of clustering. The mongos instance is responsible to bind all modules and parts together to make a complete sharding core. This module is simple and lightweight, and we can host it on the same machine that hosts other modules, such as configuration servers. It doesn't need a separate directory to store data. The mongos process uses port 27017 by default, but you can change the port using the configuration parameters. To define the configuration servers, you can use the configuration file or command-line parameters. Create a new file using your text editor in the /etc/ directory and add the following configuring settings: configdb = cfg1.sharding.com:27019, cfg2.sharding.com:27020 cfg3.sharding.com:27021 To execute and run the mongos instance, you can simply use the following command: mongos -f /etc/mongos.conf After executing the command, you should see an output like the following screenshot: Please note that if you have a configuration server that has been already used in a different sharding network, you can't use the existing data directory. You should create a new and empty data directory for the configuration server. Currently, we have mongos and all configuration servers that work together pretty well. In the next part, we will add shards to the mongos instance to complete the whole network. Managing mongos instance Now it's time to add shards and split whole dataset into smaller pieces. For production use, each shard should be a replica set network, but for the development and testing environment, you can simply add a single mongod instances to the cluster. To control and manage the mongos instance, we can simply use the mongo shell to connect to the mongos and execute commands. To connect to the mongos, you use the following command: mongo --host <mongos hostname> --port <mongos port> For instance, our mongos address is mongos1.sharding.com and the port is 27017. This is depicted in the following screenshot: After connecting to the mongos instance, we have a command environment, and we can use it to add, remove, or modify shards, or even get the status of the entire sharding network. Using the following command, you can get the status of the sharding network: sh.status() The following screenshot illustrates the output of this command: Because we haven't added any shards to sharding, you see an error that says there are no shards in the sharding network. Using the sh.help() command, you can see all commands as shown in the following screenshot: Using the sh.addShard() function, you can add shards to the network. Adding shards to mongos After connecting to the mongos, you can add shards to sharding. Basically, you can add two types of endpoints to the mongos as a shard; replica set or a standalone mongod instance. MongoDB has a sh namespace and a function called addShard(), which is used to add a new shard to an existing sharding network. Here is the example of a command to add a new shard. This is shown in the following screenshot: To add a replica set to mongos you should follow this scheme: setname/server:port For instance, if you have a replica set with the name of rs1, hostname mongod1.replicaset.com, and port number 27017, the command will be as follows: sh.addShard("rs1/mongod1.replicaset.com:27017") Using the same function, we can add standalone mongod instances. So, if we have a mongod instance with the hostname mongod1.sharding.com listening on port 27017, the command will be as follows: sh.addShard("mongod1.sharding.com:27017") You can use a secondary or primary hostname to add the replica set as a shard to the sharding network. MongoDB will detect the primary and use the primary node to interact with sharding. Now, we add the replica set network using the following command: sh.addShard("rs1/mongod1.replicaset.com:27017") If everything goes well, you won't see any output from the console, which means the adding process was successful. This is shown in the following screenshot: To see the status of sharding, you can use the sh.status() command. This is demonstrated in the following screenshot: Next, we will establish another standalone mongod instance and add it to sharding. The port number of mongod is 27016 and the hostname is mongod1.sharding.com. The following screenshot shows the output after starting the new mongod instance: Using the same approach, we will add the preceding node to sharding. This is shown in the following screenshot: It's time to see the sharding status using the sh.status() command: As you can see in the preceding screenshot, now we have two shards. The first one is a replica set with the name rs1, and the second shard is a standalone mongod instance on port 27016. If you create a new database on each shard, MongoDB syncs this new database with the mongos instance. Using the show dbs command, you can see all databases from all shards as shown in the following screenshot: The configuration database is an internal database that MongoDB uses to configure and manage the sharding network. Currently, we have all sharding modules working together. The last and final step is to enable sharding for a database and collection. Summary In this article, we prepared the environment for sharding of a database. We also learned about the implementation of a configuration server. Next, after configuring the configuration servers, we saw how to bind them to the core module of clustering. Resources for Article: Further resources on this subject: MongoDB data modeling [article] Ruby with MongoDB for Web Development [article] Dart Server with Dartling and MongoDB [article]
Read more
  • 0
  • 0
  • 4599
article-image-breaking-bank
Packt
01 Mar 2016
32 min read
Save for later

Breaking the Bank

Packt
01 Mar 2016
32 min read
In this article by Jon Jenkins, author of the book Learning Xero, covers the Xero core bank functionalities, including one of the most innovative tools of our time: automated bank feeds. We will walk through how to set up the different types of bank account you may have and the most efficient way to reconcile your bank accounts. If they don't reconcile, you will be shown how you can spot and correct any errors. Automated bank feeds have revolutionized the way in which a bank reconciliation is carried out and the speed at which you can do it. Thanks to Xero, there is no longer an excuse to not keep on top of your bank accounts and therefore maintain accurate and up-to-date information for your business's key decision makers. These are the topics we'll be covering in this article: Setting up a bank feed Using rules to speed up the process Completing a bank reconciliation Dealing with common bank reconciliation errors (For more resources related to this topic, see here.) Bank overview Reconciling bank accounts has never been as easy or as quick, and we are only just at the beginning of this journey as Xero continues to push the envelope. Xero is working with banks to not only bring bank data into your accounting system but to also push it back again. That's right; you could mark a supplier invoice paid in Xero and it could actually send the payment from your bank account. Dashboard When you log in to Xero, you are presented with the dashboard, which gives a great overview of what is going on within the business. It is also an excellent place to navigate to the main parts of Xero that you will need. If you have several bank accounts, the summary pane that shows the cash in and out during a month, as shown below, is very useful as you can hover over the bar chart to get a quick snapshot of the total cash in and out for the month with no effort at all. If you want the chart to sit at the top of your dashboard, click on Edit Dashboard at the bottom of the page, and drag and drop the chart. When finished, click on Done at the bottom of the page to lock them in place. Reconciling bank accounts is fundamental to good bookkeeping, and only once the accounts have been reconciled do you know that your records are up-to-date. It isn't worth spending lots of time looking at reports if the bank accounts haven't been reconciled, as there may be important items missing from your records. By default, all bank accounts added will be shown on your dashboard, which shows the main account information, the number of unreconciled items, and the balance per Xero and per statement. You may wish to just see a few key bank accounts, in which case you can turn some of them off by going to Accounts | Bank Accounts, where you will see a list of all bank accounts. Here, you can choose to remove the bank accounts showing on the dashboard by unchecking the Show account on Dashboard option. You can also choose the Change order option, which allows you to decide the order in which you see the bank accounts on the dashboard. Click on the up and down arrows to move the accounts as required. Bank feeds If you did not set up a bank account when you were setting up Xero, then we recommend you do that now, as you cannot set up a bank feed without one. You can do this from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts | Bank Accounts and Add Bank Account. Then, you are presented with of list of options including Bank Account, Credit Card, or PayPal. Enter your account details as requested and click on Save. It is very important at this stage to note that you may be presented with several options for your bank as they offer different accounts. If you choose the wrong one, your feed will not work. Some banks charge for a direct bank feed; you do not have to adhere to this, so ignore the feeds ending with Direct Bank Feed and select the alternative one. The difference between a Direct Bank Feed and the Yodlee service that Xero uses is that the data comes directly from the bank and not via a third party, so is deemed to be more reliable. Now that you have a bank account, you can add the feed by clicking on the Get bank feeds button, as shown in the following screenshot: On the Activate Bank Feed page, the fields will vary depending on the bank you use and the type of account you selected earlier. Enter the User Credentials as requested. You will then see a screen with the text Connecting to Bank, which states it might take a few minutes, so please bear with it; you are almost there. When prompted, select the bank account from the dropdown called Select the matching bank feed... that matches the account you are adding and choose whether you wish to import from a certain date or collect as much data as possible. How far it goes back varies by bank. If you are converting from another system, it would be wise to use the conversion date as the date from which you wish to start transactions, in order to avoid bringing in transactions processed in your old system (that is if your conversion date was May 31, 2015, you would use June 1, 2015. Once you are happy with the information provided, click on OK. If you have several accounts, such as a savings account, then simply follow the process again for each account. Refresh feed Each bank feed is different, and some will automatically update; others, however, require refreshing. You can usually tell which bank accounts require a manual refresh, as they will show the following message at the bottom of the bank account tile on the dashboard: To refresh the feed from the dashboard, find the account to update and click the Manage button in the top right-hand corner and then Refresh Bank Feed. You can also do this from within Accounts | Bank Accounts | Manage Accounts | Refresh Bank Feed. To update your bank feed, you will need to refresh the feed each time you want to reconcile the bank account. Import statements You get over most disappointments in life, unlike when you find out that the bank account you have does not have a bank feed available. Your options here are simple: go and change banks. But if that is too much hassle for you, then you could always just import a file. Xero accepts OFX, QIF, and CSV. Should your bank offer a selection, go with them in this order. OFX and QIF files are formatted and should import without too many problems. CSV, on the other hand, is a different matter. Each bank CSV download will come in a different format and will need some work before it is ready for importing. This takes some time, so I would recommend using the Xero Help Guide and searching Import a Bank Statement to get the file formatted correctly. If you only do things on a monthly basis, uploading a statement is not too much of a chore. We would say at this point that the automated bank feed is one of the most revolutionary things to come out of accounting software, so not using it is a crime. You simply are not enjoying the full benefits of using Xero and cloud software without it. Petty cash It probably costs the business more to find missing petty cash discrepancies than the discrepancy totals. Our advice is simple: try not to maintain a petty cash account if you can; it is just one more thing to reconcile and manage. We would advocate using a company debit card where possible, as the transactions will then go through your main bank account and you will know what is missing. Get staff to submit an expense claim, and if that is too much hassle, treat payments in and out as if they have gone through the director's loan account, as that is what happens in most small businesses. Should you wish to operate a petty cash account, you will need to mark the transactions as reconciled manually, as there is no feed to match bank payments and receipts against. In order to do this, you must first turn on the ability to mark transactions as reconciled manually. This can be found hiding in the Help section. When you click on Help, you should then see an option to Enable Mark as Reconciled, which you will need to click to turn on this functionality. Now that you have the ability to Mark as Reconciled, you can begin to reconcile the petty cash. Go to Manage | Reconcile Account. You will be presented with the four tabs below (if Cash Coding is not turned on for your user role, you will not see that tab). The Reconcile tab should be blank, as there is no feed or import in place. You will want to go to Account transactions, which is where the items you have marked as paid from petty cash will live. Underneath this section, you will also find a button called New Transactions, where you can create transactions on the fly that you may have missed or are not going to raise a sales invoice or supplier bill for. You can see from the following screenshot that we have added an example of a Spend Money transaction, but you can also create a Receive Money transaction when clicking on the New Transaction button. Click on Save when you have finished entering the details of your transaction. If you have outstanding sales invoices or purchase invoices that have been paid through the petty cash account, then you will need to mark them as paid using the petty cash bank account in order for them to appear in the Account Transactions section. To do this, navigate to those items and then complete the Make a Payment or Receive a Payment section in the bottom left-hand corner. From the main Account transactions screen, you can then mark your transactions as reconciled. Check off the items you wish to reconcile using the checkboxes on the left, then click on More | Mark as Reconciled. When you have completed the process, your bank account balance in Xero should match that of your petty cash sheet. The status of transactions marked as Reconciled Manually will change to Reconciled in black. When a transaction is reconciled, it has come in from either an import or a bank feed, and it will be green. If it is unreconciled, it will be orange. Loan account A loan account works in the same way as a bank account, and we would recommend that you set it up if a feed is available from your bank. Managing loans in Xero is easy, as you can set up a bank rule for the interest and use the Transfer facility to reconcile any loan repayments. Credit cards Like adding bank accounts, you can add credit cards from the dashboard by clicking on the Add Bank Account button or by navigating to Accounts |Bank Accounts |Add Bank Account | Credit Card. Add a card You may see several options for the bank you use, so double-check you are using the right option or the feed will not work. If there is no feed set up for your particular bank or credit card account, you will be notified as follows: In this case, you will need to either import a statement in either the OFX, QIF, or CSV format, or manually reconcile your credit card account. The process is the same as that detailed above for reconciling petty cash and will be matched against your credit card statement. If a feed is available, enter the login credentials requested to set up the feed in the same fashion as when adding a new bank account feed. Common issues Credit cards act in a different way than a bank account, in that each card is an account on its own, separate from the main account on which interest and balance payments are allocated. This means that even if you have just one business credit card, you will, in effect, have two accounts with the bank. You can add a feed for each account if you wish, but for the main account, the only transactions that will go through it are any interest accrued and balance payments to clear the card. We would suggest you set up the credit card account as a feed, as this is where you will see most transactions and therefore save the most time in processing. Each time interest is accrued, you will need to post it as a Spend Money transaction, and each time you make a payment, the amount will be a Transfer from one of your other accounts. Both these transactions will need to be marked as Reconciled Manually, as they will not appear on your credit card feed setup. This is done in the same way as oultlined in the Petty cash section earlier PayPal Just like a bank account, you can sync Xero with your PayPal account, even in multiple currencies. The ability to do this, coupled with using bank rules, can help supercharge your processing ability and cut down on posting errors. Add a feed There is a little bit of configuration required to set up a PayPal account. Go to Accounts | Bank Accounts | Add Bank Account | PayPal. Add the account name as you wish for it to appear in Xero and the currency for the account. To use the feed (why wouldn't you?), check  the Set up automatic PayPal import option, which will then bring up the other options shown in the following screenshot: As previously suggested, if converting from another system, then import transactions from the conversion dates, as with all previous transactions, should be dealt with in your old accounting system. Click on Save, and you will receive an e-mail from Xero to confirm your PayPal e-mail address. Click on the activation link in the e-mail. To complete the setup process, you need to update your PayPal settings in order for Xero to turn on the automatic feeds. In PayPal, go to My Account | Profile | My Selling Tools. Next to the API access, click on Update | Option 1. The box should then be Grant API Permission. In the Third Party Permission field, enter paypal_api1.xero.com, then click on Lookup. Under Available Permissions, make sure you check the following options: Click on Add, and you have finished the setup process. If you have multiple currency accounts, then complete this process for each currency. Bank rules Bank rules give you the ability to automate the processing of recurring payments and receipts based on your chosen criteria, and they can be a massive time saver if you have many transactions to deal with. An example would be the processing of PayPal fees on your PayPal account. Rather than having to process each one, you could set up a bank rule to deal with the transaction. Bank rules cannot be used for bank transfers or allocating supplier payment or customer receipts. Add bank rules You can add a bank rule directly from the bank statement line by clicking on Create rule above the bank line details. This means waiting for something to come through the bank first, which we think makes sense, as that detail is taken into consideration when setting up the bank rule, making it simpler to set up. You can also enter bank rules you know will need adding by going to the relevant bank account and clicking Manage Account | Bank Rules | Create Rule. We have broken the bank rule down into different sections. Section 1 allows you to set the conditions that must be present in order for the bank rule to trigger. Using equals means the payee or description in this example must match exactly. If you were to change it to contains, then only part of the description need be present. This can be very useful when the description contains a reference number that changes each month. You do not want the bank rule to fail, so you might choose to remove the reference number and change the condition to contains instead. You must set at least one condition. Section 2 allows you to set a contact, which we suggest you do; otherwise, you will have to do this on the Bank Account screen each time before being able to reconcile that bank statement line. Section 3 allows you to fix a value to an account code. This can be useful if the bank rule you are setting up contains an element of a fixed amount and variable amount. An example might be a telephone bill where the line rental is fixed and the balance is for call charges that will vary month by month. Section 4 allows you to allocate a percentage to an account code. If there is not a fixed value amount in section 3, you can just use section 4 and post 100% of the cost to the account code of your choice. Likewise, if you had a bank statement line that you wanted to split between account codes, then you could do so by entering a second line and using the percentage column. Section 5 allows you to set a reference to be used when the bank rule runs and there are five options. We would suggest not using the by me during bank rec option, as this again creates extra work, since you will have to fill it in each time before you can reconcile that bank statement line. Section 6 allows you to choose which bank account you want the bank rule to run on. This is useful if you start paying for an item out of a different bank account, as you can edit the rule and change the bank account rather than having to create the rule all over again. Section 7 allows you to set a title for the bank rule. Use something that will make it easy for you to remember when on the Bank Rules screen. Edit bank rules If your bank rules are not firing the way you expected or at all, then you will want to edit them to get them right. It is worth spending time in this area, as once you have mastered setting up bank rules, they will save you time. To edit a bank rule, you will need to navigate to Accounts | Bank Accounts | Manage Accounts | Bank Rules. Click on the bank rule you wish to edit, make the necessary adjustments, and then click on Save. You will know if the bank rule is working, as it appears like the following when reconciling your bank account. If you do not wish to apply the rule, you can click on Don't apply rule in the bottom-left corner, or if you wish to check what it is going to do, click on View details first to verify the bank rule is going to post where you prefer. Re-order bank rules The order in which your bank rules sit is the order in which Xero runs them. This is important to remember if you have bank rules set up that may conflict with each other and not return the result you were expecting. An example might be you purchasing different items from a supermarket, such as petrol and stationery. In this example, we will call it Xeroco. In most instances, the bank statement line will show two different descriptions or references, in this case Xeroco for the main store and Xeroco Fuel for the gas station. You will need to set up your rules carefully, as using only contains for Xeroco will mean that your postings could end up going to the wrong account code. You would want the Xeroco Fuel bank rule to sit above the Xeroco rule. Because they both contain the same description, if Xeroco was first, it would always trigger and everything would get posted to stationery, including the petrol. If you set Xeroco Fuel as the first bank rule to run if the bank statement line does not contain both words, it will continue and then run the Xeroco rule, which would prove successful. Gas will get posted to fuel and stationery will get posted to stationery. You can drag and drop bank rules to change the order in which they run. Hover over the two dots to the left of the bank rule number and you can drag them to the appropriate position. Bank reconciliation Bank reconciliation is one of the main drivers in knowing when your books and records are up-to-date. The introduction of automated bank feeds has revolutionized the way in which we can complete a bank reconciliation, which is the process of matching what has gone through the bank account and what has been posted in your accounting system. Below are some ways to utilize all the functionality in the Xero toolkit. Auto Suggest Xero is an intuitive system; it learns how you process transactions and is also able to make suggestions based on what you have posted. As shown below, Xero has found a match for the bank statement line on the left, which is why the right-hand panel is now green and you can see the OK button to reconcile the transaction, provided you are happy it is the correct selection. You can choose to turn Auto Suggest off. At the bottom of each page in the bank screen, you will find a checkbox, as shown in the following screenshot. Simply uncheck the Suggest previous entries box to turn it off. The more you use Xero, the better it learns, so we would advise sticking with it. It is not a substitute for checking, however, so please check before hitting the OK button. Find & Match When Xero cannot find a match and you know the bank statement line in question probably has an associated invoice or bill in the system, you can use Find & Match in the upper right-hand corner, as shown in the following screenshot: You can look through the list of unreconciled bank transactions shown in the panel or you can opt to use the Search facility and search by name, reference, or amount. In this example, you can see that we have now found two transactions from SMART Agency that total the £4,500 spent. As you can see in the following screenshot, there is also an option next to the monetary amounts that will allow you to split the transaction. This is useful if someone has not paid the invoice in full. If the amount received was only £2,500 in total, for example, you could use Split to allocate £1,000 against the first transaction and £1,500 against the second transaction. When checked off, these turn green, and you can click on OK to reconcile that bank statement line. If you cannot find a match, you will need to investigate what it relates to and if you are missing some paperwork. If we had been clever when making the original supplier payment, we could have used the Batch Payment option in Accounts |Purchases |Awaiting Payment, checking off the items that make up the amount paid, and Batch Payment would have enabled us to tell Xero that there was a payment made totaling £4,500. Auto Suggest would have picked this up, making the reconciliation easier and quicker. You have already done the hard bit by working out how much to pay suppliers; you don't want to have to do it again when an amount comes through the bank and you can't remember what it was for. It is also good practice, as it means you will not inadvertently pay the same supplier again since the bill will be marked as paid. The same can be done for customer invoices using the Deposit button. This is very helpful when receiving check deposits or remittance advice well in advance of the actual receipt. By marking the invoices as paid, you will not chase customers for money, unnecessarily causing bad feelings along the way. Create There will be occasions when you will not have a bill or invoice in Xero from which to reconcile the bank statement line. In these situations, you will need to create a transaction to clear the bank statement line. In the example below, you can see that we have entered who the contact is, what the cost relates to, and added why we spent the money. Xero will now allow us to clear the bank statement line, as the OK button is visible. We would suggest that this option be used sparingly, as you should have paperwork posted into Xero in the form of a bill or invoice to deal with the majority of your bank statement lines. Bank transfers When you receive money in or transfer money out to another bank account, you have set up within Xero a very simple way to deal with those transactions. Click on the Transfer tab and choose the bank account from the dropdown. You will then be able to reconcile that bank statement line. In the account that you have made the transfer to, you will find that Xero will make the auto suggest for you when you reconcile that bank account. Discuss and comments When you are performing the bank reconciliation, you may find that you get stuck and you genuinely do not know what to do with it. This is where the Discuss tab, shown in the following screenshot, can help: You can simply enter a note to yourself for someone else in the business or for your advisor to take a look at. Don't forget to click on Save when you are done. If someone can answer your query, they can then enter their comment in the Discuss tab and save it. Note that at present there is no notification process when you save a comment in the Discuss tab, so you are reliant on someone regularly checking it. You will see something similar to the note underneath the business name when you log in to Xero, so you can see there is a comment that needs action. Reconciliation Report This is the major tool in your armory to check whether your bank reconciles. There is no greater feeling in life than your bank account reconciling and there being no unpresented items left hanging around. To run the report from within a bank account, click on the Reconciliation Report button next to the Manage Account button. From here, you can choose which bank account you wish to run the report from, so you do not need to keep moving between the accounts, and also a date option as you will probably want to run the report to various dates, especially if you encounter a problem. On the reconciliation report, you will see the balance in Xero, which is the cashbook balance (that is what would be in the bank if all the outstanding payments and receipts posted in Xero cleared and all the bank statement lines that have come from the bank feed were processed). The outstanding payments and receipts are invoices and bills you have marked as paid in Xero but have not been matched to a bank statement line yet. You need to keep an eye on these, as older unreconciled items would normally indicate a misallocation or uncleared item. Plus Un-Reconciled Bank Statement Lines are those items that have come through on a feed but have not yet been allocated to something in Xero. This might mean that there are missing bills or sales invoices in Xero, for example. Statement Balance is the number that should match what is on your online or paper statement, whether it is a bank account, credit card, or PayPal account. If the figures do not match, then it will need investigating. In the next section, we have highlighted some of the things that may have caused the imbalance and some ideas of what to do to rectify the situation. Manual reconciliation If you are unable to set up a bank feed or import a statement, then you can still reconcile bank accounts in Xero; it just feels a bit like going back in time. To complete a manual reconciliation, you will need to follow the same process as used to process petty cash, as discussed earlier in this article. Common errors and corrections Despite all the innovation and technological advances Xero has made in this area, there are still things that can go wrong—some human, some machine. The main thing is to recognize this and know how to deal with it in the event that it happens. We have highlighted some of the more common issues and resolutions in the following subsections. Account not reconciling There is no greater feeling than when you get that big green checkmark telling you that you have reconciled all your transactions. Fantastic job done, you think! But not quite. You need to check your bank statement, credit card statement, loan statement, or PayPal statement to make sure it definitely matches as per the preceding bank reconciliation report section. The job's not done until you have completed the manual check. Duplicated statement lines With all things technology, there is a chance that things can go wrong, and every now and again, you may find that your bank feed has duplicated a line item. This is why it is so important to check the actual statement against that in Xero. It is the only way to truly know if the accounts reconcile. A direct bank feed that costs money is deemed to be more robust, and some feeds through Yodlee are better than others. It all depends on your bank, so it is worth checking with the Xero community to get some guidance from fellow Xeroes. If you are convinced that you have a duplicated bank statement line, you can choose to delete it by finding the offending item in your account and clicking on the cross in the top left-hand corner of the bank statement line. When you hover over the cross, the item will turn red. Use this sparingly and only when you know you have a duplicated line. Missing statement lines As with duplicated statement lines, there is also the possibility of a bank statement line not being synced, and this can be picked up when checking that the Xero bank account figure matches that of your online or paper bank statement. If they do not match, then we recommend using the reconciliation report and working backwards month by month and then week by week to try and isolate the date at which the bank last reconciled. Once you have narrowed it down to a week, you can then start doing it day by day until you find the date, and then check off the items in Xero against those on your bank statement until you find the missing items. If there are several missing items, we would probably suggest doing an import via OFX, QIF, or CSV, but if there are only a few, then it would probably be best to enter them manually and then mark them as reconciled so the bank will reconcile. Remove & Redo We know you are great at what you do, but everyone has an off day. If you have allocated something incorrectly, you can easily amend it. auto suggest is fantastic, but you may get carried away and just keep hitting that the OK button without paying enough attention. This is particularly problematic for businesses dealing with lots of invoices for similar amounts. If you do find that you have made an error, then you can remove the original allocation and redo it. You can do this by going to Accounts | Bank Accounts | Manage Account | Reconcile Account | Account Transactions. As you can see in the following screenshot, once you have found the offending item, you can check it off and then click on Remove & Redo. You will also find on the right-hand side a Search button, which will allow you to search for particular transactions rather than having to scroll through endless pages. If you happen to be in the actual bank transaction when you identify a problem, then you can click on Options | Remove & Redo. This will then push the transaction back in to the bank account screen to reconcile again. Note that if you Remove & Redo a manually entered bank transaction, it will not reappear in the bank account for you to reconcile, as it was never there in the first place. What you will need to do is post the payment or receipt against the correct invoice or bill, and then manually mark it as reconciled again or create another spend or receive money transaction. Manually marked as reconciled A telltale sign that someone has inadvertently caused a problem is when you look at the Account Transactions tab in the bank account and there is a sea of green and reconciled statuses, and then you spot the odd black reconciled status. This is an indication that something has been posted to that account and marked as reconciled manually. This will need investigating, as it may be genuine, such as some missing bank statement lines, or it could be that someone has made a mistake and it needs to be removed. Understanding reports If the bank account does not reconcile and it is not something obvious, then we would suggest looking at the bank statements imported into Xero to see if there are any obvious problems. Go to Accounts | Bank Accounts | Manage Account | Bank Statements. From this screen, have a look to see if there is any overlap of dates imported and then drill into the statements to check for anything that doesn't look right. If you do come across duplicated lines, you can remove them by checking off the box on the left and then clicking on the Delete button. You can see below that the bank statement line has been grayed out and has a status of Deleted next to it. If you later discover that you have made a mistake, then you can restore the bank statement line by checking off the box again but clicking on Restore this time. If you have incorrectly imported a duplicate statement or the bank feed has done so rather than deleting the transactions, you can choose to delete the entire statement. This can be achieved by clicking on the Delete Entire Statement button at the bottom-left of the Bank Statements screen: Make sure you have checked the Also delete reconciled transactions for this statement option before clicking Delete. If you are deleting the statement because it is incorrect, it only makes sense that you also clear any transactions associated with this statement to avoid further issues. Summary We have successfully added bank feeds that are now ready for automating the bank reconciliation process. In this article, we ran through the major bank functions and set up your bank feeds, exploring how to set up bank rules to make the bank reconciliation task even easier and quicker. On top of that, we also explored the possibilities of what could go wrong, but more importantly, how to identify errors and put them right. One of the biggest bookkeeping tasks you will undertake should now seem a lot easier. Resources for Article:   Further resources on this subject: Probability of R? [article] Dealing with a Mess [article] Essbase ASO (Aggregate Storage Option) [article]
Read more
  • 0
  • 0
  • 4591

article-image-debugging-scheduler-oracle-11g-databases
Packt
08 Oct 2009
5 min read
Save for later

Debugging the Scheduler in Oracle 11g Databases

Packt
08 Oct 2009
5 min read
Unix—all releases Something that has not been made very clear in the Oracle Scheduler documentation is that redirection cannot be used in jobs (<, >, >>, |, &&, ||). Therefore, many developers have tried to use it. So, let's keep in mind that we cannot use redirection, not in 11g as well as older releases of the database. The scripts must be executable, so don't forget to set the execution bits. This might seem like knocking down an open door, but it's easily forgotten. The user (who is the process owner of the external job and is nobody:nobody by default) should be able to execute the $ORACLE_HOME/bin/extjob file. In Unix, this means that the user should have execution permissions on all the parent directories of this file. This is not something specific to Oracle; it's just the way a Unix file system works. Really! Check it out. Since 10gR1, Oracle does not give execution privileges to others. A simple test for this is to try starting SQL*Plus as a user who is neither the Oracle installation user, nor a member of the DBA group—but a regular user. If you get all kinds of errors, then it implies that the permissions are not correct, assuming that the environment variables (ORACLE_HOME and PATH) are set up correctly. The $ORACLE_HOME/install/changePerm.sh script can fix the permissions within ORACLE_HOME (for 10g). In Oracle 11g, this again changed and is no longer needed. The Scheduler interprets the return code of external jobs and records it in the *_scheduler_job_run_details view. This interpretation can be very misleading, especially when using your own exit codes. For example, when you code your script to check the number of arguments, and code an exit 1 when you find the incorrect number of arguments, the error number is translated to ORA-27301: OS failure message:No such file or directory by Oracle using the code in errno.h. In 11g, the Scheduler also records the return code in the error# column. This lets us recognize the error code better and find where it is raised in the script that ran, when the error codes are unique within the script. When Oracle started with Scheduler, there were some quick changes. Here are the most important changes listed that could cause us problems when the definitions of the mentioned files are not exactly as listed: 10.2.0.1: $ORACLE_HOME/bin/extjob should be owned by the user who runs the jobs (process owner) and have 6550 permissions (setuid process owner). In a regular notation, that is what ls –l shows, and the privileges should be -r-sr-s---. 10.2.0.2: $ORACLE_HOME/rdbms/admin/externaljob.ora should be owned by root. This file is owned by the Oracle user (the user who installed Oracle) and the Oracle install group with 644 permissions or -rw-r—-r--, as shown by ls –l. This file controls which operating system user is going to be the process owner, or which user is going to run the job. The default contents of this file are as shown in the following screenshot: $ORACLE_HOME/bin/extjob must be the setuid root (permissions 4750 or -rwsr-x---) and executable for the Oracle install group, where the setuid root means that the root should be the owner of the file. This also means that while executing this binary, we temporarily get root privileges on the system. $ ORACLE_HOME/bin/extjobo should have normal 755 or -rwxr-xr-x permissions,and be owned by the normal Oracle software owner and group. If this file is missing, just copy it from $ORACLE_HOME/bin/extjob. On AIX, this is the first release that has external job support. 11g release: I n 11g, the same files as in 10.2.0.2 exist with the same permissions. But $ORACLE_HOME/bin/jssu is owned by root and the Oracle install group with the setuid root (permissions 4750 or -rwsr-x---). It is undoubtedly best to stop using the old 10g external jobs and migrate to the 11g external jobs with credentials as soon as possible. The security of the remote external jobs is better because of the use of credentials instead of falling back to the contents of a single file in $ORACLE_HOME/, and the flexibility is much better. In 11g, the process owner of the remote external jobs is controlled by the credential and not by a file. Windows usage On Windows, this is a little easier with regard to file system security. The OracleJobscheduler service must exist in a running state, and the user who runs this service should have the Logon as batch job privilege. A .batfile cannot be run directly, but should be called as an argument of cmd.exe, for example: --/BEGINDBMS_SCHEDULER.create_job(job_name => 'env_windows',job_type => 'EXECUTABLE',number_of_arguments => 2,job_action => 'C:windowssystem32cmd.exe',auto_drop => FALSE,enabled => FALSE);DBMS_SCHEDULER.set_job_argument_value('env_windows',1,'/c');DBMS_SCHEDULER.set_job_argument_value('env_windows',2,'d:temptest.bat');end;/ This job named env_windows calls cmd.exe, which eventually runs the script named test.bat that we created in d:temp. When the script we want to call needs arguments, they should be listed from argument number 3 onwards.
Read more
  • 0
  • 0
  • 4590
Modal Close icon
Modal Close icon