0

Explore Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

How-To Tutorials - Data

1210 Articles

article-image-security-considerations

9 min read

Security considerations

(For more resources related to this topic, see here.) Security considerations One general piece of advice that applies to every type of application development is to develop the software with security in mind, meaning it is more expensive for an error-prone application to first implement the needed features and after that to make modifications in them to enforce security. Instead, this should be done simultaneously. In this article we are raising security awareness, and next we will learn about which measures we can apply and what we can do in order to have more secure applications. Use TLS TLS (the cryptographic protocol named Transport Layer Security) is the result of the standardization of the SSL protocol (Version 3.0), which was developed by Netscape and was proprietary. Thus, in various documents and specifications, we can find the use of TLS and SSL interchangeably, even though there are actually differences in the protocol. From a security standpoint, it is recommended that all requests sent from the client during the execution of a grant flow are done over TLS. In fact, it is recommended TLS be used on both sides of the connection. OAuth 2.0 relies heavily on TLS; this is done in order to maintain confidentiality of the exchanged data over the network by providing encryption and integrity on top of the connection between the client and server. In retrospect, in OAuth 1.0 the use of TLS was not mandatory, and parts of the authorization flow (on both server side and client side) had to deal with cryptography, which resulted in various implementations, some good and some sloppy. When we make an HTTP request (for example, in order to execute some OAuth 2.0 grant flow), in order to make the connection secure the HTTP client library that is used to execute the request has to be configured to use TLS. TLS is to be used by the client application when sending requests to both authorization and resource servers, and is to be used by the servers themselves as well. The result is an end-to-end TLS protected connection. If end-to-end protection cannot be established, it is advised to reduce the scope and lifetime of the access tokens that are issued by the authorization server. The OAuth2.0 specification states that the use of TLS is mandatory when sending requests to the authorization and token endpoints and when sending requests using password authentication. Access tokens, refresh tokens, username and password combinations, and client credentials must be transmitted with the use of TLS. By using TLS, the attackers that are trying to intercept/eavesdrop the exchanged information during the execution of the grant flow will not be able to do so. If TLS is not used, attackers can eavesdrop on an access token, an authorization code, a username and password combination, or other critical information. This means that the use of TLS prevents man-in-the-middle attacks and replaying of already fulfilled requests (also called replay attacks). By performing replay attempts, the attackers can issue themselves new access tokens or can perform replays on a request towards resource servers and modify or delete data belonging to the resource owner. Last but not least, the authorization server can enforce the use of TLS on every endpoint in order to reduce the risk of phishing attacks. Ensure web server application protection For client applications that are actually web applications deployed on a server, there are numerous protection measures that can be taken into account so that the server, the database, and the configuration files are kept safe. The list is not limited and can vary between scenarios and environments; some of the key measures are as follows: Install recommended security additions and tools for the given web and database servers that are in use. Restrict remote administrator access only to the people that require it (for example, for server maintenance and application monitoring). Regulate which server user can have which roles, and regulate permissions for the resources available to them. Disable or remove unnecessary services on the server. Regulate the database connections so that they are only available to the client application. Close unnecessary open ports on the server; leaving them open can give an advantage to the attacker. Configure protection against SQL injection. Configure database and file encryption for vital information stored (credentials and so on). Avoid storing credentials in plain text format. Keep the software components that are in use updated in order to avoid security exploitation. Avoid security misconfiguration. It is important to have in mind what kind of web server it is, which database is used, which modules the client application uses, and on which services the client application depends, so that we can research how to apply the security measures appropriately. OWASP (Open Web Application Security Project) provides additional documentation on security measures and describes the industry's best practices regarding software security. It is an additional resource recommended for reference and research on this topic, and can be found at https://www.owasp.org. Ensure mobile and desktop application protection Mobile and desktop applications can be installed on devices and machines that can be part of internal/enterprise or external environments. They are more vulnerable compared to the applications deployed on regulated server environments. Attackers have a better chance to try to extract the source code from the applications and other data that comes with them. In order to provide the best possible security, some of the key measures are as follows: Use secure storage mechanisms provided by additional programming libraries and by features offered by the operating system for which the application is developed. In multiuser operating systems, store user specific data such as credentials or access and refresh tokens in locations that are not available to other users on the same system. As mentioned previously, credentials should not be stored in plain text format and should be encrypted. If using an embedded database (such as SQLite in most cases), try to enforce security measures against SQL injection and encrypt the vital information (or encrypt the whole embedded database). For mobile devices, advise the end user to utilize device lock (usually with a PIN, password, or face unlock). Implement an optional PIN or password lock on the application level that the end user can activate if desired (which can also serve as an alternative to the previous locking measure). Sanitize and validate the value from any input fields that are used in the applications, in order to avoid code injection, which can lead to changing the behavior or exposing data stored by the client application. When the application is ready to be packaged for production use (to be used by end users), perform code analysis for obfuscating code and removing the unused code. This will produce a smaller client application in file size, which will perform the same but it will be harder to reverse engineer. As usual, for additional reference and research we can refer to the OAuth2.0 threat model RFC document, to OWASP, and to security documentation specific to the programming language, tools, libraries, and operating system that the client application is built for. Utilize the state parameter As mentioned, with this parameter the state between the request and the callback is maintained. Even if it is an optional parameter it is highly advisable to use, and the value from the callback response will be validated if it is equal to the one that was sent. When setting the value for the state parameter in the request Don't use predictable values that can be guessed by attackers. Don't repeat the same value often between requests. Don't use values that can contain and expose some internal business logic of the system and can be used maliciously if discovered. Use session values: If the user agent—with which the user has authenticated and approved the authorization request—has its session cookie available, calculate a hash from it and use that one as the state value. Or use some string generator: If a session variable is not available as an alternative, we can use some generated programmable value. Some real world implementations do this by generating unique identifiers and using them as state values, commonly achieved by generating a random UUID (universally unique identifier) and converting it to a hexadecimal value. Keep track of which state value was set for which request (user session in most cases) and redirect URI, in order to validate that the returned one contains an equal value. Use refresh tokens when available For client applications that have obtained an access token and a refresh token along with it, upon access token expiry it is a good practice to request a new one by using the refresh token instead of going through the whole grant flow again. With this measure we are transmitting less data over the network and are providing less exposure that the attacker can monitor. Request the needed scope only As briefly mentioned previously in this article, it is highly advisable to specify only the required scope when requesting an access token instead of specifying the maximum one that is available. With this measure, if an attacker gets hold of the access token, he can take damaging actions only to the level specified by the scope, and not more. This is done for damage minimization until the token is revoked and invalidated. Summary In this article we learned what data is to be protected, what features OAuth 2.0 contains regarding information security, and which precautions we should take into consideration. Resources for Article: Further resources on this subject: Deploying a Vert.x application [Article] Building tiny Web-applications in Ruby using Sinatra [Article] Fine Tune the View layer of your Fusion Web Application [Article]

0
0
1387

article-image-machine-learning-examples-applicable-businesses

7 min read

Machine Learning Examples Applicable to Businesses

The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[ output == 0, density(get(colPred), adjust = 0.5) ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[ output == 1, density(get(colPred), adjust = 0.5) ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend( 'top', c('0', '1'), pch = 16, col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows: The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]

0
0
1349

article-image-working-import-process-intermediate

5 min read

Working with Import Process (Intermediate)

(For more resources related to this topic, see here.) Getting ready The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME. Given here are the prerequisites for Sqoop import process. Installed and running Relational Database Management System (MySQL). Installed and running Hadoop Cluster. Set $HADOOP_HOME environment variable. Following are the common arguments of import process. Parameters Description --connect <jdbc-uri> This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName --connection-manager <class-name> Specify connection manager class name. --driver <class-name> Specify the fully qualified name of JDBC driver class. --password <password> Set authentication password required to connect to input source. --username <username> Set authentication username. How to do it Let’s see how to work with import process First, we will start with import single RDBMS table into Hadoop. Query1: $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName The content of output file in HDFS will look like: Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop. Query 2: Import selected columns $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” Query 3: Import selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --where ‘student_id<100’ Query 4: Import selected columns of selected rows. $ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table student --target-dir /user/abc/student --columns “student_id,address,name” -- where ‘student_id<100’ How it works… Now let’s see how the above steps work: Import single table: Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System. Parameters Description --table <table-name> Name of input table to fetch. --target-dir<dir> Location of output/target dir in HDFS. --direct If user want to use non-JDBC based access mechanism for faster database access --options-file <file-path> All the command line options that are common in most of commands can put in options file for convenience. The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like: A1, B1, C1, D1 A2, B2, C2, D2 Import selected columns: By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument. The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like: student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR .......... If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like. address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 ............. Import selected rows: By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement. The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100. The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”. There’s more... This section covers some more examples of import process. Import all tables: So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool: All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted. Example: Query 5: $ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password This query will import all tables (tableName and tableName1) of database db1 into HDFS. Output directories in HDFS look like: Summary We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables. Resources for Article : Further resources on this subject: Introduction to Logging in Tomcat 7 [Article] Configuring Apache and Nginx [Article] Geronimo Architecture: Part 2 [Article]

0
0
1247

article-image-working-axes-should-know

4 min read

Working with axes (Should know)

(For more resources related to this topic, see here.) Getting ready We start with the same boilerplate that we used when creating basic charts. How to do it... The following code creates some sample data that grows exponentially. We then use the transform and tickSize setting on the Y axis to adjust how our data is displayed: ... <script> var data = [], i; for (i = 1; i <= 50; i++) { data.push([i, Math.exp(i / 10, 2)]); } $('#sampleChart').plot( [ data ], { yaxis: { transform: function (v) { return v == 0 ? v : Math.log(v); }, tickSize: 50 } } ); </script> ... Flot draws a chart with a logarithmic Y axis, so that our exponential data is easier to read: Next, we use Flot's ability to display multiple axes on the same chart as follows: ... var sine = []; for (i = 0; i < Math.PI * 2; i += 0.1) { sine.push([i, Math.sin(i)]); } var cosine = []; for (i = 0; i < Math.PI * 2; i += 0.1) { cosine.push([i, Math.cos(i) * 20]); } $('#sampleChart').plot( [ {label: 'sine', data: sine}, { label: 'cosine', data: cosine, yaxis: 2 } ], { yaxes: [ {}, { position: 'right' } ] } ); ... Flot draws the two series overlapping each other. The Y axis for the sine series is drawn on the left by default and the Y axis for the cosine series is drawn on the right as specified: How it works... The transform setting expects a function that takes a value, which is the y coordinate of our data, and returns a transformed value. In this case, we calculate the logarithm of our original data value so that our exponential data is displayed on a linear scale. We also use the tickSize setting to ensure that our labels do not overlap after the axis has been transformed. The yaxis setting under the series object is a number that specifies which axis the series should be associated with. When we specify the number 2, Flot automatically draws a second axis on the chart. We then use the yaxes setting to specify that the second axis should be positioned on the right of the chart. In this case, the sine data ranges from -1.0 to 1.0, whereas the cosine data ranges from -20 to 20. The cosine axis is drawn on the right and is independent of the sine axis. There's more... Flot doesn't have a built-in ability to interact with axes, but it does give you all the information you need to construct a solution. Making axes interactive Here, we use Flot's getAxes method to add interactivity to our axes as follows: ... var showFahrenheit = false, temperatureFormatter = function (val, axis) { if (showFahrenheit) { val = val * 9 / 5 + 32; } return val.toFixed(1); }, drawPlot = function () { var plot = $.plot( '#sampleChart', [[[0, 0], [1, 3], [3, 1]]], { yaxis: { tickFormatter: temperatureFormatter } } ); var plotPlaceholder = plot.getPlaceholder(); $.each(plot.getAxes(), function (i, axis) { var box = axis.box; var axisTarget = $('<div />'); axisTarget. css({ position: 'absolute', left: box.left, top: box.top, width: box.width, height: box.height }). click(function () { showFahrenheit = !showFahrenheit; drawPlot(); }). appendTo(plotPlaceholder); }); }; drawPlot(); ... First, note that we use a different way of creating a plot. Instead of calling the plot method on a jQuery collection that matches the placeholder element, we use the plot method directly from the jQuery object. This gives us immediate access to the Flot object, which we use to get the axes of our chart. You could have also used the following data method to gain access to the Flot object: var plot = $('#sampleChart').plot(...).data('plot'); Once we have the Flot object, we use the getAxes method to retrieve a list of axis objects. We use jQuery's each method to iterate over each axis and we create a div element that acts as a target for interaction. We set the div element's CSS so that it is in the same position and size as the axis' bounding box, and we attach an event handler to the click event before appending the div element to the plot's placeholder element. In this case, the event handler toggles a Boolean flag and redraws the plot. The flag determines whether the axis labels are displayed in Fahrenheit or Celsius, by changing the result of the function specified in the tickFormatter setting. Summary Now, we will be able to customize a chart's axes, transform the shape of a graph by using a logarithmic scale, display multiple data series with their own independent axes, and make the axes interactive. Resources for Article: Further resources on this subject: Getting started with your first jQuery plugin [Article] OpenCart Themes: Styling Effects of jQuery Plugins [Article] The Basics of WordPress and jQuery Plugin [Article]

0
0
1214

article-image-obtaining-binary-backup

6 min read

Obtaining a binary backup

Getting ready Next we need to modify the postgresql.conf file for our database to run in the proper mode for this type of backup. Change the following configuration variables: wal_level = archive max_wal_senders = 5 Then we must allow a super user to connect to the replication database, which is used by pg_basebackup. We do that by adding the following line to pg_hba.conf: local replication postgres peer Finally, restart the database instance to commit the changes. How to do it... Though it is only one command, pg_basebackup requires at least one switch to obtain a binary backup, as shown in the following step: Execute the following command to create the backup in a new directory named db_backup: $> pg_basebackup -D db_backup -x How it works... For PostgreSQL, WAL stands for Write Ahead Log. By changing wal_level to archive, those logs are written in a format compatible with pg_basebackup and other replicationbased tools. By increasing max_wal_senders from the default of zero, the database will allow tools to connect and request data files. In this case, up to five streams can request data files simultaneously. This maximum should be sufficient for all but the most advanced systems. The pg_hba.conf file is essentially a connection access control list (ACL). Since pg_basebackup uses the replication protocol to obtain data files, we need to allow local connections to request replication. Next, we send the backup itself to a directory (-D) named db_backup. This directory will effectively contain a complete copy of the binary files that make up the database. Finally, we added the -x flag to include transaction logs (xlogs), which the database will require to start, if we want to use this backup. When we get into more complex scenarios, we will exclude this option, but for now, it greatly simplifies the process. There's more... The pg_basebackup tool is actually fairly complicated. There is a lot more involved under the hood. Viewing backup progress For manually invoked backups, we may want to know how long the process might take, and its current status. Luckily, pg_basebackup has a progress indicator, which does that by using the following command: $> pg_basebackup -P -D db_backup Like many of the other switches, -P can be combined with tape archive format, standalone backups, database clones, and so on. This is clearly not necessary for automated backup routines, but could be useful for one-off backups monitored by an administrator. Compressed tape archive backups Many binary backup files come in the TAR (Tape Archive) format, which we can activate using the -f flag and setting it to t for TAR. Several Unix backup tools can directly process this type of backup, and most administrators are familiar with it. If we want a compressed output, we can set the -z flag, especially in the case of large databases. For our sample database, we should see almost a 20x compression ratio. Try the following command: $> pg_basebackup -Ft -z -D db_backup The backup file itself will be named base.tar.gz within the db_backup directory, reflecting its status as a compressed tape archive. In case the database contains extra tablespaces, each becomes a separate compressed archive. Each file can be extracted to a separate location, such as a different set of disks, for very complicated database instances. For the sake of this example, we ignored the possible presence of extra tablespaces than the pg_default default included in every installation. User-created tablespaces will greatly complicate your backup process. Making the backup standalone By specifying -x, we tell the database that we want a "complete" backup. This means we could extract or copy the backup anywhere and start it as a fully qualified database. As we mentioned before, the flag means that you want to include transaction logs, which is how the database recovers from crashes, checks integrity, and performs other important tasks. The following is the command again, for reference: $> pg_basebackup -x -D db_backup When combined with the TAR output format and compression, standalone binary backups are perfect for archiving to tape for later retrieval, as each backup is compressed and self-contained. By default, pg_basebackup does not include transaction logs, because many (possibly most) administrators back these up separately. These files have multiple uses, and putting them in the basic backup would duplicate efforts and make backups larger than necessary. We include them at this point because it is still too early for such complicated scenarios. We will get there eventually, of course. Database clones Because pg_basebackup operates through PostgreSQL's replication protocol, it can execute remotely. For instance, if the database was on a server named Production, and we wanted a copy on a server named Recovery, we could execute the following command from Recovery: $> pg_basebackup -h Production -x -D /full/db/path For this to work, we would also need this line in pg_hba.conf for Recovery: host replication postgres Recovery trust Though we set the authentication method to trust, this is not recommended for a production server installation. However, it is sufficient to allow Recovery to copy all data from Production. With the -x flag, it also means that the database can be started and kept online in case of emergency. It is a backup and a running server. Parallel compression Compression is very CPU intensive, but there are some utilities capable of threading the process. Tools such as pbzip2 or pigz can do the compression instead. Unfortunately, this only works in the case of a single tablespace (the default one; if you create more, this will not work). The following is the command for compression using pigz: $> pg_basebackup -Ft -D - | pigz -j 4 > db_backup.tar.gz It uses four threads of compression, and sets the backup directory to standard output (-) so that pigz can process the output itself. Summary In this article we saw the process of obtaining a binary backup. Though, we saw that this process is more complex and tedious, but at the same time it is much faster. Further resources on this subject: Introduction to PostgreSQL 9 Backup in PostgreSQL 9 Recovery in PostgreSQL 9

0
0
1189

article-image-author-podcast-ronald-rood-discusses-birth-oracle-scheduler

1 min read

Author Podcast - Ronald Rood discusses the birth of Oracle Scheduler

You can listen to the podcast here, or hit play in the media player below.

0
0
1174

article-image-diving-data-search-and-report

11 min read

Diving into Data – Search and Report

In this article by Josh Diakun, Paul R Johnson, and Derek Mock authors of the books Splunk Operational Intelligence Cookbook - Second Edition, we will cover the basic ways to search the data in Splunk. We will cover how to make raw event data readable (For more resources related to this topic, see here.) The ability to search machine data is one of Splunk's core functions, and it should come as no surprise that many other features and functions of Splunk are heavily driven-off searches. Everything from basic reports and dashboards to data models and fully featured Splunk applications are powered by Splunk searches behind the scenes. Splunk has its own search language known as the Search Processing Language (SPL). This SPL contains hundreds of search commands, most of which also have several functions, arguments, and clauses. While a basic understanding of SPL is required in order to effectively search your data in Splunk, you are not expected to know all the commands! Even the most seasoned ninjas do not know all the commands and regularly refer to the Splunk manuals, website, or Splunk Answers (http://answers.splunk.com). To get you on your way with SPL, be sure to check out the search command cheat sheet and download the handy quick reference guide available at http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/SplunkEnterpriseQuickReferenceGuide. Searching Searches in Splunk usually start with a base search, followed by a number of commands that are delimited by one or more pipe (|) characters. The result of a command or search to the left of the pipe is used as the input for the next command to the right of the pipe. Multiple pipes are often found in a Splunk search to continually refine data results as needed. As we go through this article, this concept will become very familiar to you. Splunk allows you to search for anything that might be found in your log data. For example, the most basic search in Splunk might be a search for a keyword such as error or an IP address such as 10.10.12.150. However, searching for a single word or IP over the terabytes of data that might potentially be in Splunk is not very efficient. Therefore, we can use the SPL and a number of Splunk commands to really refine our searches. The more refined and granular the search, the faster the time to run and the quicker you get to the data you are looking for! When searching in Splunk, try to filter as much as possible before the first pipe (|) character, as this will save CPU and disk I/O. Also, pick your time range wisely. Often, it helps to run the search over a small time range when testing it and then extend the range once the search provides what you need. Boolean operators There are three different types of Boolean operators available in Splunk. These are AND, OR, and NOT. Case sensitivity is important here, and these operators must be in uppercase to be recognized by Splunk. The AND operator is implied by default and is not needed, but does no harm if used. For example, searching for the term error or success would return all the events that contain either the word error or the word success. Searching for error success would return all the events that contain the words error and success. Another way to write this can be error AND success. Searching web access logs for error OR success NOT mozilla would return all the events that contain either the word error or success, but not those events that also contain the word mozilla. Common commands There are many commands in Splunk that you will likely use on a daily basis when searching data within Splunk. These common commands are outlined in the following table: Command Description chart/timechart This command outputs results in a tabular and/or time-based output for use by Splunk charts. dedup This command de-duplicates results based upon specified fields, keeping the most recent match. eval This command evaluates new or existing fields and values. There are many different functions available for eval. fields This command specifies the fields to keep or remove in search results. head This command keeps the first X (as specified) rows of results. lookup This command looks up fields against an external source or list, to return additional field values. rare This command identifies the least common values of a field. rename This command renames the fields. replace This command replaces the values of fields with another value. search This command permits subsequent searching and filtering of results. sort This command sorts results in either ascending or descending order. stats This command performs statistical operations on the results. There are many different functions available for stats. table This command formats the results into a tabular output. tail This command keeps only the last X (as specified) rows of results. top This command identifies the most common values of a field. transaction This command merges events into a single event based upon a common transaction identifier. Time modifiers The drop-down time range picker in the Graphical User Interface (GUI) to the right of the Splunk search bar allows users to select from a number of different preset and custom time ranges. However, in addition to using the GUI, you can also specify time ranges directly in your search string using the earliest and latest time modifiers. When a time modifier is used in this way, it automatically overrides any time range that might be set in the GUI time range picker. The earliest and latest time modifiers can accept a number of different time units: seconds (s), minutes (m), hours (h), days (d), weeks (w), months (mon), quarters (q), and years (y). Time modifiers can also make use of the @ symbol to round down and snap to a specified time. For example, searching for sourcetype=access_combined earliest=-1d@d latest=-1h will search all the access_combined events from midnight, a day ago until an hour ago from now. Note that the snap (@) will round down such that if it were 12 p.m. now, we would be searching from midnight a day and a half ago until 11 a.m. today. Working with fields Fields in Splunk can be thought of as keywords that have one or more values. These fields are fully searchable by Splunk. At a minimum, every data source that comes into Splunk will have the source, host, index, and sourcetype fields, but some source might have hundreds of additional fields. If the raw log data contains key-value pairs or is in a structured format such as JSON or XML, then Splunk will automatically extract the fields and make them searchable. Splunk can also be told how to extract fields from the raw log data in the backend props.conf and transforms.conf configuration files. Searching for specific field values is simple. For example, sourcetype=access_combined status!=200 will search for events with a sourcetype field value of access_combined that has a status field with a value other than 200. Splunk has a number of built-in pre-trained sourcetypes that ship with Splunk Enterprise that might work with out-of-the-box, common data sources. These are available at http://docs.splunk.com/Documentation/Splunk/latest/Data/Listofpretrainedsourcetypes. In addition, Technical Add-Ons (TAs), which contain event types and field extractions for many other common data sources such as Windows events, are available from the Splunk app store at https://splunkbase.splunk.com. Saving searches Once you have written a nice search in Splunk, you may wish to save the search so that you can use it again at a later date or use it for a dashboard. Saved searches in Splunk are known as Reports. To save a search in Splunk, you simply click on the Save As button on the top right-hand side of the main search bar and select Report. Making raw event data readable When a basic search is executed in Splunk from the search bar, the search results are displayed in a raw event format by default. To many users, this raw event information is not particularly readable, and valuable information is often clouded by other less valuable data within the event. Additionally, if the events span several lines, only a few events can be seen on the screen at any one time. In this recipe, we will write a Splunk search to demonstrate how we can leverage Splunk commands to make raw event data readable, tabulating events and displaying only the fields we are interested in. Getting ready You should be familiar with the Splunk search bar and search results area. How to do it… Follow the given steps to search and tabulate the selected event data: Log in to your Splunk server. Select the Search & Reporting application from the drop-down menu located in the top left-hand side of the screen. Set the time range picker to Last 24 hours and type the following search into the Splunk search bar: index=main sourcetype=access_combined Then, click on Search or hit Enter. Splunk will return the results of the search and display the raw search events under the search bar. Let's rerun the search, but this time we will add the table command as follows: index=main sourcetype=access_combined | table _time, referer_domain, method, uri_path, status, JSESSIONID, useragent Splunk will now return the same number of events, but instead of presenting the raw events to you, the data will be in a nicely formatted table, displaying only the fields we specified. This is much easier to read! Save this search by clicking on Save As and then on Report. Give the report the name cp02_tabulated_webaccess_logs and click on Save. On the next screen, click on Continue Editing to return to the search. How it works… Let's break down the search piece by piece: Search fragment Description index=main All the data in Splunk is held in one or more indexes. While not strictly necessary, it is a good practice to specify the index (es) to search, as this will ensure a more precise search. sourcetype=access_combined This tells Splunk to search only the data associated with the access_combined sourcetype, which, in our case, is the web access logs. | table _time, referer_domain, method, uri_path, action, JSESSIONID, useragent Using the table command, we take the result of our search to the left of the pipe and tell Splunk to return the data in a tabular format. Splunk will only display the fields specified after the table command in the table of results. In this recipe, you used the table command. The table command can have a noticeable performance impact on large searches. It should be used towards the end of a search, once all the other processing on the data by the other Splunk commands has been performed. The stats command is more efficient than the table command and should be used in place of table where possible. However, be aware that stats and table are two very different commands. There's more… The table command is very useful in situations where we wish to present data in a readable format. Additionally, tabulated data in Splunk can be downloaded as a CSV file, which many users find useful for offline processing in spreadsheet software or for sending to others. There are some other ways we can leverage the table command to make our raw event data readable. Tabulating every field Often, there are situations where we want to present every event within the data in a tabular format, without having to specify each field one by one. To do this, we simply use a wildcard (*) character as follows: index=main sourcetype=access_combined | table * Removing fields, then tabulating everything else While tabulating every field using the wildcard (*) character is useful, you will notice that there are a number of Splunk internal fields, such as _raw, that appear in the table. We can use the fields command before the table command to remove the fields as follows: index=main sourcetype=access_combined | fields - sourcetype, index, _raw, source date* linecount punct host time* eventtype | table * If we do not include the minus (-) character after the fields command, Splunk will keep the specified fields and remove all the other fields. Summary In this article we covered along with the introduction to Splunk, how to make raw event data readable Resources for Article: Further resources on this subject: Splunk's Input Methods and Data Feeds [Article] The Splunk Interface [Article] The Splunk Web Framework [Article]

0
0
1170

article-image-article-reporting-with-microsoft-sql

7 min read

Reporting with Microsoft SQL

(For more resources related to this topic, see here.) Server 2012 Power View – Self-service Reporting Self-service reporting is when business users have the ability to create personalized reports and analytical queries without requiring the IT department to get involved. There will be some basic work that the IT department must do, namely creating the various data marts that the reporting tools will use as well as deploying those reporting tools. However, once that is done, IT will be freed of creating reports so that they can work on other tasks. Instead, the people who know the data best—the business users—will be able to build the reports. Here is a typical scenario that occurs when a self-service reporting solution is not in place: a business user wants a report created, so they fill out a report request that gets routed to IT. The IT department is backlogged with report requests, so it takes them weeks to get back to the user. When they do, they interview the user to get more details about exactly what data the user wants on the report and the look of the report (the business requirements). The IT person may not know the data that well, so they will have to get educated by the user on what the data means. This leads to mistakes in understanding what the user is requesting. The IT person may take away an incorrect assumption of what data the report should contain or how it should look. Then, the IT person goes back and creates the report. A week or so goes by and he shows the user the report. Then, they hear things from the user such as "that is not correct" or "that is not what I meant". The IT person fixes the report and presents it to the user once again. More problems are noticed, fixes are made, and this cycle is repeated four to five times before the report is finally up to the user's satisfaction. In the end, a lot of time has been wasted by the business user and the IT person, and the finished version of the report took way longer that it should have. This is where a self-service reporting tool such as Power View comes in. It is so intuitive and easy to use that most business users can start developing reports with it with little or no training. The interface is so visually appealing that it makes report writing fun. This results in users creating their own reports, thereby empowering businesses to make timely, proactive decisions and explore issues much more effectively than ever before. In this article, we will cover the major features and functions of Power View, including the setup, various ways to start Power View, data visualizations, the user interface, data models, deploying and sharing reports, multiple views, chart highlighting, slicing, filters, sorting, exporting to PowerPoint, and finally, design tips. We will also talk about PowerPivot and the Business Intelligence Sematic Model (BISM). By the end of the article, you should be able to jump right in and start creating reports. Getting started Power View was first introduced as a new integrated reporting feature of SQL Server 2012 (Enterprise or BI Edition) with SharePoint 2010 Enterprise Edition. It has also been seamlessly integrated and built directly into Excel 2013 and made available as an add-in that you can simply enable (although it is not possible to share Power View reports between SharePoint and Excel). Power View allows users to quickly create highly visual and interactive reports via a What You See Is What You Get (WYSIWYG) interface. The following screenshot gives an example of a type of report you can build with Power View, which includes various types of visualizations: Sales Dashboard The following screenshot is another example of a Power View report that makes heavy use of slicers along with a bar chart and tables: Promotion Dashboard We will start by discussing PowerPivot and BISM and will then go over the setup procedures for the two possible ways to use Power View: through SharePoint or via Excel 2013. PowerPivot It is important to understand what PowerPivot is and how it relates to Power View. PowerPivot is a data analysis add-on for Microsoft Excel. With it, you can mash large amounts of data together that you can then analyze and aggregate all in one workbook, bypassing the Excel maximum worksheet size of one million rows. It uses a powerful data engine to analyze and query large volumes of data very quickly. There are many data sources that you can use to import data into PowerPivot. Once the data is imported, it becomes part of a data model, which is simply a collection of tables that have relationships between them. Since the data is in Excel, it is immediately available to PivotTables, PivotCharts, and Power View. PowerPivot is implemented in an application window separate from Excel that gives you the ability to do such things as insert and delete columns, format text, hide columns from client tools, change column names, and add images. Once you complete your changes, you have the option of uploading (publishing) the PowerPivot workbook to a PowerPivot Gallery or document library (on a BI site) in SharePoint (a PowerPivot Gallery is a special type of SharePoint document library that provides document and preview management for published Excel workbooks that contain PowerPivot data). This will allow you to share the data model inside PowerPivot with others. To publish your PowerPivot workbook to SharePoint, perform the following steps: Open the Excel file that contains the PowerPivot workbook. Select the File tab on the ribbon. If using Excel 2013, click on Save As and then click on Browse and enter the SharePoint location of the PowerPivot Gallery (see the next screenshot). If using Excel 2010, click on Save & Send, click on Save to SharePoint, and then click on Browse. Click on Save and the file will then be uploaded to SharePoint and immediately be made available to others. Saving files to the PowerPivot Gallery A Power View report can be built from the PowerPivot workbook in the PowerPivot Gallery in SharePoint or from the PowerPivot workbook in an Excel 2013 file. Business Intelligence Semantic Model Business Intelligence Semantic Model (BISM) is a new data model that was introduced by Microsoft in SQL Server 2012. It is a single unified BI platform that publicizes one model for all end-user experiences. It is a hybrid model that exposes two storage implementations: the multidimensional data model (formerly called OLAP) and the tabular data model, which uses the xVelocity engine (formally called VertiPaq), all of which are hosted in SQL Server Analysis Services (SSAS). The tabular data model provides the architecture and optimization in a format that is the same as the data storage method used by PowerPivot, which uses an in-memory analytics engine to deliver fast access to tabular data. Tabular data models are built using SQL Server Data Tools (SSDT) and can be created from scratch or by importing a PowerPivot data model contained within an Excel workbook. Once the model is complete, it is deployed to an SSAS server instance configured for tabular storage mode to make it available for others to use. This provides a great way to create a self-service BI solution, and then make it a department solution and then an enterprise solution, as shown: Self-service solution: A business user loads data into PowerPivot and analyzes the data, making improvements along the way. Department solution: The Excel file that contains the PowerPivot workbook is deployed to a SharePoint site used by the department (in which the active data model actually resides in an SSAS instance and not in the Excel file). Department members use and enhance the data model over time. Enterprise solution: The PowerPivot data model from the SharePoint site is imported into a tabular data model by the IT department. Security is added and then the model is deployed to SSAS so the entire company can use it. Summary In this article, we learned about the features of Power View and how it is an excellent tool for self-service reporting. We talked about PowerPivot how it relates to Power View. Resources for Article: Further resources on this subject: Microsoft SQL Server 2008 High Availability: Installing Database Mirroring [Article] Microsoft SQL Server 2008 - Installation Made Easy [Article] Best Practices for Microsoft SQL Server 2008 R2 Administration [Article]

0
0
1127

article-image-article-ibm-cognos-10-bi-dashboard-business-insight

7 min read

IBM Cognos 10 Business Intelligencea

Introducing IBM Cognos 10 BI Cognos Connection In this recipe we will be exploring Cognos Connection, which is the user interface presented to the user when he/she logs in to IBM Cognos 10 BI for the first time. IBM Cognos 10 BI, once installed and configured, can be accessed through the Web using supported web browsers. For a list of supported web browsers, refer to the Installation and Configuration Guide shipped with the product. Getting ready As stated earlier, make sure that IBM Cognos 10 BI is installed and configured. Install and configure the GO Sales and GO Data Warehouse samples. Use the gateway URI to log on to the web interface called Cognos Connection. How to do it... To explore Cognos Connection, perform the following steps: Log on to Cognos Connection using the gateway URI that may be similar to http://<HostName>:<PortNumber>/ibmcognos/cgi-bin/cognos.cgi. Take note of the Cognos Connection interface. It has the GO Sales and GO Data Warehouse samples visible. Note the blue-colored folder icon, shown as in the preceding screenshot. It represents metadata model packages that are published to Cognos Connection using the Cognos Framework Manager tool. These packages have objects that represent business data objects, relationships, and calculations, which can be used to author reports and dashboards. Refer to the book, IBM Cognos TM1 Cookbook by Packt Publishing to learn how to create metadata models packages. From the toolbar, click on Launch. This will open a menu, showing different studios, each having different functionality, as shown in the following screenshot: We will use Business Insight and Business Insight Advanced, which are the first two choices in the preceding menu. These are the two components used to create and view dashboards. For other options, refer to the corresponding books by the same publisher. For instance, refer to the book, IBM Cognos 8 Report Studio Cookbook to know more about creating and distributing complex reports. Query Studio and Analysis Studio are meant to provide business users with the facility to slice and dice business data themselves. Event Studio is meant to define business situations and corresponding actions. Coming back to Cognos Connection, note that a yellow-colored folder icon, which is shown as represents a user-defined folder, which may or may not contain other published metadata model packages, reports, dashboards, and other content. In our case, we have a user-defined folder called Samples. This was created when we installed and configured samples shipped with the product. Click on the New Folder icon, which is represented by , on the toolbar to create a user-defined folder. Other options are also visible here, for instance to create a new dashboard. Click on the user-defined folder—Samples to view its contents, as shown in the following screenshot: As shown in the preceding screenshot, it has more such folders, each having its own content. The top part of the pane shows the navigation path. Let's navigate deeper into Models | Business Insight Samples to show some sample dashboards, created using IBM Cognos Business Insight, as shown in the following screenshot: Click on one of these links to view the corresponding dashboard. For instance, click on Sales Dashboard (Interactive) to view the dashboard, as shown in the following screenshot: The dashboard can also be opened in the authoring tool, which is IBM Cognos Business Insight, in this case by clicking on the icon shown as on extreme right, on Cognos Connection. It will show the same result as shown in the preceding screenshot. We will see the Business Insight interface in detail later in this article. How it works... Cognos Connection is the primary user interface that user sees when he/she logs in for the first time. Business data has to be first identified and imported from the metadata model using the Cognos Framework Manager tool. Relationships (inner/outer joins) and calculations are then created, and the resultant metadata model package is published to the IBM Cognos 10 BI Server. This becomes available on Cognos Connection. Users are given access to appropriate studios on Cognos Connection, according to their needs. Analysis, reports, and dashboards are then created and distributed using one of these studios. The preceding sample has used Business Insight, for instance. Later sections in this article will look more into Business Insight and Business Insight Advanced. The next section focuses on the Business Insight interface details from the navigation perspective. Exploring IBM Cognos Business Insight User Interface In this recipe we will explore IBM Cognos Business Insight User Interface in more detail. We will explore various areas of the UI, each dedicated to perform different actions. Getting ready As stated earlier, we will be exploring different sections of Cognos Business Insight. Hence, make sure that IBM Cognos 10 BI installation is open and samples are set up properly. We will start the recipe assuming that the IBM Cognos Connection window is already open on the screen. How to do it... To explore IBM Cognos Business Insight User Interface, perform the following steps: In the IBM Cognos Connection window, navigate to Business Insight Samples, as shown in the following screenshot: Click on one of the dashboards, for instance Marketing Dashboard to open the dashboard in Business Insight. Different areas are labeled, as shown in the following figure: The overall layout is termed as Dashboard. The topmost toolbar is called Application bar . The Application bar contains different icons to manage the dashboard as a whole. For instance, we can create, open, e-mail, share, or save the dashboard using one of the icons on the Application bar. The user can explore different icons on the Application bar by hovering the mouse pointer over them. Hovering displays the tooltip, which has a brief but self-explanatory help text. Similarly, it has a Widget toolbar for every widget, which gets activated when the user clicks on the corresponding widget. When the mouse is focused away from the widget, the Widget toolbar disappears. It has various options, for instance to refresh the widget data, print as PDF, resize to ? t content, and so on. It also provides the user with the capability to change the chart type as well as to change the color palette. However, all these options have help text associated with them, which is activated on mouse hover. Content tab and Content pane show the list of objects available on the Cognos Connection. Directory structure on Cognos Connection can be navigated using Content pane and Content tab, and hence, available objects can be added to or removed from the dashboard. The drag-and-drop functionality has been provided as a result of which creating and editing a dashboard has become as simple as moving objects between the Dashboard area and Cognos Connection. The Toolbox tab displays additional widgets. The Slider Filter and Select Value Filter widgets allow the user to filter report content. The other toolbox widgets allow user to add more report content to the dashboard, such as HTML content, images, RSS feeds, and rich text. How it works... In the preceding section, we have seen basic areas of Business Insight. More than one user can log on to the IBM Cognos 10 BI server, and create various objects on Cognos Connection. These objects include packages, reports, cubes, templates, and statistics to name a few. These objects can be created using one or more tools available to users. For instance, reports can be created using one of the studios available. Cubes can be created using IBM Cognos TM1 or IBM Cognos Transformer and published on Cognos Connection. Metadata model packages can be created using IBM Cognos Framework Manager and published on Cognos Connection. These objects can then be dragged, dropped, and formatted as standalone objects in Cognos Business Insight, and hence, dashboards can be created.

0
0
1068

article-image-article-hbase-administration-performance-tuning-hadoop-0

7 min read

HBase Administration, Performance Tuning, Hadoop

Setting up Hadoop to spread disk I/O Modern servers usually have multiple disk devices to provide large storage capacities. These disks are usually configured as RAID arrays, as their factory settings. This is good for many cases but not for Hadoop. The Hadoop slave node stores HDFS data blocks and MapReduce temporary files on its local disks. These local disk operations benefit from using multiple independent disks to spread disk I/O. In this recipe, we will describe how to set up Hadoop to use multiple disks to spread its disk I/O. Getting ready We assume you have multiple disks for each DataNode node. These disks are in a JBOD (Just a Bunch Of Disks) or RAID0 configuration. Assume that the disks are mounted at /mnt/d0, /mnt/d1, …, /mnt/dn, and the user who starts HDFS has write permission on each mount point. How to do it... In order to set up Hadoop to spread disk I/O, follow these instructions: On each DataNode node, create directories on each disk for HDFS to store its data blocks: code 1 Add the following code to the HDFS configuration file (hdfs-site.xml): code 2 Sync the modified hdfs-site.xml file across the cluster: code 3 Restart HDFS: code 4 How it works... We recommend JBOD or RAID0 for the DataNode disks, because you don't need the redundancy of RAID, as HDFS ensures its data redundancy using replication between nodes. So, there is no data loss when a single disk fails. Which one to choose, J BOD or RAID0? You will theoretically get better performance from a JBOD configuration than from a RAID configuration. This is because, in a RAID configuration, you have to wait for the slowest disk in the array to complete before the entire write operation can complete, which makes the average I/O time equivalent to the slowest disk's I/O time. In a JBOD configuration, operations on a faster disk will complete independently of the slower ones, which makes the average I/O time faster than the slowest one. However, enterprise-class RAID cards might make big differences. You might want to benchmark your JBOD and RAID0 configurations before deciding which one to go with. For both JBOD and RAID0 configurations, you will have the disks mounted at different paths. The key point here is to set the dfs.data.dirproperty to all the directories created on each disk. The dfs.data.dirproperty specifies where the DataNode should store its local blocks. By setting it to comma-separated multiple directories, DataNode stores its blocks across all the disks in round robin fashion. This causes Hadoop to efficiently spread disk I/O to all the disks. Warning Do not leave blanks between the directory paths in the dfs.data.dir property value, or it won't work as expected. You will need to sync the changes across the cluster and restart HDFS to apply them. There's more... If you run MapReduce, as MapReduce stores its temporary files on TaskTracker's local file system, you might also like to set up MapReduce to spread its disk I/O: On each TaskTracker node, create directories on each disk for MapReduce to store its intermediate data files: code 5 Add the following to MapReduce's configuration file (mapred-site.xml): code 6 Sync the modified mapred-site.xml file across the cluster and restart MapReduce. MapReduce generates a lot of temporary files on TaskTrackers' local disks during its execution. Like HDFS, setting up multiple directories on different disks helps spread MapReduce disk I/O significantly. Using network topology script to make Hadoop rack-aware Hadoop has the concept of "Rack Awareness ". Administrators are able to define the rack of each DataNode in the cluster. Making Hadoop rack-aware is extremely important because: Rack awareness prevents data loss Rack awareness improves network performance In this recipe, we will describe how to make Hadoop rack-aware and why it is important. Getting ready You will need to know the rack to which each of your slave nodes belongs. Log in to the master node as the user who started Hadoop. How to do it... The following steps describe how to make Hadoop rack-aware: Create a topology.sh script and store it under the Hadoop configuration directory. Change the path for topology.data, in line 3, to fit your environment: code 7 Don't forget to set the execute permission on the script file: code 8 Create a topology.data file, as shown in the following snippet; change the IP addresses and racks to fit your environment: code 9 Add the following to the Hadoop core configuration file (core-site.xml): code 10 Sync the modified files across the cluster and restart HDFS and MapReduce. Make sure HDFS is now rack-aware. If everything works well, you should be able to find something like the following snippet in your NameNode log file: 2012-03-10 13:43:17,284 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/10.160.19.149:50010 2012-03-10 13:43:17,297 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/10.161.30.108:50010 2012-03-10 13:43:17,429 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/10.166.221.198:50010 Make sure MapReduce is now rack-aware. If everything works well, you should be able to find something like the following snippet in your JobTracker log file: 2012-03-10 13:50:38,341 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack3/ip-10-160-19-149.us-west-1.compute.internal 2012-03-10 13:50:38,485 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack1/ip-10-161-30-108.us-west-1.compute.internal 2012-03-10 13:50:38,569 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /dc1/rack2/ip-10-166-221-198.us-west-1.compute.internal How it works... The following diagram shows the concept of Hadoop rack awareness: Each block of the HDFS files will be replicated to multiple DataNodes, to prevent loss of all the data copies due to failure of one machine. However, if all copies of data happen to be replicated on DataNodes in the same rack, and that rack fails, all the data copies will be lost. So to avoid this, the NameNode needs to know the network topology in order to use that information to make intelligent data replication. As shown in the previous diagram, with the default replication factor of three, two data copies will be placed on the machines in the same rack, and another one will be put on a machine in a different rack. This ensures that a single rack failure won't result in the loss of all data copies. Normally, two machines in the same rack have more bandwidth and lower latency between them than two machines in different racks. With the network topology information, Hadoop is able to maximize network performance by reading data from proper DataNodes. If data is available on the local machine, Hadoop will read data from it. If not, Hadoop will try reading data from a machine in the same rack, and if it is available on neither, data will be read from machines in different racks. In step 1, we create a topology.sh script. The script takes DNS names as arguments and returns network topology (rack) names as the output. The mapping of DNS names to network topology is provided by the topology.data file, which was created in step 2. If an entry is not found in the topology.data file, the script returns /default/rack as a default rack name. Note that we use IP addresses, and not hostnames in the topology. data file. There is a known bug that Hadoop does not correctly process hostnames that start with letters "a" to "f". Check HADOOP-6682 for more details. In step 3, we set the topology.script.file.name property in core-site.xml, telling Hadoop to invoke topology.sh to resolve DNS names to network topology names. After restarting Hadoop, as shown in the logs of steps 5 and 6, HDFS and MapReduce add the correct rack name as a prefix to the DNS name of slave nodes. This indicates that the HDFS and MapReduce rack awareness work well with the aforementioned settings.

0
0
1066

Previous
77
78
79
80
81
Next