Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
Packt
26 Dec 2013
21 min read
Save for later

Implementing the Naïve Bayes classifier in Mahout

Packt
26 Dec 2013
21 min read
(for more resources related to this topic, see here.) Bayes was a Presbyterian priest who died giving his "Tractatus Logicus" to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes' work came to light in the scientific community. The corpus of Bayes' study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event. In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table: Outlook Temperature Temperature Humidity Humidity Windy Play Numeric Nominal Numeric Nominal Overcast 83 Hot 86 High FALSE Yes Overcast 64 Cool 65 Normal TRUE Yes Overcast 72 Mild 90 High TRUE Yes Overcast 81 Hot 75 Normal FALSE Yes Rainy 70 Mild 96 High FALSE Yes Rainy 68 Cool 80 Normal FALSE Yes Rainy 65 Cool 70 Normal TRUE No Rainy 75 Mild 80 Normal FALSE Yes Rainy 71 Mild 91 High TRUE No Sunny 85 Hot 85 High FALSE No Sunny 80 Hot 90 High TRUE No Sunny 72 Mild 95 High FALSE No Sunny 69 Cool 70 Normal FALSE Yes Sunny 75 Mild 70 Normal TRUE Yes The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset. There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category. Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information. This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article's recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents. Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier. Using the Mahout text classifier to demonstrate the basic use case The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model. The steps involved are the following: Converting the raw text file into a sequence file Creating vector files from the sequence files Creating our working vectors Getting ready The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz. For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2. The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe. Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder. In our case, the working folder is /mnt/new; so, our working folder's command-line variables will be set using the following command: export WORK_DIR=/mnt/new/ You can create a new folder and change the WORK_DIR bash variable accordingly. Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path. To download the dataset, we only need to open up a terminal console and give the following command: wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz Once your working dataset is downloaded, decompress it using the following command: tar –xvzf 20news-bydate.tar.gz You should see the folder structure as shown in the following screenshot: The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence: rm -rf ${WORK_DIR}/20news-all mkdir ${WORK_DIR}/20news-all cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all Now, we should have our input folder, which is the 20news-all folder, ready to be used: The following screenshot shows a bunch of files, all in the same folder: By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows: From: xxx Subject: yyyyy Organization: zzzz X-Newsreader: rusnews v1.02 Lines: 50 jaeger@xxx (xxx) writes: >In article xxx writes: >>zzzz "How BCCI adapted the Koran rules of banking". The >>Times. August 13, 1991. > > So, let's see. If some guy writes a piece with a title that implies > something is the case then it must be so, is that it? We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization. Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command: ./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command: ./mahout seqdirectory --help The following table describes the various options that can be used with the seqdirectory command: Parameter Description --input (-i) input his gives the path to the job input directory. --output (-o) output The directory pathname for the output. --overwrite (-ow) If present, overwrite the output directory before running the job. --method (-xm) method The execution method to use: sequential or mapreduce. The default is mapreduce. --chunkSize (-chunk) chunkSize The chunkSize values in megabyte. The default is 64 Mb. --fileFilterClass (-filter) fileFilterClass The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter. --keyPrefix (-prefix) keyPrefix The prefix to be prepended to the key of the sequence file. --charset (-c) charset The name of the character encoding of the input files.The default is UTF-8. --overwrite (-ow) If present, overwrite the output directory before running the job. --help (-h) Prints the help menu to the command console. --tempDir tempDir If specified, tells Mahout to use this as a temporary folder. --startPhase startPhase Defines the first phase that needs to be run. --endPhase endPhase Defines the last phase that needs to be run To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command: hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more In our case, the result is as follows: /67399 From:xxx Subject: Re: Imake-TeX: looking for beta testers Organization: CS Department, Dortmund University, Germany Lines: 59 Distribution: world NNTP-Posting-Host: tommy.informatik.uni-dortmund.de In article <xxxxx>, yyy writes: |> As I announced at the X Technical Conference in January, I would like |> to |> make Imake-TeX, the Imake support for using the TeX typesetting system, |> publically available. Currently Imake-TeX is in beta test here at the |> computer science department of Dortmund University, and I am looking |> for |> some more beta testers, preferably with different TeX and Imake |> installations. The Hadoop command is pretty simple, and the syntax is as follows: hadoop fs –text <input file> In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command: ./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf The following command parameters are described briefly: The -lnorm parameter instructs the vector to use the L_2 norm as a distance The -nv parameter is an optional parameter that outputs the vector as namedVector The -wt parameter instructs which weight function needs to be used We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier. How to do it… Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier. To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm. A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent. To split data, we use the following command: ./mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy: ./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows: ./mahout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex\ -ow -o ${WORK_DIR}/20news-testing The following screenshot shows the output of the preceding command: How it works... We have given certain commands and we have seen the outcome, but you've done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder. Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps: Data preparation Data training Data testing In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources. Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association. Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression. As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won't spend too long on it. But the next step is the most important one during data preparation. Let us recall it: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways: The -i parameter is used to declare the input folder where all the sequence files are stored The -o parameter is used to create the output folder where the vector containing the weights is stored The -nv parameter tells Mahout that the output format should be in the namedVector format The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category The -lnorm parameter is a function used to normalize the weights using the L_2 distance The -ow: parameter overwrites the previously generated output results The -m: parameter gives the minimum log-likelihood ratio The whole purpose of this computation step is to transform the sequence files that contain the documents' raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the Wi term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one: In the preceding formula, Wi is the TF-IDF weight of the word indexed by i. N is the total number of documents. DFi is the frequency of the i word in all the documents. In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set. So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes. The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder: Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words. The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows: mahout vectordump -i ${WORK_DIR}/20news-vectors/tfidf-vectors/ part-r-00000 –o ${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command: hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914| The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset. Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far. Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result. Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot: We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent. But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics. So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification. This is reasonable as sports terms like "hockey" are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups. We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command: ./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix: In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup. There's more We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text. We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples. As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one: mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.
Read more
  • 0
  • 0
  • 3207

article-image-how-bridge-client-server-gap-using-ajax-part-ii
Packt
15 Oct 2009
7 min read
Save for later

How to Bridge the Client-Server Gap using AJAX (Part II)

Packt
15 Oct 2009
7 min read
AJAX and events Suppose we wanted to allow each dictionary term name to control the display of the definition that follows; clicking on the term name would show or hide the associated definition. With the techniques we have seen so far, this should be pretty straightforward: $(document).ready(function() { $('.term').click(function() { $(this).siblings('.definition').slideToggle(); });}); When a term is clicked, this code finds siblings of the element that have a class of definition, and slides them up or down as appropriate. All seems in order, but a click does nothing with this code. Unfortunately, the terms have not yet been added to the document when we attach the click handlers. Even if we managed to attach click handlers to these items, once we clicked on a different letter the handlers would no longer be attached. This is a common problem with areas of a page populated by AJAX. A popular solution is to rebind handlers each time the page area is refreshed. This can be cumbersome, however, as the event binding code needs to be called each time anything causes the DOM structure of the page to change. We can implement event delegation, actually binding the event to an ancestor element that never changes. In this case, we'll attach the click handler to the document using .live() and catch our clicks that way: $(document).ready(function() { $('.term').live('click', function() { $(this).siblings('.definition').slideToggle(); });}); The .live() method tells the browser to observe all clicks anywhere on the page. If (and only if) the clicked element matches the .term selector, then the handler is executed. Now the toggling behavior will take place on any term, even if it is added by a later AJAX transaction. Security limitations For all its utility in crafting dynamic web applications, XMLHttpRequest (the underlying browser technology behind jQuery's AJAX implementation) is subject to strict boundaries. To prevent various cross-site scripting attacks, it is not generally possible to request a document from a server other than the one that hosts the original page. This is generally a positive situation. For example, some cite the implementation of JSON parsing by using eval() as insecure. If malicious code is present in the data file, it could be run by the eval() call. However, since the data file must reside on the same server as the web page itself, the ability to inject code in the data file is largely equivalent to the ability to inject code in the page directly. This means that, for the case of loading trusted JSON files, eval() is not a significant security concern. There are many cases, though, in which it would be beneficial to load data from a third-party source. There are several ways to work around the security limitations and allow this to happen. One method is to rely on the server to load the remote data, and then provide it when requested by the client. This is a very powerful approach as the server can perform pre-processing on the data as needed. For example, we could load XML files containing RSS news feeds from several sources, aggregate them into a single feed on the server, and publish this new file for the client when it is requested. To load data from a remote location without server involvement, we have to get a bit sneakier. A popular approach for the case of loading foreign JavaScript files is injecting <script> tags on demand. Since jQuery can help us insert new DOM elements, it is simple to do this: $(document.createElement('script')) .attr('src', 'http://example.com/example.js') .appendTo('head'); In fact, the $.getScript() method will automatically adapt to this technique if it detects a remote host in its URL argument, so even this is handled for us. The browser will execute the loaded script, but there is no mechanism to retrieve results from the script. For this reason, the technique requires cooperation from the remote host. The loaded script must take some action, such as setting a global variable that has an effect on the local environment. Services that publish scripts that are executable in this way will also provide an API with which to interact with the remote script. Another option is to use the <iframe> HTML tag to load remote data. This element allows any URL to be used as the source for its data fetching, even if it does not match the host page's server. The data can be loaded and easily displayed on the current page. Manipulating the data, however, typically requires the same cooperation needed for the <script> tag approach; scripts inside the <iframe> need to explicitly provide the data to objects in the parent document. Using JSONP for remote data The idea of using <script> tags to fetch JavaScript files from a remote source can be adapted to pull in JSON files from another server as well. To do this, we need to slightly modify the JSON file on the server, however. There are several mechanisms for doing this, one of which is directly supported by jQuery: JSON with Padding, or JSONP. The JSONP file format consists of a standard JSON file that has been wrapped in parentheses and prepended with an arbitrary text string. This string, the "padding", is determined by the client requesting the data. Because of the parentheses, the client can either cause a function to be called or a variable to be set depending on what is sent as the padding string. A PHP implementation of the JSONP technique is quite simple: <?php print($_GET['callback'] .'('. $data .')');?> Here, $data is a variable containing a string representation of a JSON file. When this script is called, the callback query string parameter is prepended to the resulting file that gets returned to the client. To demonstrate this technique, we need only slightly modify our earlier JSON example to call this remote data source instead. The $.getJSON() function makes use of a special placeholder character, ?, to achieve this. $(document).ready(function() { var url = 'http://examples.learningjquery.com/jsonp/g.php'; $('#letter-g a').click(function() { $.getJSON(url + '?callback=?', function(data) { $('#dictionary').empty(); $.each(data, function(entryIndex, entry) { var html = '<div class="entry">'; html += '<h3 class="term">' + entry['term'] + '</h3>'; html += '<div class="part">' + entry['part'] + '</div>'; html += '<div class="definition">'; html += entry['definition']; if (entry['quote']) { html += '<div class="quote">'; $.each(entry['quote'], function(lineIndex, line) { html += '<div class="quote-line">' + line + '</div>'; }); if (entry['author']) { html += '<div class="quote-author">' + entry['author'] + '</div>'; } html += '</div>'; } html += '</div>'; html += '</div>'; $('#dictionary').append(html); }); }); return false; });}); We normally would not be allowed to fetch JSON from a remote server (examples.learningjquery.com in this case). However, since this file is set up to provide its data in the JSONP format, we can obtain the data by appending a query string to our URL, using ? as a placeholder for the value of the callback argument. When the request is made, jQuery replaces the ? for us, parses the result, and passes it to the success function as data just as if this were a local JSON request. Note that the same security cautions hold here as before; whatever the server decides to return to the browser will execute on the user's computer. The JSONP technique should only be used with data coming from a trusted source.
Read more
  • 0
  • 0
  • 3204

article-image-managing-oracle-business-intelligence
Packt
25 Mar 2013
2 min read
Save for later

Managing Oracle Business Intelligence

Packt
25 Mar 2013
2 min read
(For more resources related to this topic, see here.) Getting ready Oracle Business Intelligence Version 11.1.1.6 is installed on a host other than the OMS host. The agent needs to be up and running on the host where Oracle Business Intelligence is installed. How to do it... To discover the OBIEE 11g target, perform the following steps: Log in to Enterprise Manager Cloud Control. From the Targets tab, select the Middleware option from the drop-down menu. Click on the + Add button on the Middleware home page. Select the Oracle Fusion Middleware/Weblogic Domain option from the drop-down list. Populate the Oracle Business Intelligence Weblogic Domain details on the Add Fusion Middleware Farm: Find Targets screen. Select Administration Server Host of the Oracle Business Intelligence host. Specify the Oracle Business Intelligence domain admin server port, and add a username, a password. Also provide a meaningful Unique Domain Identifier. The default name is left unchanged in this example. Click on Continue. Click on Close. Click on Add Targets to add the discovered components of the Oracle Business Intelligence domain. Click on Close. Verify the Add Fusion Middleware Farm: Results screen. Expand the Show Targets Details tab to view all the components of Oracle Business Intelligence. Click on OK. Click on the Farm01_bifoundation_domain link to monitor and manage all the Oracle Business Intelligence components. How it works... This recipe describes the steps to discover the Oracle Fusion Middleware/WebLogic Domain target. The Fusion Middleware plugin Version 12.1.0.3 is embedded with Cloud Control 12.1.0.2 and is certified to discover components of the Oracle Business Intelligence target, which can be managed and monitored using EM Cloud Control. There's more... Multiple Oracle Business Intelligence instances can be managed and monitored more efficiently using EM Cloud Control 12c. Enterprise Manager displays the performance of the managed targets data in a user-friendly graphical format. Summary This article deals with the steps required to discover the Oracle Fusion Middleware/WebLogic Domain target. Resources for Article : Further resources on this subject: Unveil the Power of Your Business Data with Oracle Discoverer [Article] Oracle Integration and Consolidation Products [Article] Oracle Business Intelligence : Getting Business Information from Data [Article]
Read more
  • 0
  • 0
  • 3202

article-image-controls-and-widgets
Packt
10 Aug 2015
25 min read
Save for later

Controls and Widgets

Packt
10 Aug 2015
25 min read
In this article by Chip Lambert and Shreerang Patwardhan, author of the book, Mastering jQuery Mobile, we will take our Civic Center application to the next level and in the process of doing so, we will explore different widgets. We will explore the touch events provided by the jQuery Mobile framework further and then take a look at how this framework interacts with third-party plugins. We will be covering the following different widgets and topics in this article: Collapsible widget Listview widget Range slider widget Radio button widget Touch events Third-party plugins HammerJs FastClick Accessibility (For more resources related to this topic, see here.) Widgets We already made use of widgets as part of the Civic Center application. "Which? Where? When did that happen? What did I miss?" Don't panic as you have missed nothing at all. All the components that we use as part of the jQuery Mobile framework are widgets. The page, buttons, and toolbars are all widgets. So what do we understand about widgets from their usage so far? One thing is pretty evident, widgets are feature-rich and they have a lot of things that are customizable and that can be tweaked as per the requirements of the design. These customizable things are pretty much the methods and events that these small plugins offer to the developers. So all in all: Widgets are feature rich, stateful plugins that have a complete lifecycle, along with methods and events. We will now explore a few widgets as discussed before and we will start off with the collapsible widget. A collapsible widget, more popularly known as the accordion control, is used to display and style a cluster of related content together to be easily accessible to the user. Let's see this collapsible widget in action. Pull up the index.html file. We will be adding the collapsible widget to the facilities page. You can jump directly to the content div of the facilities page. We will replace the simple-looking, unordered list and add the collapsible widget in its place. Add the following code in place of the <ul>...<li></li>...</ul> portion: <div data-role="collapsibleset"> <div data-role="collapsible"> <h3>Banquet Halls</h3> <p>List of banquet halls will go here</p> </div> <div data-role="collapsible"> <h3>Sports Arena</h3> <p>List of sports arenas will go here</p> </div> <div data-role="collapsible">    <h3>Conference Rooms</h3> <p>List of conference rooms will come here</p> </div> <div data-role="collapsible"> <h3>Ballrooms</h3> <p>List of ballrooms will come here</p> </div> </div> That was pretty simple. As you must have noticed, we are creating a group of collapsibles defined by div with data-role="collapsibleset". Inside this div, we have multiple div elements each with data-role of "collapsible". These data roles instruct the framework to style div as a collapsible. Let's break individual collapsibles further. Each collapsible div has to have a heading tag (h1-h6), which acts as the title for that collapsible. This heading can be followed by any HTML structure that is required as per your application's design. In our application, we added a paragraph tag with some dummy text for now. We will soon be replacing this text with another widget—listview. Before we proceed to look at how we will be doing this, let's see what the facilities page is looking like right now: Now let's take a look at another widget that we will include in our project—the listview widget. The listview widget is a very important widget from the mobile website stand point. The listview widget is highly customizable and can play an important role in the navigation system of your web application as well. In our application, we will include listview within the collapsible div elements that we have just created. Each collapsible will hold the relevant list items which can be linked to a detailed page for each item. Without further discussion, let's take a look at the following code. We have replaced the contents of the first collapsible list item within the paragraph tag with the code to include the listview widget. We will break up the code and discuss the minute details later: <div data-role="collapsible"> <h3>Banquet Halls</h3> <p> <span>We have 3 huge banquet halls named after 3 most celebrated Chef's from across the world.</span> <ul data-role="listview" data-inset="true"> <li> <a href="#">Gordon Ramsay</a> </li> <li> <a href="#">Anthony Bourdain</a> </li> <li> <a href="#">Sanjeev Kapoor</a> </li> </ul> </p> </div> That was pretty simple, right? We replaced the dummy text from the paragraph tag with a span that has some details concerning what that collapsible list is about, and then we have an unordered list with data-role="listview" and some property called data-inset="true". We have seen several data-roles before, and this one is no different. This data-role attribute informs the framework to style the unordered list, such as a tappable button, while a data-inset property informs the framework to apply the inset appearance to the list items. Without this property, the list items would stretch from edge to edge on the mobile device. Try setting the data-inset property to false or removing the property altogether. You will see the results for yourself. Another thing worth noticing in the preceding code is that we have included an anchor tag within the li tags. This anchor tag informs the framework to add a right arrow icon on the extreme right of that list item. Again, this icon is customizable, along with its position and other styling attributes. Right now, our facilities page should appear as seen in the following image: We will now add similar listview widgets within the remaining three collapsible items. The content for the next collapsible item titled Sports Arena should be as follows. Once added, this collapsible item, when expanded, should look as seen in the screenshot that follows the code: <div data-role="collapsible">    <h3>Sports Arena</h3>    <p>        <span>We have 3 huge sport arenas named after 3 most celebrated sport personalities from across the world.       </span>        <ul data-role="listview" data-inset="true">            <li>                <a href="#">Sachin Tendulkar</a>            </li>            <li>                <a href="#">Roger Federer</a>            </li>            <li>                <a href="#">Usain Bolt</a>            </li>        </ul>    </p> </div> The code for the listview widgets that should be included in the next collapsible item titled Conference Rooms. Once added, this collapsible, item when expanded, should look as seen in the image that follows the code: <div data-role="collapsible">    <h3>Conference Rooms</h3>    <p>        <span>            We have 3 huge conference rooms named after 3 largest technology companies.        </span>        <ul data-role="listview" data-inset="true">            <li>                <a href="#">Google</a>            </li>            <li>                <a href="#">Twitter</a>            </li>            <li>                <a href="#">Facebook</a>            </li>        </ul>    </p> </div> The final collapsible list item – Ballrooms – should hold the following code, to include its share of the listview items: <div data-role="collapsible">    <h3>Ballrooms</h3>    <p>        <span>            We have 3 huge ball rooms named after 3 different dance styles from across the world.        </span>        <ul data-role="listview" data-inset="true">            <li>                <a href="#">Ballet</a>            </li>            <li>                <a href="#">Kathak</a>            </li>            <li>                <a href="#">Paso Doble</a>            </li>        </ul>    </p> </div> After adding these listview items, our facilities page should look as seen in the following image: The facilities page now looks much better than it did earlier, and we now understand a couple more very important widgets available in jQuery Mobile—the collapsible widget and the listview Widget. We will now explore two form widgets – slider widget and the radio buttons widget. For this, we will be enhancing our catering page. Let's build a simple tool that will help the visitors of this site estimate the food expense based on the number of guests and the type of cuisine that they choose. Let's get started then. First, we will add the required HTML, to include the slider widget and the radio buttons widget. Scroll down to the content div of the catering page, where we have the paragraph tag containing some text about the Civic Center's catering services. Add the following code after the paragraph tag: <form>    <label style="font-weight: bold; padding: 15px 0px;" for="slider">Number of guests</label>    <input type="range" name="slider" id="slider" data-highlight="true" min="50" max="1000" value="50">    <fieldset data-role="controlgroup" id="cuisine-choices">        <legend style="font-weight: bold; padding: 15px 0px;">Choose your cuisine</legend>        <input type="radio" name="cuisine-choice" id="cuisine-choice-cont" value="15" checked="checked" />        <label for="cuisine-choice-cont">Continental</label>        <input type="radio" name="cuisine-choice" id="cuisine-choice-mex" value="12" />        <label for="cuisine-choice-mex">Mexican</label>        <input type="radio" name="cuisine-choice" id="cuisine-choice-ind" value="14" />        <label for="cuisine-choice-ind">Indian</label>    </fieldset>    <p>        The approximate cost will be: <span style="font-weight: bold;" id="totalCost"></span>    </p> </form> That is not much code, but we are adding and initializing two new form widgets here. Let's take a look at the code in detail: <label style="font-weight: bold; padding: 15px 0px;" for="slider">Number of guests</label> <input type="range" name="slider" id="slider" data-highlight="true" min="50" max="1000" value="50"> We are initializing our first form widget here—the slider widget. The slider widget is an input element of the type range, which accepts a minimum value and maximum value and a default value. We will be using this slider to accept the number of guests. Since the Civic Center can cater to a maximum of 1,000 people, we will set the maximum limit to 1,000 and we expect that we have at least 50 guests, so we set a minimum value of 50. Since the minimum number of guests that we cater for is 50, we set the input's default value to 50. We also set the data-highlight attribute value to true, which informs the framework that the selected area on the slider should be highlighted. Next comes the group of radio buttons. The most important attribute to be considered here is the data-role="controlgroup" set on the fieldset element. Adding this data-role combines the radio buttons into one single group, which helps inform the user that one of the radio buttons is to be selected. This gives a visual indication to the user that one radio button out of the whole lot needs to be selected. The values assigned to each of the radio inputs here indicate the cost per person for that particular cuisine. This value will help us calculate the final dollar value for the number of selected guests and the type of cuisine. Whenever you are using the form widgets, make sure you have the form elements in the hierarchy as required by the jQuery Mobile framework. When the elements are in the required hierarchy, the framework can apply the required styles. At the end of the previous code snippet, we have a paragraph tag where we will populate the approximate cost of catering for the selected number of guests and the type of cuisine selected. The catering page should now look as seen in the following image. Right now, we only have the HTML widgets in place. When you drag the slider or select different radio buttons, you will only see the UI interactions of these widgets and the UI treatments that the framework applies to these widgets. However, the total cost will not be populated yet. We will need to write some JavaScript logic to determine this value, and we will take a look at this in a minute. Before moving to the JavaScript part, make sure you have all the code that is needed: Now let's take a look at the magic part of the code (read JavaScript) that is going to make our widgets usable for the visitors of this Civic Center web application. Add the following JavaScript code in the script tag at the very end of our index.html file: $(document).on('pagecontainershow', function(){    var guests = 50;    var cost = 35;    var totalCost;    $("#slider").on("slidestop", function(event, ui){        guests = $('#slider').val();        totalCost = costCal();        $("#totalCost").text("$" + totalCost);    });    $("input:radio[name=cuisine-choice]").on("click", function() {        cost = $(this).val();        var totalCost = costCal();        $("#totalCost").text("$" + totalCost);    });    function costCal(){        return guests * cost;    } }); That is a pretty small chunk of code and pretty simple too. We will be looking at a few very important events that are part of the framework and that come in very handy when developing web applications with jQuery Mobile. One of the most important things that you must have already noticed is that we are not making use of the customary $(document).on('ready', function(){ in Jquery, but something that looks as the following code: $(document).on('pagecontainershow', function(){ The million dollar question here is "why doesn't DOM already work in jQuery Mobile?" As part of jQuery, the first thing that we often learn to do is execute our jQuery code as soon as the DOM is ready, and this is identified using the $(document).ready function. In jQuery Mobile, pages are requested and injected into the same DOM as the user navigates from one page to another and so the DOM ready event is as useful as it executes only for the first page. Now we need an event that should execute when every page loads, and $(document).pagecontainershow is the one. The pagecontainershow element is triggered on the toPage after the transition animation has completed. The pagecontainershow element is triggered on the pagecontainer element and not on the actual page. In the function, we initialize the guests and the cost variables to 50 and 35 respectively, as the minimum number of guests we can have is 50 and the "Continental" cuisine is selected by default, which has a value of 35. We will be calculating the estimated cost when the user changes the number of guests or selects a different radio button. This brings us to the next part of our code. We need to get the value of the number of guests as soon as the user stops sliding the slider. jQuery Mobile provides us with the slidestop event for this very purpose. As soon as the user stops sliding, we get the value of the slider and then call the costCal function, which returns a value that is the number of guests multiplied by the cost of the selected cuisine per person. We then display this value in the paragraph at the bottom for the user to get an estimated cost. We will discuss some more about the touch events that are available as part of the jQuery Mobile framework in the next section. When the user selects a different radio button, we retrieve the value of the selected radio button, call the costCal function again, and update the value displayed in the paragraph at the bottom of our page. If you have the code correct and your functions are all working fine, you should see something similar to the following image: Input with touch We will take a look at a couple of touch events, which are tap and taphold. The tap event is triggered after a quick touch; whereas the taphold event is triggered after a sustained, long press touch. The jQuery Mobile tap event is the gesture equivalent of the standard click event that is triggered on the release of the touch gesture. The following snippet of code should help you incorporate the tap event when you need to use it in your application: $(".selector").on("tap", function(){    console.log("tap event is triggered"); }); The jQuery Mobile taphold event triggers after a sustained, complete touch event, which is more commonly known as the long press event. The taphold event fires when the user taps and holds for a minimum of 750 milliseconds. You can also change the default value, but we will come to that in a minute. First, let's see how the taphold event is used: $(".selector").on("taphold", function(){    console.log("taphold event is triggered"); }); Now to change the default value for the long press event, we need to set the value for the following piece of code: $.event.special.tap.tapholdThreshold Working with plugins A number of times, we will come across scenarios where the capabilities of the framework are just not sufficient for all the requirements of your project. In such scenarios, we have to make use of third-party plugins in our project. We will be looking at two very interesting plugins in the course of this article, but before that, you need to understand what jQuery plugins exactly are. A jQuery plugin is simply a new method that has been used to extend jQuery's prototype object. When we include the jQuery plugin as part of our code, this new method becomes available for use within your application. When selecting jQuery plugins for your jQuery Mobile web application, make sure that the plugin is optimized for mobile devices and incorporates touch events as well, based on your requirements. The first plugin that we are going to look at today is called FastClick and is developed by FT Labs. This is an open source plugin and so can be used as part of your application. FastClick is a simple, easy-to-use library designed to eliminate the 300 ms delay between a physical tap and the firing on the click event on mobile browsers. Wait! What are we talking about? What is this 300 ms delay between tap and click? What exactly are we discussing? Sure. We understand the confusion. Let's explain this 300 ms delay issue. The click events have a 300 ms delay on touch devices, which makes web applications feel laggy on a mobile device and doesn't give users a native-like feel. If you go to a site that isn't mobile-optimized, it starts zoomed out. You have to then either pinch and zoom or double tap some content so that it becomes readable. The double-tap is a performance killer, because with every tap we have to wait to see whether it might be a double tap—and this wait is 300 ms. Here is how it plays out: touchstart touchend Wait 300ms in case of another tap click This pause of 300 ms applies to click events in JavaScript, but also other click-based interactions such as links and form controls. Most mobile web browsers out there have this 300 ms delay on the click events, but now a few modern browsers such as Chrome and FireFox for Android and iOS are removing this 300 ms delay. However, if you are supporting the older Android and iOS versions, with older mobile browsers, you might want to consider including the FastClick plugin in your application, which helps resolve this problem. Let's take a look at how we can use this plugin in any web application. First, you need to download the plugin files, or clone their GitHub repository here: https://github.com/ftlabs/fastclick. Once you have done that, include a reference to the plugin's JavaScript file in your application: <script type="application/javascript" src="path/fastclick.js"></script> Make sure that the script is loaded prior to instantiating FastClick on any element of the page. FastClick recommends you to instantiate the plugin on the body element itself. We can do this using the following piece of code: $(function){    FastClick.attach(document.body); } That is it! Your application is now free of the 300 ms click delay issue and will work as smooth as a native application. We have just provided you with an introduction to the FastClick plugin. There are several more features that this plugin provides. Make sure you visit their website—https://github.com/ftlabs/fastclick—for more details on what the plugin has to offer. Another important plugin that we will look at is HammerJs. HammerJs, again is an open source library that helps recognize gestures made by touch, mouse, and pointerEvents. Now, you would say that the jQuery Mobile framework already takes care of this, so why do we need a third-party plugin again? True, jQuery Mobile supports a variety of touch events such as tap, tap and hold, and swipe, as well as the regular mouse events, but what if in our application we want to make use of some touch gestures such as pan, pinch, rotate, and so on, which are not supported by jQuery Mobile by default? This is where HammerJs comes into the picture and plays nicely along with jQuery Mobile. Including HammerJS in your web application code is extremely simple and straightforward, like the FastClick plugin. You need to download the plugin files and then add a reference to the plugin JavaScript file: <script type="application/javascript" src="path/hammer.js"></script> Once you have included the plugin, you need to create a new instance on the Hammer object and then start using the plugin for all the touch gestures you need to support: var hammerPan = new Hammer(element_name, options); hammerPan.on('pan', function(){    console.log("Inside Pan event"); }); By default, Hammer adds a set of events—tap, double tap, swipe, pan, press, pinch, and rotate. The pinch and rotate recognizers are disabled by default, but can be turned on as and when required. HammerJS offers a lot of features that you might want to explore. Make sure you visit their website—http://hammerjs.github.io/ to understand the different features the library has to offer and how you can integrate this plugin within your existing or new jQuery Mobile projects. Accessibility Most of us today cannot imagine our lives without the Internet and our smartphones. Some will even argue that the Internet is the single largest revolutionary invention of all time that has touched numerous lives across the globe. Now, at the click of a mouse or the touch of your fingertip, the world is now at your disposal, provided you can use the mouse, see the screen, and hear the audio—impairments might make it difficult for people to access the Internet. This makes us wonder about how people with disabilities would use the Internet, their frustration in doing so, and the efforts that must be taken to make websites accessible to all. Though estimates vary on this, most studies have revealed that about 15% of the world's population have some kind of disability. Not all of these people would have an issue with accessing the web, but let's assume 5% of these people would face a problem in accessing the web. This 5% is also a considerable amount of users, which cannot be ignored by businesses on the web, and efforts must be taken in the right direction to make the web accessible to these users with disabilities. jQuery Mobile framework comes with built-in support for accessibility. jQuery Mobile is built with accessibility and universal access in mind. Any application that is built using jQuery Mobile is accessible via the screen reader as well. When you make use of the different jQuery Mobile widgets in your application, unknowingly you are also adding support for web accessibility into your application. jQuery Mobile framework adds all the necessary aria attributes to the elements in the DOM. Let's take a look at how the DOM looks for our facilities page: Look at the highlighted Events button in the top right corner and its corresponding HTML (also highlighted) in the developer tools. You will notice that there are a few attributes added to the anchor tag that start with aria-. We did not add any of these aria- attributes when we wrote the code for the Events button. jQuery Mobile library takes care of these things for you. The accessibility implementation is an ongoing process and the awesome developers at jQuery Mobile are working towards improving the support every new release. We spoke about aria- attributes, but what do they really represent? WAI - ARIA stands for Web Accessibility Initiative – Accessible Rich Internet Applications. This was a technical specification published by the World Wide Web Consortium (W3C) and basically specifies how to increase the accessibility of web pages. ARIA specifies the roles, properties, and states of a web page that make it accessible to all users. Accessibility is extremely vast, hence covering every detail of it is not possible. However, there is excellent material available on the Internet on this topic and we encourage you to read and understand this. Try to implement accessibility into your current or next project even if it is not based on jQuery Mobile. Web accessibility is an extremely important thing that should be considered, especially when you are building web applications that will be consumed by a huge consumer base—on e-commerce websites for example. Summary In this article, we made use of some of the available widgets from the jQuery Mobile framework and we built some interactivity into our existing Civic Center application. The widgets that we used included the range slider, the collapsible widget, the listview widget, and the radio button widget. We evaluated and looked at how to use two different third-party plugins—FastClick and HammerJs. We concluded the article by taking a look at the concept of Web Accessibility. Resources for Article: Further resources on this subject: Creating Mobile Dashboards [article] Speeding up Gradle builds for Android [article] Saying Hello to Unity and Android [article]
Read more
  • 0
  • 0
  • 3200

article-image-getting-and-running-mysql-python
Packt
24 Dec 2010
8 min read
Save for later

Getting Up and Running with MySQL for Python

Packt
24 Dec 2010
8 min read
  MySQL for Python Integrate the flexibility of Python and the power of MySQL to boost the productivity of your Python applications Implement the outstanding features of Python's MySQL library to their full potential See how to make MySQL take the processing burden from your programs Learn how to employ Python with MySQL to power your websites and desktop applications Apply your knowledge of MySQL and Python to real-world problems instead of hypothetical scenarios A manual packed with step-by-step exercises to integrate your Python applications with the MySQL database server Getting MySQL for Python How you get MySQL for Python depends on your operating system and the level of authorization you have on it. In the following subsections, we walk through the common operating systems and see how to get MySQL for Python on each. Using a package manager (only on Linux) Package managers are used regularly on Linux, but none come by default with Macintosh and Windows installations. So users of those systems can skip this section. A package manager takes care of downloading, unpacking, installing, and configuring new software for you. In order to use one to install software on your Linux installation, you will need administrative privileges. Administrative privileges on a Linux system can be obtained legitimately in one of the following three ways: Log into the system as the root user (not recommended) Switch user to the root user using su Use sudo to execute a single command as the root user The first two require knowledge of the root user's password. Logging into a system directly as the root user is not recommended due to the fact that there is no indication in the system logs as to who used the root account. Logging in as a normal user and then switching to root using su is better because it keeps an account of who did what on the machine and when. Either way, if you access the root account, you must be very careful because small mistakes can have major consequences. Unlike other operating systems, Linux assumes that you know what you are doing if you access the root account and will not stop you from going so far as deleting every file on the hard drive. Unless you are familiar with Linux system administration, it is far better, safer, and more secure to prefix the sudo command to the package manager call. This will give you the benefit of restricting use of administrator-level authority to a single command. The chances of catastrophic mistakes are therefore mitigated to a great degree. More information on any of these commands is available by prefacing either man or info before any of the preceding commands (su, sudo). Which package manager you use depends on which of the two mainstream package management systems your distribution uses. Users of RedHat or Fedora, SUSE, or Mandriva will use the RPM Package Manager (RPM) system. Users of Debian, Ubuntu, and other Debian-derivatives will use the apt suite of tools available for Debian installations. Each package is discussed in the following: Using RPMs and yum If you use SUSE, RedHat, or Fedora, the operating system comes with the yum package manager . You can see if MySQLdb is known to the system by running a search (here using sudo): sudo yum search mysqldb If yum returns a hit, you can then install MySQL for Python with the following command: sudo yum install mysqldb Using RPMs and urpm If you use Mandriva, you will need to use the urpm package manager in a similar fashion. To search use urpmq: sudo urpmq mysqldb And to install use urpmi: sudo urpmi mysqldb Using apt tools on Debian-like systems Whether you run a version of Ubuntu, Xandros, or Debian, you will have access to aptitude, the default Debian package manager. Using sudo we can search for MySQLdb in the apt sources using the following command: sudo aptitude search mysqldb On most Debian-based distributions, MySQL for Python is listed as python-mysqldb. Once you have found how apt references MySQL for Python, you can install it using the following code: sudo aptitude install python-mysqldb Using a package manager automates the entire process so you can move to the section Importing MySQL for Python. Using an installer for Windows Windows users will need to use the older 1.2.2 version of MySQL for Python. Using a web browser, go to the following link: http://sourceforge.net/projects/mysql-python/files/ This page offers a listing of all available files for all platforms. At the end of the file listing, find mysql-python and click on it. The listing will unfold to show folders containing versions of MySQL for Python back to 0.9.1. The version we want is 1.2.2. Windows binaries do not currently exist for the 1.2.3 version of MySQL for Python. To get them, you would need to install a C compiler on your Windows installation and compile the binary from source. Click on 1.2.2 and unfold the file listing. As you will see, the Windows binaries are differentiated by Python version—both 2.4 and 2.5 are supported. Choose the one that matches your Python installation and download it. Note that all available binaries are for 32-bit Windows installations, not 64-bit. After downloading the binary, installation is a simple matter of double-clicking the installation EXE file and following the dialogue. Once the installation is complete, the module is ready for use. So go to the section Importing MySQL for Python. Using an egg file One of the easiest ways to obtain MySQL for Python is as an egg file, and it is best to use one of those files if you can. Several advantages can be gained from working with egg files such as: They can include metadata about the package, including its dependencies They allow for the use of egg-aware software, a helpful level of abstraction Eggs can, technically, be placed on the Python executable path and used without unpacking They save the user from installing packages for which they do not have the appropriate version of software They are so portable that they can be used to extend the functionality of third-party applications Installing egg handling software One of the best known egg utilities—Easy Install, is available from the PEAK Developers' Center at http://peak.telecommunity.com/DevCenter/EasyInstall. How you install it depends on your operating system and whether you have package management software available. In the following section, we look at several ways to install Easy Install on the most common systems. Using a package manager (Linux) On Ubuntu you can try the following to install the easy_install tool (if not available already): shell> sudo aptitude install python-setuptools On RedHat or CentOS you can try using the yum package manager: shell> sudo yum install python-setuptools On Mandriva use urpmi: shell> sudo urpmi python-setuptools You must have administrator privileges to do the installations just mentioned. Without a package manager (Mac, Linux) If you do not have access to a Linux package manager, but nonetheless have a Unix variant as your operating system (for example, Mac OS X), you can install Python's setuptools manually. Go to: http://pypi.python.org/pypi/setuptools#files Download the relevant egg file for your Python version. When the file is downloaded, open a terminal and change to the download directory. From there you can run the egg file as a shell script. For Python 2.5, the command would look like this: sh setuptools-0.6c11-py2.5.egg This will install several files, but the most important one for our purposes is easy_install, usually located in /usr/bin. On Microsoft Windows On Windows, one can download the setuptools suite from the following URL: http://pypi.python.org/pypi/setuptools#files From the list located there, select the most appropriate Windows executable file. Once the download is completed, double-click the installation file and proceed through the dialogue. The installation process will set up several programs, but the one important for our purposes is easy_install.exe. Where this is located will differ by installation and may require using the search function from the Start Menu. On 64-bit Windows, for example, it may be in the Program Files (x86) directory. If in doubt, do a search. On Windows XP with Python 2.5, it is located here: C:Python25Scriptseasy_install.exe Note that you may need administrator privileges to perform this installation. Otherwise, you will need to install the software for your own use. Depending on the setup of your system, this may not always work. Installing software on Windows for your own use requires the following steps: Copy the setuptools installation file to your Desktop. Right-click on it and choose the runas option. Enter the name of the user who has enough rights to install it (presumably yourself) After the software has been installed, ensure that you know the location of the easy_install.exe file. You will need it to install MySQL for Python.
Read more
  • 0
  • 0
  • 3200

article-image-categories-and-attributes-magento-part-2
Packt
22 Oct 2009
7 min read
Save for later

Categories and Attributes in Magento: Part 2

Packt
22 Oct 2009
7 min read
Time for action: Creating Attributes In this section we will create an Attribute set for our store. First, we will create Attributes. Then, we will create the set. Before you begin Because Attributes are the main tool for describing your Products, it is important to make the best use of them. Plan which Attributes you want to use. What aspects or characteristics of your Products will a customer want to search for? Make those Attributes. What aspects of your Products will a customer want to choose? Make these Attributes, too. Attributes are organized into Attribute Sets. Each set is a collection of Attributes. You should create different sets to describe the different types of Products that you want to sell. In our coffee store, we will create two Attribute Sets: one for Single Origin coffees and one for Blends. They will differ in only one way. For Single Origin coffees, we will have an Attribute showing the country or region where the coffee is grown. We will not have this Attribute for blends because the coffees used in a blend can come from all over the world. Our sets will look like the following: Single Origin Attribute set Blended Attribute set Name Name Description Description Image Image Grind Grind Roast Roast Origin SKU SKU Price Price Size Size   Now, let's create the Attributes and put them into sets. The result of the following directions will be several new Attributes and two new Attribute Sets: If you haven't already, log in to your site's backend, which we call the Administrative Panel: Select Catalog | Attributes | Manage Attributes. list of all the Attributes is displayed. These attributes have been created for you. Some of these Attributes (such as color, cost, and description) are visible to your customers. Other Attributes affect the display of a Product, but your customers will never see them. For example, custom_design can be used to specify the name of a custom layout, which will be applied to a Product's page. Your customers will never see the name of the custom layout. We will add our own attributes to this list. Click the Add New Attribute button. The New Product Attribute page displays: There are two tabs on this page: Properties and Manage Label / Options. You are in the Properties tab. The Attribute Properties section contains settings that only the Magento administrator (you) will see. These settings are values that you will use when working with the Attribute. The Frontend Properties section contains settings that affect how this Attribute will be presented to your shoppers. We will cover each setting on this page. Attribute Code is the name of the Attribute. Your customers will never see this value. You will use it when managing the Attribute. Refer back to the list of Attributes that appeared in Step 2. The Attribute identifier appears in the first column, labelled Attribute Code. The Attribute Code must contain only lowercase letters, numbers, and the underscore character. And, it must begin with a letter. The Scope of this Attribute can be set as Store View, Website, or Global. For now, you can leave it set to the default—Store View. The other values become useful when you use one Magento installation to create multiple stores or multiple web sites. That is beyond the scope of this quick-start guide. After you assign an Attribute set to a Product, you will fill in values for the Attributes. For example, suppose you assign a set that contains the attributes color, description, price, and image. You will then need to enter the color, description, price, and image for that Product. Notice that each of the Attributes in that set is a different kind of data. For color, you would probably want to use a drop-down list to make selecting the right color quick and easy. This would also avoid using different terms for the same color such as "Red" and "Magenta". For description, you would probably want to use a freeform text field. For price, you would probably want to use a field that accepts only numbers, and that requires you to use two decimal places. And for image, you would want a field that enables you to upload a picture. The field Catalog Input Type for Store Owner enables you to select the kind of data that this Attribute will hold: In our example we are creating an Attribute called roast. When we assign this value to a Product, we want to select a single value for this field from a list of choices. So, we will select Dropdown. If you select Dropdown or Multiple Select for this field, then under the Manage Label/Options tab, you will need to enter the list of choices (the list of values) for this field. If you select Yes for Unique Value, then no two products can have the same value for this Attribute. For example, if I made roast a unique Attribute, that means only one kind of coffee in my store could be a Light roast, only one kind of coffee could be a French roast, only one could be Espresso, and so on. For an Attribute such as roast, this wouldn't make much sense. However, if this Attribute was the SKU of the Product, then I might want to make it unique. That would prevent me from entering the same SKU number for two different Products. If you select Yes for Values Required, then you must select or enter a value for this Attribute. You will not be able to save a Product with this Attribute if you leave it blank. In the case of roast, it makes sense to require a value. Our customers would not buy a coffee without knowing what kind of roast the coffee has. Input Validation for Store Owner causes Magento to check the value entered for an Attribute, and confirm that it is the right kind of data. When entering a value for this Attribute, if you do not enter the kind of data selected, then Magento gives you a warning message. The Apply To field determines which Product Types can have this Attribute applied to them. Remember that the three Product Types in Magento are Simple, Grouped, and Configurable. Recall that in our coffee store, if a type of coffee comes in only one roast, then it would be a Simple Product. And, if the customer gets to choose the roast, it would be a Configurable Product. So we want to select at least Simple Product and Configurable Product for the Apply To field: But what about Grouped Product? We might sell several different types of coffee in one package, which would make it a Grouped Product. For example, we might sell a Grouped Product that consists of a pound of Hawaiian Kona and a pound of Jamaican Blue Mountain. We could call this group something like "Island Coffees". If we applied the Attribute roast to this Grouped Product, then both types of coffee would be required to have the same roast. However, if Kona is better with a lighter roast and Blue Mountain is better with a darker roast, then we don't want them to have the same roast. So in our coffee store, we will not apply the Attribute roast to Grouped Products. When we sell coffees in special groupings, we will select the roast for each coffee. You will need to decide which Product Types each Attribute can be applied to. If you are the store owner and the only one using your site, you will know which Attributes should be applied to which Products. So, you can safely choose All Product Types for this setting.
Read more
  • 0
  • 0
  • 3197
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-jasperreports-36-creating-report-model-beans-java-applications
Packt
26 Jun 2010
3 min read
Save for later

JasperReports 3.6: Creating a Report from Model Beans of Java Applications

Packt
26 Jun 2010
3 min read
(For more resources on JasperReports, see here.) Getting ready You need a Java JAR file that contains class files for the JavaBeans required for this recipe. A custInvoices.jar file is contained in the source code (chap4). Unzip the source code file for this article and copy the Task5 folder from the unzipped source code to a location of your choice. How to do it... Let's start using Java objects as data storage units. Open the ModelBeansReport.jrxml file from the Task5 folder of the source code for this article (chapt 4). The Designer tab of iReport shows a report containing data in the Title, Column Header, Customer Group Header1 and Detail 1 sections, as shown in the following screenshot: If you have not made any database connection so far in your iReport installation, you will see an Empty datasource shown selected in a drop-down list just below the main menu. Click on the Report Datasources icon, shown encircled to the right of the drop-down list in the following screenshot: A new window named Connections / Datasources will open, as shown next. This window lists an Empty data source as well as the datasources you have made so far. Click the New button at the top-right of the Connections / Datasources window. This will open a new Datasource selection window, as shown in the following screenshot: Select JavaBeans set datasource from the datasource types, as shown next. Click the Next button. A new window named JavaBeans set datasource will open, as shown in the following screenshot: Enter CustomerInvoicesJavaBeans as the name of your new connection in the text box beside the Name field, as shown in the following screenshot: Enter com.CustomerInvoicesFactory as the name of the factory class in the text box beside the Factory class field, as shown in the following screenshot: This com.CustomerInvoicesFactory class provides iReport with access to JavaBeans that contain your data. Enter getBeanCollection as the name of the static method in the text box beside The static method... field, as shown in the following screenshot: Leave the rest of the fields at their default values. Click the Test button to test your new connection to the JavaBeans datasource. You will see an Exception message dialog. This exception message occurs because iReport can't find your factory class. Dismiss the message box by clicking OK. Click the Save button at the bottom of the JavaBeans set datasource window and close the Connections / Datasources window as well.
Read more
  • 0
  • 0
  • 3196

article-image-setup-routine-enterprise-spring-application
Packt
14 Jan 2016
6 min read
Save for later

Setup Routine for an Enterprise Spring Application

Packt
14 Jan 2016
6 min read
In this article by Alex Bretet, author of the book Spring MVC Cookbook, you will learn to install Eclipse for Java EE developers and Java SE 8. (For more resources related to this topic, see here.) Introduction The choice of the Eclipse IDE needs to be discussed as there is some competition in this domain. Eclipse is popular in the Java community for being an active open source product; it is consequently accessible online to anyone with no restrictions. It also provides, among other usages, a very good support to web implementations, particularly to MVC approaches. Why use the Spring Framework? The Spring Framework and its community have also contributed to pull forward the Java platform for more than a decade. Presenting the whole framework in detail would require us to write more than a article. However, the core functionality based on the principles of Inversion of Control and Dependency Injection through a performant access to the bean repository allows massive reusability. Staying lightweight, the Spring Framework secures great scaling capabilities and could probably suit all modern architectures. The following recipe is about downloading and installing the Eclipse IDE for JEE developers and downloading and installing JDK 8 Oracle Hotspot. Getting ready This first sequence could appear as redundant or unnecessary with regard to your education or experience. For instance, you will, for sure, stay away from unidentified bugs (integration or development). You will also be assured of experiencing the same interfaces as the presented screenshots and figures. Also, because third-party products are living, you will not have to face the surprise of encountering unexpected screens or windows. How to do it... You need to perform the following steps to install the Eclipse IDE: Download a distribution of the Eclipse IDE for Java EE developers. We will be using in this article an Eclipse Luna distribution. We recommend you to install this version, which can be found at https://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/lunasr1, so that you can follow along with our guidelines and screenshots completely. Download a Luna distribution for the OS and environment of your choice: The product to be downloaded is not a binary installer but a ZIP archive. If you feel confident enough to use another version (more actual) of the Eclipse IDE for Java EE developers, all of them can be found at https://www.eclipse.org/downloads. For the upcoming installations, on Windows, a few target locations are suggested to be at the root directory C:. To avoid permission-related issues, it would be better if your Windows user is configured to be a local administrator. If you can't be part of this group, feel free to target installation directories you have write access to. Extract the downloaded archive into an eclipse directory:     If you are on Windows, archive into the C:Users{system.username}eclipse directory     If you are using Linux, archive into the /home/usr/{system.username}/eclipse directory     If you are using Mac OS X, archive into the /Users/{system.username}/eclipse directory Select and download a JDK 8. We suggest you to download the Oracle Hotspot JDK. Hotspot is a performant JVM implementation that has originally been built by Sun Microsystems. Now owned by Oracle, the Hotspot JRE and JDK are downloadable for free. Choose the product corresponding to your machine through Oracle website's link, http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. To avoid a compatibility issue later on, do stay consistent with the architecture choice (32 or 64 bits) that you have made earlier for the Eclipse archive. Install the JDK 8. On Windows, perform the following steps:     Execute the downloaded file and wait until you reach the next installation step.     On the installation-step window, pay attention to the destination directory and change it to C:javajdk1.8.X_XX (X_XX as the latest current version. We will be using jdk1.8.0_25 in this article.)     Also, it won't be necessary to install an external JRE, so uncheck the Public JRE feature. On a Linux/Mac OS, perform the following steps:     Download the tar.gz archive corresponding to your environment.     Change the current directory to where you want to install Java. For easier instructions, let's agree on the /usr/java directory.     Move the downloaded tar.gz archive to this current directory.     Unpack the archive with the following command line, targeting the name of your archive: tar zxvf jdk-8u25-linux-i586.tar.gz (This example is for a binary archive corresponding to a Linux x86 machine).You must end up with the /usr/java/jdk1.8.0_25 directory structure that contains the subdirectories /bin, /db, /jre, /include, and so on. How it works… Eclipse for Java EE developers We have installed the Eclipse IDE for Java EE developers. Comparatively to Eclipse IDE for Java developers, there are some additional packages coming along, such as Java EE Developer Tools, Data Tools Platform, and JavaScript Development Tools. This version is appreciated for its capability to manage development servers as part of the IDE itself and its capability to customize the Project Facets and to support JPA. The Luna version is officially Java SE 8 compatible; it has been a decisive factor here. Choosing a JVM The choice of JVM implementation could be discussed over performance, memory management, garbage collection, and optimization capabilities. There are lots of different JVM implementations, and among them, a couple of open source solutions, such as OpenJDK and IcedTea (RedHat). It really depends on the application requirements. We have chosen Oracle Hotspot from experience, and from reference implementations, deployed it in production; it can be trusted for a wide range of generic purposes. Hotspot also behaves very well to run Java UI applications. Eclipse is one of them. Java SE 8 If you haven't already played with Scala or Clojure, it is time to take the functional programming train! With Java SE 8, Lambda expressions reduce the amount of code dramatically with an improved readability and maintainability. We won't implement only this Java 8 feature, but it being probably the most popular, it must be highlighted as it has given a massive credit to the paradigm change. It is important nowadays to feel familiar with these patterns. Summary In this article, you learned how to install Eclipse for Java EE developers and Java SE 8. Resources for Article: Further resources on this subject: Support for Developers of Spring Web Flow 2[article] Design with Spring AOP[article] Using Spring JMX within Java Applications[article]
Read more
  • 0
  • 0
  • 3194

article-image-logos-inkscape
Packt
03 Nov 2010
6 min read
Save for later

Logos in Inkscape

Packt
03 Nov 2010
6 min read
  Inkscape 0.48 Essentials for Web Designers Use the fascinating Inkscape graphics editor to create attractive layout designs, images, and icons for your website The first book on the newly released Inkscape version 0.48, with an exclusive focus on web design Comprehensive coverage of all aspects of Inkscape required for web design Incorporate eye-catching designs, patterns, and other visual elements to spice up your web pages Learn how to create your own Inkscape templates in addition to using the built-in ones Written in a simple illustrative manner, which will appeal to web designers and experienced Inkscape users alike        Here's an example of a web page with a logo as a major design element: Logos as the cornerstone of the design Logos are the graphical representation or emblem for a company or organization—sometimes even individuals use them to promote instant recognition. They can be a graphic (a combination of symbols or icons), a combination of graphics and text, or graphical forms of text. Why are they important in web design? Since most companies want to be recognized by their logo alone—the logo is the critical piece of the design. It needs prominent placement to work flawlessly with the design. Best practices for creating logos There are a lot of guidelines and principles to the best logo designs. And they start with some simple ideas that have been reworked and discussed intensely since the start of the Internet. But it never hurts to review the best practices. You want your company logos to be: Simple: That's right, you want to keep them clean, simple, neat, and intensely easy to recreate. If you nail this attribute, the others listed below will be easy to achieve. Memorable: Think of all the great company logos. You remember them in your mind's eye very easily right? That's because they are unique and in essence simple. These two attributes together make some of the best company logos today. Timeless: These logos will last many years. This not only saves the company or individual money, but it also increases the memorability of the logo and brand of the company. Versatile: Any logo that can be used in print (color and black and white), digital media, television, any size, letterhead, billboards, and small iconic statements along the bottom of web pages or promotional materials—is a successful logo. You never know where a logo might be placed, especially on the web. You want something that can be used in a prominent location on the company web site itself, but also something that works in a small thumbnail space for social media or cell phone applications. Appropriate: We want the logo to be appropriate for the company it is representing. The right colors, images, and more will go along way in giving the company credibility immediately upon first glance by any consumer or potential client. It can also prove to be a great indication of the services one can expect from the company itself. Seems easy enough right? It is, after some practice and some processes are in place. It never hurts to have a loose process to work with clients to determine their needs and wants in a logo. Some may already have a logo and want to keep parts of the design and revamp others while other clients might be so new they haven't ever had a logo before. As a start, here's a brief process for working with clients and discussing logos. Information gathering There's no better place to start than to open the floor for discussion. Here's just the start of what you can ask or gather from your client in an initial information gathering meeting: Does the client already have a logo? If yes, do they intend to keep that logo to use in the web design? Again if yes, get the source files. Hopefully they are in vector graphic format so they are scalable and usable right away in the web design. Are they interested in a logo redesign? This can be beneficial if they are rebranding themselves as a business or having a 'grand re-opening' of some sort. It can breathe life into a stale business and sometimes garner some new interest. If yes, is it a complete (open to anything) redesign? Or are there certain elements that need to stay? Sometimes color is important, or a certain font or even a certain graphical element needs to stay within the logo. Listen and take notes; it is important to work with the client to try to fulfill their needs as much as possible. If the client is open to a complete redesign, brainstorm a bit with them about their needs and wants. Colors, fonts, graphical ideas. Don't be afraid to bring out some paper and pencils and start sketching some ideas. Sometimes it can be most productive to work through some rough ideas this way to get a feel for what the client likes most and not. Consider it a working session. Try to understand where the client wants to use this logo most prominently. Keeping that in mind you will design something that is versatile and could be used in most mediums; you still want to know where they plan to use it the most. That way, you can tailor the logo as much as possible for that space—especially if you can use more color. What are the primary goals of the company? What is their mission statement? Does the client already have brand guidelines to consider? Creating initial designs After the initial informational session it is your turn to start designing. Take the paper and pencil sketches (if you had any) from the initial meeting and expand on them. In fact, spend a bit more time with your team and flesh out a few more of those ideas in a true brainstorming session. It can be beneficial to start this way first before jumping on to the computer and getting caught in details like typeface and effects. Once you have some solid ideas, bring it over to the computer and start designing. Focus on only three of your best ideas. That way you bring only your best to the client to review and discuss. Much like with the web design process, the logo design process takes a very similar route. You bring design mock ups to your client to review, give feedback, redesign, and then you go back and design some more—all until you get approvals. And then you—being an Inkscape expert—can build and then export the logo in any number of vector formats for use in almost any medium.
Read more
  • 0
  • 0
  • 3193

article-image-google-app-engine
Packt
05 Feb 2015
11 min read
Save for later

Google App Engine

Packt
05 Feb 2015
11 min read
In this article by Massimiliano Pippi, author of the book Python for Google App Engine, in this article, you will learn how to write a web application and seeing the platform in action. Web applications commonly provide a set of features such as user authentication and data storage. App Engine provides the services and tools needed to implement such features. (For more resources related to this topic, see here.) In this article, we will see: Details of the webapp2 framework How to authenticate users Storing data on Google Cloud Datastore Building HTML pages using templates Experimenting on the Notes application To better explore App Engine and Cloud Platform capabilities, we need a real-world application to experiment on; something that's not trivial to write, with a reasonable list of requirements. A good candidate is a note-taking application; we will name it Notes. Notes enable the users to add, remove, and modify a list of notes; a note has a title and a body of text. Users can only see their personal notes, so they must authenticate before using the application. The main page of the application will show the list of notes for logged-in users and a form to add new ones. The code from the helloworld example is a good starting point. We can simply change the name of the root folder and the application field in the app.yaml file to match the new name we chose for the application, or we can start a new project from scratch named notes. Authenticating users The first requirement for our Notes application is showing the home page only to users who are logged in and redirect others to the login form; the users service provided by App Engine is exactly what we need and adding it to our MainHandler class is quite simple: import webapp2 from google.appengine.api import users class MainHandler(webapp2.RequestHandler): def get(self): user = users.get_current_user() if user is not None: self.response.write('Hello Notes!') else: login_url = users.create_login_url(self.request.uri) self.redirect(login_url) app = webapp2.WSGIApplication([ ('/', MainHandler) ], debug=True) The user package we import on the second line of the previous code provides access to users' service functionalities. Inside the get() method of the MainHandler class, we first check whether the user visiting the page has logged in or not. If they have, the get_current_user() method returns an instance of the user class provided by App Engine and representing an authenticated user; otherwise, it returns None as output. If the user is valid, we provide the response as we did before; otherwise, we redirect them to the Google login form. The URL of the login form is returned using the create_login_url() method, and we call it, passing as a parameter the URL we want to redirect users to after a successful authentication. In this case, we want to redirect users to the same URL they are visiting, provided by webapp2 in the self.request.uri property. The webapp2 framework also provides handlers with a redirect() method we can use to conveniently set the right status and location properties of the response object so that the client browsers will be redirected to the login page. HTML templates with Jinja2 Web applications provide rich and complex HTML user interfaces, and Notes is no exception but, so far, response objects in our applications contained just small pieces of text. We could include HTML tags as strings in our Python modules and write them in the response body but we can imagine how easily it could become messy and hard to maintain the code. We need to completely separate the Python code from HTML pages and that's exactly what a template engine does. A template is a piece of HTML code living in its own file and possibly containing additional, special tags; with the help of a template engine, from the Python script, we can load this file, properly parse special tags, if any, and return valid HTML code in the response body. App Engine includes in the Python runtime a well-known template engine: the Jinja2 library. To make the Jinja2 library available to our application, we need to add this code to the app.yaml file under the libraries section: libraries: - name: webapp2 version: "2.5.2" - name: jinja2 version: latest We can put the HTML code for the main page in a file called main.html inside the application root. We start with a very simple page: <!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title>Notes</title> </head> <body> <div class="container"> <h1>Welcome to Notes!</h1> <p> Hello, <b>{{user}}</b> - <a href="{{logout_url}}">Logout</a> </p> </div> </body> </html> Most of the content is static, which means that it will be rendered as standard HTML as we see it but there is a part that is dynamic and whose content depend on which data will be passed at runtime to the rendering process. This data is commonly referred to as template context. What has to be dynamic is the username of the current user and the link used to log out from the application. The HTML code contains two special elements written in the Jinja2 template syntax, {{user}} and {{logout_url}}, that will be substituted before the final output occurs. Back to the Python script; we need to add the code to initialize the template engine before the MainHandler class definition: import os import jinja2 jinja_env = jinja2.Environment( loader=jinja2.FileSystemLoader(os.path.dirname(__file__))) The environment instance stores engine configuration and global objects, and it's used to load templates instances; in our case, instances are loaded from HTML files on the filesystem in the same directory as the Python script. To load and render our template, we add the following code to the MainHandler.get() method: class MainHandler(webapp2.RequestHandler): def get(self): user = users.get_current_user() if user is not None: logout_url = users.create_logout_url(self.request.uri) template_context = { 'user': user.nickname(), 'logout_url': logout_url, } template = jinja_env.get_template('main.html') self.response.out.write( template.render(template_context)) else: login_url = users.create_login_url(self.request.uri) self.redirect(login_url) Similar to how we get the login URL, the create_logout_url() method provided by the user service returns the absolute URI to the logout procedure that we assign to the logout_url variable. We then create the template_context dictionary that contains the context values we want to pass to the template engine for the rendering process. We assign the nickname of the current user to the user key in the dictionary and the logout URL string to the logout_url key. The get_template() method from the jinja_env instance takes the name of the file that contains the HTML code and returns a Jinja2 template object. To obtain the final output, we call the render() method on the template object passing in the template_context dictionary whose values will be accessed, specifying their respective keys in the HTML file with the template syntax elements {{user}} and {{logout_url}}. Handling forms The main page of the application is supposed to list all the notes that belong to the current user but there isn't any way to create such notes at the moment. We need to display a web form on the main page so that users can submit details and create a note. To display a form to collect data and create notes, we put the following HTML code right below the username and the logout link in the main.html template file: {% if note_title %} <p>Title: {{note_title}}</p> <p>Content: {{note_content}}</p> {% endif %} <h4>Add a new note</h4> <form action="" method="post"> <div class="form-group"> <label for="title">Title:</label> <input type="text" id="title" name="title" /> </div> <div class="form-group"> <label for="content">Content:</label> <textarea id="content" name="content"></textarea> </div> <div class="form-group"> <button type="submit">Save note</button> </div> </form> Before showing the form, a message is displayed only when the template context contains a variable named note_title. To do this, we use an if statement, executed between the {% if note_title %} and {% endif %} delimiters; similar delimiters are used to perform for loops or assign values inside a template. The action property of the form tag is empty; this means that upon form submission, the browser will perform a POST request to the same URL, which in this case is the home page URL. As our WSGI application maps the home page to the MainHandler class, we need to add a method to this class so that it can handle POST requests: class MainHandler(webapp2.RequestHandler): def get(self): user = users.get_current_user() if user is not None: logout_url = users.create_logout_url(self.request.uri) template_context = { 'user': user.nickname(), 'logout_url': logout_url, } template = jinja_env.get_template('main.html') self.response.out.write( template.render(template_context)) else: login_url = users.create_login_url(self.request.uri) self.redirect(login_url) def post(self): user = users.get_current_user() if user is None: self.error(401) logout_url = users.create_logout_url(self.request.uri) template_context = { 'user': user.nickname(), 'logout_url': logout_url, 'note_title': self.request.get('title'), 'note_content': self.request.get('content'), } template = jinja_env.get_template('main.html') self.response.out.write( template.render(template_context)) When the form is submitted, the handler is invoked and the post() method is called. We first check whether a valid user is logged in; if not, we raise an HTTP 401: Unauthorized error without serving any content in the response body. Since the HTML template is the same served by the get() method, we still need to add the logout URL and the user name to the context. In this case, we also store the data coming from the HTML form in the context. To access the form data, we call the get() method on the self.request object. The last three lines are boilerplate code to load and render the home page template. We can move this code in a separate method to avoid duplication: def _render_template(self, template_name, context=None): if context is None: context = {} template = jinja_env.get_template(template_name) return template.render(context) In the handler class, we will then use something like this to output the template rendering result: self.response.out.write( self._render_template('main.html', template_context)) We can try to submit the form and check whether the note title and content are actually displayed above the form. Summary Thanks to App Engine, we have already implemented a rich set of features with a relatively small effort so far. We have discovered some more details about the webapp2 framework and its capabilities, implementing a nontrivial request handler. We have learned how to use the App Engine users service to provide users authentication. We have delved into some fundamental details of Datastore and now we know how to structure data in grouped entities and how to effectively retrieve data with ancestor queries. In addition, we have created an HTML user interface with the help of the Jinja2 template library, learning how to serve static content such as CSS files. Resources for Article: Further resources on this subject: Machine Learning in IPython with scikit-learn [Article] Introspecting Maya, Python, and PyMEL [Article] Driving Visual Analyses with Automobile Data (Python) [Article]
Read more
  • 0
  • 0
  • 3192
article-image-probability-r
Packt
23 Feb 2016
17 min read
Save for later

Probability of R?

Packt
23 Feb 2016
17 min read
It's time for us to put descriptive statistics down for the time being. It was fun for a while, but we're no longer content just determining the properties of observed data; now we want to start making deductions about data we haven't observed. This leads us to the realm of inferential statistics. In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that's just a rumor. (For more resources related to this topic, see here.) Basic probability Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence. Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur. The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive. The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows: The probability of a coin flip yielding either heads or tails looks like this: And the probability of a coin flip yielding both heads and tails is denoted as follows: The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can't happen. The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is {1, 2, 3, 4, 5, 6}. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1): Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can't roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities: While rolling a 1 or rolling a 2 aren't mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner: These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events. Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card: What, then, is the probability of getting a black card and an Ace? Well, these events are conditionally independent, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore: Intuitively, this makes sense, because there are two black Aces out of a possible 52. What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows: Where P(A|B) means the probability of A given B. For example, if we represent A as drawing a Heart and B as drawing a red card, P(A | B) means what's the probability of drawing a heart if we know that the card we drew was red?. Since a red card is equally likely to be a Heart or a Diamond, P(A|B) is 0.5. Therefore: In the preceding equation, we used the form P(B) P(A|B). Had we used the form P(A) P(B|A), we would have got the same answer: So, these two forms are equivalent: For kicks, let's divide both sides of the equation by P(B). That yields the following equivalence: This equation is known as Bayes' Theorem. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics. Bayes' Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election. At its core, Bayes' Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes' Theorem is often more intuitive: where H is the hypothesis and E is the evidence. Let's see an example of Bayes' Theorem in action! There's a hot new recreational drug on the scene called Allighate (or Ally for short). It's named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it. Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times. When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense. Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate. How so? Let's find out by applying Bayes' theorem! Let's focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive. We want to solve the left side of the equation, so let's plug in values. The first part of the right side of the equation, P(Positive Test | Ally User), is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, P(Ally User), is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, P(Ally user | Positive Test), is the posterior. It is the probability of our hypothesis updated to reflect new evidence. In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows: Which is: (.99 * .001) + (.01 * .999) = 0.01098 Plugging that into the denominator, our final answer is calculated as follows: Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from .001 to .09. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user. Ignoring the prior probability in cases like these is known as base-rate fallacy. Shanice assuaged Ronald's embarrassment by assuring him that it was a very common mistake. Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get .0081. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated. Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work. Even though Sally passed the drug test, we know that it's likely (almost certain) that she uses Ally. Bayes' theorem gives us a way to quantify that probability! Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get - after all, how many non-Ally users do you think eat pencils and live in sewers? A tale of two interpretations Though it may seem strange to hear, there is actually a hot philosophical debate about what probability really is. Though there are others, the two primary camps into which virtually all mathematicians fall are the frequentist camp and the Bayesian camp. The frequentist interpretation describes probability as the relative likelihood of observing an outcome in an experiment when you repeat the experiment multiple times. Flipping a coin is a perfect example; the probability of heads converges to 50% as the number of times it is flipped goes to infinity. The frequentist interpretation of probability is inherently objective; there is a true probability out there in the world, which we are trying to estimate. The Bayesian interpretation, however, views probability as our degree of belief about something. Because of this, the Bayesian interpretation is subjective; when evidence is scarce, there are sometimes wildly different degrees of belief among different people. Described in this manner, Bayesianism may scare many people off, but it is actually quite intuitive. For example, when a meteorologist describes the probability of rain as 70%, people rarely bat an eyelash. But this number only really makes sense within a Bayesian framework because exact meteorological conditions are not repeatable, as is required by frequentist probability. Not simply a heady academic exercise, these two interpretations lead to different methodologies in solving problems in data analysis. Many times, both approaches lead to similar results. Though practitioners may strongly align themselves with one side over another, good statisticians know that there's a time and a place for both approaches. Though Bayesianism as a valid way of looking at probability is debated, Bayes theorem is a fact about probability and is undisputed and non-controversial. Sampling from distributions Observing the outcome of trials that involve a random variable, a variable whose value changes due to chance, can be thought of as sampling from a probability distribution—one that describes the likelihood of each member of the sample space occurring. That sentence probably sounds much scarier than it needs to be. Take a die roll for example. Figure 4.1: Probability distribution of outcomes of a die roll Each roll of a die is like sampling from a discrete probability distribution for which each outcome in the sample space has a probability of 0.167 or 1/6. This is an example of a uniform distribution, because all the outcomes are uniformly as likely to occur. Further, there are a finite number of outcomes, so this is a discrete uniform distribution (there also exist continuous uniform distributions). Flipping a coin is like sampling from a uniform distribution with only two outcomes. More specifically, the probability distribution that describes coin-flip events is called a Bernoulli distribution—it's a distribution describing only two events. Parameters We use probability distributions to describe the behavior of random variables because they make it easy to compute with and give us a lot of information about how a variable behaves. But before we perform computations with probability distributions, we have to specify the parameters of those distributions. These parameters will determine exactly what the distribution looks like and how it will behave. For example, the behavior of both a 6-sided die and a 12-sided die is modeled with a uniform distribution. Even though the behavior of both the dice is modeled as uniform distributions, the behavior of each is a little different. To further specify the behavior of each distribution, we detail its parameter; in the case of the (discrete) uniform distribution, the parameter is called n. A uniform distribution with parameter n has n equally likely outcomes of probability 1 / n. The n for a 6-sided die and a 12-sided die is 6 and 12 respectively. For a Bernoulli distribution, which describes the probability distribution of an event with only two outcomes, the parameter is p. Outcome 1 occurs with probability p, and the other outcome occurs with probability 1 - p, because they are collectively exhaustive. The flip of a fair coin is modeled as a Bernoulli distribution with p = 0.5. Imagine a six-sided die with one side labeled 1 and the other five sides labeled 2. The outcome of the die roll trials can be described with a Bernoulli distribution, too! This time, p = 0.16 (1/6). Therefore, the probability of not rolling a 1 is 5/6. The binomial distribution The binomial distribution is a fun one. Like our uniform distribution described in the previous section, it is discrete. When an event has two possible outcomes, success or failure, this distribution describes the number of successes in a certain number of trials. Its parameters are n, the number of trials, and p, the probability of success. Concretely, a binomial distribution with n=1 and p=0.5 describes the behavior of a single coin flip—if we choose to view heads as successes (we could also choose to view tails as successes). A binomial distribution with n=30 and p=0.5 describes the number of heads we should expect. Figure 4.2: A binomial distribution (n=30, p=0.5) On average, of course, we would expect to have 15 heads. However, randomness is the name of the game, and seeing more or fewer heads is totally expected. How can we use the binomial distribution in practice?, you ask. Well, let's look at an application. Larry the Untrustworthy Knave—who can only be trusted some of the time—gives us a coin that he alleges is fair. We flip it 30 times and observe 10 heads. It turns out that the probability of getting exactly 10 heads on 30 flips is about 2.8%*. We can use R to tell us the probability of getting 10 or fewer heads using the pbinom function:   > pbinom(10, size=30, prob=.5)   [1] 0.04936857 It appears as if the probability of this occurring, in a correctly balanced coin, is roughly 5%. Do you think we should take Larry at his word? *If you're interested The way we determined the probability of getting exactly 10 heads is by using the probability formula for Bernoulli trials. The probability of getting k successes in n trials is equal to: where p is the probability of getting one success and: The normal distribution When we described the normal distribution and how ubiquitous it is? The behavior of many random variables in real life is very well described by a normal distribution with certain parameters. The two parameters that uniquely specify a normal distribution are µ (mu) and σ (sigma). µ, the mean, describes where the distribution's peak is located and σ, the standard deviation, describes how wide or narrow the distribution is. Figure 4.3: Normal distributions with different parameters The distribution of heights of American females is approximately normally distributed with parameters µ= 65 inches and σ= 3.5 inches. Figure 4.4: Normal distributions with different parameters With this information, we can easily answer questions about how probable it is to choose, at random, US women of certain heights. We can't really answer the question What is the probability that we choose a person who is exactly 60 inches?, because virtually no one is exactly 60 inches. Instead, we answer questions about how probable it is that a random person is within a certain range of heights. What is the probability that a randomly chosen woman is 70 inches or taller? If you recall, the probability of a height within a range is the area under the curve, or the integral over that range. In this case, the range we will integrate looks like this: Figure 4.5: Area under the curve of the height distribution from 70 inches to positive infinity    > f <- function(x){ dnorm(x, mean=65, sd=3.5) }   > integrate(f, 70, Inf)   0.07656373 with absolute error < 2.2e-06 The preceding R code indicates that there is a 7.66% chance of randomly choosing a woman who is 70 inches or taller. Luckily for us, the normal distribution is so popular and well studied, that there is a function built into R, so we don't need to use integration ourselves.   > pnorm(70, mean=65, sd=3.5)   [1] 0.9234363  The pnorm function tells us the probability of choosing a woman who is shorter than 70 inches. If we want to find P (> 70 inches), we can either subtract this value by 1 (which gives us the complement) or use the optional argument lower.tail=FALSE. If you do this, you'll see that the result matches the 7.66% chance we arrived at earlier. Summary You can check out similar books published by Packt Publishing on R (https://www.packtpub.com/tech/r): Unsupervised Learning with R by Erik Rodríguez Pacheco (https://www.packtpub.com/big-data-and-business-intelligence/unsupervised-learning-r) R Data Science Essentials by Raja B. Koushik and Sharan Kumar Ravindran (https://www.packtpub.com/big-data-and-business-intelligence/r-data-science-essentials) Resources for Article: Further resources on this subject: Dealing With A Mess [article] Navigating The Online Drupal Community [article] Design With Spring AOP [article]
Read more
  • 0
  • 0
  • 3191

article-image-docker-hosts
Packt
07 Mar 2016
19 min read
Save for later

Docker Hosts

Packt
07 Mar 2016
19 min read
In this article by Scott Gallagher, the author of Securing Docker, we are glad you decided to read this article and we want to make sure that the resources you are using are being secured in proper ways to ensure system integrity and data loss prevention. It is also important to understand why you should care about the security. If data loss prevention doesn't scare you already, thinking about the worst possible scenario—a full system compromise and the possibility of your secret designs being leaked or stolen by others—might help to reinforce security. In this article, we will be taking a look at securing Docker hosts and will be covering the following topics: Docker host overview Discussing Docker host Virtualization and isolation Attack surface of Docker daemon Securing Docker hosts Docker Machine SELinux and AppArmor Auto-patching hosts (For more resources related to this topic, see here.) Docker host overview Before we get in depth and dive in, let's first take a step back and review exactly what the Docker host is. In this section, we will look at the Docker host itself to get an understanding of what we are referring to when we are talking about the Docker host. We will also be looking at the virtualization and isolation that Docker uses to ensure security. Discussing Docker host When we think of a Docker host, what comes to our mind? If you put it in terms of virtual machines that almost all of us are familiar with, we would then compare what a VM host is compared with a Docker host. A VM host is what the virtual machines actually run on top of. Typically, this is something like VMware ESXi if you are using VMware or Windows Server if you are using Hyper-V. Let's take a look at how they are as compared so that you can get a visual representation of the two, as shown in the following diagram: The preceding image depicts the similarities between a VM host and Docker host. As stated previously, the host of any service is simply the system that the underlying virtual machines or containers in Docker run on top of. Therefore, a host is the operating system or service that contains and operates the underlying systems that you install and set up a service on such as web servers, databases, and more. Virtualization and isolation To understand how Docker hosts can be secured, we must first understand how the Docker host is set up and what items are contained in the Docker host. Again, like VM hosts, they contain the operating system that the underlying service operates on. With VMs, you are creating a whole new operating system on top of this VM host operating system. However, on Docker, you are not doing that and are sharing the Linux kernel that the Docker host is using. Let's take a look at the following diagram to help us represent this: As we can see from the preceding image, there is a distinct difference between how items are set up on a VM host and on a Docker host. On a VM host, each virtual machine has all of its own items inclusive to itself. Each containerized application brings its own set of libraries, whether it is Windows or Linux. Now, on the Docker host, we don't see that. We see that they share the Linux kernel version that is being used on the Docker host. That being said, there are some security aspects that need to be addressed on the Docker host side of things. Now, on the VM host side, if someone does compromise a virtual machine, the operating system is isolated to just that one virtual machine. Back on the Docker host side of things, if the kernel is compromised on the Docker host, then all the containers running on that host are now at high risk as well. So, now you should see how important it is that we focus on security when it comes to Docker hosts. Docker hosts do use some isolation that will help protect against kernel or container compromises in a few ways. Two of these ways are by implementing namespaces and cgroups. Before we can discuss how they help, let's first give a definition for each of them. Kernel namespaces as they are commonly known as provide a form of isolation for the containers that will be running on your hosts. What does this mean? This means that each container that you run on top of your Docker hosts will be given its own network stack so that it doesn't get privileged access to another containers socket or interfaces. However, by default, all Docker containers are sitting on the bridged interface so that they can communicate with each other easily. Think of the bridged interface as a network switch that all the containers are connected to. Namespaces also provide isolation for processes and mount isolation. Processes running in one container can't affect or even see processes running in another Docker container. Isolation for mount points is also on a container by container basis. This means that mount points on one container can't see or interact with mount points on another container. On the other hand, control groups are what control and limit resources for containers that will be running on top of your Docker hosts. What does this boil down to, meaning how will it benefit you? It means that cgroups, as they will be called going forward, help each container get its fair share of memory disk I/O, CPU, and much more. So, a container cannot bring down an entire host by exhausting all the resources available on it. This will help to ensure that even if an application is misbehaving that the other containers won't be affected by this application and your other applications can be assured uptime. Attack surface of Docker daemon While Docker does ease some of the complicated work in the virtualization world, it is easy to forget to think about the security implications of running containers on your Docker hosts. The largest concern you need to be aware of is that Docker requires root privileges to operate. For this reason, you need to be aware of who has access to your Docker hosts and the Docker daemon as they will have full administrative access to all your Docker containers and images on your Docker host. They can start new containers, stop existing ones, remove images, pull new images, and even reconfigure running containers as well by injecting commands into them. They can also extract sensitive information like passwords and certificates from the containers. For this reason, make sure to also separate important containers if you do need to keep separate controls on who has access to your Docker daemon. This is for containers where people have a need for access to the Docker host where the containers are running. If a user needs API access then that is different and separation might not be necessary. For example, keep containers that are sensitive on one Docker host, while keeping normal operation containers running on another Docker host and grant permissions for other staff access to the Docker daemon on the unprivileged host. If possible, it is also recommended to drop the setuid and setgid capabilities from containers that will be running on your hosts. If you are going to run Docker, it's recommended to only use Docker on this server and not other applications. Docker also starts containers with a very restricted set of capabilities, which helps in your favor to address security concerns. To drop the setuid or setgid capabilities when you start a Docker container you will do something similar to the following: $ docker run -d --cap-drop SETGID --cap-drop SETUID nginx This would start the nginx container and would drop the SETGID and SETUID capabilities for the container. Docker's end goal is to map the root user to a non-root user that exists on the Docker host. They also are working towards allowing the Docker daemon to run without requiring root privileges. These future improvements will only help facilitate how much focus Docker does take when they are implementing their feature sets. Protecting the Docker daemon To protect the Docker daemon even more, we can secure the communications that our Docker daemon is using. We can do this by generating certificates and keys. There are are few terms to understand before we dive into the creation of the certificates and keys. A Certificate Authority (CA) is an entity that issues certificates. This certificate certifies the ownership of the public key by the subject that is specified in the certificate. By doing this, we can ensure that your Docker daemon will only accept communication from other daemons that have a certificate that was also signed by the same CA. Now, we will be looking at how to ensure that the containers you will be running on top of your Docker hosts will be secure in a few pages; however, first and foremost, you want to make sure the Docker daemon is running securely. To do this, there are some parameters you will need to enable for when the daemon starts. Some of the things you will need beforehand will be as follows: Create a CA. $ openssl genrsa -aes256 -out ca-key.pem 4096 Generating RSA private key, 4096 bit long modulus ......................................................................................................................................................................................................................++ ....................................................................++ e is 65537 (0x10001) Enter pass phrase for ca-key.pem: Verifying - Enter pass phrase for ca-key.pem: You will need to specify two values, pass phrase and pass phrase. This needs to be between 4 and 1023 characters. Anything less than 4 or more than 1023 won't be accepted. $ openssl req -new -x509 -days <number_of_days> -key ca-key.pem -sha256 -out ca.pem Enter pass phrase for ca-key.pem: You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:US State or Province Name (full name) [Some-State]:Pennsylvania Locality Name (eg, city) []: Organization Name (eg, company) [Internet Widgits Pty Ltd]: Organizational Unit Name (eg, section) []: Common Name (e.g. server FQDN or YOUR name) []: Email Address []: There are a couple of items you will need. You will need pass phrase you entered earlier for ca-key.pem. You will also need the Country, State, city, Organization Name, Organizational Unit Name, fully qualified domain name (FQDN), and Email Address; to be able to finalize the certificate. Create a client key and signing certificate. $ openssl genrsa -out key.pem 4096 $ openssl req -subj '/CN=<client_DNS_name>' -new -key key.pem -out client.csr Sign the public key. $ openssl x509 -req -days <number_of_days> -sha256 -in client.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out cert.em Change permissions. $ chmod -v 0400 ca-key.pem key.pem server-key.em $ chmod -v 0444 ca.pem server-cert.pem cert.em Now, you can make sure that your Docker daemon only accepts connections from these that you provide the signed certificates to: $ docker daemon --tlsverify --tlscacert=ca.pem --tlscert=server-certificate.pem --tlskey=server-key.pem -H=0.0.0.0:2376 Make sure that the certificate files are in the directory you are running the command from or you will need to specify the full path to the certificate file. On each client, you will need to run the following: $ docker --tlsverify --tlscacert=ca.pem --tlscert=cert.pem --tlskey=key.pem -H=<$DOCKER_HOST>:2376 version Again, the location of the certificates is important. Make sure to either have them in a directory where you plan to run the preceding command or specify the full path to the certificate and key file locations. You can read more about using TLS by default with your Docker daemon by going to the following link:http://docs.docker.com/engine/articles/https/ For more reading on Docker Secure Deployment Guidelines, the following link provides a table that can be used to gain insight into some other items you can utilize as well:https://github.com/GDSSecurity/Docker-Secure-Deployment-Guidelines Some of the highlights from that website are: Collecting security and audit logs Utilizing the privileged switch when running Docker containers Device control groups Mount points Security audits Securing Docker hosts Where do we start to secure our hosts? What tools do we need to start with? We will take a look at using Docker Machine in this section and how to ensure the hosts that we are creating are being created in a secure manner. Docker hosts are like the front door of your house, if you don't secure them properly, then anybody can just walk right in. We will also take a look at Security-Enhanced Linux (SELinux) and AppArmor to ensure that you have an extra layer of security on top of the hosts that you are creating. Lastly, we will take a look at some of the operating systems that support and do auto patching of their operating systems when a security vulnerability is discovered. Docker Machine Docker Machine is the tool that allows you to install the Docker daemon onto your virtual hosts. You can then manage these Docker hosts with Docker Machine. Docker Machine can be installed either through the Docker Toolbox on Windows and Mac. If you are using Linux, you will install Docker Machine through a simple curl command: $ curl -L https://github.com/docker/machine/releases/download/v0.6.0/docker-machine-`uname -s`-`uname -m` > /usr/local/bin/docker-machine && $ chmod +x /usr/local/bin/docker-machine The first command installs Docker Machine into the /usr/local/bin directory and the second command changes the permissions on the file and sets it to executable. We will be using Docker Machine in the following walkthrough to set up a new Docker host. Docker Machine is what you should be or will be using to set up your hosts. For this reason, we will start with it to ensure your hosts are set up in a secure manner. We will take a look at how you can tell if your hosts are secure when you create them using the Docker Machine tool. Let's take a look at what it looks like when you create a Docker host using Docker Machine, as follows: $ docker-machine create --driver virtualbox host1 Running pre-create checks... Creating machine... Waiting for machine to be running, this may take a few minutes... Machine is running, waiting for SSH to be available... Detecting operating system of created instance... Provisioning created instance... Copying certs to the local machine directory... Copying certs to the remote machine...   Setting Docker configuration on the remote daemon... From the preceding output, as the create is running, it is doing things such as creating the machine, waiting for SSH to become available, performing actions, copying the certificates to the correct location, and setting up the Docker configuration, as follows: To see how to connect Docker to this machine, run: docker-machine env host1 $ docker-machine env host1 export DOCKER_TLS_VERIFY="1" export DOCKER_HOST="tcp://192.168.99.100:2376" export DOCKER_CERT_PATH="/Users/scottpgallagher/.docker/machine/machines/host1" export DOCKER_MACHINE_NAME="host1" # Run this command to configure your shell: # eval "$(docker-machine env host1)" The preceding commands output shows the commands that were run to set this machine up as the one that Docker commands will now run against:  eval "$(docker-machine env host1)" We can now run the regular Docker commands, such as docker info, and it will return information from host1, now that we have set it as our environment. We can see from the preceding highlighted output that the host is being set up secure from the start from two of the export lines. Here is the first highlighted line by itself: export DOCKER_TLS_VERIFY="1" From the other highlighted output, DOCKER_TLS_VERIFY is being set to 1 or true. Here is the second highlighted line by itself: export DOCKER_HOST="tcp://192.168.99.100:2376" We are setting the host to operate on the secure port of 2376 as opposed to the insecure port of 2375. We can also gain this information by running the following command: $ docker-machine ls NAME      ACTIVE   DRIVER       STATE     URL                         SWARM                     host1              *        virtualbox     Running   tcp://192.168.99.100:2376   Make sure to check the TLS switch options that can be used with Docker Machine if you have used the previous instructions to setup your Docker hosts and Docker containers to use TLS. These switches would be helpful if you have existing certificates that you want to use as well. These switches can be found by running the following command: $ docker-machine --help   Options:   --debug, -D      Enable debug mode   -s, --storage-path "/Users/scottpgallagher/.docker/machine"   Configures storage path [$MACHINE_STORAGE_PATH]   --tls-ca-cert      CA to verify remotes against [$MACHINE_TLS_CA_CERT]   --tls-ca-key      Private key to generate certificates [$MACHINE_TLS_CA_KEY]   --tls-client-cert     Client cert to use for TLS [$MACHINE_TLS_CLIENT_CERT]   --tls-client-key       Private key used in client TLS auth [$MACHINE_TLS_CLIENT_KEY]   --github-api-token     Token to use for requests to the Github API [$MACHINE_GITHUB_API_TOKEN]   --native-ssh      Use the native (Go-based) SSH implementation. [$MACHINE_NATIVE_SSH]   --help, -h      show help   --version, -v      print the version You can also regenerate TLS certificates for a machine using the regenerate-certs subcommand in the event that you want that peace of mind or that your keys do get compromised. An example command would look similar to the following command: $ docker-machine regenerate-certs host1    Regenerate TLS machine certs?  Warning: this is irreversible. (y/n): y Regenerating TLS certificates Copying certs to the local machine directory... Copying certs to the remote machine... Setting Docker configuration on the remote daemon... SELinux and AppArmor Most Linux operating systems are based on the fact that they can leverage SELinux or AppArmor for more advanced access controls to files or locations on the operating system. With these components, you can limit a containers ability to execute a program as the root user with root privileges. Docker does ship a security model template that comes with AppArmor and Red Hat comes with SELinux policies as well for Docker. You can utilize these provided templates to add an additional layer of security on top of your environments. For more information about SELinux and Docker, I would recommend visiting the following website: https://www.mankier.com/8/docker_selinux While, on the other hand, if you are in the market for some more reading on AppArmor and Docker, I would recommend visiting the following website: https://github.com/docker/docker/tree/master/contrib/apparmor Here you will find a template.go file, which is the template that Docker ships with its application that is the AppArmor template. Auto-patching hosts If you really want to get into advanced Docker hosts, then you could use CoreOS and Amazon Linux AMI, which perform auto-patching, both in a different way. While CoreOS will patch your operating system when a security update comes out and will reboot your operating system, the Amazon Linux AMI will complete the updates when you reboot. So when choosing which operating system to use when you are setting up your Docker hosts, make sure to take into account the fact that both of these operating system implement some form of auto-patching, but in a different way. You will want to make sure you are implementing some type of scaling or failover to address the needs of something that is running on CoreOS so that there is no downtime when a reboot occurs to patch the operating system. Summary In this article, we looked at how to secure our Docker hosts. The Docker hosts are the fire line of defense as they are the starting point where your containers will be running and communicating with each other and end users. If these aren't secure, then there is no purpose of moving forward with anything else. You learned how to set up the Docker daemon to run securely running TLS by generating the appropriate certificates for both the host and the clients. We also looked at the virtualization and isolation benefits of using Docker containers, but make sure to remember the attack surface of the Docker daemon too. Other items included how to use Docker Machine to easily create Docker hosts on secure operating systems with secure communication and ensure that they are being setup using secure methods when you use it to setup your containers. Using items such as SELinux and AppArmor also help to improve your security footprint as well. Lastly, we covered some Docker host operating systems that you can use for auto-patching as well, such as CoreOS and Amazon Linux AMI. Resources for Article:   Further resources on this subject: Understanding Docker [article] Hands On with Docker Swarm [article] Introduction to Docker [article]
Read more
  • 0
  • 0
  • 3190

article-image-creating-wcf-service-business-object-and-data-submission-silverlight-4
Packt
23 Apr 2010
10 min read
Save for later

Creating a WCF Service, Business Object and Data Submission with Silverlight 4

Packt
23 Apr 2010
10 min read
Data applications When building applications that utilize data, it is important to start with defining what data you are going to collect and how it will be stored once collected. In the last chapter, we created a Silverlight application to post a collection of ink strokes to the server. We are going to expand the inkPresenter control to allow a user to submit additional information. Most developers would have had experience building business object layers, and with Silverlight we can still make use of these objects, either by using referenced class projects/libraries or by consuming WCF services and utilizing the associated data contracts. Time for action – creating a business object We'll create a business object that can be used by both Silverlight and our ASP.NET application. To accomplish this, we'll create the business object in our ASP.NET application, define it as a data contract, and expose it to Silverlight via our WCF service. Start Visual Studio and open the CakeORamaData solution. When we created the solution, we originally created a Silverlight application and an ASP.NET web project. In the web project, add a reference to the System.Runtime.Serialization assembly. Right-click on the web project and choose to add a new class. Name this class ServiceObjects and click OK. In the ServiceObjects class file, replace the existing code with the following code: using System; using System.Runtime.Serialization; namespace CakeORamaData.Web { [DataContract] public class CustomerCakeIdea { [DataMember] public string CustomerName { get; set; } [DataMember] public string PhoneNumber { get; set; } [DataMember] public string Email { get; set; } [DataMember] public DateTime EventDate { get; set; } [DataMember] public StrokeInfo[] Strokes { get; set; } } [DataContract] public class StrokeInfo { [DataMember] public double Width { get; set; } [DataMember] public double Height { get; set; } [DataMember] public byte[] Color { get; set; } [DataMember] public byte[] OutlineColor { get; set; } [DataMember] public StylusPointInfo[] Points { get; set; } } [DataContract] public class StylusPointInfo { [DataMember] public double X { get; set; } [DataMember] public double Y { get; set; } } } What we are doing here is defining the data that we'll be collecting from the customer. What just happened? We just added a business object that will be used by our WCF service and our Silverlight application. We added serialization attributes to our class, so that it can be serialized with WCF and consumed by Silverlight. The [DataContract] and [DataMember] attributes are the serialization attributes that WCF will use when serializing our business object for transmission. WCF provides an opt-in model, meaning that types used with WCF must include these attributes in order to participate in serialization. The [DataContract] attribute is required, however if you wish to, you can use the [DataMember] attribute on any of the properties of the class. By default, WCF will use the System.Runtime.Serialization.DataContractSerialzer to serialize the DataContract classes into XML. The .NET Framework also provides a NetDataContractSerializer which includes CLR information in the XML or the JsonDataContractSerializer that will convert the object into JavaScript Object Notation (JSON). The WebGet attribute provides an easy way to define which serializer is used. For more information on these serializers and the WebGet attribute visit the following MSDN web sites: http://msdn.microsoft.com/en-us/library/system.runtime.serialization.datacontractserializer.aspx. http://msdn.microsoft.com/en-us/library/system.runtime.serialization.netdatacontractserializer.aspx. http://msdn.microsoft.com/en-us/library/system.runtime.serialization.json.datacontractjsonserializer.aspx. http://msdn.microsoft.com/en-us/library/system.servicemodel.web.webgetattribute.aspx. Windows Communication Foundation (WCF) Windows Communication Foundation (WCF) provides a simplified development experience for connected applications using the service oriented programming model. WCF builds upon and improves the web service model by providing flexible channels in which to connect and communicate with a web service. By utilizing these channels developers can expose their services to a wide variety of client applications such as Silverlight, Windows Presentation Foundation and Windows Forms. Service oriented applications provide a scalable and reusable programming model, allowing applications to expose limited and controlled functionality to a variety of consuming clients such as web sites, enterprise applications, smart clients, and Silverlight applications. When building WCF applications the service contract is typically defined by an interface decorated with attributes that declare the service and the operations. Using an interface allows the contract to be separated from the implementation and is the standard practice with WCF. You can read more about Windows Communication Foundation on the MSDN website at: http://msdn.microsoft.com/en-us/netframework/aa663324.aspx. Time for action – creating a Silverlight-enabled WCF service Now that we have our business object, we need to define a WCF service that can accept the business object and save the data to an XML file. With the CakeORamaData solution open, right-click on the web project and choose to add a new folder, rename it to Services. Right-click on the web project again and choose to add a new item. Add a new WCF Service named CakeService.svc to the Services folder. This will create an interface and implementation files for our WCF service. Avoid adding the Silverlight-enabled WCF service, as this adds a service that goes against the standard design patterns used with WCF: The standard design practice with WCF is to create an interface that defines the ServiceContract and OperationContracts of the service. The interface is then provided, a default implementation on the server. When the service is exposed through metadata, the interface will be used to define the operations of the service and generate the client classes. The Silverlight-enabled WCF service does not create an interface, just an implementation, it is there as a quick entry point into WCF for developers new to the technology. Replace the code in the ICakeService.cs file with the definition below. We are defining a contract with one operation that allows a client application to submit a CustomerCakeIdea instance: using System; using System.Collections.Generic; using System.Linq; using System.Runtime.Serialization; using System.ServiceModel; using System.Text; namespace CakeORamaData.Web.Services { // NOTE: If you change the interface name "ICakeService" here, you must also update the reference to "ICakeService" in Web.config. [ServiceContract] public interface ICakeService { [OperationContract] void SubmitCakeIdea(CustomerCakeIdea idea); } } The CakeService.svc.cs file will contain the implementation of our service interface. Add the following code to the body of the CakeService.svc.cs file to save the customer information to an XML file: using System; using System.ServiceModel.Activation; using System.Xml; namespace CakeORamaData.Web.Services { // NOTE: If you change the class name "CakeService" here, you must also update the reference to "CakeService" in Web.config. [AspNetCompatibilityRequirements(RequirementsMode = AspNetCompatibilityRequirementsMode.Allowed)] public class CakeService : ICakeService { public void SubmitCakeIdea(CustomerCakeIdea idea) { if (idea == null) return; using (var writer = XmlWriter.Create(String.Format(@"C: ProjectsCakeORamaCustomerData{0}.xml", idea.CustomerName))) { writer.WriteStartDocument(); //<customer> writer.WriteStartElement("customer"); writer.WriteAttributeString("name", idea.CustomerName); writer.WriteAttributeString("phone", idea.PhoneNumber); writer.WriteAttributeString("email", idea.Email); // <eventDate></eventDate> writer.WriteStartElement("eventDate"); writer.WriteValue(idea.EventDate); writer.WriteEndElement(); // <strokes> writer.WriteStartElement("strokes"); if (idea.Strokes != null && idea.Strokes.Length > 0) { foreach (var stroke in idea.Strokes) { // <stroke> writer.WriteStartElement("stroke"); writer.WriteAttributeString("width", stroke.Width. ToString()); writer.WriteAttributeString("height", stroke.Height. ToString()); writer.WriteStartElement("color"); writer.WriteAttributeString("a", stroke.Color[0]. ToString()); writer.WriteAttributeString("r", stroke.Color[1]. ToString()); writer.WriteAttributeString("g", stroke.Color[2]. ToString()); writer.WriteAttributeString("b", stroke.Color[3]. ToString()); writer.WriteEndElement(); writer.WriteStartElement("outlineColor"); writer.WriteAttributeString("a", stroke. OutlineColor[0].ToString()); writer.WriteAttributeString("r", stroke. OutlineColor[1].ToString()); writer.WriteAttributeString("g", stroke. OutlineColor[2].ToString()); writer.WriteAttributeString("b", stroke. OutlineColor[3].ToString()); writer.WriteEndElement(); if (stroke.Points != null && stroke.Points.Length > 0) { writer.WriteStartElement("points"); foreach (var point in stroke.Points) { writer.WriteStartElement("point"); writer.WriteAttributeString("x", point. X.ToString()); writer.WriteAttributeString("y", point. Y.ToString()); writer.WriteEndElement(); } writer.WriteEndElement(); } // </stroke> writer.WriteEndElement(); } } // </strokes> writer.WriteEndElement(); //</customer> writer.WriteEndElement(); writer.WriteEndDocument(); } } } } We added the AspNetCompatibilityRequirements attribute to our CakeService implementation. This attribute is required in order to use a WCF service from within ASP.NET. Open Windows Explorer and create the path C:ProjectsCakeORamaCustomerData on your hard drive to store the customer XML files. One thing to note is that you will need to grant write permission to this directory for the ASP.NET user account when in a production environment. When adding a WCF service through Visual Studio, binding information is added to the web.config file. The default binding for WCF is wsHttpBinding, which is not a valid binding for Silverlight. The valid bindings for Silverlight are basicHttpBinding, binaryHttpBinding (implemented with a customBinding), and netTcpBinding. We need to modify the web.config, so that Silverlight can consume the service. Open the web.config file and add this customBinding section to the <system.serviceModel> node: <bindings> <customBinding> <binding name="customBinding0"> <binaryMessageEncoding /> <httpTransport> <extendedProtectionPolicy policyEnforcement="Never" /> </httpTransport> </binding> </customBinding> </bindings> We'll need to change the <service> node in the web.config to use our new customBinding, (we use the customBinding to implement binary HTTP which sends the information as a binary stream to the service), rather than the wsHttpbinding from: <service behaviorConfiguration="CakeORamaData.Web.Services. CakeServiceBehavior" name="CakeORamaData.Web.Services.CakeService"> <endpoint address="" binding="wsHttpBinding" contract="CakeORamaData.Web.Services.ICakeService"> <identity> <dns value="localhost" /> </identity> </endpoint> <endpoint address="mex" binding="mexHttpBinding" contract="IM etadataExchange" /> </service> To the following: <service behaviorConfiguration="CakeORamaData.Web.Services. CakeServiceBehavior" name="CakeORamaData.Web.Services.CakeService"> <endpoint address="" binding="customBinding" bindingConfiguratio n="customBinding0" contract="CakeORamaData.Web.Services.ICakeService" /> <endpoint address="mex" binding="mexHttpBinding" contract="IMeta dataExchange" /> </service> Set the start page to the CakeService.svc file, then build and run the solution. We will be presented with the following screen, which lets us know that the service and bindings are set up correctly: Our next step is to add the service reference to Silverlight. On the Silverlight project, right-click on the References node and choose to Add a Service Reference: On the dialog that opens, click the Discover button and choose the Services in Solution option. Visual Studio will search the current solution for any services: Visual Studio will find our CakeService and all we have to do is change the Namespace to something that makes sense such as Services and click the OK button: We can see that Visual Studio has added some additional references and files to our project. Developers used to WCF or Web Services will notice the assembly references and the Service References folder: Silverlight creates a ServiceReferences.ClientConfig file that stores the configuration for the service bindings. If we open this file, we can take a look at the client side bindings to our WCF service. These bindings tell our Silverlight application how to connect to the WCF service and the URL where it is located: <configuration> <system.serviceModel> <bindings> <customBinding> <binding name="CustomBinding_ICakeService"> <binaryMessageEncoding /> <httpTransport maxReceivedMessageSize="2147483647" maxBufferSize="2147483647"> <extendedProtectionPolicy policyEnforcemen t="Never" /> </httpTransport> </binding> </customBinding> </bindings> <client> <endpoint address="http://localhost:2268/Services/ CakeService.svc" binding="customBinding" bindingConfiguration="Cust omBinding_ICakeService" contract="Services.ICakeService" name="CustomBinding_ICakeService" /> </client> </system.serviceModel> </configuration>
Read more
  • 0
  • 0
  • 3189
article-image-developing-opencart-themes
Packt
06 Jul 2015
13 min read
Save for later

Developing OpenCart Themes

Packt
06 Jul 2015
13 min read
In this article by Rupak Nepali, author of the book OpenCart Theme and Module Development, we learn about the basic features of OpenCart and its functionalities. Similarly, you will learn about global methods used in OpenCart. (For more resources related to this topic, see here.) The features of OpenCart The latest version of OpenCart at the time of writing this article is 2.0.1.1, which boasts of a multitude of great features: Modern and fully responsive design, OCMod (virtual file modification) A redesigned admin area and frontend More payment gateways included in the standard download Event notification system Custom form fields An unlimited module instance system to increase functionality Its pre-existing features include the following: Open source nature Templatable for changing the presentation section It also supports: Downloadable products Unlimited categories, products, manufacturers Multilanguage Multicurrency Product reviews and ratings PCI-compliant Automatic image resizing Multiple tax rates related products Unlimited information pages Shipping weight calculation Discount coupon system It is search engine optimized and has backup and restoration tools. It provides features such as printable invoices, sales reports, error logging, multistore features, multiple marketplace integration tools such as OpenBay Pro, and many more. Now, let's start with some basic general setting that will be helpful in creating our theme and module. Advantages of using Bootstrap in OpenCart themes The following can be the advantages of using Bootstrap in an OpenCart 2 theme: Speeds up development and saves time: There are many ready-made components, such as those available at http://getbootstrap.com/components/, which can be used directly in the template like we can use buttons, alert messages, many typography tables, forms, and many JavaScript functionalities. These are made responsive by default. So, there is no need to spend much time checking each device, which ultimately helps decrease development time and save time. Responsiveness: Bootstrap is made for devices of all shapes. So, using the conventions provided by bootstrap, it is easy to gain responsiveness in the site and design. Can upgrade easily: If we create our OpenCart theme with bootstrap, we can easily upgrade Bootstrap with little effort. There is no need to invest lots of time searching for upgrades of CSS and devices. Things to remember while making an OpenCart theme Here are a few things to take care of before stepping into HTML and CSS to create an OpenCart theme: In the header section, you can include a logo section, search section, currency section, language section, category menu section as well as mini-cart section. You can also include links to the home page, wish list page, account pages, shopping cart page, and checkout page. You can even show telephone numbers. These are provided by default. In the footer section, you can include links to information pages, customer service pages (such as a contact us page), a return page, a site map page, extra links (such as brands, gift vouchers, affiliates, and specials), and links to pages such as my account page, the order history, wish list, and newsletter. Include CSS modules in the style sheet, such as .box, .box .box-heading, .box .box-content, and so on, as clients can add many extra modules to fulfill their requirements. So, if we do not include these, then the design of the extra module may be hampered. Include CSS that supports three-column structure as well as right-column-activated-two-column structure, left-column-activated-two-column structure, and one-column structure in such a way that the following happens: if the left columns are deactivated, then the right-column structure is activated. If the right columns are deactivated, then the left-column structure is activated. Finally, if both the columns are deactivated, then one column structure is activated. The following diagram shows the four styles of a theme: Include only the modified files and folders. If the CSS does not find the referenced files, then it takes them from the default folder. Try to create a folder structure like what is shown in this screenshot: Prepare CSS for the buttons, checkout steps, cart pages, table, heading, and carousel. Steps to create new theme based on default theme The following steps help create a new theme based on the default theme: Navigate to the catalog/view/themefolder and create a new folder. Let's name it packttheme. Now navigate to the catalog/view/theme/defaultfolder, copy the image and stylesheetfolder, go to the catalog/view/theme/packtthemefolder, and paste it here. Go to the catalog/view/theme/packtthemefolder and create a new folder, named template. Next, navigate to the catalog/view/theme/default/template folder, copy the commonfolder, go to the catalog/view/theme/packttheme/templatefolder, and paste the commonfolder there. Now the folder structure looks like this: Open catalog/view/theme/packttheme/header.tpl in your favorite text editor, and find and replace the word default with packttheme. After performing the replacement, save header.tpl, and your default base theme is ready for use. Now log in to the admin section and go to Administrator | System | Setting. Edit the store for which you wish to change the theme, and click on the Store tab. Then choose packttheme in the template select box. Next, click on save. Now refresh the frontend, and packttheme will be activated. Now we can make changes to the CSS or the template files, and see the changes. Sometimes, we need theme-specific JavaScript. In such cases, we can create a javascript folder, or another similar folder as per the requirement. Global library methods OpenCart has many predefined methods that can be called anywhere, for example, in controller, model, as well as view template files. You can find system-level library files at system/library/. We will show you how methods can be written and what their functions are. Affiliate (affiliate.php) You can find most of the affiliate code written in the affiliate section. You can check out the files at catalog/controller/affiliate/ and catalog/model/affiliate/. Here is a list of methods we can use for the affiliate library: When an e-mail and password are passed to this method, it logs in to the affiliate section if the username (e-mail) and password match among the affiliates. You can find this code at catalog/controller/affiliate/ login.php on validate method: $this->affiliate->login($email, $password); The affiliate gets logged out. This means the affiliate ID is cleared and its session is destroyed. Also, the affiliate's first name, last name, e-mail, telephone, and fax are given an empty value: $this->affiliate->logout(); Check whether the affiliate is logged in. If you like to show a message to the logged-in affiliate only, then you can use this code: $this->affiliate->isLogged(); When we echo the following line, it will show the ID of the active affiliate:: if ($this->affiliate->isLogged()){echo "Welcome to the Affiliate Section";} else {echo "You are not at Affiliate Section";}$this->affiliate->getId(); Changes made in the catalog folder The following changes need to be made in the catalogfolder: Go to catalog/model/shipping and copy weight.php. Paste it in the same folder and rename it to totalcost.php. Open it and find the following line: classModelShippingWeight extends Model { Change the class name to this: classModelShippingTotalcost extends Model { Now, find weightand replace all its occurrences with totalcost. After performing the replacement, find the following lines of code: $totalcost = $this->cart->gettotalcost(); Make the changes as shown here: $totalcost = $this->cart->getSubTotal(); Our requirement is to show the shipping cost as per the total cost purchased, so we have made the change you just saw. Now, find these lines of code: if ((string)$cost != '') { $quote_data['totalcost_' . $result['geo_zone_id']] = array( 'code'=>'totalcost.totalcost_'.$result['geo_zone_id'], 'title'=>$result['name'].'('.$this->language->get('text_ totalcost') . ' ' . $this->totalcost->format($totalcost, $this- >config->get('config_totalcost_class_id')) . ')', 'cost' => $cost, 'tax_class_id' => $this->config->get('totalcost_tax_class_id'), 'text'=> $this->currency->format($this->tax->calculate($cost, $this->config->get('totalcost_tax_class_id'), $this->config- >get('config_tax'))) ); } In them, consider the following lines: 'title'=>$result['name'].'('.$this->language->get('text_totalcost') . ' ' . $this->totalcost->format($totalcost, $this->config->get('config_totalcost_class_id')) . ')', Make this change, as we only need the name: 'title' => $result['name'], Weight has different classes, such as kilogram, gram, pound, and so on, but in our total cost purchased, we did not have any class specified so we removed it. Save the file. Go to catalog/language/english/shipping and copy weight.php. Paste it in the same folder and rename it to totalcost.php. Open it, find Weight, and replace it with Total Cost. With these changes, the module is ready to install. Go to Admin | Extensions | Shipping, and then find Total Cost Based Shipping. Click on Install and grant permission to modify and access to the user. Then, edit to con figure it. In the General tab, change the status to Enabled. Other tabs are loaded as per the geo zones setting. The default geo zones for OpenCart are set as UK Shipping and UK VAT. Now, insert the value for Total Cost versus Rates. If the subtotal reaches 25, then the shipping cost is 10; if it reaches 50, then the shipping cost is 12; and if it reaches 100, then the shipping cost is 15. So, we have inserted 25:10, 50:12, 100:15. If the customer tries to order more than the inserted total cost, then no shipping is activated. In this way, you can now clone the shipping modules and make changes to the logic as per your requirement. Database tables for feedback Let's begin by creating tables in the database. We all know that OpenCart has multistore abilities, supports multiple languages, and can be displayed in multiple layouts, so when creating a database we should take this into consideration. Four tables are created: feedback, feedback_description, feedback_to_layout, and feedback_to_store. A feedback table is created for saving feedback-related data, and feedback_description is created for storing multiple-language data for the feedback. A feedback_to_layout table is created for saving the association of layout to feedback, and a feedback_to_store table is created for saving the association of store to feedback. When we install OpenCart, we use oc_ as a database prefix as shown in the following image and query. A database prefix is defined in config.php where mostly you find something such as define('DB_PREFIX', 'oc_'); , or what you entered when installing OpenCart. We create the oc_feedback table. It saves the status, sort order, date added, and feedback ID. Then we create the oc_feedback_description table, where we will save the feedback writer's name, feedback given, and language ID, for multiple languages. Then we create the oc_feedback_to_store table to save the store ID and feedback ID and keep the relationship between feedback and whichever store's feedback is to be shown. Finally, we create the oc_feedback_to_layout table to save the feedback_id and layout_id to show the feedback for the layout you want. This diagram shows the database schema: Creating the template file for the frontend Go to catalog/view/theme/default/template and create a feedback folder. Then create a feedback.tpl file and insert this code: <?php echo $header; ?><div class="container"><ul class="breadcrumb"><?php foreach ($breadcrumbs as $breadcrumb) { ?><li><a href="<?php echo $breadcrumb['href']; ?>"><?php echo$breadcrumb['text']; ?></a></li><?php } ?></ul> The preceding code shows the breadcrumbs array. The following code shows the list of feedback: <div class="row"><?php echo $column_left; ?><?php if ($column_left && $column_right) { ?><?php $class = 'col-sm-6'; ?><?php } elseif ($column_left || $column_right) { ?><?php $class = 'col-sm-9'; ?><?php } else { ?><?php $class = 'col-sm-12'; ?><?php } ?><div id="content" class="<?php echo $class; ?>"><?php echo $content_top; ?><h1><?php echo $heading_title; ?></h1><?php foreach ($feedbacks as $feedback) { ?><div class="col-xs-12"><div class="row"><h4><?php echo $feedback['author']; ?></h4><p><?php echo $feedback['description']; ?></p><hr /></div></div><?php } ?> This shows the list of feedback authors and descriptions, as shown in the following screenshot: To show the pagination in the template file, we have to insert the following lines of code into whichever part we want to show the pagination in: <div class="row"><div class="col-sm-6 text-left"><?php echo $pagination; ?></div><div class="col-sm-6 text-right"><?php echo $results; ?></div></div> It shows the pagination in the template file, and we mostly show the pagination at the bottom, so paste it at the end of the feedback.tpl file: <?php if (!$feedbacks) { ?><p><?php echo $text_empty; ?></p><div class="buttons"><div class="pull-right"><a href="<?php echo $continue; ?>"class="btn btn-primary"><?php echo $button_continue; ?></a></div></div><?php } ?> If there are no feedbacks, then a message similar to There are no feedbacks to showis shown, as per the language file: <?php echo $content_bottom; ?></div><?php echo $column_right; ?></div></div><?php echo $footer; ?> In this way, the template file is completed and so is our feedback management. You can create a module and show it as a module as well. To view the list of feedback at the frontend, we have to use a link similar to http://www.example.com/index.php?route=feedback/feedback, and insert the link somewhere in the templates so that visitors will be able to see the feedback list. Like this, you can extend to show the feedback as a module, and create a form at the frontend from which visitors can submit feedback. You can find these at demo code. Try this first and check out the code if you need any help. We have made the code files as descriptive as possible. Summary Using OpenCart themes, you can customize the presentation layer of OpenCart. Likewise, if you can code OpenCart's extensions or modules, then you can also customize the functionality of the OpenCart e-commerce framework and make an e-commerce site easier to administer and look better Resources for Article: Further resources on this subject: OpenCart Themes: Using the jCarousel Plugin [Article] Implementing OpenCart Modules [Article] OpenCart Themes: Styling Effects of jQuery Plugins [Article]
Read more
  • 0
  • 0
  • 3188

Packt
17 Feb 2016
29 min read
Save for later

Spark – Architecture and First Program

Packt
17 Feb 2016
29 min read
In this article by Sumit Gupta and Shilpi Saxena, the authors of Real-Time Big Data Analytics, we will discuss the architecture of Spark and its various components in detail. We will also briefly talk about the various extensions/libraries of Spark, which are developed over the core Spark framework. (For more resources related to this topic, see here.) Spark is a general-purpose computing engine that initially focused to provide solutions to the iterative and interactive computations and workloads. For example, machine learning algorithms, which reuse intermediate or working datasets across multiple parallel operations. The real challenge with iterative computations was the dependency of the intermediate data/steps on the overall job. This intermediate data needs to be cached in the memory itself for faster computations because flushing and reading from a disk was an overhead, which, in turn, makes the overall process unacceptably slow. The creators of Apache Spark not only provided scalability, fault tolerance, performance, distributed data processing but also provided in-memory processing of distributed data over the cluster of nodes. To achieve this, a new layer abstraction of distributed datasets that is partitioned over the set of machines (cluster) was introduced, which can be cached in the memory to reduce the latency. This new layer of abstraction was known as resilient distributed datasets (RDD). RDD, by definition, is an immutable (read-only) collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. It is important to note that Spark is capable of performing in-memory operations, but at the same time, it can also work on the data stored on the disk. High-level architecture Spark provides a well-defined and layered architecture where all its layers and components are loosely coupled and integration with external components/libraries/extensions is performed using well-defined contracts. Here is the high-level architecture of Spark 1.5.1 and its various components/layers: The preceding diagram shows the high-level architecture of Spark. Let's discuss the roles and usage of each of the architecture components: Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage. Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch. Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated applications. Spark applications can leverage cluster managers such as YARN (http://tinyurl.com/pcymnnf) and Mesos (http://mesos.apache.org/) for the allocation and deallocation of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the APIs that are used to request for the allocation and deallocation of available resource across the cluster. Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a wide variety of applications and languages. Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to perform ad hoc queries and interactive analysis over large datasets. The preceding architecture should be sufficient enough to understand the various layers of abstraction provided by Spark. All the layers are loosely coupled, and if required, can be replaced or extended as per the requirements. Spark extensions is one such layer that is widely used by architects and developers to develop custom libraries. Let's move forward and talk more about Spark extensions, which are available for developing custom applications/jobs. Spark extensions/libraries In this section, we will briefly discuss the usage of various Spark extensions/libraries that are available for different use cases. The following are the extensions/libraries available with Spark 1.5.1: Spark Streaming: Spark Streaming, as an extension, is developed over the core Spark API. It enables scalable, high-throughput, and fault-tolerant stream processing of live data streams. Spark Streaming enables the ingestion of data from various sources, such as Kafka, Flume, Kinesis, or TCP sockets. Once the data is ingested, it can be further processed using complex algorithms that are expressed with high-level functions, such as map, reduce, join, and window. Finally, the processed data can be pushed out to filesystems, databases, and live dashboards. In fact, Spark Streaming also facilitates the usage Spark's machine learning and graph processing algorithms on data streams. For more information, refer to http://spark.apache.org/docs/latest/streaming-programming-guide.html. Spark MLlib: Spark MLlib is another extension that provides the distributed implementation of various machine learning algorithms. Its goal is to make practical machine learning library scalable and easy to use. It provides implementation of various common machine learning algorithms used for classification, regression, clustering, and many more. For more information, refer to http://spark.apache.org/docs/latest/mllib-guide.html. Spark GraphX: GraphX provides the API to create a directed multigraph with properties attached to each vertex and edges. It also provides the various common operators used for the aggregation and distributed implementation of various graph algorithms, such as PageRank and triangle counting. For more information, refer to http://spark.apache.org/docs/latest/graphx-programming-guide.html. Spark SQL: Spark SQL provides the distributed processing of structured data and facilitates the execution of relational queries, which are expressed in a structured query language. (http://en.wikipedia.org/wiki/SQL). It provides the high level of abstraction known as DataFrames, which is a distributed collection of data organized into named columns. For more information, refer to http://spark.apache.org/docs/latest/sql-programming-guide.html. SparkR: R (https://en.wikipedia.org/wiki/R_(programming_language) is a popular programming language used for statistical computing and performing machine learning tasks. However, the execution of the R language is single threaded, which makes it difficult to leverage in order to process large data (TBs or PBs). R can only process the data that fits into the memory of a single machine. In order to overcome the limitations of R, Spark introduced a new extension: SparkR. SparkR provides an interface to invoke and leverage Spark distributed execution engine from R, which allows us to run large-scale data analysis from the R shell. For more information, refer to http://spark.apache.org/docs/latest/sparkr.html. All the previously listed Spark extension/libraries are part of the standard Spark distribution. Once we install and configure Spark, we can start using APIs that are exposed by the extensions. Apart from the earlier extensions, Spark also provides various other external packages that are developed and provided by the open source community. These packages are not distributed with the standard Spark distribution, but they can be searched and downloaded from http://spark-packages.org/. Spark packages provide libraries/packages for integration with various data sources, management tools, higher level domain-specific libraries, machine learning algorithms, code samples, and other Spark content. Let's move on to the next section where we will dive deep into the Spark packaging structure and execution model, and we will also talk about various other Spark components. Spark packaging structure and core APIs In this section, we will briefly talk about the packaging structure of the Spark code base. We will also discuss core packages and APIs, which will be frequently used by the architects and developers to develop custom applications with Spark. Spark is written in Scala (http://www.scala-lang.org/), but for interoperability, it also provides the equivalent APIs in Java and Python as well. For brevity, we will only talk about the Scala and Java APIs, and for Python APIs, users can refer to https://spark.apache.org/docs/1.5.1/api/python/index.html. A high-level Spark code base is divided into the following two packages: Spark extensions: All APIs for a particular extension is packaged in its own package structure. For example, all APIs for Spark Streaming are packaged in the org.apache.spark.streaming.* package, and the same packaging structure goes for other extensions: Spark MLlib—org.apache.spark.mllib.*, Spark SQL—org.apcahe.spark.sql.*, Spark GraphX—org.apache.spark.graphx.*. For more information, refer to http://tinyurl.com/q2wgar8 for Scala APIs and http://tinyurl.com/nc4qu5l for Java APIs. Spark Core: Spark Core is the heart of Spark and provides two basic components: SparkContext and SparkConfig. Both of these components are used by each and every standard or customized Spark job or Spark library and extension. The terms/concepts Context and Config are not new and more or less they have now become a standard architectural pattern. By definition, a Context is an entry point of the application that provides access to various resources/features exposed by the framework, whereas a Config contains the application configurations, which helps define the environment of the application. Let's move on to the nitty-gritty of the Scala APIs exposed by Spark Core: org.apache.spark: This is the base package for all Spark APIs that contains a functionality to create/distribute/submit Spark jobs on the cluster. org.apache.spark.SparkContext: This is the first statement in any Spark job/application. It defines the SparkContext and then further defines the custom business logic that is is provided in the job/application. The entry point for accessing any of the Spark features that we may want to use or leverage is SparkContext, for example, connecting to the Spark cluster, submitting jobs, and so on. Even the references to all Spark extensions are provided by SparkContext. There can be only one SparkContext per JVM, which needs to be stopped if we want to create a new one. The SparkContext is immutable, which means that it cannot be changed or modified once it is started. org.apache.spark.rdd.RDD.scala: This is another important component of Spark that represents the distributed collection of datasets. It exposes various operations that can be executed in parallel over the cluster. The SparkContext exposes various methods to load the data from HDFS or the local filesystem or Scala collections, and finally, create an RDD on which various operations such as map, filter, join, and persist can be invoked. RDD also defines some useful child classes within the org.apache.spark.rdd.* package such as PairRDDFunctions to work with key/value pairs, SequenceFileRDDFunctions to work with Hadoop sequence files, and DoubleRDDFunctions to work with RDDs of doubles. We will read more about RDD in the subsequent sections. org.apache.spark.annotation: This package contains the annotations, which are used within the Spark API. This is the internal Spark package, and it is recommended that you do not to use the annotations defined in this package while developing our custom Spark jobs. The three main annotations defined within this package are as follows: DeveloperAPI: All those APIs/methods, which are marked with DeveloperAPI, are for advance usage where users are free to extend and modify the default functionality. These methods may be changed or removed in the next minor or major releases of Spark. Experimental: All functions/APIs marked as Experimental are officially not adopted by Spark but are introduced temporarily in a specific release. These methods may be changed or removed in the next minor or major releases. AlphaComponent: The functions/APIs, which are still being tested by the Spark community, are marked as AlphaComponent. These are not recommended for production use and may be changed or removed in the next minor or major releases. org.apache.spark.broadcast: This is one of the most important packages, which are frequently used by developers in their custom Spark jobs. It provides the API for sharing the read-only variables across the Spark jobs. Once the variables are defined and broadcast, they cannot be changed. Broadcasting the variables and data across the cluster is a complex task, and we need to ensure that an efficient mechanism is used so that it improves the overall performance of the Spark job and does not become an overhead. Spark provides two different types of implementations of broadcasts—HttpBroadcast and TorrentBroadcast. The HttpBroadcast broadcast leverages the HTTP server to fetch/retrieve the data from Spark driver. In this mechanism, the broadcast data is fetched through an HTTP Server running at the driver itself and further stored in the executor block manager for faster accesses. The TorrentBroadcast broadcast, which is also the default implementation of the broadcast, maintains its own block manager. The first request to access the data makes the call to its own block manager, and if not found, the data is fetched in chunks from the executor or driver. It works on the principle of BitTorrent and ensures that the driver is not the bottleneck in fetching the shared variables and data. Spark also provides accumulators, which work like broadcast, but provide updatable variables shared across the Spark jobs but with some limitations. You can refer to https://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.Accumulator. org.apache.spark.io: This provides implementation of various compression libraries, which can be used at block storage level. This whole package is marked as Developer API, so developers can extend and provide their own custom implementations. By default, it provides three implementations: LZ4, LZF, and Snappy. org.apache.spark.scheduler: This provides various scheduler libraries, which help in job scheduling, tracking, and monitoring. It defines the directed acyclic graph (DAG) scheduler (http://en.wikipedia.org/wiki/Directed_acyclic_graph). The Spark DAG scheduler defines the stage-oriented scheduling where it keeps track of the completion of each RDD and the output of each stage and then computes DAG, which is further submitted to the underlying org.apache.spark.scheduler.TaskScheduler API that executes them on the cluster. org.apache.spark.storage: This provides APIs for structuring, managing, and finally, persisting the data stored in RDD within blocks. It also keeps tracks of data and ensures that it is either stored in memory, or if the memory is full, it is flushed to the underlying persistent storage area. org.apache.spark.util: These are the utility classes used to perform common functions across the Spark APIs. For example, it defines MutablePair, which can be used as an alternative to Scala's Tuple2 with the difference that MutablePair is updatable while Scala's Tuple2 is not. It helps in optimizing memory and minimizing object allocations. Spark execution model – master worker view Let's move on to the next section where we will dive deep into the Spark execution model, and we will also talk about various other Spark components. Spark essentially enables the distributed in-memory execution of a given piece of code. We discussed the Spark architecture and its various layers in the previous section. Let's also discuss its major components, which are used to configure the Spark cluster, and at the same time, they will be used to submit and execute our Spark jobs. The following are the high-level components involved in setting up the Spark cluster or submitting a Spark job: Spark driver: This is the client program, which defines SparkContext. The entry point for any job that defines the environment/configuration and the dependencies of the submitted job is SparkContext. It connects to the cluster manager and requests resources for further execution of the jobs. Cluster manager/resource manager/Spark master: The cluster manager manages and allocates the required system resources to the Spark jobs. Furthermore, it coordinates and keeps track of the live/dead nodes in a cluster. It enables the execution of jobs submitted by the driver on the worker nodes (also called Spark workers) and finally tracks and shows the status of various jobs running by the worker nodes. Spark worker/executors: A worker actually executes the business logic submitted by the Spark driver. Spark workers are abstracted and are allocated dynamically by the cluster manager to the Spark driver for the execution of submitted jobs. The following diagram shows the high-level components and the master worker view of Spark: The preceding diagram depicts the various components involved in setting up the Spark cluster, and the same components are also responsible for the execution of the Spark job. Although all the components are important, but let's briefly discuss the cluster/resource manager, as it defines the deployment model and allocation of resources to our submitted jobs. Spark enables and provides flexibility to choose our resource manager. As of Spark 1.5.1, the following are the resource managers or deployment models that are supported by Spark: Apache Mesos: Apache Mesos (http://mesos.apache.org/) is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other frameworks on a dynamically shared pool of nodes. Apache Mesos and Spark are closely related to each other (but they are not the same). The story started way back in 2009 when Mesos was ready and there were talks going on about the ideas/frameworks that can be developed on top of Mesos, and that's exactly how Spark was born. Refer to http://spark.apache.org/docs/latest/running-on-mesos.html for more information on running Spark jobs on Apache Mesos. Hadoop YARN: Hadoop 2.0 (http://tinyurl.com/qypb4xm), also known as YARN, was a complete change in the architecture. It was introduced as a generic cluster computing framework that was entrusted with the responsibility of allocating and managing the resources required to execute the varied jobs or applications. It introduced new daemon services, such as the resource manager (RM), node manager (NM), and application master (AM), which are responsible for managing cluster resources, individual nodes, and respective applications. YARN also introduced specific interfaces/guidelines for application developers where they can implement/follow and submit or execute their custom applications on the YARN cluster. The Spark framework implements the interfaces exposed by YARN and provides the flexibility of executing the Spark applications on YARN. Spark applications can be executed in the following two different modes in YARN: YARN client mode: In this mode, the Spark driver executes the client machine (the machine used for submitting the job), and the YARN application master is just used for requesting the resources from YARN. All our logs and sysouts (println) are printed on the same console, which is used to submit the job. YARN cluster mode: In this mode, the Spark driver runs inside the YARN application master process, which is further managed by YARN on the cluster, and the client can go away just after submitting the application. Now as our Spark driver is executed on the YARN cluster, our application logs/sysouts (println) are also written in the log files maintained by YARN and not on the machine that is used to submit our Spark job. For more information on executing Spark applications on YARN, refer to http://spark.apache.org/docs/latest/running-on-yarn.html. Standalone mode: The Core Spark distribution contains the required APIs to create an independent, distributed, and fault tolerant cluster without any external or third-party libraries or dependencies. Local mode: Local mode should not be confused with standalone mode. In local mode, Spark jobs can be executed on a local machine without any special cluster setup by just passing local[N] as the master URL, where N is the number of parallel threads. Writing and executing our first Spark program In this section, we will install/configure and write our first Spark program in Java and Scala. Hardware requirements Spark supports a variety of hardware and software platforms. It can be deployed on commodity hardware and also supports deployments on high-end servers. Spark clusters can be provisioned either on cloud or on-premises. Though there is no single configuration or standards, which can guide us through the requirements of Spark, but to create and execute Spark examples provided in this article, it would be good to have a laptop/desktop/server with the following configuration: RAM: 8 GB. CPU: Dual core or Quad core. DISK: SATA drives with a capacity of 300 GB to 500 GB with 15 k RPM. Operating system: Spark supports a variety of platforms that include various flavors of Linux (Ubuntu, HP-UX, RHEL, and many more) and Windows. For our examples, we will recommend that you use Ubuntu for the deployment and execution of examples. Spark core is coded in Scala, but it offers several development APIs in different languages, such as Scala, Java, and Python, so that developers can choose their preferred weapon for coding. The dependent software may vary based on the programming languages but still there are common sets of software for configuring the Spark cluster and then language-specific software for developing Spark jobs. In the next section, we will discuss the software installation steps required to write/execute Spark jobs in Scala and Java on Ubuntu as the operating system. Installation of the basic softwares In this section, we will discuss the various steps required to install the basic software, which will help us in the development and execution of our Spark jobs. Spark Perform the following steps to install Spark: Download the Spark compressed tarball from http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.4.tgz. Create a new directory spark-1.5.1 on your local filesystem and extract the Spark tarball into this directory. Execute the following command on your Linux shell in order to set SPARK_HOME as an environment variable: export SPARK_HOME=<Path of Spark install Dir> Now, browse your SPARK_HOME directory and it should look similar to the following screenshot: Java Perform the following steps to install Java: Download and install Oracle Java 7 from http://www.oracle.com/technetwork/java/javase/install-linux-self-extracting-138783.html. Execute the following command on your Linux shell to set JAVA_HOME as an environment variable: export JAVA_HOME=<Path of Java install Dir> Scala Perform the following steps to install Scala: Download the Scala 2.10.5 compressed tarball from http://downloads.typesafe.com/scala/2.10.5/scala-2.10.5.tgz?_ga=1.7758962.1104547853.1428884173. Create a new directory, Scala 2.10.5, on your local filesystem and extract the Scala tarball into this directory. Execute the following commands on your Linux shell to set SCALA_HOME as an environment variable, and add the Scala compiler to the $PATH system: export SCALA_HOME=<Path of Scala install Dir> Next, execute the command in the following screenshot to ensure that the Scala runtime and Scala compiler is available and the version is 2.10.x: Spark 1.5.1 supports the 2.10.5 version of Scala, so it is advisable to use the same version to avoid any runtime exceptions due to mismatch of libraries. Eclipse Perform the following steps to install Eclipse: Based on your hardware configuration, download Eclipse Luna (4.4) from http://www.eclipse.org/downloads/packages/eclipse-ide-java-eedevelopers/lunasr2: Next, install the IDE for Scala in Eclipse itself so that we can write and compile our Scala code inside Eclipse (http://scala-ide.org/download/current.html). We are now done with the installation of all the required software. Let's move on and configure our Spark cluster. Configuring the Spark cluster The first step to configure the Spark cluster is to identify the appropriate resource manager. We discussed the various resource managers in the Spark execution model – master worker view section (Yarn, Mesos, and standalone). Standalone is the most preferred resource manager for development because it is simple/quick and does not require installation of any other component or software. We will also configure the standalone resource manager for all our Spark examples, and for more information on Yarn and Mesos, refer to the Spark execution model – master worker view section. Perform the following steps to bring up an independent cluster using Spark binaries: The first step to set up the Spark cluster is to bring up the master node, which will track and allocate the systems' resource. Open your Linux shell and execute the following command: $SPARK_HOME/sbin/start-master.sh The preceding command will bring up your master node, and it will also enable a UI, the Spark UI to monitor the nodes/jobs in the Spark cluster, http://<host>:8080/. The <host> is the domain name of the machine on which the master is running. Next, let's bring up our worker node, which will execute our Spark jobs. Execute the following command on the same Linux shell: $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker <Spark-Master> & In the preceding command, replace the <Spark-Master> with the Spark URL, which is shown at the top of the Spark UI, just beside Spark master at. The preceding command will start the Spark worker process in the background and the same will also be reported in the Spark UI. The Spark UI shown in the preceding screenshot shows the three different sections, providing the following information: Workers: This reports the health of a worker node, which is alive or dead and also provides drill-down to query the status and details logs of the various jobs executed by that specific worker node Running applications: This shows the applications that are currently being executed in the cluster and also provides drill-down and enables viewing of application logs Completed application: This is the same functionality as running applications; the only difference being that it shows the jobs, which are finished We are done!!! Our Spark cluster is up and running and ready to execute our Spark jobs with one worker node. Let's move on and write our first Spark application in Scala and Java and further execute it on our newly created cluster. Coding Spark job in Scala In this section, we will code our first Spark job in Scala, and we will also execute the same job on our newly created Spark cluster and will further analyze the results. This is our first Spark job, so we will keep it simple. We will use the Chicago crimes dataset for August 2015and will count the number of crimes reported in August 2015. Perform the following steps to code the Spark job in Scala for aggregating the number of crimes in August 2015: Open Eclipse and create a Scala project called Spark-Examples. Expand your newly created project and modify the version of the Scala library container to 2.10. This is done to ensure that the version of Scala libraries used by Spark and the custom jobs developed/deployed are the same. Next, open the properties of your project Spark-Examples and add the dependencies for the all libraries packaged with the Spark distribution, which can be found at $SPARK_HOME/lib. Next, create a chapter.six Scala package, and in this package, define a new Scala object by the name of ScalaFirstSparkJob. Define a main method in the Scala object and also import SparkConfand SparkContext. Now, add the following code to the main method of ScalaFirstSparkJob: object ScalaFirstSparkJob { def main(args: Array[String]) { println("Creating Spark Configuration") //Create an Object of Spark Configuration val conf = new SparkConf() //Set the logical and user defined Name of this Application conf.setAppName("My First Spark Scala Application") println("Creating Spark Context") //Create a Spark Context and provide previously created //Object of SparkConf as an reference. val ctx = new SparkContext(conf) //Define the location of the file containing the Crime Data val file = "file:///home/ec2-user/softwares/crime-data/ Crimes_-Aug-2015.csv"; println("Loading the Dataset and will further process it") //Loading the Text file from the local file system or HDFS //and converting it into RDD. //SparkContext.textFile(..) - It uses the Hadoop's //TextInputFormat and file is broken by New line Character. //Refer to http://hadoop.apache.org/docs/r2.6.0/api/org/ apache/hadoop/mapred/TextInputFormat.html //The Second Argument is the Partitions which specify the parallelism. //It should be equal or more then number of Cores in the cluster. val logData = ctx.textFile(file, 2) //Invoking Filter operation on the RDD, and counting the number of lines in the Data loaded in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. val numLines = logData.filter(line => true).count() //Finally Printing the Number of lines. println("Number of Crimes reported in Aug-2015 = " + numLines) } } We are now done with the coding! Our first Spark job in Scala is ready for execution. Now, from Eclipse itself, export your project as a .jar fie, name it spark-examples.jar, and save this .jar file in the root of $SPARK_HOME. Next, open your Linux console, go to $SPARK_HOME, and execute the following command: $SPARK_HOME/bin/spark-submit --class chapter.six.ScalaFirstSparkJob --master spark://ip-10-166-191-242:7077 spark-examples.jar In the preceding command, ensure that the value given to --masterparameter is the same as it is shown on your Spark UI. The Spark-submit is a utility script, which is used to submit the Spark jobs to the cluster. As soon as you click on Enter and execute the preceding command, you will see lot of activity (log messages) on the console, and finally, you will see the output of your job at the end: Isn't that simple! As we move forward and discuss Spark more, you will appreciate the ease of coding and simplicity provided by Spark for creating, deploying, and running jobs in a distributed framework. Your completed job will also be available for viewing at the Spark UI: The preceding image shows the status of our first Scala job on the UI. Now let's move forward and develop the same Job using Spark Java APIs. Coding Spark job in Java Perform the following steps to code the Spark job in Java for aggregating the number of crimes in August 2015: Open your Spark-Examples Eclipse project (created in the previous section). Add a new chapter.six.JavaFirstSparkJobJava file, and add the following code snippet: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; public class JavaFirstSparkJob { public static void main(String[] args) { System.out.println("Creating Spark Configuration"); // Create an Object of Spark Configuration SparkConf javaConf = new SparkConf(); // Set the logical and user defined Name of this Application javaConf.setAppName("My First Spark Java Application"); System.out.println("Creating Spark Context"); // Create a Spark Context and provide previously created //Objectx of SparkConf as an reference. JavaSparkContext javaCtx = new JavaSparkContext(javaConf); System.out.println("Loading the Crime Dataset and will further process it"); String file = "file:///home/ec2-user/softwares/crime-data/Crimes_-Aug-2015.csv"; JavaRDD<String> logData = javaCtx.textFile(file); //Invoking Filter operation on the RDD. //And counting the number of lines in the Data loaded //in RDD. //Simply returning true as "TextInputFormat" have already divided the data by "\n" //So each RDD will have only 1 line. long numLines = logData.filter(new Function<String, Boolean>() { public Boolean call(String s) { return true; } }).count(); //Finally Printing the Number of lines System.out.println("Number of Crimes reported in Aug-2015 = "+numLines); javaCtx.close(); } } Next, compile the preceding JavaFirstSparkJob from Eclipse itself and perform steps 7, 8, and 9 of the previous section in which we executed the Spark Scala job. We are done! Analyze the output on the console; it should be the same as the output of the Scala job, which we executed in the previous section. Troubleshooting – tips and tricks In this section, we will talk about troubleshooting tips and tricks, which are helpful in solving the most common errors encountered while working with Spark. Port numbers used by Spark Spark binds various network ports for communication within the cluster/nodes and also exposes the monitoring information of jobs to developers and administrators. There may be instances where the default ports used by Spark may not be available or may be blocked by the network firewall which in turn will result in modifying the default Spark ports for master/worker or driver. Here is a list of all the ports utilized by Spark and their associated parameters, which need to be configured for any changes (http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security). Classpath issues – class not found exception Classpath is the most common issue and it occurs frequently in distributed applications. Spark and its associated jobs run in a distributed mode on a cluster. So, if your Spark job is dependent upon external libraries, then we need to ensure that we package them into a single JAR fie and place it in a common location or the default classpath of all worker nodes or define the path of the JAR file within SparkConf itself: val sparkConf = new SparkConf().setAppName("myapp").setJars(<path of Jar file>)) Other common exceptions In this section, we will talk about few of the common errors/issues/exceptions encountered by architects/developers when they set up Spark or execute Spark jobs: Too many open files: This increases the ulimit on your Linux OS by executingsudo ulimit –n 20000. Version of Scala: Spark 1.5.1 supports Scala 2.10, so if you have multiple versions of Scala deployed on your box, then ensure that all versions are the same, that is, Scala 2.10. Out of memory on workers in standalone mode: This configures SPARK_WORKER_MEMORY in $SPARK_HOME/conf/spark-env.sh. By default, it provides a total memory of 1 G to workers, but at the same time, you should analyze and ensure that you are not loading or caching too much data on worker nodes. Out of memory in applications executed on worker nodes: This configures spark.executor.memory in your SparkConf, as follows: val sparkConf = new SparkConf().setAppName("myapp") .set("spark.executor.memory", "1g") The preceding tips will help you solve basic issues in setting up Spark clusters, but as you move ahead, there could be more complex issues, which are beyond the basic setup, and for all those issues, post your queries at http://stackoverflow.com/questions/tagged/apache-spark or mail at user@spark.apache.org. Summary In this article, we discussed the architecture of Spark and its various components. We also configured our Spark cluster and executed our first Spark job in Scala and Java. Resources for Article:   Further resources on this subject: Data mining [article] Python Data Science Up and Running [article] The Design Patterns Out There and Setting Up Your Environment [article]
Read more
  • 0
  • 0
  • 3188
Modal Close icon
Modal Close icon