Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Events
Videos
Audiobooks
Packt Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

7019 Articles
article-image-flash-10-multiplayer-game-lobby-and-new-game-screen-implementation
Packt
27 Jul 2010
5 min read
Save for later

Flash 10 Multiplayer Game: The Lobby and New Game Screen Implementation

Packt
27 Jul 2010
5 min read
(For more resources on Flash and Games, see here.) The lobby screen implementation In this section, we will learn how to implement the room display within the lobby. Lobby screen in Hello World Upon login, the first thing the player needs to do is enter the lobby. Once the player has logged into the server successfully, the default behavior of the PulseGame in PulseUI is to call enterLobby API. The following is the implementation within PulseGame: protected function postInit():void { m_netClient.enterLobby();} Once the player has successfully entered the lobby, the client will start listening to all the room updates that happen in the lobby. These updates include any newly created room, any updates to the room objects, for example, any changes to the player count of a game room, host change, etc. Customizing lobby screen In the PulseUI, the lobby screen is the immediate screen that gets displayed after a successful login. The lobby screen is drawn over whatever the outline object has drawn onto the screen. The following is added to the screen when the lobby screen is shown to the player: Search lobby UI Available game rooms Game room scroll buttons Buttons for creating a new game room Navigation buttons to top ten and register screens When the lobby is called to hide, the lobby UI elements are taken off the screen to make way for the incoming screen. For our initial game prototype, we don't need to make any changes. The PulseUI framework already offers all of the essential set of functionalities of a lobby for any kind of multiplayer game. However, the one place you may want to add more details is in what gets display for each room within the lobby. Customizing game room display The room display is controlled by the class RoomsDisplay, an instance of which is contained in GameLobbyScreen. The RoomsDisplay contains a number of RoomDisplay object instances, one for each room being displayed. In order to modify what gets displayed in each room display, we do it inside of the class that is subclassed from RoomDisplay. The following figure shows the containment of the Pulse layer classes and shows what we need to subclass in order to modify the room display: In all cases, we would subclass (MyGame) the PulseGame. In order to have our own subclass of lobby screen, we first need to create class (MyGameLobbyScreen) inherited from GameLobbyScreen. In addition, we also need to override the method initLobbyScreen as shown below: protected override function initLobbyScreen():void { m_gameLobbyScreen = new MyGameLobbyScreen();} In order to provide our own RoomsDisplay, we need to create a subclass (MyRoomsDisplay) inherited from RoomsDisplay class and we need to override the method where it creates the RoomsDisplay in GameLobbyScreen as shown below: protected function createRoomsDisplay():void { m_roomsDisplay = new MyRoomsDisplay();} Finally, we do similar subclassing for MyRoomDisplay and override the method that creates the RoomDisplay in MyRoomsDisplay as follows: protected override function createRoomDisplay (room:GameRoomClient):RoomDisplay { return new MyRoomDisplay(room);} Now that we have hooked up to create our own implementation of RoomDisplay, we are free to add any additional information we like. In order to add additional sprites, we now simply need to override the init method of GameRoom and provide our additional sprites. Filtering rooms to display The choice is up to the game developer to either display all the rooms currently created or just the ones that are available to join. We may override the method shouldShowRoom method in the subclass of RoomsDisplay (MyRoomsDisplay) to change the default behavior. The default behavior is to show rooms that are only available to join as well as rooms that allow players to join even after the game has started. Following is the default method implementation: protected function shouldShowRoom(room:GameRoomClient):Boolean { var show:Boolean; show = (room.getRoomType() == GameConstants.ROOM_ALLOW_POST_START); if(show == true) return true; else { return (room.getRoomStatus() == GameConstants.ROOM_STATE_WAITING); }} Lobby and room-related API Upon successful logging, all game implementation must call the enterLobby method. public function enterLobby(gameLobbyId:String = "DefaultLobby"):void You may pass a null string in case you only wish to have one default lobby. The following notification will be received again by the client whether the request to enter a lobby was successful or not. At this point, the game screen should switch to the lobby screen. function onEnteredLobby(error:int):void If entering a lobby was successful, then the client will start to receive a bunch of onNewGameRoom notifications, one for each room that was found active in the entered lobby. The implementation should draw the corresponding game room with the details on the lobby screen. function onNewGameRoom(room:GameRoomClient):void The client may also receive other lobby-related notifications such as onUpdateGameRoom for any room updates and onRemoveGameRoom for any room objects that no longer exist in lobby. function onUpdateGameRoom(room:GameRoomClient):voidfunction onRemoveGameRoom(room:GameRoomClient):void If the player wishes to join an existing game room in the lobby, you simply call joinGameRoom and pass the corresponding room object. public function joinGameRoom(gameRoom:GameRoomClient):void In response to a join request, the server notifies the requesting client of whether the action was successful or failed via the game client callback method. function onJoinedGameRoom(gameRoomId:int, error:int):void A player already in a game room may leave the room and go back to the lobby, by calling the following API: public function leaveGameRoom():void Note that if the player successfully left the room, the calling game client will receive the notification via the following callback API: function onLeaveGameRoom(error:int):void
Read more
  • 0
  • 0
  • 4931

article-image-joomla-15-top-extensions-using-languages
Packt
28 Oct 2010
5 min read
Save for later

Joomla! 1.5 Top Extensions for Using Languages

Packt
28 Oct 2010
5 min read
  Joomla! 1.5 Top Extensions Cookbook Over 80 great recipes for taking control of Joomla! Extensions Set up and use the best extensions available for Joomla! Covers extensions for just about every use of Joomla! Packed with recipes to help you get the most of the Joomla! extensions Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible        Introduction One of the greatest features of Joomla! is that you can build a multilingual website. The Joomla! interface can be displayed in many languages. You can simply download the translation pack for the required language and install that to Joomla!. If you don't have a translation pack for your desired language, you can translate it by editing language files directly or by using the translation manager component. The translation manager component allows you to visually translate your site's interface into any language, right from the Joomla! administration area. After completing the translation, you can pack the translation and share it with others, so that they can install the translation in other Joomla! sites. Besides translating the Joomla! interface, you can translate a site's contents into your desired language. The GTranslate component allows you to translate your site's content into 55 languages using Google's translation service. Adding a language to your site Joomla! can build a multilingual site. A site interface can be in multiple languages, using different locales. In this recipe, you will learn how to add an additional language to a Joomla! site. Getting ready... Joomla! translations are available in major languages. First, decide which language you want to add to your site. For example, we want to add French to a Joomla! website. A French translation for Joomla! 1.5 is available to download at the Joomla! Extensions Directory, http://extensions.joomla.org/extensions/languages/translations-for-joomla. Download this extension from http://joomlacode.org/gf/project/french/frs/, and install it from the Extensions | Install/Uninstall screen. How to do it... After installation, follow the steps as shown: From the Joomla! administration panel, click on Extensions | Language Manager. This will show the Language Manager screen, listing the installed languages for the site: Note that the default language for the site is marked with a yellow star in the Default column. To make the newly-installed language, French (Fr), the default language for your site, select the language and click on the Default button in the toolbar. Preview the site's frontend and you will find the site's interface (not content) in French. For example, the Login Form module will look like the following screenshot: For changing the language of the administration panel, in the Language Manager screen click on Administrator, select a language from the list, and set that as the default language for the administrator backend. See also... Adding a translation will only show the Joomla! interfaces in that language. The content of the site is not translated or displayed in the selected language. Also note that we still don't have a mechanism to select our desired language. All of these things can be done using the Joom!Fish extension, which is discussed in the recipe Manually Translating Your Joomla! Site's Content into Your Desired Language. Translating language files for your site Joomla!'s translations are available in most major languages. However, you may like to change the translations and have your own translation in your desired language. In that case, Joomla! provides a mechanism to translate the Joomla! interface language. In this recipe, you will learn how to translate language files for your site from the administration backend. Getting ready... Translation Manager is a popular extension that can help you translate the site's language files right from the administration backend, without opening a text editor. Download this extension from http://joomlacode.org/gf/project/joomla_1_5_tr1/frs/ and install it from the Extensions | Install/Uninstall screen. How to do it... After installation, follow the steps as shown: From the Joomla! administration panel, click on Components | Translation Manager. This will show the Translation Manager screen, listing all of the installed languages for the site and the administration backend. < For changing any language translation, select that language, for example Site [en-GB] English(United Kingdom), and click on the View Files button. This will show the language files for that language. Now select a file, such as, com_banners, and click on the Edit button. This shows the string editing screen for the com_banners.ini file. Change the strings accordingly, and click on the Save button in the toolbar. For adding a new language, click on New in the Translation Manager screen. This will show the Create New Language screen: In the Language Details section, configure the following: Client: Select who will be the client for this translation—Administrator, Installation, or Site. If you want to translate for the administrator interface, select Administrator. We want to translate the site's frontend, therefore we select Site. Language ISO tag: Type the ISO tag for the language. For example, if we want to translate it into Bengali, type ISO code bn-BD. Name: Type language name, that is Bangla. Description: Type a short description for the translation. Legacy Name: Type the traditional name of the language, for example, bn for bn-BD. Language Locales: Type the locale code for the language. Windows Code page: Specify the code page for the language. The default is iso-8859-1. For the Bangla language it will be utf-8. PDF Font: Specify the font family to be used for displaying the PDF in that language. Right-to-Left: Specify Yes if the language is to be read from right to left (for example, Arabic). In the Author Details section, provide the translator's name (probably your name), e-mail address, website URL, version number for the translation, creation date, the copyright holder's name, and URL to the license document. When done, click on the Save button in the toolbar. This saves the language definition and you will see the language name on the Translation Manager screen:
Read more
  • 0
  • 0
  • 4925

article-image-customizing-kubrik-wordpress
Packt
19 Apr 2010
5 min read
Save for later

Customizing Kubrik with Wordpress

Packt
19 Apr 2010
5 min read
Getting ready The first step is to make a list of the changes that you want to make. Here's what we've come up with: Increase width to 980px Increase font sizes in sidebar Use a graphic header We've picked 980px as our target width because this size is optimized for a 1024 screen resolution and works well in a grid layout. Several CSS adjustments will be necessary to realize this modification, as well as using an image editing program (we will be using Photoshop). How to do it... To increase the page width, the first step is to determine which entries in the CSS stylesheet are controlling the width. Using Firebug to inspect the page (as seen below), we find that the selector #page has a value of 760px for the width property. And #header has a width of 758px (less because there is a 1px left margin). The .narrowcolumn selector gives the main content column a width of 450px. And #sidebar has a width of 190px. Finally, #footer has a width of 760px. So, we will increase #page and #footer to 980px. #header we will increase to 978px. Let's apply all of the additional 220px width to .narrowcolumn. Taking note of the existing 45px left margin, our new value for the width property will be 700px. That means #sidebar width will remain at 190px, but the margin-left will need to be increased from 545px to 765px. Click on Appearance | Editor. In the right-hand column, below the Templates heading, click on style.css. Scroll past the section that says /* Begin Typography & Colors */, until you get to the section that says /* Begin Structure */. Make the following changes to the stylesheet (style.css), commenting as appropriate to document your changes. #page { background-color: white; margin: 20px auto; padding: 0; width: 980px; /* increased from 760px */ border: 1px solid #959596; }#header { background-color: #73a0c5; margin: 0 0 0 1px; padding: 0; height: 200px; width: 978px; /* increased from 758px */ }.narrowcolumn { float: left; padding: 0 0 20px 45px; margin: 0px 0 0; width: 700px; /* increased from 450px */ }#sidebar {margin-left:765px; /* increaseed from 545px */padding:20px 0 10px;width:190px;}#footer { padding: 0; margin: 0 auto; width: 980px; /* increased from 760px */ clear: both; } Adjustments via Photoshop We'll also need to use an image editing program to modify the three background images that create the rounded corners: kubrikbg-ltr.jpg, kubrickheader.jpg, and kubrickfooter.jpg. In this example, we modify kubrik-ltr.jpg (the background image for #page), a 760px image. Open up the image in Photoshop, select all, copy, create a new document (with a white or transparent background), and paste (Ctrl-A, Ctrl-C, Ctrl-N, Ctrl-V). Increase the canvas size (Image | Canvas Size) to 980px, keeping the image centered on the left-hand side by clicking on the left-pointing arrow. Select one half of the image with the Rectangular Marquee Tool, cut and paste. Use the Move Tool to drag the new layer to the right-hand side of the canvas. In this case, it does not matter if you can see the transparent background or if your selection was exactly one half the image. Since the middle of the image is simply a white background, we are really only concerned with the borders on the left and right. The following screenshot shows the background image cut in half and moved over: Save for Web and Devices, exporting as a jpg. Then, replace the existing kubrikbgltr.jpg with your modified version via FTP. The steps are similar for both kubrickheader.jpg and kubrickfooter.jpg. Increase the canvas size and copy/paste from the existing image to increase the image size without stretching or distortion. The only difference is that you need to copy and paste different parts of the image in order to preserve the background gradient and/or top and bottom borders. In order to complete our theme customization, the width of .widecolumn will need to be increased from 450px to 700px (and the 150px margin should be converted to a 45px margin, the same as .narrowcolumn). Also, the kubrikwide.jpg background image will need to be modified with an image editing program to increase the size from 760px to 980px. Then, the individual post view will look as good as the homepage. By following the same steps as above, you should now be prepared to make this final customization yourself. Our next goal is to increase the sizes of the sidebar fonts. Firebug helps us to pinpoint the relevant CSS. #sidebar h2 has a font-size of 1.2em (around line 123 of style.css). Let's change this to 1.75em. #sidebar has font-size of 1em. Let's increase this to 1.25em. To use a graphic in the header, open up kubrickheader.jpg in a new Photoshop document. Use the magic wand tool to select and delete the blue gradient with rounded corners. Now, use the rounded rectangle tool to insert your own custom header area. You can apply another gradient, if desired. We choose to apply a bevel and emboss texture to our grey rectangle. Then, to paste in some photos, decreasing their opacity to 50%. In a short time, we've been able to modify Kubrik by re-writing CSS and using an image-editing program. This is the most basic technique for theme modification. Here is the result:
Read more
  • 0
  • 0
  • 4924

article-image-building-financial-functions-excel-2010
Packt
07 Jul 2011
5 min read
Save for later

Building Financial Functions into Excel 2010

Packt
07 Jul 2011
5 min read
Excel 2010 Financials Cookbook Powerful techniques for financial organization, analysis, and presentation in Microsoft Excel         Till now, in the previous articles, we have focused on manipulating data within and outside of Excel in order to prepare to make financial decisions. Now that the data has been prepared, re-arranged, or otherwise adjusted, we are able to leverage the functions within Excel to make actual decisions. Utilizing these functions and the individual scenarios, we will be able to effectively eliminate the uncertainty due to poor analysis. Since this article utilizes financial scenarios for demonstrating the use of the various functions, it is important to note that these scenarios take certain "unknowns" for granted, and makes a number of assumptions in order to minimize the complexity of the calculation. Real-world scenarios will require a greater focus on calculating and accounting for all variables. Determining standard deviation for assessing risk In the recipes mentioned so far, we have shown the importance of monitoring and analyzing frequency to determine the likelihood that an event will occur. Standard deviation will now allow for an analysis of the frequency in a different manner, or more specifically, through variance. With standard deviation, we will be able to determine the basic top and bottom thresholds of data, and plot general movement within that threshold to determine the variance within the data range. This variance will allow the calculation of risk within investments. As a financial manager, you must determine the risk associated with investing capital in order to gain a return. In this particular instance, you will invest in stock. In order to minimize loss of investment capital, you must determine the risk associated between investing between two different stocks, Stock A, and Stock B. In this recipe, we will utilize standard deviation to determine which stock, either A or B, presents a higher risk, and hence a greater risk of loss. How to do it... We will begin by entering the selling prices of Stock A and Stock B in columns A and B, respectively: Within this list of selling prices, at first glance we can see that Stock B has a higher selling price. The stock opening price and selling price over the course of 52 weeks almost always remains above that of Stock A. As an investor looking to gain a higher return, we may wish to choose Stock B based on this cursory review; however, high selling price does not negate the need for consistency. In cell C2, enter the formula =STDEV(A2:A53) and press Enter: In cell C3, enter the formula =STDEV(B2:B53) and press Enter: We can see from the calculation of standard deviation, that Stock B has a deviation range or variance of over $20, whereas Stock A's variance is just over $9: Given this information, we can determine that Stock A presents a lower risk than Stock B. If we invest in Stock A, at any given time, utilizing past performance, our average risk of loss is $9, whereas in Stock B we an average risk of $20. How it works... The function of STDEV or standard deviation in Excel utilizes the given numbers as a complete population. This means that it does not account for any other changes or unknowns. Excel will use this data set as a complete set and determine the greatest change from high to low within the numbers. This range of change is your standard deviation. Excel also includes the function STDEVP that treats the data as a selection of a larger population. This function should be sed if you are calculating standard deviation on a subset of data (for example, six months out of an entire year). If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock A, you can see the selling prices of the stock, and how they travel within the deviation range: If we translate these numbers into a line graph with standard deviation bars, as shown in the following screenshot for Stock B, you can see the selling prices of the stock, and understand how they travel within the deviation range: The bars shown on the graphs represent the standard deviation as calculated by Excel. We can see visually that not only does Stock B represent a greater risk with the larger deviation, but also many of the stock prices fall below our deviation, representing further risk to the investor. With funds to invest as a finance manager, Stock A represents a lower risk investment. There's more... Standard deviation can be calculated for almost any data set. For this recipe, we calculated deviation over the course of one year; however, if we expand the data to include multiple years we can further determine long-term risk. While Stock B represents high short-term risk, in the long-term analysis, Stock B may present as a less risky investment. Combining standard deviation with a five-number summary analysis, we can further gain risk and performance information.
Read more
  • 0
  • 0
  • 4923

article-image-chatgpt-for-information-retrieval-and-competitive-intelligence
Valentina Alto
02 Jun 2023
2 min read
Save for later

ChatGPT for Information Retrieval and Competitive Intelligence

Valentina Alto
02 Jun 2023
2 min read
This article is an excerpt from the book Modern Generative AI with ChatGPT and OpenAI Models, by Valentina Alto. This book will provide you with insights into the inner workings of the LLMs and guide you through creating your own language models. Information retrieval and competitive intelligence are fields where ChatGPT is a game-changer. It can retrieve information from its knowledge base and reframe it in an original way.One example is using ChatGPT as a search engine to provide summaries, reviews, and recommendations for books:  Alternatively, we could ask for some suggestions for a new book we wish to read based on our preferences:  If we design the prompt with specific information, ChatGPT can serve as a tool for pointing us towards the right references for research or studies. For example, asking ChatGPT to list relevant references for feedforward neural networks:  ChatGPT can also be useful for competitive intelligence. For example, generating a list of existing books with similar content:  Or providing advice on how to be competitive in the market:  ChatGPT can also suggest improvements regarding book content to make it stand out:  Overall, ChatGPT can be a valuable assistant for information retrieval and competitive intelligence. However, it's important to remember that the knowledge base cutoff is 2021, so real-time information may not be available. About the AuthorValentina Alto graduated in 2021 in Data Science. Since 2020 she has been working in Microsoft as Azure Solution Specialist and, since 2022, she focused on Data&AI workloads within the Manufacturing and Pharmaceutical industry. She has been working on customers’ projects closely with system integrators to deploy cloud architecture with a focus on datalake house and DWH, data integration and engineering, IoT and real-time analytics, Azure Machine Learning, Azure cognitive services (including Azure OpenAI Service), and PowerBI for dashboarding. She holds a BSc in Finance and an MSc degree in Data Science from Bocconi University, Milan, Italy. Since her academic journey she has been writing Tech articles about Statistics, Machine Learning, Deep Learning and AI on various publications. She has also written a book about the fundamentals of Machine Learning with Python.  You can connect with Valentina on:LinkedInMedium
Read more
  • 0
  • 0
  • 4923

article-image-spark-programming-model
Packt
20 Feb 2015
13 min read
Save for later

The Spark Programming Model

Packt
20 Feb 2015
13 min read
In this article by Nick Pentreath, author of the book Machine Learning with Spark, we will delve into a high-level overview of Spark's design, we will introduce the SparkContext object as well as the Spark shell, which we will use to interactively explore the basics of the Spark programming model. While this section provides a brief overview and examples of using Spark, we recommend that you read the following documentation to get a detailed understanding:Spark Quick Start: http://spark.apache.org/docs/latest/quick-start.htmlSpark Programming guide, which covers Scala, Java, and Python: http://spark.apache.org/docs/latest/programming-guide.html (For more resources related to this topic, see here.) SparkContext and SparkConf The starting point of writing any Spark program is SparkContext (or JavaSparkContext in Java). SparkContext is initialized with an instance of a SparkConf object, which contains various Spark cluster-configuration settings (for example, the URL of the master node). Once initialized, we will use the various methods found in the SparkContext object to create and manipulate distributed datasets and shared variables. The Spark shell (in both Scala and Python, which is unfortunately not supported in Java) takes care of this context initialization for us, but the following lines of code show an example of creating a context running in the local mode in Scala: val conf = new SparkConf().setAppName("Test Spark App").setMaster("local[4]")val sc = new SparkContext(conf) This creates a context running in the local mode with four threads, with the name of the application set to Test Spark App. If we wish to use default configuration values, we could also call the following simple constructor for our SparkContext object, which works in exactly the same way: val sc = new SparkContext("local[4]", "Test Spark App") The Spark shell Spark supports writing programs interactively using either the Scala or Python REPL (that is, the Read-Eval-Print-Loop, or interactive shell). The shell provides instant feedback as we enter code, as this code is immediately evaluated. In the Scala shell, the return result and type is also displayed after a piece of code is run. To use the Spark shell with Scala, simply run ./bin/spark-shell from the Spark base directory. This will launch the Scala shell and initialize SparkContext, which is available to us as the Scala value, sc. Your console output should look similar to the following screenshot: To use the Python shell with Spark, simply run the ./bin/pyspark command. Like the Scala shell, the Python SparkContext object should be available as the Python variable sc. You should see an output similar to the one shown in this screenshot: Resilient Distributed Datasets The core of Spark is a concept called the Resilient Distributed Dataset (RDD). An RDD is a collection of "records" (strictly speaking, objects of some type) that is distributed or partitioned across many nodes in a cluster (for the purposes of the Spark local mode, the single multithreaded process can be thought of in the same way). An RDD in Spark is fault-tolerant; this means that if a given node or task fails (for some reason other than erroneous user code, such as hardware failure, loss of communication, and so on), the RDD can be reconstructed automatically on the remaining nodes and the job will still complete. Creating RDDs RDDs can be created from existing collections, for example, in the Scala Spark shell that you launched earlier: val collection = List("a", "b", "c", "d", "e")val rddFromCollection = sc.parallelize(collection) RDDs can also be created from Hadoop-based input sources, including the local filesystem, HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input format that implements the Hadoop InputFormat interface, including text files, other standard Hadoop formats, HBase, Cassandra, and many more. The following code is an example of creating an RDD from a text file located on the local filesystem: val rddFromTextFile = sc.textFile("LICENSE") The preceding textFile method returns an RDD where each record is a String object that represents one line of the text file. Spark operations Once we have created an RDD, we have a distributed collection of records that we can manipulate. In Spark's programming model, operations are split into transformations and actions. Generally speaking, a transformation operation applies some function to all the records in the dataset, changing the records in some way. An action typically runs some computation or aggregation operation and returns the result to the driver program where SparkContext is running. Spark operations are functional in style. For programmers familiar with functional programming in Scala or Python, these operations should seem natural. For those without experience in functional programming, don't worry; the Spark API is relatively easy to learn. One of the most common transformations that you will use in Spark programs is the map operator. This applies a function to each record of an RDD, thus mapping the input to some new output. For example, the following code fragment takes the RDD we created from a local text file and applies the size function to each record in the RDD. Remember that we created an RDD of Strings. Using map, we can transform each string to an integer, thus returning an RDD of Ints: val intsFromStringsRDD = rddFromTextFile.map(line => line.size) You should see output similar to the following line in your shell; this indicates the type of the RDD: intsFromStringsRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[5] at map at <console>:14 In the preceding code, we saw the => syntax used. This is the Scala syntax for an anonymous function, which is a function that is not a named method (that is, one defined using the def keyword in Scala or Python, for example). The line => line.size syntax means that we are applying a function where the input variable is to the left of the => operator, and the output is the result of the code to the right of the => operator. In this case, the input is line, and the output is the result of calling line.size. In Scala, this function that maps a string to an integer is expressed as String => Int.This syntax saves us from having to separately define functions every time we use methods such as map; this is useful when the function is simple and will only be used once, as in this example. Now, we can apply a common action operation, count, to return the number of records in our RDD: intsFromStringsRDD.count The result should look something like the following console output: 14/01/29 23:28:28 INFO SparkContext: Starting job: count at <console>:17...14/01/29 23:28:28 INFO SparkContext: Job finished: count at <console>:17, took 0.019227 sres4: Long = 398 Perhaps we want to find the average length of each line in this text file. We can first use the sum function to add up all the lengths of all the records and then divide the sum by the number of records: val sumOfRecords = intsFromStringsRDD.sumval numRecords = intsFromStringsRDD.countval aveLengthOfRecord = sumOfRecords / numRecords The result will be as follows: aveLengthOfRecord: Double = 52.06030150753769 Spark operations, in most cases, return a new RDD, with the exception of most actions, which return the result of a computation (such as Long for count and Double for sum in the preceding example). This means that we can naturally chain together operations to make our program flow more concise and expressive. For example, the same result as the one in the preceding line of code can be achieved using the following code: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count An important point to note is that Spark transformations are lazy. That is, invoking a transformation on an RDD does not immediately trigger a computation. Instead, transformations are chained together and are effectively only computed when an action is called. This allows Spark to be more efficient by only returning results to the driver when necessary so that the majority of operations are performed in parallel on the cluster. This means that if your Spark program never uses an action operation, it will never trigger an actual computation, and you will not get any results. For example, the following code will simply return a new RDD that represents the chain of transformations: val transformedRDD = rddFromTextFile.map(line => line.size).filter(size => size > 10).map(size => size * 2) This returns the following result in the console: transformedRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[8] at map at <console>:14 Notice that no actual computation happens and no result is returned. If we now call an action, such as sum, on the resulting RDD, the computation will be triggered: val computation = transformedRDD.sum You will now see that a Spark job is run, and it results in the following console output: ...14/11/27 21:48:21 INFO SparkContext: Job finished: sum at <console>:16, took 0.193513 scomputation: Double = 60468.0 The complete list of transformations and actions possible on RDDs as well as a set of more detailed examples are available in the Spark programming guide (located at http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations), and the API documentation (the Scala API documentation) is located at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD). Caching RDDs One of the most powerful features of Spark is the ability to cache data in memory across a cluster. This is achieved through use of the cache method on an RDD: rddFromTextFile.cache Calling cache on an RDD tells Spark that the RDD should be kept in memory. The first time an action is called on the RDD that initiates a computation, the data is read from its source and put into memory. Hence, the first time such an operation is called, the time it takes to run the task is partly dependent on the time it takes to read the data from the input source. However, when the data is accessed the next time (for example, in subsequent queries in analytics or iterations in a machine learning model), the data can be read directly from memory, thus avoiding expensive I/O operations and speeding up the computation, in many cases, by a significant factor. If we now call the count or sum function on our cached RDD, we will see that the RDD is loaded into memory: val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count Indeed, in the following output, we see that the dataset was cached in memory on the first call, taking up approximately 62 KB and leaving us with around 270 MB of memory free: ...14/01/30 06:59:27 INFO MemoryStore: ensureFreeSpace(63454) called with curMem=32960, maxMem=31138775014/01/30 06:59:27 INFO MemoryStore: Block rdd_2_0 stored as values to memory (estimated size 62.0 KB, free 296.9 MB)14/01/30 06:59:27 INFO BlockManagerMasterActor$BlockManagerInfo:Added rdd_2_0 in memory on 10.0.0.3:55089 (size: 62.0 KB, free: 296.9 MB)... Now, we will call the same function again: val aveLengthOfRecordChainedFromCached = rddFromTextFile.map(line => line.size).sum / rddFromTextFile.count We will see from the console output that the cached data is read directly from memory: ...14/01/30 06:59:34 INFO BlockManager: Found block rdd_2_0 locally... Spark also allows more fine-grained control over caching behavior. You can use the persist method to specify what approach Spark uses to cache data. More information on RDD caching can be found here: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Broadcast variables and accumulators Another core feature of Spark is the ability to create two special types of variables: broadcast variables and accumulators. A broadcast variable is a read-only variable that is made available from the driver program that runs the SparkContext object to the nodes that will execute the computation. This is very useful in applications that need to make the same data available to the worker nodes in an efficient manner, such as machine learning algorithms. Spark makes creating broadcast variables as simple as calling a method on SparkContext as follows: val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e")) The console output shows that the broadcast variable was stored in memory, taking up approximately 488 bytes, and it also shows that we still have 270 MB available to us: 14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488) called with curMem=96414, maxMem=31138775014/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored as values to memory(estimated size 488.0 B, free 296.9 MB)broadCastAList: org.apache.spark.broadcast.Broadcast[List[String]] = Broadcast(1) A broadcast variable can be accessed from nodes other than the driver program that created it (that is, the worker nodes) by calling value on the variable: sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++ x).collect This code creates a new RDD with three records from a collection (in this case, a Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with the relevant record from our new RDD appended to the broadcastAList that is our broadcast variable. Notice that we used the collect method in the preceding code. This is a Spark action that returns the entire RDD to the driver as a Scala (or Python or Java) collection. We will often use collect when we wish to apply further processing to our results locally within the driver program. Note that collect should generally only be used in cases where we really want to return the full result set to the driver and perform further processing. If we try to call collect on a very large dataset, we might run out of memory on the driver and crash our program.It is preferable to perform as much heavy-duty processing on our Spark cluster as possible, preventing the driver from becoming a bottleneck. In many cases, however, collecting results to the driver is necessary, such as during iterations in many machine learning models. On inspecting the result, we will see that for each of the three records in our new RDD, we now have a record that is our original broadcasted List, with the new element appended to it (that is, there is now either "1", "2", or "3" at the end): ...14/01/31 10:15:39 INFO SparkContext: Job finished: collect at <console>:15, took 0.025806 sres6: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b, c, d, e, 2), List(a, b, c, d, e, 3)) An accumulator is also a variable that is broadcasted to the worker nodes. The key difference between a broadcast variable and an accumulator is that while the broadcast variable is read-only, the accumulator can be added to. There are limitations to this, that is, in particular, the addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program. Each worker node can only access and add to its own local accumulator value, and only the driver program can access the global value. Accumulators are also accessed within the Spark code using the value method. For more details on broadcast variables and accumulators, see the Shared Variables section of the Spark Programming Guide: http://spark.apache.org/docs/latest/programming-guide.html#shared-variables. Summary In this article, we learned the basics of Spark's programming model and API using the interactive Scala console. Resources for Article: Further resources on this subject: Ridge Regression [article] Clustering with K-Means [article] Machine Learning Examples Applicable to Businesses [article]
Read more
  • 0
  • 0
  • 4922
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-leveraging-python-world-big-data
Packt
07 Sep 2015
26 min read
Save for later

Leveraging Python in the World of Big Data

Packt
07 Sep 2015
26 min read
 We are generating more and more data day by day. We have generated more data this century than in the previous century and we are currently only 15 years into this century. big data is the new buzz word and everyone is talking about it. It brings new possibilities. Google Translate is able to translate any language, thanks to big data. We are able to decode our human genome due to it. We can predict the failure of a turbine and do the required maintenance on it because of big data. There are three Vs of big data and they are defined as follows: Volume: This defines the size of the data. Facebook has petabytes of data on its users. Velocity: This is the rate at which data is generated. Variety: Data is not only in a tabular form. We can get data from text, images, and sound. Data comes in the form of JSON, XML, and other types as well. Let's take a look at the following screenshot:   In this article by Samir Madhavan, author of Mastering Python for Data Science, we'll learn how to use Python in the world of big data by doing the following: Understanding Hadoop Writing a MapReduce program in Python Using a Hadoop library (For more resources related to this topic, see here.) What is Hadoop? According to the Apache Hadoop's website, Hadoop stores data in a distributed manner and helps in computing it. It has been designed to scale easily to any number of machines with the help of computing power and storage. Hadoop was created by Doug Cutting and Mike Cafarella in the year 2005. It was named after Doug Cutting's son's toy elephant.   The programming model Hadoop is a programming paradigm that takes a large distributed computation as a sequence of distributed operations on large datasets of key-value pairs. The MapReduce framework makes use of a cluster of machines and executes MapReduce jobs across these machines. There are two phases in MapReduce—a mapping phase and a reduce phase. The input data to MapReduce is key value pairs of data. During the mapping phase, Hadoop splits the data into smaller pieces, which is then fed to the mappers. These mappers are distributed across machines within the cluster. Each mapper takes the input key-value pairs and generates intermediate key-value pairs by invoking a user-defined function within them. After the mapper phase, Hadoop sorts the intermediate dataset by key and generates a set of key-value tuples so that all the values belonging to a particular key are together. During the reduce phase, the reducer takes in the intermediate key-value pair and invokes a user-defined function, which then generates a output key-value pair. Hadoop distributes the reducers across the machines and assigns a set of key-value pairs to each of the reducers.  Data processing through MapReduce The MapReduce architecture MapReduce has a master-slave architecture, where the master is the JobTracker and TaskTracker is the slave. When a MapReduce program is submitted to Hadoop, the JobTracker assigns the mapping/reducing task to the TaskTracker and it takes of the task over executing the program. The Hadoop DFS Hadoop's distributed filesystem has been designed to store very large datasets in a distributed manner. It has been inspired by the Google File system, which is a proprietary distributed filesystem designed by Google. The data in HDFS is stored in a sequence of blocks, and all blocks are of the same size except for the last block. The block sizes are configurable in Hadoop. Hadoop's DFS architecture It also has a master/slave architecture where NameNode is the master machine and DataNode is the slave machine. The actual data is stored in the data node. The NameNode keeps a tab on where certain kinds of data are stored and whether it has the required replication. It also helps in managing a filesystem by creating, deleting, and moving directories and files in the filesystem. Python MapReduce Hadoop can be downloaded and installed from https://hadoop.apache.org/. We'll be using the Hadoop streaming API to execute our Python MapReduce program in Hadoop. The Hadoop Streaming API helps in using any program that has a standard input and output as a MapReduce program. We'll be writing three MapReduce programs using Python, they are as follows: A basic word count Getting the sentiment Score of each review Getting the overall sentiment score from all the reviews The basic word count We'll start with the word count MapReduce. Save the following code in a word_mapper.py file: import sys for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() # words in the line is split word_tokens = l.split() # Key Value pair is outputted for w in word_tokens: print '%st%s' % (w, 1) In the preceding mapper code, each line of the file is stripped of the leading and trailing white spaces. The line is then divided into tokens of words and then these tokens of words are outputted as a key value pair of 1. Save the following code in a word_reducer.py file: from operator import itemgetter import sys current_word_token = None counter = 0 word = None # STDIN Input for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() # input from the mapper is parsed word_token, counter = l.split('t', 1) # count is converted to int try: counter = int(counter) except ValueError: # if count is not a number then ignore the line continue #Since Hadoop sorts the mapper output by key, the following # if else statement works if current_word_token == word_token: current_counter += counter else: if current_word_token: print '%st%s' % (current_word_token, current_counter) current_counter = counter current_word_token = word_token # The last word is outputed if current_word_token == word_token: print '%st%s' % (current_word_token, current_counter) In the preceding code, we use the current_word_token parameter to keep track of the current word that is being counted. In the for loop, we use the word_token parameter and a counter to get the value out of the key-value pair. We then convert the counter to an int type. In the if/else statement, if the word_token value is same as the previous instance, which is current_word_token, then we keep counting else statement's value. If it's a new word that has come as the output, then we output the word and its count. The last if statement is to output the last word. We can check out if the mapper is working fine by using the following command: $ echo 'dolly dolly max max jack tim max' | ./BigData/word_mapper.py The output of the preceding command is shown as follows: dolly1 dolly1 max1 max1 jack1 tim1 max1 Now, we can check if the reducer is also working fine by piping the reducer to the sorted list of the mapper output: $ echo "dolly dolly max max jack tim max" | ./BigData/word_mapper.py | sort -k1,1 | ./BigData/word_reducer.py The output of the preceding command is shown as follows: dolly2 jack1 max3 tim1 Now, let's try to apply the same code on a local file containing the summary of mobydick: $ cat ./Data/mobydick_summary.txt | ./BigData/word_mapper.py | sort -k1,1 | ./BigData/word_reducer.py The output of the preceding command is shown as follows: a28 A2 abilities1 aboard3 about2 A sentiment score for each review We'll extend this to write a MapReduce program to determine the sentiment score for each review. Write the following code in the senti_mapper.py file: import sys import re positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() #Convert to lower case l = l.lower() #Getting the sentiment score score = sentiment_score(l, positive_words, negative_words) # Key Value pair is outputted print '%st%s' % (l, score) In the preceding code, we used the sentiment_score function, which was designed to give the sentiment score as output. For each line, we strip the leading and trailing white spaces and then get the sentiment score for a review. Finally, we output a sentence and the score. For this program, we don't require a reducer as we can calculate the sentiment in the mapper itself and we just have to output the sentiment score. Let's test whether the mapper is working fine locally with a file containing the reviews for Jurassic World: $ cat ./Data/jurassic_world_review.txt | ./BigData/senti_mapper.py there is plenty here to divert, but little to leave you enraptored. such is the fate of the sequel: bigger. louder. fewer teeth.0 if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.-1 there's a problem when the most complex character in a film is the dinosaur-2 not so much another bloated sequel as it is the fruition of dreams deferred in the previous films. too bad the genre dictates that those dreams are once again destined for disaster.-2 We can see that our program is able to calculate the sentiment score well. The overall sentiment score To calculate the overall sentiment score, we would require the reducer and we'll use the same mapper but with slight modifications. Here is the mapper code that we'll use stored in the overall_senti_mapper.py file: import sys import hashlib positive_words = open('./Data/positive-words.txt').read().split('n') negative_words = open('./Data/negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() #Convert to lower case l = l.lower() #Getting the sentiment score score = sentiment_score(l, positive_words, negative_words) #Hashing the review to use it as a string hash_object = hashlib.md5(l) # Key Value pair is outputted print '%st%s' % (hash_object.hexdigest(), score) This mapper code is similar to the previous mapper code, but here we use the MD5 hash library to review and then to get the output as the key. Here is the reducer code that is utilized to determine the overall sentiments score of the movie. Store the following code in the overall_senti_reducer.py file: from operator import itemgetter import sys total_score = 0 # STDIN Input for l in sys.stdin: # input from the mapper is parsed key, score = l.split('t', 1) # count is converted to int try: score = int(score) except ValueError: # if score is not a number then ignore the line continue #Updating the total score total_score += score print '%s' % (total_score,) In the preceding code, we strip the value containing the score and we then keep adding to the total_score variable. Finally, we output the total_score variable, which shows the sentiment of the movie. Let's locally test the overall sentiment on Jurassic World, which is a good movie, and then test the sentiment for the movie, Unfinished Business, which was critically deemed poor: $ cat ./Data/jurassic_world_review.txt | ./BigData/overall_senti_mapper.py | sort -k1,1 | ./BigData/overall_senti_reducer.py 19 $ cat ./Data/unfinished_business_review.txt | ./BigData/overall_senti_mapper.py | sort -k1,1 | ./BigData/overall_senti_reducer.py -8 We can see that our code is working well and we also see that Jurassic World has a more positive score, which means that people have liked it a lot. On the contrary, Unfinished Business has a negative value, which shows that people haven't liked it much. Deploying the MapReduce code on Hadoop We'll create a directory for data on Moby Dick, Jurassic World, and Unfinished Business in the HDFS tmp folder: $ Hadoop fs -mkdir /tmp/moby_dick $ Hadoop fs -mkdir /tmp/jurassic_world $ Hadoop fs -mkdir /tmp/unfinished_business Let's check if the folders are created: $ Hadoop fs -ls /tmp/ Found 6 items drwxrwxrwx - mapred Hadoop 0 2014-11-14 15:42 /tmp/Hadoop-mapred drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/jurassic_world drwxrwxrwx - hdfs Hadoop 0 2014-11-14 15:41 /tmp/mapred drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/moby_dick drwxr-xr-x - samzer Hadoop 0 2015-06-16 18:17 /tmp/temp635459726 drwxr-xr-x - samzer Hadoop 0 2015-06-18 18:31 /tmp/unfinished_business Once the folders are created, let's copy the data files to the respective folders. $ Hadoop fs -copyFromLocal ./Data/mobydick_summary.txt /tmp/moby_dick $ Hadoop fs -copyFromLocal ./Data/jurassic_world_review.txt /tmp/jurassic_world $ Hadoop fs -copyFromLocal ./Data/unfinished_business_review.txt /tmp/unfinished_business Let's verify that the file is copied: $ Hadoop fs -ls /tmp/moby_dick $ Hadoop fs -ls /tmp/jurassic_world $ Hadoop fs -ls /tmp/unfinished_business Found 1 items -rw-r--r-- 3 samzer Hadoop 5973 2015-06-18 18:34 /tmp/moby_dick/mobydick_summary.txt Found 1 items -rw-r--r-- 3 samzer Hadoop 3185 2015-06-18 18:34 /tmp/jurassic_world/jurassic_world_review.txt Found 1 items -rw-r--r-- 3 samzer Hadoop 2294 2015-06-18 18:34 /tmp/unfinished_business/unfinished_business_review.txt We can see that files have been copied successfully. With the following command, we'll execute our mapper and reducer's script in Hadoop. In this command, we define the mapper, reducer, input, and output file locations, and then use Hadoop streaming to execute our scripts. Let's execute the word count program first: $ Hadoop jar /usr/lib/Hadoop-0.20-mapreduce/contrib/streaming/Hadoop-*streaming*.jar -file ./BigData/word_mapper.py -mapper word_mapper.py -file ./BigData/word_reducer.py -reducer word_reducer.py -input /tmp/moby_dick/* -output /tmp/moby_output Let's verify that the word count MapReduce program is working successfully: $ Hadoop fs -cat /tmp/moby_output/* The output of the preceding command is shown as follows: (Queequeg1 A2 Africa1 Africa,1 After1 Ahab13 Ahab,1 Ahab's6 All1 American1 As1 At1 Bedford,1 Bildad1 Bildad,1 Boomer,2 Captain1 Christmas1 Day1 Delight,1 Dick6 Dick,2 The program is working as intended. Now, we'll deploy the program that calculates the sentiment score for each of the reviews. Note that we can add the positive and negative dictionary files to the Hadoop streaming: $ Hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-*streaming*.jar -file ./BigData/word_mapper.py -mapper word_mapper.py -file ./BigData/word_reducer.py -reducer word_reducer.py -input /tmp/moby_dick/* -output /tmp/moby_output In the preceding code, we use the Hadoop command with the Hadoop streaming JAR file and then define the mapper and reducer files, and finally, the input and output directories in Hadoop. Let's check the sentiments score of the movies review: $ Hadoop fs -cat /tmp/jurassic_output/* The output of the preceding command is shown as follows: "jurassic world," like its predecessors, fills up the screen with roaring, slathering, earth-shaking dinosaurs, then fills in mere humans around the edges. it's a formula that works as well in 2015 as it did in 1993.3 a perfectly fine movie and entertaining enough to keep you watching until the closing credits.4 an angry movie with a tragic moral ... meta-adoration and criticism ends with a genetically modified dinosaur fighting off waves of dinosaurs.-3 if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.-1 This program is also working as intended. Now, we'll try out the overall sentiment of a movie: $ Hadoop jar /usr/lib/Hadoop-0.20-mapreduce/contrib/streaming/Hadoop-*streaming*.jar -file ./BigData/overall_senti_mapper.py -mapper Let's verify the result: $ Hadoop fs -cat /tmp/unfinished_business_output/* The output of the preceding command is shown as follows: -8 We can see that the overall sentiment score comes out correctly from MapReduce. Here is a screenshot of the JobTracker status page:   The preceding image shows a portal where the jobs submitted to the JobTracker can be viewed and the status can be seen. This can be seen on port 50070 of the master system. From the preceding image, we can see that a job is running, and the status above the image shows that the job has been completed successfully. File handling with Hadoopy Hadoopy is a library in Python, which provides an API to interact with Hadoop to manage files and perform MapReduce on it. Hadoopy can be downloaded from http://www.Hadoopy.com/en/latest/tutorial.html#installing-Hadoopy. Let's try to put a few files in Hadoop through Hadoopy in a directory created within HDFS, called data: $ Hadoop fs -mkdir data Here is the code that puts the data into HDFS: importHadoopy import os hdfs_path = '' def read_local_dir(local_path): for fn in os.listdir(local_path): path = os.path.join(local_path, fn) if os.path.isfile(path): yield path def main(): local_path = './BigData/dummy_data' for file in read_local_dir(local_path): Hadoopy.put(file, 'data') print"The file %s has been put into hdfs"% (file,) if __name__ =='__main__': main() The file ./BigData/dummy_data/test9 has been put into hdfs The file ./BigData/dummy_data/test7 has been put into hdfs The file ./BigData/dummy_data/test1 has been put into hdfs The file ./BigData/dummy_data/test8 has been put into hdfs The file ./BigData/dummy_data/test6 has been put into hdfs The file ./BigData/dummy_data/test5 has been put into hdfs The file ./BigData/dummy_data/test3 has been put into hdfs The file ./BigData/dummy_data/test4 has been put into hdfs The file ./BigData/dummy_data/test2 has been put into hdfs In the preceding code, we list all the files in a directory and then put each of the files into Hadoop using the put() method of Hadoopy. Let's check if all the files have been put into HDFS: $ Hadoop fs -ls data The output of the preceding command is shown as follows: Found 9 items -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test1 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test2 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test3 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test4 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test5 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test6 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test7 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test8 -rw-r--r-- 3 samzer Hadoop 0 2015-06-23 00:19 data/test9 So, we have successfully been able to put files into HDFS. Pig Pig is a platform that has a very expressive language to perform data transformations and querying. The code that is written in Pig is done in a scripting manner and this gets compiled to MapReduce programs, which execute on Hadoop. The following image is the logo of Pig Latin:  The Pig logo Pig helps in reducing the complexity of raw-level MapReduce programs, and enables the user to perform fast transformations. Pig Latin is the textual language that can be learned from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html. We'll be covering how to perform the top 10 most occurring words with Pig, and then we'll see how you can create a function in Python that can be used in Pig. Let's start with the word count. Here is the Pig Latin code, which you can save in thepig_wordcount.py file: data = load '/tmp/moby_dick/'; word_token = foreach data generate flatten(TOKENIZE((chararray)$0)) as word; group_word_token = group word_token by word; count_word_token = foreach group_word_token generate COUNT(word_token) as cnt, group; sort_word_token = ORDER count_word_token by cnt DESC; top10_word_count = LIMIT sort_word_token 10; DUMP top10_word_count; In the preceding code, we can load the summary of Moby Dick, which is then tokenized line by line and is basically split into individual elements. The flatten function converts a collection of individual word tokens in a line to a row-by-row form. We then group by the words and then take a count of the words for each word. Finally, we sort the count of words in a descending order and then we limit the count of the words to the first 10 rows to get the top 10 most occurring words. Let's execute the preceding pig script: $ pig ./BigData/pig_wordcount.pig The output of the preceding command is shown as follows: (83,the) (36,and) (28,a) (25,of) (24,to) (15,his) (14,Ahab) (14,Moby) (14,is) (14,in) We are able to get our top 10 words. Let's now create a user-defined function with Python, which will be used in Pig. We'll define two user-defined functions to score positive and negative sentiments of a sentence. The following code is the UDF used to score the positive sentiment and it's available in the positive_sentiment.py file: positive_words = [ 'a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation', 'acco$ ] @outputSchema("pnum:int") def sentiment_score(text): positive_score = 0 for w in text.split(''): if w in positive_words: positive_score+=1 return positive_score In the preceding code, we define the positive word list, which is used by the sentiment_score() function. The function checks for the positive words in a sentence and finally outputs their total count. There is an outputSchema() decorator that is used to tell Pig what type of data is being outputted, which in our case is int. Here is the code to score the negative sentiment and it's available in the negative_sentiment.py file. The code is almost similar to the positive sentiment: negative_words = ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted', 'ab$....] @outputSchema("nnum:int") def sentiment_score(text): negative_score = 0 for w in text.split(''): if w in negative_words: negative_score-=1 return negative_score The following code is used by Pig to score the sentiments of the Jurassic World reviews and its available in the pig_sentiment.pig file: register 'positive_sentiment.py' using org.apache.pig.scripting.jython.JythonScriptEngine as positive; register 'negative_sentiment.py' using org.apache.pig.scripting.jython.JythonScriptEngine as negative; data = load '/tmp/jurassic_world/*'; feedback_sentiments = foreach data generate LOWER((chararray)$0) as feedback, positive.sentiment_score(LOWER((chararray)$0)) as psenti, negative.sentiment_score(LOWER((chararray)$0)) as nsenti; average_sentiments = foreach feedback,feedback_sentiments generate psenti + nsenti; dump average_sentiments; In the preceding Pig script, we first register the Python UDF scripts using the register command and give them an appropriate name. We then load our Jurassic World review. We then convert our reviews to lowercase and score the positive and negative sentiments of a review. Finally, we add the score to get the overall sentiments of a review. Let's execute the Pig script and see the results: $ pig ./BigData/pig_sentiment.pig The output of the preceding command is shown as follows: (there is plenty here to divert, but little to leave you enraptored. such is the fate of the sequel: bigger. louder. fewer teeth.,0) (if you limit your expectations for jurassic world to "more teeth," it will deliver on that promise. if you dare to hope for anything more-relatable characters, narrative coherence-you'll only set yourself up for disappointment.,-1) (there's a problem when the most complex character in a film is the dinosaur,-2) (not so much another bloated sequel as it is the fruition of dreams deferred in the previous films. too bad the genre dictates that those dreams are once again destined for disaster.,-2) (a perfectly fine movie and entertaining enough to keep you watching until the closing credits.,4) (this fourth installment of the jurassic park film series shows some wear and tear, but there is still some gas left in the tank. time is spent to set up the next film in the series. they will keep making more of these until we stop watching.,0) We have successfully scored the sentiments of the Jurassic World review using the Python UDF in Pig. Python with Apache Spark Apache Spark is a computing framework that works on top of HDFS and provides an alternative way of computing that is similar to MapReduce. It was developed by AmpLab of UC Berkeley. Spark does its computation mostly in the memory because of which, it is much faster than MapReduce, and is well suited for machine learning as it's able to handle iterative workloads really well.   Spark uses the programming abstraction of RDDs (Resilient Distributed Datasets) in which data is logically distributed into partitions, and transformations can be performed on top of this data. Python is one of the languages that is used to interact with Apache Spark, and we'll create a program to perform the sentiment scoring for each review of Jurassic Park as well as the overall sentiment. You can install Apache Spark by following the instructions at https://spark.apache.org/docs/1.0.1/spark-standalone.html. Scoring the sentiment Here is the Python code to score the sentiment: from __future__ import print_function import sys from operator import add from pyspark import SparkContext positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score if __name__ == "__main__": if len(sys.argv) != 2: print("Usage: sentiment <file>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonSentiment") lines = sc.textFile(sys.argv[1], 1) scores = lines.map(lambda x: (x, sentiment_score(x.lower(), positive_words, negative_words))) output = scores.collect() for (key, score) in output: print("%s: %i" % (key, score)) sc.stop() In the preceding code, we define our standard sentiment_score() function, which we'll be reusing. The if statement checks whether the Python script and the text file is given. The sc variable is a Spark Context object with the PythonSentiment app name. The filename in the argument is passed into Spark through the textFile() method of the sc variable. In the map() function of Spark, we define a lambda function, where each line of the text file is passed, and then we obtain the line and its respective sentiment score. The output variable gets the result, and finally, we print the result on the screen. Let's score the sentiment of each of the reviews of Jurassic World. Replace the <hostname> with your hostname, this should suffice: $ ~/spark-1.3.0-bin-cdh4/bin/spark-submit --master spark://<hostname>:7077 ./BigData/spark_sentiment.py hdfs://localhost:8020/tmp/jurassic_world/* We'll get the following output for the preceding command: There is plenty here to divert but little to leave you enraptured. Such is the fate of the sequel: Bigger, Louder, Fewer teeth: 0 If you limit your expectations for Jurassic World to more teeth, it will deliver on this promise. If you dare to hope for anything more—relatable characters or narrative coherence—you'll only set yourself up for disappointment:-1 We can see that our Spark program was able to score the sentiment for each of the reviews. The number in the end of the output of the sentiment score shows that if the review has been positive or negative, the higher the number of the sentiment score—the better the review and the more negative the number of the sentiment score—the more negative the review has been. We use the Spark Submit command with the following parameters: A master node of the Spark system A Python script containing the transformation commands An argument to the Python script The overall sentiment Here is a Spark program to score the overall sentiment of all the reviews: from __future__ import print_function import sys from operator import add from pyspark import SparkContext positive_words = open('positive-words.txt').read().split('n') negative_words = open('negative-words.txt').read().split('n') def sentiment_score(text, pos_list, neg_list): positive_score = 0 negative_score = 0 for w in text.split(''): if w in pos_list: positive_score+=1 if w in neg_list: negative_score+=1 return positive_score - negative_score if __name__ =="__main__": if len(sys.argv) != 2: print("Usage: Overall Sentiment <file>", file=sys.stderr) exit(-1) sc = SparkContext(appName="PythonOverallSentiment") lines = sc.textFile(sys.argv[1], 1) scores = lines.map(lambda x: ("Total", sentiment_score(x.lower(), positive_words, negative_words))) .reduceByKey(add) output = scores.collect() for (key, score) in output: print("%s: %i"% (key, score)) sc.stop() In the preceding code, we have added a reduceByKey() method, which reduces the value by adding the output values, and we have also defined the key as Total, so that all the scores are reduced based on a single key. Let's try out the preceding code to get the overall sentiment of Jurassic World. Replace the <hostname> with your hostname, this should suffice: $ ~/spark-1.3.0-bin-cdh4/bin/spark-submit --master spark://<hostname>:7077 ./BigData/spark_overall_sentiment.py hdfs://localhost:8020/tmp/jurassic_world/* The output of the preceding command is shown as follows: Total: 19 We can see that Spark has given an overall sentiment score of 19. The applications that get executed on Spark can be viewed in the browser on the 8080 port of the Spark master. Here is a screenshot of it:   We can see that the number of nodes of Spark, applications that are getting executed currently, and the applications that have been executed. Summary In this article, you were introduced to big data, learned about how the Hadoop software works, and the architecture associated with it. You then learned how to create a mapper and a reducer for a MapReduce program, how to test it locally, and then put it into Hadoop and deploy it. You were then introduced to the Hadoopy library and using this library, you were able to put files into Hadoop. You also learned about Pig and how to create a user-defined function with it. Finally, you learned about Apache Spark, which is an alternative to MapReduce and how to use it to perform distributed computing. With this article, we have come to an end in our journey, and you should be in a state to perform data science tasks with Python. From here on, you can participate in Kaggle Competitions at https://www.kaggle.com/ to improve your data science skills with real-world problems. This will fine-tune your skills and help understand how to solve analytical problems. Also, you can sign up for the Andrew NG course on Machine Learning at https://www.coursera.org/learn/machine-learning to understand the nuances behind machine learning algorithms. Resources for Article: Further resources on this subject: Bizarre Python[article] Predicting Sports Winners with Decision Trees and pandas[article] Optimization in Python [article]
Read more
  • 0
  • 0
  • 4921

article-image-building-queries-visually-mysql-query-browser
Packt
23 Oct 2009
3 min read
Save for later

Building Queries Visually in MySQL Query Browser

Packt
23 Oct 2009
3 min read
MySQL Query Browser, one of the open source MySQL GUI tools from MySQL AB, is used for building MySQL database queries visually. In MySQL Query Browser, you build database queries using just your mouse—click, drag and drop! MySQL Query Browser has plenty of visual query building functions and features. This article shows two examples, building Join and Master-detail queries. These examples will demonstrate some of these functions and features. Join Query A pop-up query toolbar will appear when you drag a table or column from the Object Browser’s Schemata tab to the Query Area. You drop the table or column on the pop-up query toolbar’s button to build your query. The following example demonstrates the use of the pop-up query toolbar to build a join query that involves three tables and two types of join (equi and left outer). Drag and drop the product table from the Schemata to Add Table(s) button. A SELECT query on the product table is written in the Query Area. Drag and drop the item table from Schemata to the JOIN Table(s) button on the Pop-up Query Toolbar. The two tables are joined on the foreign-key, product_code. If no foreign-key relationship exists, the drag and drop won’t have any effect. Drag and drop the order table from Schemata to the LEFT OUTER JOIN button on the Pop-up Query Toolbar. Maximize query area by pressing F11. You get a larger query area, and your lines are sequentially numbered (for easier identification). Move the FROM clause to its next line, by putting your cursor just before the FROM word and press Enter. Similarly, move the ON clause to its next line. Now, you can see all lines completely, and that the item table is left join to the order table on their foreign-key relationship column, the order_number column. As of now our query is SELECT *, i.e. selecting all columns from all tables. Let’s now select the columns we’d like to show at the query’s output. For example, drag and drop the order_number from the item table, product_name from the product table, and then quantity from the item table. (If necessary, expand the table folders to see their columns). The sequence of the selecting the columns is reflected in the SELECT clause (from left to right). Note that you can’t select column from the left join of the order table (if you try, nothing will happen) Next, add an additional condition. Drag and drop the amount column on the WHERE button in the Pop-up Query Toolbar. The column is added, with an AND, in the WHERE clause of the query. Type in its condition value, for example, > 1000. To finalize our query, drag and drop product_name on the ORDER button, and then, order_number (from item table, not order table) on the GROUP button. You’ll see that the GROUP BY and ORDER clauses are ordered correctly, i.e. the GROUP BY clause first before the ORDER BY, regardless of your drag & drop sequence. To test your query, click the Execute button. Your query should run without any error, and display its output in the query area (below the query).  
Read more
  • 0
  • 0
  • 4920

article-image-flash-multiplayer-virtual-world-setting-smartfoxserver-third-party-http-and-database-s
Packt
17 Aug 2010
7 min read
Save for later

Flash Multiplayer Virtual World: Setting up SmartFoxServer with Third-party HTTP and Database Server

Packt
17 Aug 2010
7 min read
(For more resources on Flash, see here.) We are going to download and install Apache and MySQL server package. These kinds of server package features have easy install that auto-configures most of the server settings. It will also install some essential tool for beginners to manage the server easily, such as GUI server administration panel. Installing WAMP on Windows WampServer is an open source HTTP and database server solution on Windows. WAMP stands for Windows, Apache, MySQL, and PHP package. Go to http://www.wampserver.com/en/download.php. Click on Download WampServer to download the installer. Run the installer with all default settings. The server is configured and ready. The WampServer can run by launching from Start | Programs | WampServer | Start WampServer. It will be in the task bar and the server management operation can be found by clicking the WampServer icon. We can start the server by putting the server online in the menu. Installing MAMP on Mac OSX Similar to WampServer, MAMP is the one package web server solution that stands for Mac, Apache, MySQL, and PHP package. The MAMP package can be downloaded at http://www.mamp.info/. Download the MAMP package from the official website. Double-click on the downloaded MAMP dmg file to mount it. Drag the MAMP folder into the Applications folder. To run the MAMP server, go to Applications | MAMP and double-click on the MAMP.app. Installing LAMP on Linux As the same naming convention, the "L" stands for Linux here. Different Linux distributions use different ways to install applications. There may not be a oneclick install method on some Linux branch which requires us to install the Apache and MySQL individually. Some Linux may provide graphic user interface to install LAMP by just selecting it in the applications list. We will use Ubuntu to demonstrate the installation of LAMP. Launch terminal from Applications | Accessories | Terminal. Type following command to install LAMP. sudo tasksel install lamp-server The installer will progress and configure different modules. A dialog will prompt several times asking for a new MySQL root password. You can set your own MySQL password, while in the example we will leave the root password blank. After the completion of the installation, the MySQL server is set up as service in the system. It runs automatically and we do not need to manually launch it to use it. Connecting SmartFoxServer and MySQL server SmartFoxServer is a Java application and Java database connection driver is needed to connect from SmartFoxServer to MySQL database. Downloading JDBC Driver for MySQL JDBC is a Java database connection driver that we need to establish connections between the Java-based SmartFoxServer and the MySQL server. The JDBC driver for MySQL is called Connector/J. We are going to install it to enable MySQL connection from SmartFoxServer. Go to http://dev.mysql.com/downloads/connector/j/5.1.html in web browser. Download the Platform Independent Zip Archive. It may ask you to log in to MySQL.com account. Click on No thanks, just take me to the downloads! to bypass the login step. Choose a mirror to download by clicking on HTTP. Setting up the JDBC driver The MySQL Java connector comes with a bunch of files. We only need two among them. Extract the mysql-connector-java-5.1.10.zip file to a temporary folder. Open the folder and find the mysql-connector-java-5.1.10-bin.jar file. Copy that jar file into SmartFoxServer installation directory | jre | lib | ext. Go into the src directory of the extracted directory and copy the org directory to SmartFoxServer installation directory | jre | lib | ext. Configuring the server settings The configuration file of SmartFoxServer is an XML file that allows us to configure many server settings. It can configure the initial zone or room creation, server address, admin authorization, value tuning for performance, and a lot more. We are going to set the database connection for testing our setup in this article (core settings are out of scope of this article). The configuration file is called config.xml and is located in the SmartFoxServer installation directory under the Server directory. Configuring MySQL server connection in SmartFoxServer Open the config.xml in your favorite text editor. Go to line 203 of the config.xml. This line should be within the structure of a Zone tag with name as dbZone. Change the lines 203-218 from the config.xml: Original code: <DatabaseManager active="false"> <Driver>sun.jdbc.odbc.JdbcOdbcDriver</Driver> <ConnectionString>jdbc:odbc:sfsTest</ConnectionString> <!-- Example connecting to MySQL <Driver>org.gjt.mm.mysql.Driver</Driver> <ConnectionString>jdbc:mysql://192.168.0.1:3306/sfsTest </ConnectionString> --> <UserName>yourname</UserName> <Password>yourpassword</Password> <TestSQL><![CDATA[SELECT COUNT(*) FROM contacts]]></TestSQL> Replace the code in lines 203-218 with the following code: <DatabaseManager active="true"> <Driver>org.gjt.mm.mysql.Driver</Driver> <ConnectionString>jdbc:mysql://127.0.0.1:3306/mysql </ConnectionString> <UserName>root</UserName> <Password></Password> <TestSQL><![CDATA[SELECT NOW()]]></TestSQL> The new setting activates the DatabaseManager and configures the JDBC driver to the MySQL connector that we just downloaded. We also changed the user name and password of the connection to the database to "root" and empty password. We will use the empty password through out the development process but it is strongly recommended to set your own database user password. There is a TestSQL setting where we can write a simple database query so that the SmartFoxServer will try to run it to test if the database connection is correct. As we have not created any new databases for the virtual world, we will test the database connection by querying the current server time. Restarting the server We’ve just set up the connection between SmartFoxServer and third-party database. It is time to test the new setting by restarting the SmartFoxServer. To stop the SmartFoxServer in Windows and Linux, press Ctrl + C. To stop it in Mac OS X, click on the Cancel button in the SmartFoxServer log window. There is a log that appears as usual after we start up the server again. It is important to check the log carefully every time the config.xml is changed. The logfile can provide details of any errors that occur when it tries to load the configure file. For example, if we configure the database connection just now but forget to activate the DatabaseManager, the server will start up correctly. Then you may spend a lot of time debugging why the database connection is not working until you find that the DatabaseManager is not active at all. This happened to me several times while I was developing my first flash virtual world. If the server is running with the new database connection settings, the following lines will be appearing in the log. There can be different database manager settings for each zone. When checking the log, we should be aware which zone the log is referring to. We are configuring the database manager of dbZone zone now. DB Manager Activated ( org.gjt.mm.mysql.Driver ) Zone: dbZone If we forget to activate the DatabaseManager, we will not see the DB Manager Activated wording. Instead, the following message may appear in the log: DB Manager is not active in this Zone! Moreover, if the SmartFoxServer faces some fatal error on start up, it will terminate itself with more detailed error logs. The following lines are an example for error logs that appear when the MySQL connector file is missing: Can’t load db driver: org.gjt.mm.mysql.Driver [ Servre ] > DbManager could not retrive a connection. Java.sql.SQLException: Configuration file not found DbManagerException: The Test SQL statement failed! Please check your configuration. These lines state that the testing SQL failed to run, which we just set to test the connection. It also describes what exception has caused this error to help the debugging.
Read more
  • 0
  • 0
  • 4918

article-image-introduction-terminal
Packt
21 May 2014
19 min read
Save for later

An Introduction to the Terminal

Packt
21 May 2014
19 min read
(For more resources related to this topic, see here.) Why should we use the terminal? With Mint containing a complete suite of graphical tools, one may wonder why it is useful to learn and use the terminal at all. Depending on the type of user, learning how to execute commands in a terminal may or may not be beneficial. If you are a user who intends to use Linux only for basic purposes such as browsing the Internet, checking e-mails, playing games, editing documents, printing, watching videos, listening to music, and so on, terminal commands may not be a useful skill to learn as all of these activities (as well as others) are best handled by a graphical desktop environment. However, the real value of the terminal in Linux comes with advanced administration. Some administrative activities are faster using shell commands than using the GUI. For example, if you wanted to edit the /etc/fstab file, it would take fewer steps to type sudo nano /etc/fstab than it would to open a file manager with root permissions, navigate to the /etc directory, find the fstab file, and click on it to open it. This is especially true if all you want to do is make a quick change. Similarly, typing sudo apt-get install geany may be faster if you already know the name of the package you want, compared to opening up Mint Software Manager, waiting for it to load, finding the geany package, and installing it. On older and slower systems, the overhead caused by graphical programs may delay execution time. Another value in the Linux Shell is scripting. With a script, you can create a text file with a list of commands and instructions and execute all of the commands contained within a single execution. For example, you can create a list of packages that you would prefer to install on your system, type them out in a text file, and add your distribution package's installation command at the beginning of the list. Now, you can install all of your favorite programs with a single command. If you save this script for later, you can execute it any time you reinstall Linux Mint so that you can immediately have access to all your favorite programs. If you are administering a server, you can create a script to check the overall health of the system at various times, check for security intrusions, or even configure servers to send you weekly reports on just about anything you'd like to keep yourself updated on. There are entire books dedicated to scripting, so we won't go in detail about it in this article. However, by the end of the article, we will create a script to demonstrate how to do so. Accessing the shell When it comes to Linux, there is very rarely (if ever) a single way to do anything. Just like you have your pick between desktop environments, text editors, browsers, and just about anything else, you also have a choice when it comes to accessing a Linux terminal to execute shell commands. As a matter of fact, you even have a choice on which terminal emulator to use in order to interpret your commands. Linux Mint comes bundled with an application called the GNOME Terminal. This application is actually developed for a completely different desktop environment (GNOME) but is included in Mint because the Mint developers did not create their own terminal emulator for Cinnamon. The GNOME Terminal did the job very well, so there was no need to reinvent the wheel. Once you open the GNOME Terminal, it is ready to do your bidding right away. The following screenshot shows the GNOME terminal, ready for action: As mentioned earlier, there are other terminal emulators that are available. One of the popular terminal emulators is Konsole. It typically comes bundled with Linux distributions, which feature the KDE environment (such as Mint's own KDE edition). In addition, there is also the xfce4-terminal, which comes bundled with the Xfce environment. Although each terminal emulator is generally geared toward the desktop environment that features it, there's nothing stopping you from installing them if you find that GNOME Terminal doesn't suit your needs. However, each of the terminal emulators generally function in much the same way, and you may not notice much of a difference, especially when you're starting out. You may be wondering what exactly a terminal emulator is. A terminal emulator is a windowed application that runs in a graphical environment (such as Cinnamon in Mint) that provides you with a terminal window through which you can execute shell commands to interact with the system. In essence, a terminal emulator is emulating what a full-screen terminal may look like, but in an application window. Each terminal emulator in Linux gives you the ability to interact with that distribution's chosen shell, and as each of the various terminal emulators interact with the same shell, you won't notice anything unique about them regarding how commands are run. The differences between one terminal emulator and another are usually in the form of features in the graphical user interface, which surround the terminal window, such as being able to open new terminal windows in tabs instead of separate instances and even open transparent windows so that you can see what is behind your terminal window as you type. While learning about Linux, you'll often hear the term Bash when referring to the shell. Bash is a type of command interpreter that Linux uses; however, there are several others, including (but not limited to) the C shell, the Dash shell, and the Korn shell. When you interact with your Linux distribution through a terminal emulator, you are actually interacting with its shell. Bash itself is a successor to Bourne shell (originally created by Stephen Bourne) and is an acronym for "Bourne Again Shell." All distributions virtually include Bash as their default shell; it's the closest shell to a standard one in terms of shells that Linux has. As you start out on your Linux journey, Bash is the only shell you should concern yourself with and the only shell that will be covered in this article. Scripts are generally written against the shell environment in which they are intended to run. This is why when you read about writing scripts in Linux, you'll see them referred to as Bash Scripts as Bash is the target shell and pretty much the standard Linux shell. In addition, terminal emulators aren't the only way to access the Linux shell for entering commands. In fact, you don't even need to install a terminal emulator. You can use TTY (Teletype) terminals, which are full-screen terminals available for your use, by simply pressing a combination of keys on your keyboard. When you switch to a TTY terminal, you are switching away from your desktop environment to a dedicated text-mode console. You can access a TTY terminal by pressing Alt + Ctrl and one of the function keys (F1 through F6) at the same time. To switch back to Cinnamon, press Alt + Ctrl + F8. Not all distributions handle TTY terminals in the same way. For example, some start the desktop environment on TTY 7 (Alt + Ctrl + F7), and others may have a different number of TTYs available. If you are using a different flavor of Mint and Alt + Ctrl + F8 doesn't bring you back to your desktop environment, try Alt + Ctrl + F7 instead. You should notice that the terminal number changes each time you switch between TTY terminals. For example, if you press Alt + Ctrl + F1, you should see a heading that looks similar to Linux Mint XX ReleaseName HostName tty1 (notice the tty number at the end). If you press Alt + Ctrl + F2, you'll see a heading similar to Linux Mint XX ReleaseName HostName tty2. You should notice right away that the TTY number corresponds to the function key you used to access it. The benefit of a TTY is that it is an environment separate from your desktop environment, where you can run commands and large jobs. You can have a separate command running in each TTY, each independent of the others, without occupying space in your desktop environment. However, not everyone will find TTYs useful. It all depends on your use case and personal preferences. Regardless of how you access a terminal in Linux to practice entering your commands, all the examples in this article will work fine. In fact, it doesn't even matter if you use the bundled GNOME Terminal or another terminal emulator. Feel free to play around as each of them handles commands in the same way and will work fine for the purposes of this article. Executing commands While utilizing the shell and entering commands, you will find yourself in a completely different world compared to your desktop environment. While using the shell, you'll enter a command, wait for a confirmation that the command was successful (if applicable), and then you will be brought back to the prompt so that you can execute another command. In many cases, the shell simply returns to the prompt with no output. This constitutes a success. Be warned though; the Linux shell makes no assumptions. If you type something incorrectly, you will either see an error message or produce unexpected output. If you tell the shell to delete a file and you direct it to the wrong one, it typically won't prompt for confirmation and will bypass the trash folder. The Linux Shell does exactly what you tell it to, not necessarily what you want it to. Don't let that scare you though. The Linux Shell is very logical and easy to learn. However, with great power comes great responsibility. To get started, open your terminal emulator. You can either open the GNOME Terminal (you will find it in the application menu under Accessories or pinned to the left pane of the application menu by default) or switch to a TTY by pressing Ctrl + Alt +F1. You'll see a prompt that will look similar to the following: username@hostname ~$ Let's take a moment to examine the prompt. The first part of the prompt displays the username that the commands will be executed as. When you first open a terminal, it is opened under the user account that opened it. The second part of the prompt is the host name of the computer, which will be whatever you named it during the installation. Next, the path is displayed. In the preceding example, it's simply a tilde (~). The ~ character in Linux represents the currently logged-in user's home directory. Thus, in the preceding prompt, we can see that the current directory that the prompt is attached to is the user's home directory. Finally, a dollar sign symbol ($) is displayed. This represents that the commands are to be run as a normal user and not as a root user. For example, a user named C. Norris is using a machine named Neptune. This user opens a terminal and then switches to the /media directory. The prompt would then be similar to the following: cnorris@neptune /media $ Now that we have an understanding of the prompt, let's walk through some examples of entering some very basic commands, which are discussed in the following steps. Later in the article, we'll go over more complete examples; however, for now, let's take the terminal out for a spin. Open a prompt, type pwd, and press Enter. The pwd command stands for print working directory. In the output, it should display the complete path that the terminal is attached to. If you ever lose your way, the pwd command will save the day. Notice that the command prints the working directly and completes it. This means that it returns you right back to the prompt, ready to accept another command. Next, try the ls command. (That's "L" and "S", both lowercase). This stands for list storage. When you execute the ls command, you should see a list of the files saved in your current working directory. If there are no files in your working directory, you'll see no output. For a little bit of fun, try the following command: cowsay Linux Mint is Awesome This command shows that the Mint developers have a sense of humor and included the cowsay program in the default Mint installation. You can make the cow say anything you'd like, but be nice. The following screenshot shows the output of the preceding cowsay command, included in Mint for laughs: Navigating the filesystem Before we continue with more advanced terminal usage, it's important to understand how the filesystem is laid out in Linux as well as how to navigate it. First, we must clarify what exactly is meant by the term "filesystem" as it can refer to different things depending on the context. If you recall, when you installed Linux Mint, you formatted one or more partitions with a filesystem, most likely ext4. In this context, we're referring to the type of formatting applied to a hard-disk partition. There are many different filesystems available for formatting hard disk partitions, and this is true for all operating systems. However, there is another meaning to "filesystem" with regards to Linux. In the context of this article, filesystem refers to the default system of directories (also known as folders) in a Linux installation and how to navigate from one folder to another. The filesystem in an installed Linux system includes many different folders, each with its own purpose. In order to understand how to navigate between directories in a Linux filesystem, you should first have a basic understanding of what the folders are for. You can view the default directory structure in the Linux filesystem in one of the following two ways: One way is to open the Nemo file manager and click on File System on the left-hand side of the window. This will open a view of the default folders in Linux, as shown in the following screenshot: Additionally, you can execute the following command from your terminal emulator: ls -l / The following screenshot shows the output of the preceding command from the root of the filesystem: The first point to understand, especially if you're coming from Windows, is that there is no drive lettering in Linux. This means that there is no C drive for your operating system or D drive for your optical drive. The closest thing that the Linux filesystem has for a C: drive is a single forward slash, which represents the beginning of the filesystem. In Linux, everything is a subdirectory of /. When we executed the preceding command (ls -l /), we were telling the terminal emulator that we'd like a listing of / or the beginning of the drive. The -l flag tells the terminal emulator that we would like a long alphabetical listing rather than a horizontal one. Paths are written as shown in the following command line example. In this example, the path references the Music directory under Joe's home directory: /home/joe/Music The first slash (/home) references the beginning of the filesystem. If a path in Linux is typed starting with a single forward slash, this means that the path starts with the beginning of the drive. In the preceding example, if we start at the beginning of the filesystem, we'll see a directory there named home. Inside the home folder, we'll see another directory named joe. Inside the joe directory, we'll find another directory named Music. The cd command is used to change the directory from the current working directory, to the one that we want to work with. Let's demonstrate this with an example. First, let's say that the prompt Joe sees in his terminal is the following: joe@Mint ~ $ From this, we can deduce that the current working directory is Joe's home directory. We know this because the ~ character is shorthand for the user's home directory. Let's assume that Joe types the following:? pwd Then, his output will be as follows: /home/joe In his case, ~ is the same as /home/joe. Since Joe is currently in his home directory, he can see the contents of that directory by simply typing the following command: ls The Music directory that Joe wants to access would be shown in the output as its path is /home/joe/Music. To change the working directory of the terminal to /home/joe/Music, Joe can type the following: cd /home/joe/Music His prompt will change to the following: joe@Mint ~/Music $ However, the cd command does not make you type the full path. With the cd command, you can type an absolute or relative path. In the preceding command line using cd command, we referenced an absolute path. The absolute path is a path from the beginning of the disk (the single forward slash), and each directory from the beginning is completely typed out. In this example, it's unnecessary to type the full path because Joe is already in his home directory. As Music is a subdirectory of the directory he's already in, all he has to do is type the following command in order to get access to his Music directory: cd Music That's it. Without first typing a forward slash, the command interpreter understands that we are referencing a directory in the current working directory. If Joe was to use /Music as a path instead, this wouldn't work because there is no Music directory at the top level of his hard drive. If Joe wants to go back one level, he can enter the following command: cd.. Typing the cd command along with two periods tells the command interpreter that we would like to move backwards to the level above the one where we currently are. In this case, the command would return Joe back to his home directory. Finally, as if the difference between a filesystem in the context of hard drive formatting and filesystem in the context of directory structure wasn't confusing enough, there is another key term you should know for use with Linux. This term also has multiple meanings that change depending on the context in which you use it. The word is root. The user account named root is present on all Linux systems. The root account is the Alpha and Omega of the Linux system. The root user has the most permissions of any user on the system; root could even delete the entire filesystem and everything contained within it if necessary. Therefore, it's generally discouraged to use the root account for fear of a typo destroying your entire system. However, in regards to this article, when we talk about root, we're not talking about the root user account. There are actually two other meanings to the word root in Linux in regards to the filesystem. First, you'll often hear of someone referring to the root of the filesystem. They are referring to the single forward slash that represents the beginning of the filesystem. Second, there is a directory in the root of the filesystem named root. Its path is as follows: /root Linux administrators will refer to that directory as "slash root", indicating that it is a directory called root, and it is stored in the root (beginning) of the filesystem. So, what is the /root directory? The /root directory is the home directory for the root account. In this article, we have referred to the /home directory several times. In a Linux system, each user gets their own directory underneath /home. David's home directory would be /home/david and Cindy's home directory is likely to be /home/cindy. (Using lowercase for all user names is a common practice for Linux administrators). Notice, however, that there is no /home/root. The root account is special, and it does not have a home directory in /home as normal users would have. /root is basically the equivalent of a home directory for root. The /root directory is not accessible to ordinary users. For example, try the following command: ls /root The ls command by itself displays the contents of the current working directory. However, if we pass a path to ls, we're telling ls that we want to list the storage of a different directory. In the preceding command, we're requesting to list the storage of the /root directory. Unfortunately, we can't. The root account does not want its directories visible to mortal users. If you execute the command, it will give you an error message indicating that permission was denied. Like many Ubuntu-based distributions, the root account in Mint is actually disabled. Even though it's disabled, the /root directory still exists and the root account can be used but not directly logged in to. The takeaway is that you cannot actually log in as root. So far, we've covered the /home and /root subdirectories of /, but what about the rest? This section of the article will be closed with a brief description of what each directory is used for. Don't worry; you don't have to memorize them all. Just use this section as reference. /bin: This stores essential commands accessible to all users. The executables for commands such as ls are stored here. /boot: This stores the configuration information for the boot loader as well as the initial ramdisk for the boot sequence. /dev: This holds the location for devices to represent pieces of hardware, such as hard drives and sound cards. /etc: This stores the configuration files used in the system. Examples include the configuration for Samba, which handles cross-platform networking, as well as the fstab file, which stores mount points for hard disks. /home: As discussed earlier in the article, each user account gets its own directory underneath this directory for storing personal files. /lib: This stores the libraries needed for other binaries. /media: This directory serves as a place for removable media to be mounted. If you insert media (such as a flash drive), you'll find it underneath this directory. /mnt: This directory is used for manual mount points; /media is generally used instead, and this directory still exists as a holdover from the past. /opt: Additional programs can be installed here. /proc: Within /proc, you'll find virtual files that represent processes and kernel data. /root: This is the home directory for the root account. /sbin: This consists of super user program binaries. /tmp: This is a place for temporary files. /usr: This is a directory where utilities and applications can be stored for use by all users, but it is not modified directly by users other than the root user. /var: This is a directory where continually changing files, such as printer spools and logs, are stored.
Read more
  • 0
  • 0
  • 4917
article-image-why-meteor-rocks
Packt
08 Jul 2015
23 min read
Save for later

Why Meteor Rocks!

Packt
08 Jul 2015
23 min read
In this article by Isaac Strack, the author of the book, Getting Started with Meteor.js JavaScript Framework - Second Edition, has discussed some really amazing features of Meteor that has contributed a lot to the success of Meteor. Meteor is a disruptive (in a good way!) technology. It enables a new type of web application that is faster, easier to build, and takes advantage of modern techniques, such as Full Stack Reactivity, Latency Compensation, and Data On The Wire. (For more resources related to this topic, see here.) This article explains how web applications have changed over time, why that matters, and how Meteor specifically enables modern web apps through the above-mentioned techniques. By the end of this article, you will have learned: What a modern web application is What Data On The Wire means and how it's different How Latency Compensation can improve your app experience Templates and Reactivity—programming the reactive way! Modern web applications Our world is changing. With continual advancements in displays, computing, and storage capacities, things that weren't even possible a few years ago are now not only possible but are critical to the success of a good application. The Web in particular has undergone significant change. The origin of the web app (client/server) From the beginning, web servers and clients have mimicked the dumb terminal approach to computing where a server with significantly more processing power than a client will perform operations on data (writing records to a database, math calculations, text searches, and so on), transform the data and render it (turn a database record into HTML and so on), and then serve the result to the client, where it is displayed for the user. In other words, the server does all the work, and the client acts as more of a display, or a dumb terminal. This design pattern for this is called…wait for it…the client/server design pattern. The diagrammatic representation of the client-server architecture is shown in the following diagram: This design pattern, borrowed from the dumb terminals and mainframes of the 60s and 70s, was the beginning of the Web as we know it and has continued to be the design pattern that we think of when we think of the Internet. The rise of the machines (MVC) Before the Web (and ever since), desktops were able to run a program such as a spreadsheet or a word processor without needing to talk to a server. This type of application could do everything it needed to, right there on the big and beefy desktop machine. During the early 90s, desktop computers got even more beefy. At the same time, the Web was coming alive, and people started having the idea that a hybrid between the beefy desktop application (a fat app) and the connected client/server application (a thin app) would produce the best of both worlds. This kind of hybrid app—quite the opposite of a dumb terminal—was called a smart app. Many business-oriented smart apps were created, but the easiest examples can be found in computer games. Massively Multiplayer Online games (MMOs), first-person shooters, and real-time strategies are smart apps where information (the data model) is passed between machines through a server. The client in this case does a lot more than just display the information. It performs most of the processing (or acts as a controller) and transforms the data into something to be displayed (the view). This design pattern is simple but very effective. It's called the Model View Controller (MVC) pattern. The model is essentially the data for an application. In the context of a smart app, the model is provided by a server. The client makes requests to the server for data and stores that data as the model. Once the client has a model, it performs actions/logic on that data and then prepares it to be displayed on the screen. This part of the application (talking to the server, modifying the data model, and preparing data for display) is called the controller. The controller sends commands to the view, which displays the information. The view also reports back to the controller when something happens on the screen (a button click, for example). The controller receives the feedback, performs the logic, and updates the model. Lather, rinse, repeat! Since web browsers were built to be "dumb clients", the idea of using a browser as a smart app back then was out of question. Instead, smart apps were built on frameworks such as Microsoft .NET, Java, or Macromedia (now Adobe) Flash. As long as you had the framework installed, you could visit a web page to download/run a smart app. Sometimes, you could run the app inside the browser, and sometimes, you would download it first, but either way, you were running a new type of web app where the client application could talk to the server and share the processing workload. The browser grows up Beginning in the early 2000s, a new twist on the MVC pattern started to emerge. Developers started to realize that, for connected/enterprise "smart apps", there was actually a nested MVC pattern. The server code (controller) was performing business logic against the database (model) through the use of business objects and then sending processed/rendered data to the client application (a "view"). The client was receiving this data from the server and treating it as its own personal "model". The client would then act as a proper controller, perform logic, and send the information to the view to be displayed on the screen. So, the "view" for the server MVC was the "model" for the client MVC. As browser technologies (HTML and JavaScript) matured, it became possible to create smart apps that used the Nested MVC design pattern directly inside an HTML web page. This pattern makes it possible to run a full-sized application using only JavaScript. There is no longer any need to download multiple frameworks or separate apps. You can now get the same functionality from visiting a URL as you could previously by buying a packaged product. A giant Meteor appears! Meteor takes modern web apps to the next level. It enhances and builds upon the nested MVC design pattern by implementing three key features: Data On The Wire through the Distributed Data Protocol (DDP) Latency Compensation with Mini Databases Full Stack Reactivity with Blaze and Tracker Let's walk through these concepts to see why they're valuable, and then, we'll apply them to our Lending Library application. Data On The Wire The concept of Data On The Wire is very simple and in tune with the nested MVC pattern; instead of having a server process everything, render content, and then send HTML across the wire, why not just send the data across the wire and let the client decide what to do with it? This concept is implemented in Meteor using the Distributed Data Protocol, or DDP. DDP has a JSON-based syntax and sends messages similar to the REST protocol. Additions, deletions, and changes are all sent across the wire and handled by the receiving service/client/device. Since DDP uses WebSockets rather than HTTP, the data can be pushed whenever changes occur. But the true beauty of DDP lies in the generic nature of the communication. It doesn't matter what kind of system sends or receives data over DDP—it can be a server, a web service, or a client app—they all use the same protocol to communicate. This means that none of the systems know (or care) whether the other systems are clients or servers. With the exception of the browser, any system can be a server, and without exception, any server can act as a client. All the traffic looks the same and can be treated in a similar manner. In other words, the traditional concept of having a single server for a single client goes away. You can hook multiple servers together, each serving a discreet purpose, or you can have a client connect to multiple servers, interacting with each one differently. Think about what you can do with a system like that: Imagine multiple systems all coming together to create, for example, a health monitoring system. Some systems are built with C++, some with Arduino, some with…well, we don't really care. They all speak DDP. They send and receive data on the wire and decide individually what to do with that data. Suddenly, very difficult and complex problems become much easier to solve. DDP has been implemented in pretty much every major programming language, allowing you true freedom to architect an enterprise application. Latency Compensation Meteor employs a very clever technique called Mini Databases. A mini database is a "lite" version of a normal database that lives in the memory on the client side. Instead of the client sending requests to a server, it can make changes directly to the mini database on the client. This mini database then automatically syncs with the server (using DDP of course), which has the actual database. Out of the box, Meteor uses MongoDB and Minimongo: When the client notices a change, it first executes that change against the client-side Minimongo instance. The client then goes on its merry way and lets the Minimongo handlers communicate with the server over DDP. If the server accepts the change, it then sends out a "changed" message to all connected clients, including the one that made the change. If the server rejects the change, or if a newer change has come in from a different client, the Minimongo instance on the client is corrected, and any affected UI elements are updated as a result. All of this doesn't seem very groundbreaking, but here's the thing—it's all asynchronous, and it's done using DDP. This means that the client doesn't have to wait until it gets a response back from the server. It can immediately update the UI based on what is in the Minimongo instance. What if the change was illegal or other changes have come in from the server? This is not a problem as the client is updated as soon as it gets word from the server. Now, what if you have a slow internet connection or your connection goes down temporarily? In a normal client/server environment, you couldn't make any changes, or the screen would take a while to refresh while the client waits for permission from the server. However, Meteor compensates for this. Since the changes are immediately sent to Minimongo, the UI gets updated immediately. So, if your connection is down, it won't cause a problem: All the changes you make are reflected in your UI, based on the data in Minimongo. When your connection comes back, all the queued changes are sent to the server, and the server will send authorized changes to the client. Basically, Meteor lets the client take things on faith. If there's a problem, the data coming in from the server will fix it, but for the most part, the changes you make will be ratified and broadcast by the server immediately. Coding this type of behavior in Meteor is crazy easy (although you can make it more complex and therefore more controlled if you like): lists = new Mongo.Collection("lists"); This one line declares that there is a lists data model. Both the client and server will have a version of it, but they treat their versions differently. The client will subscribe to changes announced by the server and update its model accordingly. The server will publish changes, listen to change requests from the client, and update its model (its master copy) based on these change requests. Wow, one line of code that does all that! Of course, there is more to it, but that's beyond the scope of this article, so we'll move on. To better understand Meteor data synchronization, see the Publish and subscribe section of the meteor documentation at http://docs.meteor.com/#/full/meteor_publish. Full Stack Reactivity Reactivity is integral to every part of Meteor. On the client side, Meteor has the Blaze library, which uses HTML templates and JavaScript helpers to detect changes and render the data in your UI. Whenever there is a change, the helpers re-run themselves and add, delete, and change UI elements, as appropriate, based on the structure found in the templates. These functions that re-run themselves are called reactive computations. On both the client and the server, Meteor also offers reactive computations without having to use a UI. Called the Tracker library, these helpers also detect any data changes and rerun themselves accordingly. Because both the client and the server are JavaScript-based, you can use the Tracker library anywhere. This is defined as isomorphic or full stack reactivity because you're using the same language (and in some cases the same code!) on both the client and the server. Re-running functions on data changes has a really amazing benefit for you, the programmer: you get to write code declaratively, and Meteor takes care of the reactive part automatically. Just tell Meteor how you want the data displayed, and Meteor will manage any and all data changes. This declarative style is usually accomplished through the use of templates. Templates work their magic through the use of view data bindings. Without getting too deep, a view data binding is a shared piece of data that will be displayed differently if the data changes. Let's look at a very simple data binding—one for which you don't technically need Meteor—to illustrate the point. Let's perform the following set of steps to understand the concept in detail: In LendLib.html, you will see an HTML-based template expression: <div id="categories-container">      {{> categories}}   </div> This expression is a placeholder for an HTML template that is found just below it: <template name="categories">    <h2 class="title">my stuff</h2>.. So, {{> categories}} is basically saying, "put whatever is in the template categories right here." And the HTML template with the matching name is providing that. If you want to see how data changes will affect the display, change the h2 tag to an h4 tag and save the change: <template name="categories">    <h4 class="title">my stuff</h4> You'll see the effect in your browser. (my stuff will become itsy bitsy.) That's view data binding at work. Change the h4 tag back to an h2 tag and save the change, unless you like the change. No judgment here...okay, maybe a little bit of judgment. It's ugly, and tiny, and hard to read. Seriously, you should change it back before someone sees it and makes fun of you! Alright, now that we know what a view data binding is, let's see how Meteor uses it. Inside the categories template in LendLib.html, you'll find even more templates: <template name="categories"> <h4 class="title">my stuff</h4> <div id="categories" class="btn-group">    {{#each lists}}      <div class="category btn btn-primary">        {{Category}}      </div>    {{/each}} </div> </template> Meteor uses a template language called Spacebars to provide instructions inside templates. These instructions are called expressions, and they let us do things like add HTML for every record in a collection, insert the values of properties, and control layouts with conditional statements. The first Spacebars expression is part of a pair and is a for-each statement. {{#each lists}} tells the interpreter to perform the action below it (in this case, it tells it to make a new div element) for each item in the lists collection. lists is the piece of data, and {{#each lists}} is the placeholder. Now, inside the {{#each lists}} expression, there is one more Spacebars expression: {{Category}} Since the expression is found inside the #each expression, it is considered a property. That is to say that {{Category}} is the same as saying this.Category, where this is the current item in the for-each loop. So, the placeholder is saying, "add the value of the Category property for the current record." Now, if we look in LendLib.js, we will see the reactive values (called reactive contexts) behind the templates: lists : function () { return lists.find(... Here, Meteor is declaring a template helper named lists. The helper, lists, is found inside the template helpers belonging to categories. The lists helper happens to be a function that returns all the data in the lists collection, which we defined previously. Remember this line? lists = new Mongo.Collection("lists"); This lists collection is returned by the above-mentioned helper. When there is a change to the lists collection, the helper gets updated and the template's placeholder is changed as well. Let's see this in action. On your web page pointing to http://localhost:3000, open the browser console and enter the following line: > lists.insert({Category:"Games"}); This will update the lists data collection. The template will see this change and update the HTML code/placeholder. Each of the placeholders will run one additional time for the new entry in lists, and you'll see the following screen: When the lists collection was updated, the Template.categories.lists helper detected the change and reran itself (recomputed). This changed the contents of the code meant to be displayed in the {{> categories}} placeholder. Since the contents were changed, the affected part of the template was re-run. Now, take a minute here and think about how little we had to do to get this reactive computation to run: we simply created a template, instructing Blaze how we want the lists data collection to be displayed, and we put in a placeholder. This is simple, declarative programming at its finest! Let's create some templates We'll now see a real-life example of reactive computations and work on our Lending Library at the same time. Adding categories through the console has been a fun exercise, but it's not a long-term solution. Let's make it so that we can do that on the page instead as follows: Open LendLib.html and add a new button just before the {{#each lists}} expression: <div id="categories" class="btn-group"> <div class="category btn btn-primary" id="btnNewCat">    <span class="glyphicon glyphicon-plus"></span> </div> {{#each lists}} This will add a plus button on the page, as follows: Now, we want to change the button into a text field when we click on it. So let's build that functionality by using the reactive pattern. We will make it based on the value of a variable in the template. Add the following {{#if…else}} conditionals around our new button: <div id="categories" class="btn-group"> {{#if new_cat}} {{else}}    <div class="category btn btn-primary" id="btnNewCat">      <span class="glyphicon glyphicon-plus"></span>    </div> {{/if}} {{#each lists}} The first line, {{#if new_cat}}, checks to see whether new_cat is true or false. If it's false, the {{else}} section is triggered, and it means that we haven't yet indicated that we want to add a new category, so we should be displaying the button with the plus sign. In this case, since we haven't defined it yet, new_cat will always be false, and so the display won't change. Now, let's add the HTML code to display when we want to add a new category: {{#if new_cat}} <div class="category form-group" id="newCat">      <input type="text" id="add-category" class="form-control" value="" />    </div> {{else}} ... {{/if}} There's the smallest bit of CSS we need to take care of as well. Open ~/Documents/Meteor/LendLib/LendLib.css and add the following declaration: #newCat { max-width: 250px; } Okay, so now we've added an input field, which will show up when new_cat is true. The input field won't show up unless it is set to true; so, for now, it's hidden. So, how do we make new_cat equal to true? Save your changes if you haven't already done so, and open LendLib.js. First, we'll declare a Session variable, just below our Meteor.isClient check function, at the top of the file: if (Meteor.isClient) { // We are declaring the 'adding_category' flag Session.set('adding_category', false); Now, we'll declare the new template helper new_cat, which will be a function returning the value of adding_category. We need to place the new helper in the Template.categories.helpers() method, just below the declaration for lists: Template.categories.helpers({ lists: function () {    ... }, new_cat: function(){    //returns true if adding_category has been assigned    //a value of true    return Session.equals('adding_category',true); } }); Note the comma (,) on the line above new_cat. It's important that you add that comma, or your code will not execute. Save these changes, and you'll see that nothing has changed. Ta-da! In reality, this is exactly as it should be because we haven't done anything to change the value of adding_category yet. Let's do this now: First, we'll declare our click event handler, which will change the value in our Session variable. To do this, add the following highlighted code just below the Template.categories.helpers() block: Template.categories.helpers({ ... }); Template.categories.events({ 'click #btnNewCat': function (e, t) {    Session.set('adding_category', true);    Tracker.flush();    focusText(t.find("#add-category")); } }); Now, let's take a look at the following line of code: Template.categories.events({ This line declares that events will be found in the category template. Now, let's take a look at the next line: 'click #btnNewCat': function (e, t) { This tells us that we're looking for a click event on the HTML element with an id="btnNewCat" statement (which we already created in LendLib.html). Session.set('adding_category', true); Tracker.flush(); focusText(t.find("#add-category")); Next, we set the Session variable, adding_category = true, flush the DOM (to clear up anything wonky), and then set the focus onto the input box with the id="add-category" expression. There is one last thing to do, and that is to quickly add the focusText(). helper function. To do this, just before the closing tag for the if (Meteor.isClient) function, add the following code: /////Generic Helper Functions///// //this function puts our cursor where it needs to be. function focusText(i) { i.focus(); i.select(); }; } //<------closing bracket for if(Meteor.isClient){} Now, when you save the changes and click on the plus button, you will see the input box: Fancy! However, it's still not useful, and we want to pause for a second and reflect on what just happened; we created a conditional template in the HTML page that will either show an input box or a plus button, depending on the value of a variable. This variable is a reactive variable, called a reactive context. This means that if we change the value of the variable (like we do with the click event handler), then the view automatically updates because the new_cat helpers function (a reactive computation) will rerun. Congratulations, you've just used Meteor's reactive programming model! To really bring this home, let's add a change to the lists collection (which is also a reactive context, remember?) and figure out a way to hide the input field when we're done. First, we need to add a listener for the keyup event. Or, to put it another way, we want to listen when the user types something in the box and hits Enter. When this happens, we want to add a category based on what the user typed. To do this, let's first declare the event handler. Just after the click handler for #btnNewCat, let's add another event handler: 'click #btnNewCat': function (e, t) {    ... }, 'keyup #add-category': function (e,t){    if (e.which === 13)    {      var catVal = String(e.target.value || "");      if (catVal)      {        lists.insert({Category:catVal});        Session.set('adding_category', false);      }    } } We add a "," character at the end of the first click handler, and then add the keyup event handler. Now, let's check each of the lines in the preceding code: This line checks to see whether we hit the Enter/Return key. if (e.which === 13) This line of code checks to see whether the input field has any value in it: var catVal = String(e.target.value || ""); if (catVal) If it does, we want to add an entry to the lists collection: lists.insert({Category:catVal}); Then, we want to hide the input box, which we can do by simply modifying the value of adding_category: Session.set('adding_category', false); There is one more thing to add and then we'll be done. When we click away from the input box, we want to hide it and bring back the plus button. We already know how to do this reactively, so let's add a quick function that changes the value of adding_category. To do this, add one more comma after the keyup event handler and insert the following event handler: 'keyup #add-category': function (e,t){ ... }, 'focusout #add-category': function(e,t){    Session.set('adding_category',false); } Save your changes, and let's see this in action! In your web browser on http://localhost:3000, click on the plus sign, add the word Clothes, and hit Enter. Your screen should now resemble the following screenshot: Feel free to add more categories if you like. Also, experiment by clicking on the plus button, typing something in, and then clicking away from the input field. Summary In this article, you learned about the history of web applications and saw how we've moved from a traditional client/server model to a nested MVC design pattern. You learned what smart apps are, and you also saw how Meteor has taken smart apps to the next level with Data On The Wire, Latency Compensation, and Full Stack Reactivity. You saw how Meteor uses templates and helpers to automatically update content, using reactive variables and reactive computations. Lastly, you added more functionality to the Lending Library. You made a button and an input field to add categories, and you did it all using reactive programming rather than directly editing the HTML code. Resources for Article: Further resources on this subject: Building the next generation Web with Meteor [article] Quick start - creating your first application [article] Meteor.js JavaScript Framework: Why Meteor Rocks! [article]
Read more
  • 0
  • 0
  • 4916

article-image-blogger-improving-your-blog-google-analytics-and-search-engine-optimization
Packt
22 Oct 2009
7 min read
Save for later

Blogger: Improving Your Blog with Google Analytics and Search Engine Optimization

Packt
22 Oct 2009
7 min read
If you've ever wondered how people find your website or how to generate more traffic, then this article tells you more about your visitors. Knowing where they come from, what posts they like, how long they stay, and other site metrics are all valuable information to have as a blogger. You would expect to pay for such a deep look into the underbelly of your blog, but Google wants to give it to you for free. Why for free? The better your site does, the more likely you are to pay for AdWords or use other Google tools. The Google Analytics online statistics application is a delicious carrot to encourage content rich sites and better ad revenue for everyone involved. You also want people to find your blog when they perform a search about your topic. The painful truth is that search engines have to find your blog first before it will show up in their results. There are thousands of new blogs being created everyday. If you want people to be able to find your blog in the increasingly crowded blogosphere, optimizing your blog for search engines will improve the odds. Improving Your Blog with Google Analytics Analytics gives you an overwhelming amount of data to use for measuring the success of your sites, and ads. Once you've had time to analyze that data, you will want to take action to improve the performance of your blog, and ads. We'll now look at how Analytics can help you make decisions about the design, and content of your site. Analyzing Navigation The Navigation section of the Content Overview report reveals how your visitors actually navigate your blog. Visitors move around a site in ways we can't predict. Seeing how they actually navigate a site and where they entered the site are powerful tools we can use to diagnose where we need to improve our blog. Exploring the Navigation Summary The Navigation Summary shows you the path people take through your site, including how they get there and where they go. We can see from the following graphical representation that our visitors entered the site through the main page of the blog most of the time. After reaching that page, over half the time, they went to other pages within the site. Entrance Paths We can see the path, the visitors take to enter our blog using the Entrance Paths report. It will show us from where they entered our site, which pages they looked at, and the last page they viewed before exiting. Visitors don't always enter by the main page of a site, especially if they find the site using search engines or trackbacks. The following screenshot displays a typical entrance path. The visitor comes to the site home page, and then goes to the full page of one of the posts. It looks like our visitors are highly attracted to the recipe posts. Georgia may want to feature more posts about recipes that tie in with her available inventory. Optimizing your Landing Page The Landing Page reports tell you where your visitors are coming from, and if they have used keywords to find you. You have a choice between viewing the source visitors used to get to your blog, or the keywords. Knowing the sources will give you guidance on the areas you should focus your marketing or advertising efforts on. Examining Entrance Sources You can quickly see how visitors are finding your site, whether through a direct link, or a search engine, locally from Blogger, or from social networking applications such as Twitter.com. In the Entrance Sources graph shown in the following screenshot, we can see that the largest among the number of people are coming to the blog using a direct link. Blogger is also responsible for a large share of our visitors, which is over 37%. There is even a visitor drawn to the blog from Twitter.com, where Georgia has an account. Discovering Entrance Keywords When visitors arrive at your site using keywords, the words they use will show up on the report. If they are using words in a pattern that do not match your site content, you may see a high bounce rate. You can use this report to redesign your landing page to better represent the purpose of your site by the words, and phrases that you use. Interpreting Click Patterns When visitors visit your site they show their attraction to links, and interactive content by clicking on them. Click Patterns are the representation of all those mouse clicks over a set time period. Using the Site Overlay reporting feature, you can visually see the mouse clicks represented in a graphical pattern. Much like collared pins stuck on a wall chart they will quickly reveal to you, which areas of your site visitors clicked on the most, and which links they avoided. Understanding Site Overlay Site Overlay shows the number of clicks for your site by laying them transparently in a graphical format on top of your site. Details with the number of clicks, and goal tracking information pop up in a little box when you hover over a click graphic with your mouse. At the top of the screen are options that control the display of the Site Overlay. Clicking the Hide Overlay link will hide the overlay from view. The Displaying drop-down list lets you choose how to view mouse Clicks on the page, or goals. The date range is the last item displayed. The graphical bars shown on top of the page content indicate where visitors clicked, and how many of them did so. You can quickly see what areas of the page interest your visitors the most. Based on the page clicks you see, you will have an idea of the content, and advertising that is most interesting to your visitors. Yes, Site Overlay will show the content areas of the page the visitors clicked on, and the advertisement areas. It will also help you see which links are tied to goals, and whether they are enticing your visitors to click. Optimizing Your Blog for Search Engines We are going to take our earlier checklists and use them as guides on where to make changes to our blog. When the changes are complete, the blog will be more attractive to search engines and visitors. We will start with changes we can make "On-site", and then progress to ways we can improve search engine results with "Off-site" improvements. Optimizing On-site The most crucial improvements we identified earlier were around the blog settings, template, and content. We will start with the easiest fixes, then dive into the template to correct validation issues. Let's begin with the settings in our Blogger blog. Seeding the Blog Title and Description with Keywords When you created your blog, did you take a moment to think about what words potential visitors were likely to type in when searching for your blog? Using keywords in the title and description of your blog gives potential visitors a preview and explanation of the topics they can expect to encounter in your blog. This information is what will also display in search results when potential visitors perform a search. Updating the Blog Title and Description It's never too late to seed your blog title and description with keywords. We will edit the blog title and description to optimize them for search engines. Login to your blog and navigate to Settings | Basic. We are going to replace the current title text with a phrase that more closely fits the blog. Type Organic Fruit for All into the Title field. Now, we are going to change the description of the blog. Type Organic Fruit Recipes, seasonal tips, and guides to healthy living into the description field. Scroll down to the bottom of the screen and click the Save Settings button. Y ou can enter up to 500 characters of descriptive text. What Just Happened? When we changed the title and description of our blog in the Basic Settings section, Blogger saved the changes and updated the template information as well. Now, when search engines crawl our blog, they will see richer descriptions of our blog in the blog title and blog description. The next optimization task is to verify that search engines can index our blog.
Read more
  • 0
  • 0
  • 4913

article-image-zen-gift-education
Packt
23 Oct 2009
4 min read
Save for later

Zen Gift of Education

Packt
23 Oct 2009
4 min read
Zen Gift of Education Many distributions have special releases around Christmas and New Year. I was planning to look at some of these this month like last year's Ubuntu Christmas Edition. But instead I found a release that's useful enough to maintain all year around. ZenEdu is a Live distribution that packs a whole bunch of educational tools on top of the Slackware-based light-weight and zippy Zenwalk Linux. As per Zenwalk's Wiki, ZenEdu was initiated by a user on the distro's French forum last year in December. That time the distro contained mostly French-only educational programs. This year, several members of the Zenwalk Linux community decided to release an international edition of ZenEdu. The distro is a goldmine of open source educational software and also packs a detailed user manual, which shows the developers' serious approach to do things properly. The educational apps included in the distro cover a broad range of subjects. The ZenEdu ISO is about 700 MB and includes apps that'll help users with subjects like Astronomy, Mathematics, and Chemistry. Since learning is the core idea behind the distro, it goes beyond traditional curriculum subjects and also packs tools that'll teach students the basics of programming and music. Some of the tools I particularly like are Stellarium - the popular 3D planetarium, Stardict - a multi-language dictionary, ghemical - a comprehensive computational chemistry package, Little Wizard that introduces the basics of programming to young students, and Maxima, for the manipulation of symbolic and numerical expressions, including differentiation, integration, ordinary differential equations, systems of linear equations, etc. If you want to learn music, train your ears with Solfege, and use TuxGuitar to edit and play guitar tablatures. What sets ZenEdu apart from other educational distros is that it bundles other productivity tools as well. This includes general-purpose applications like the IceWeasel web browser, IceDove email client, Pidgin for instant messaging, Kompozer for authoring web pages, and OpenOffice.org for word processing. Furthermore, the distro packs several other apps, which according to the developers, were chosen based on their usefulness to students while keeping in mind the things that might interest them. This includes a simple program to manage personal tasks and todo lists, a drawing program, a comic book viewer, a video editor, and a program to create a wide array of 3D content. However, there are dozens of free software educational tools that aren't included in this CD due to size considerations. But that's no problem. Since ZenEdu is based on Zenwalk, it too can be expanded with drag-and-drop modules. To create a new customize ZenEdu Live CD, browse and download the modules of educational apps you want and use the remastering application, isomaster to add them to your customized ZenEdu Live CD! The highlight of this distro though is the iTALC tool for teachers. iTALC, which stands for Intelligent Teaching And Learning with Computers, is a powerful cross-platform didactical tool that lets teachers view and control other computers in their network. Using iTALC teachers can see what's going on in computer labs and take snapshots, remote-control computers to support and help students, run a demo on all students' computers in real-time, send text-messages to students, cycle power and rebooting computers remotely, etc. ZenEdu has a special 'teacher' account pre-configured to run iTALC. Once logged in from that user, you can start iTALC and navigate through its interface, first adding student computers, and then controlling or monitoring them. ZenEdu's wiki page advices that if you'll be using the program regularly, you should save the 'teacher' account's iTALC directory (/home/teacher/.italc/) inside zenlive/rootcopy of the Live CD via isomaster. This will load the iTALC configuration the next time you boot the remastered Live CD. If you'll be using iTALC regularly you'd be well off installing ZenEdu on to your hard disk. Unfortunately, ZenEdu isn't installable. It's only a Live CD, and at best can be installed onto a USB Flash stick for portability. Most of the specialized distros I've played with, tend to be too specialized. They do what they are supposed to, but nothing more. ZenEdu is different in that, in a single CD, the developers have managed to squeeze a good number of educational apps as well as everyday tools. I hope members of the Zenwalk community, actively develop and maintain ZenEdu.   Some more articles by Mayank Sharma: Meet the Distro guy Making a Complete yet Small Linux Distribution
Read more
  • 0
  • 0
  • 4908
article-image-how-to-win-kaggle-competition-with-apache-sparkml
Savia Lobo
27 Feb 2018
11 min read
Save for later

How to win Kaggle competition with Apache SparkML

Savia Lobo
27 Feb 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. The book will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x.[/box] In today’s tutorial we will show how to take advantage of Apache SparkML to win a Kaggle competition. We'll use an archived competition offered by BOSCH, a German multinational engineering and electronics company, on production line performance data. The data for this competition represents measurement of parts as they move through Bosch's production line. Each part has a unique Id. The goal is to predict which part will fail quality control (represented by a 'Response' = 1). For more details on the competition data you may visit the website: https://www.kaggle.com/c/bosch-production-line-p erformance/data. Data preparation The challenge data comes in three ZIP packages but we only use two of them. One contains categorical data, one contains continuous data, and the last one contains timestamps of  measurements, which we will ignore for now. If you extract the data, you'll get three large CSV files. So the first thing that we want to do is re-encode them into parquet in order to be more space-efficient: def convert(filePrefix : String) = { val basePath = "yourBasePath" var df = spark .read .option("header",true) .option("inferSchema", "true") .csv("basePath+filePrefix+".csv") df = df.repartition(1) df.write.parquet(basePath+filePrefix+".parquet") } convert("train_numeric") convert("train_date") convert("train_categorical") First, we define a function convert that just reads the .csv file and rewrites it as a .parquet  file. As you can see, this saves a lot of space: Now we read the files in again as DataFrames from the parquet files : var df_numeric = spark.read.parquet(basePath+"train_numeric.parquet") var df_categorical = spark.read.parquet(basePath+"train_categorical.parquet") Here is the output of the same: This is very high-dimensional data; therefore, we will take only a subset of the columns for this illustration: df_categorical.createOrReplaceTempView("dfcat") var dfcat = spark.sql("select Id, L0_S22_F545 from dfcat") In the following picture, you can see the unique categorical values of that column: Now let's do the same with the numerical dataset: df_numeric.createOrReplaceTempView("dfnum") var dfnum = spark.sql("select Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,Response from dfnum") Here is the output of the same: Finally, we rejoin these two relations: var df = dfcat.join(dfnum,"Id") df.createOrReplaceTempView("df") Then we have to do some NA treatment: var df_notnull = spark.sql(""" select Response as label, case when L0_S22_F545 is null then 'NA' else L0_S22_F545 end as L0_S22_F545, case when L0_S0_F0 is null then 0.0 else L0_S0_F0 end as L0_S0_F0, case when L0_S0_F2 is null then 0.0 else L0_S0_F2 end as L0_S0_F2, case when L0_S0_F4 is null then 0.0 else L0_S0_F4 end as L0_S0_F4 from df """) Feature engineering Now it is time to run the first transformer (which is actually an estimator). It is StringIndexer and needs to keep track of an internal mapping table between strings and indexes. Therefore, it is not a transformer but an estimator: import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} var indexer = new StringIndexer() .setHandleInvalid("skip") .setInputCol("L0_S22_F545") .setOutputCol("L0_S22_F545Index") var indexed = indexer.fit(df_notnull).transform(df_notnull) indexed.printSchema As we can see clearly in the following image, an additional column called L0_S22_F545Index has been created: Finally, let's examine some content of the newly created column and compare it with the source column. We can clearly see how the category string gets transformed into a float index: Now we want to apply OneHotEncoder, which is a transformer, in order to generate better features for our machine learning model: var encoder = new OneHotEncoder() .setInputCol("L0_S22_F545Index") .setOutputCol("L0_S22_F545Vec") var encoded = encoder.transform(indexed) As you can see in the following figure, the newly created column L0_S22_F545Vec contains org.apache.spark.ml.linalg.SparseVector objects, which is a compressed representation of a sparse vector: Note: Sparse vector representations: The OneHotEncoder, as many other algorithms, returns a sparse vector of the org.apache.spark.ml.linalg.SparseVector type as, according to the definition, only one element of the vector can be one, the rest has to remain zero. This gives a lot of opportunity for compression as only the position of the elements that are non-zero has to be known. Apache Spark uses a sparse vector representation in the following format: (l,[p],[v]), where l stands for length of the vector, p for position (this can also be an array of positions), and v for the actual values (this can be an array of values). So if we get (13,[10],[1.0]), as in our earlier example, the actual sparse vector looks like this: (0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0). So now that we are done with our feature engineering, we want to create one overall sparse vector containing all the necessary columns for our machine learner. This is done using VectorAssembler: import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors var vectorAssembler = new VectorAssembler() .setInputCols(Array("L0_S22_F545Vec", "L0_S0_F0", "L0_S0_F2","L0_S0_F4")) setOutputCol("features") var assembled = vectorAssembler.transform(encoded) We basically just define a list of column names and a target column, and the rest is done for us: As the view of the features column got a bit squashed, let's inspect one instance of the feature field in more detail: We can clearly see that we are dealing with a sparse vector of length 16 where positions 0, 13, 14, and 15 are non-zero and contain the following values: 1.0, 0.03, -0.034, and -0.197. Done! Let's create a Pipeline out of these components. Testing the feature engineering pipeline Let's create a Pipeline out of our transformers and estimators: import org.apache.spark.ml.Pipeline import org.apache.spark.ml.PipelineModel //Create an array out of individual pipeline stages var transformers = Array(indexer,encoder,assembled) var pipeline = new Pipeline().setStages(transformers).fit(df_notnull) var transformed = pipeline.transform(df_notnull) Note that the setStages method of Pipeline just expects an array of transformers and estimators, which we had created earlier. As parts of the Pipeline contain estimators, we have to run fit on our DataFrame first. The obtained Pipeline object takes a DataFrame in the transform method and returns the results of the transformations: As expected, we obtain the very same DataFrame as we had while running the stages individually in a sequence. Training the machine learning model Now it's time to add another component to the Pipeline: the actual machine learning algorithm-RandomForest: import org.apache.spark.ml.classification.RandomForestClassifier var rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") var model = new Pipeline().setStages(transformers :+ rf).fit(df_notnull) var result = model.transform(df_notnull) This code is very straightforward. First, we have to instantiate our algorithm and obtain it as a reference in rf. We could have set additional parameters to the model but we'll do this later in an automated fashion in the CrossValidation step. Then, we just add the stage to our Pipeline, fit it, and finally transform. The fit method, apart from running all upstream stages, also calls fit on the RandomForestClassifier in order to train it. The trained model is now contained within the Pipeline and the transform method actually creates our predictions column: As we can see, we've now obtained an additional column called prediction, which contains the output of the RandomForestClassifier model. Of course, we've only used a very limited subset of available features/columns and have also not yet tuned the model, so we don't expect to do very well; however, let's take a look at how we can evaluate our model easily with Apache SparkML. Model evaluation Without evaluation, a model is worth nothing as we don't know how accurately it performs. Therefore, we will now use the built-in BinaryClassificationEvaluator in order to assess prediction performance and a widely used measure called areaUnderROC (going into detail here is beyond the scope of this book): import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator val evaluator = new BinaryClassificationEvaluator() import org.apache.spark.ml.param.ParamMap var evaluatorParamMap = ParamMap(evaluator.metricName -> "areaUnderROC") var aucTraining = evaluator.evaluate(result, evaluatorParamMap) As we can see, there is a built-in class called org.apache.spark.ml.evaluation.BinaryClassificationEvaluator and there are some other classes for other prediction use cases such as RegressionEvaluator or MulticlassClassificationEvaluator. The evaluator takes a parameter map--in this case, we are telling it to use the areaUnderROC metric--and finally, the evaluate method evaluates the result: As we can see, areaUnderROC is 0.5424418446501833. An ideal classifier would return a score of one. So we are only doing a bit better than random guesses but, as already stated, the number of features that we are looking at is fairly limited. Note : In the previous example we are using the areaUnderROC metric which is used for evaluation of binary classifiers. There exist an abundance of other metrics used for different disciplines of machine learning such as accuracy, precision, recall and F1 score. The following provides a good overview http://www.cs.cornell.edu/courses/cs578/2003fa/performance_measures.pdf This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. This areaUnderROC is in fact a very bad value. Let's see if choosing better parameters for our RandomForest model increases this a bit in the next section. CrossValidation and hyperparameter tuning As explained before, a common step in machine learning is cross-validating your model using testing data against training data and also tweaking the knobs of your machine learning algorithms. Let's use Apache SparkML in order to do this for us, fully automated! First, we have to configure the parameter map and CrossValidator: import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder} var paramGrid = new ParamGridBuilder() .addGrid(rf.numTrees, 3 :: 5 :: 10 :: 30 :: 50 :: 70 :: 100 :: 150 :: Nil) .addGrid(rf.featureSubsetStrategy, "auto" :: "all" :: "sqrt" :: "log2" :: "onethird" :: Nil) .addGrid(rf.impurity, "gini" :: "entropy" :: Nil) .addGrid(rf.maxBins, 2 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .addGrid(rf.maxDepth, 3 :: 5 :: 10 :: 15 :: 20 :: 25 :: 30 :: Nil) .build() var crossValidator = new CrossValidator() .setEstimator(new Pipeline().setStages(transformers :+ rf)) .setEstimatorParamMaps(paramGrid) .setNumFolds(5) .setEvaluator(evaluator) var crossValidatorModel = crossValidator.fit(df_notnull) var newPredictions = crossValidatorModel.transform(df_notnull) The org.apache.spark.ml.tuning.ParamGridBuilder is used in order to define the hyperparameter space where the CrossValidator has to search and finally, the org.apache.spark.ml.tuning.CrossValidator takes our Pipeline, the hyperparameter space of our RandomForest classifier, and the number of folds for the CrossValidation as parameters. Now, as usual, we just need to call fit and transform on the CrossValidator and it will basically run our Pipeline multiple times and return a model that performs the best. Do you know how many different models are trained? Well, we have five folds on CrossValidation and five-dimensional hyperparameter space cardinalities between two and eight, so let's do the math: 5 * 8 * 5 * 2 * 7 * 7 = 19600 times! Using the evaluator to assess the quality of the cross-validated and tuned model Now that we've optimized our Pipeline in a fully automatic fashion, let's see how our best model can be obtained: var bestPipelineModel = crossValidatorModel.bestModel.asInstanceOf[PipelineModel] var stages = bestPipelineModel.stages import org.apache.spark.ml.classification.RandomForestClassificationModel val rfStage = stages(stages.length-1).asInstanceOf[RandomForestClassificationModel] rfStage.getNumTrees rfStage.getFeatureSubsetStrategy rfStage.getImpurity rfStage.getMaxBins rfStage.getMaxDepth The crossValidatorModel.bestModel code basically returns the best Pipeline. Now we use bestPipelineModel.stages to obtain the individual stages and obtain the tuned RandomForestClassificationModel using stages(stages.length 1).asInstanceOf[RandomForestClassificationModel]. Note that stages.length-1 addresses the last stage in the Pipeline, which is our RandomForestClassifier. So now, we can basically run evaluator using the best model and see how it performs: You might have noticed that 0.5362224872557545 is less than 0.5424418446501833, as we've obtained before. So why is this the case? Actually, this time we used cross-validation, which means that the model is less likely to over fit and therefore the score is a bit lower. So let's take a look at the parameters of the best model: Note that we've limited the hyperparameter space, so numTrees, maxBins, and maxDepth have been limited to five, and bigger trees will most likely perform better. So feel free to play around with this code and add features, and also use a bigger hyperparameter space, say, bigger trees. Finally, we've applied the concepts that we discussed on a real dataset from a Kaggle competition, which is a good starting point for your own machine learning project with Apache SparkML. If you found our post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to know more about advanced analytics on your Big Data with latest Apache Spark 2.x.    
Read more
  • 0
  • 0
  • 4908

article-image-five-years-ubuntu
Packt
31 Dec 2009
3 min read
Save for later

Five Years of Ubuntu

Packt
31 Dec 2009
3 min read
Community If there is any one word that could sum up Ubuntu, it would be Community.  Even the definition of the word "Ubuntu" makes reference to community, and how the betterment of the individual and community are interconnected. Nearly everyone I've met through Ubuntu in the last five years cites the community as the single major reason for their use. In many aspects, Ubuntu is technically equal to its competitors, but nowhere else will you find the same level of community support. Nowhere else will you find the same level of friendship and positive atmosphere. Over the last five years I have tested many alternate Linux distributions and I have yet to find any other community that is as accepting, or that goes out of their way to invite you into the group. The Ubuntu community has so many people actively engaged in trying to provide a positive environment, it is truly amazing. If you find yourself at an Ubuntu event, don't be surprised if you actually see hugs! And watch out, it is contagious! The source of this positive community atmosphere is the guidance of the Ubuntu Code of Conduct. From the beginning, Ubuntu has had a Code of Conduct, and active contributing community members are expected to understand and sign the document. By outlining in clear terms what is expected of a community member, and keeping it forefront in members minds, Ubuntu is able to foster an atmosphere of mutual respect. This one simple document is what sets Ubuntu apart from every other online community. Sure you can find a community of contributors to any project, but nowhere else will you find the same respect and welcoming atmosphere that you'll find in Ubuntu. Simplicity Ubuntu introduced a level of simplicity to the Linux environment that hasn't been seen before. What has historically been a hobbyist operating system has been tuned and refined to the point that truly anyone can install and use it. Before Ubuntu, a user would need to be familiar with partitioning and package selections (at minimum!) in order to install a machine. With Ubuntu a machine can be installed with a very sane set of default tools without any technical decision making from the user. With tools such as Wubi, now included on Ubuntu installation disks, a user is able to install Ubuntu alongside an existing Windows installation and have the two peacefully coexist. Giving the user the ability to try-before-you-buy, and leave current installations and data intact have also drastically improved the userbase and adoption rate. Ubuntu was also the first to promote a single CD installation. Where most other Linux distributions were offering DVD based installations, Ubuntu packaged a core selection onto a single CD and offered those for download. They even offered to send free CDs through the mail to anyone that requested them! It doesn't get much simpler than than! The truly amazing thing about this simplicity is that, despite being limited to a single 700MB CD, Ubuntu comes with a plethora of software. The base installation provides a comprehensive desktop environment, including a full office suite, web browser, mail client, audio and video tools and more! This emphasis on delivering a highly refined, usable environment from the start is a very important aspect of Ubuntu.
Read more
  • 0
  • 0
  • 4905
Modal Close icon
Modal Close icon